NotificationsYou must be signed in to change notification settings
Fork352
Star6.6k

pgml Python SDK with vector search support#636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

santiatpml merged 21 commits intomasterfromsanti-pgml-memory-sdk-python

May 23, 2023

Merged

pgml Python SDK with vector search support#636

santiatpml merged 21 commits intomasterfromsanti-pgml-memory-sdk-python

May 23, 2023

Conversation

Copy link

Contributor

santiatpml commentedMay 19, 2023

Objective of this SDK is to provide an easy interface for PostgresML generative AI capabilities. This version supports vector search using multiple models and text splitters.

Quick start instructions arehere

santiadavaniand others added13 commits

May 12, 2023 16:25

Python SDK init

89c1223

create collection init

53dde0f

Upsert documents + tests

2d685e8

Creating more tables as part of collection ..

86ccef8

Register models and text splitters

5b5cee4

Refactored run select and added models

7b03e01

Embeddings and vector search

2d9202e

Incremental updates for chunks and embeddings

ea19ecc

Docstrings for all modules

dee6e5b

Minor updates

b7a0495

Added basic readme with quickstart

5186705

Updated readme with PGML_CONNECTION

5c8cf62

Updated readme

5a81918

santiatpml requested review fromlevkk,montanalow andsolidsnack

May 19, 2023 16:04

solidsnack approved these changes

May 19, 2023

View reviewed changes

levkk reviewed

May 19, 2023

View reviewed changes

pgml-sdks/python/pgml/pgml/collection.py

		run_create_or_insert_statement(conn, create_schema_statement)
		create_table_statement = (
		"CREATE TABLE IF NOT EXISTS %s (\
		id serial8 PRIMARY KEY,\

Copy link

Contributor

levkkMay 19, 2023•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	idserial8PRIMARYKEY,\
	idbigserialPRIMARYKEY,\

More idiomatic, but doesn't matter.

levkk reviewed

May 19, 2023

View reviewed changes

pgml-sdks/python/pgml/pgml/collection.py Outdated

		document uuid NOT NULL,\
		metadata jsonb NOT NULL DEFAULT '{}',\
		text text NOT NULL,\
		UNIQUE (document)\

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Duplicate primary key technically, since this is unique. Curious why we can't use theid field as the document identifier?

levkk reviewed

May 19, 2023

View reviewed changes

pgml-sdks/python/pgml/pgml/collection.py Outdated

		run_create_or_insert_statement(conn, create_index_statement, autocommit=True)

		create_index_statement = (
		"CREATE INDEX CONCURRENTLY IF NOT EXISTS \

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Don't need to create an index here ondocument,UNIQUE does it already automatically. So I think you end up with two indexes on the same field.

levkk reviewed

May 19, 2023

View reviewed changes

pgml-sdks/python/pgml/pgml/collection.py

		)
		run_create_or_insert_statement(conn, create_statement)

		index_statement = (

Copy link

Contributor

levkkMay 19, 2023•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

UNIQUE (task, splitter, model) in the table definition creates a compound index on those three columns. Having 3 additional individual indexes on the same columns may not be necessary.

levkk reviewed

May 19, 2023

View reviewed changes

pgml-sdks/python/pgml/pgml/collection.py

		created_at timestamptz NOT NULL DEFAULT now(), \
		task text NOT NULL, \
		splitter int8 NOT NULL REFERENCES %s\
		ON DELETE CASCADE\

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Are you sure you want toCASCADE? This has the effect of automatically deleting rows from this table when the row they are referencing in another table is deleted. This can delete a lot of data accidentally. The preferred way for me generally is toON DELETE RESTRICT which is the default. That way, you'll get an error when attempting to delete a splitter that's referenced by this table. This is default behavior also, so you can just removeON DELETE CASCADE.

levkk reviewed

May 19, 2023

View reviewed changes

pgml-sdks/python/pgml/pgml/collection.py Outdated

		self.transforms_table = self.name + ".transforms"
		create_statement = (
		"CREATE TABLE IF NOT EXISTS %s (\
		oid regclass PRIMARY KEY,\

Copy link

Contributor

levkkMay 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Odd way of referencing another table, I've never seen that done before. Is this compatible with logical replication? I.e. if we want to move this data to another database, the oids probably won't match anymore, since they are specific to a Postgres installation.

Copy link

Contributor

levkkMay 19, 2023•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This should probably be calledtable_name. Anoid is something very Postgres-specific and does not mean a table reference at all. In fact, it can reference a row in a TOAST table or a type inpg_type.

levkk reviewed

May 19, 2023

View reviewed changes

pgml-sdks/python/pgml/pgml/collection.py Outdated

		"CREATE TABLE IF NOT EXISTS %s ( \
		id serial8 PRIMARY KEY,\
		created_at timestamptz NOT NULL DEFAULT now(),\
		chunk int8 NOT NULL REFERENCES %s\

Copy link

Contributor

levkkMay 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The convention is sometimes to usefield_id to explain to the reader that this refers to the primary key (usuallyid) column in another table. The reviewer would look in this case if the column being referenced has an index on it, which is often required for foreign keys: a foreign key validation needs to be an index scan, the quickest way to validate that this value exists in another table.

Copy link

Contributor

		log.info("id key is not present.. hashing")
		document_id = hashlib.md5(text.encode("utf-8")).hexdigest()
		metadata = document
		delete_statement = "DELETE FROM %s WHERE document = %s" % (

		chunks = text_splitter.create_documents([text])
		for chunk_id, chunk in enumerate(chunks):
		insert_statement = (
		"INSERT INTO %s (document,splitter,chunk_id, chunk) VALUES (%s, %s, %s, %s);"

		model_params = results[0]["parameters"]

		# get all chunks that don't have embeddings
		embeddings_statement = (

		embeddings_table = self._create_or_get_embeddings_table(
		conn, model_id=model_id, splitter_id=splitter_id
		)
		select_statement = "SELECT name, parameters FROM %s WHERE id = %d;" % (

		results = run_select_statement(conn, select_statement)

		model = results[0]["name"]
		query_embeddings = self._get_embeddings(

		query_embeddings = self._get_embeddings(
		conn, query, model_name=model, parameters=query_parameters
		)
		embeddings_table = self._create_or_get_embeddings_table(

		)

		select_statement = (
		"SELECT chunk, 1 - (%s.embedding <=> %s::float8[]::vector) AS score FROM %s ORDER BY score DESC LIMIT %d;"

		for result in results:
		_out = {}
		_out["score"] = result["score"]
		select_statement = "SELECT chunk, document FROM %s WHERE id = %d" % (

		self.pool.putconn(conn)
		return Collection(self.pool, name)

		def delete_collection(self, name: str) -> None:

Movatterモバイル変換

pgml Python SDK with vector search support#636

pgml Python SDK with vector search support#636

Uh oh!

Conversation

santiatpml commentedMay 19, 2023

Uh oh!

levkkMay 19, 2023• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

levkkMay 19, 2023• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

levkkMay 19, 2023• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

levkk left a comment

Choose a reason for hiding this comment

Uh oh!

levkk commentedMay 19, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

levkkMay 19, 2023•
edited
Loading

levkkMay 19, 2023•
edited
Loading

levkkMay 19, 2023•
edited
Loading