Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

pgml Python SDK with vector search support#636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
santiatpml merged 21 commits intomasterfromsanti-pgml-memory-sdk-python
May 23, 2023

Conversation

@santiatpml
Copy link
Contributor

Objective of this SDK is to provide an easy interface for PostgresML generative AI capabilities. This version supports vector search using multiple models and text splitters.

Quick start instructions arehere

run_create_or_insert_statement(conn, create_schema_statement)
create_table_statement = (
"CREATE TABLE IF NOT EXISTS %s (\
id serial8 PRIMARY KEY,\
Copy link
Contributor

@levkklevkkMay 19, 2023
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
idserial8PRIMARYKEY,\
idbigserialPRIMARYKEY,\

More idiomatic, but doesn't matter.

document uuid NOT NULL,\
metadata jsonb NOT NULL DEFAULT '{}',\
text text NOT NULL,\
UNIQUE (document)\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Duplicate primary key technically, since this is unique. Curious why we can't use theid field as the document identifier?

run_create_or_insert_statement(conn, create_index_statement, autocommit=True)

create_index_statement = (
"CREATE INDEX CONCURRENTLY IF NOT EXISTS \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Don't need to create an index here ondocument,UNIQUE does it already automatically. So I think you end up with two indexes on the same field.

solidsnack and montanalow reacted with thumbs up emoji
)
run_create_or_insert_statement(conn, create_statement)

index_statement = (
Copy link
Contributor

@levkklevkkMay 19, 2023
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

UNIQUE (task, splitter, model) in the table definition creates a compound index on those three columns. Having 3 additional individual indexes on the same columns may not be necessary.

created_at timestamptz NOT NULL DEFAULT now(), \
task text NOT NULL, \
splitter int8 NOT NULL REFERENCES %s\
ON DELETE CASCADE\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Are you sure you want toCASCADE? This has the effect of automatically deleting rows from this table when the row they are referencing in another table is deleted. This can delete a lot of data accidentally. The preferred way for me generally is toON DELETE RESTRICT which is the default. That way, you'll get an error when attempting to delete a splitter that's referenced by this table. This is default behavior also, so you can just removeON DELETE CASCADE.

self.transforms_table = self.name + ".transforms"
create_statement = (
"CREATE TABLE IF NOT EXISTS %s (\
oid regclass PRIMARY KEY,\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Odd way of referencing another table, I've never seen that done before. Is this compatible with logical replication? I.e. if we want to move this data to another database, the oids probably won't match anymore, since they are specific to a Postgres installation.

Copy link
Contributor

@levkklevkkMay 19, 2023
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This should probably be calledtable_name. Anoid is something very Postgres-specific and does not mean a table reference at all. In fact, it can reference a row in a TOAST table or a type inpg_type.

"CREATE TABLE IF NOT EXISTS %s ( \
id serial8 PRIMARY KEY,\
created_at timestamptz NOT NULL DEFAULT now(),\
chunk int8 NOT NULL REFERENCES %s\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The convention is sometimes to usefield_id to explain to the reader that this refers to the primary key (usuallyid) column in another table. The reviewer would look in this case if the column being referenced has an index on it, which is often required for foreign keys: a foreign key validation needs to be an index scan, the quickest way to validate that this value exists in another table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

_id is not just a railsism, it's a popular and significant readability improvement to easily distinguish local fields from foreign keys, which isn't really Hungarian notation. It's part of what-this-field-represents, rather than what-data-type-is-this-field.

It's more similar to the convention many programming languages use by adding* or& to distinguish references and pointers from local objects on the stack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A foreign key can be polymorphic, in which case, the naming convention suddenly becomes important. Sure, one can't really enforce naming conventions programmatically, but there is no reason to make something harder to understand.

For example,created_at can be renamed torandom_date and only by looking at the definition of the column one would realize that this is actually a creation timestamp, but we don't do that, we call itcreated_at because we want to be kind to our future selves and to reviewers and other users of our code. An argument can be made for calling variablesabcd and only by reading the code fully, one will truly understand what they do, but we don't do that either.

Rails can be many things that we don't agree with, but the way ActiveRecord handles data is fantastic, and we would be wise to learn from their experience and do something most engineers are familiar and happy to work with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

And you will commonly see code that is forced to name two variables in a context things likedocument_ptr and adocument , to disambiguate the one that has been dereferenced safely. No one would assume that the compiler checks_ptr correctness, but if they ever saw a variable with that extension, next to a variable without the extension, and the truth of the references disagreed with the conventions we've established match expectations, people would say the variable was misnamed and likely to cause bugs or confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We write code for others to read. Code should consistently follow the conventions established in the project. The convention has already been established, and there is no need to revisit that convention, when the suggestion obviously confused multiple team members in multiple places due to not just inconsistency with the greater project scope, but inconsistency within the necessary layers to achieve a solution, and also inconsistency within a single table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Natural joins withUSING is generally useless, because it uses all columns in common, not just the key. In this casecreated_at would be shared but would not be the same, so natural joins won't work.

NoteUSING is reasonably safe from column changes in the joined relations since only the listed columns are combined. NATURAL is considerably more risky since any schema changes to either relation that cause a new matching column name to be present will cause the join to combine that new column as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If you're still not convinced, there is theuser_uuid in your example, which is yet another instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

But similarly, if you really want to useUSING, and not natural joins, it works just as well to call them bothdocument_id, and you can keep that feature of SQL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Calling them bothdocument_id would also be a break with Rails conventions, though...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You seem confused about the importance of Rails conventions, and to be missing the point that this schema is not an island unto itself, nor is this SDK. The important aspect is that a database and schema in this context are useless without an application layer of logic on top. That application layer will have concerns with ORM that will be simpler to maintain if the database follows a convention of using_id prefixes. The reason you see the pattern replicated so widely, is because it's so frequently eases the mental burden of application developers. DBA's or people working only inside the database may not understand these concepts or concerns, but that doesn't mean they don't exist.

Copy link
Contributor

@levkklevkk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The tables feel a bit overindexed. In Postgres, each index carries a write penalty: for each row that's inserted into the table, each index needs to be updated accordingly. When all (or most) columns are updated, an update to any column requires an update to all indexes (i.e. Postgres can't do HOT, heap-only, updates).

I think I understand the schema overall, although it would be helpful to define it in a separate.sql file which can then be executed when the SDK is used for the first time. Although, we do generate a lot of tables in the fly, so understandably that's not possible for all use cases.

Overall, I think this is great to start with, and might require some optimizations as it's deployed at scale.

@levkk
Copy link
Contributor

It is much better to improve performance later by removing indexes than to try to improve it by adding them...I'm not sure how this kind of thinking about indexes got started.

Experience. Removing an index is dangerous, adding an index is safe.

log.info("id key is not present.. hashing")
document_id = hashlib.md5(text.encode("utf-8")).hexdigest()
metadata = document
delete_statement = "DELETE FROM %s WHERE document = %s" % (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This would be more clear to me if the field name wasdocument_id to match the variable name. I was half expectingdocument to be a text value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Right, so using one convention for names at one layer, and a different convention for names in the next layer down seems confusing, there is good reason to add_id in both layers.

chunks = text_splitter.create_documents([text])
for chunk_id, chunk in enumerate(chunks):
insert_statement = (
"INSERT INTO %s (document,splitter,chunk_id, chunk) VALUES (%s, %s, %s, %s);"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Missing_id suffixes make this hard for me to read.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes,chunk_id is the chunk index for the given document text. For example: if the text "hello world" is split into two chunks then chunk_id 0 will map to "hello" and chunk_id 1 will map to "world".id is the global id of the chunk across all documents and splitters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, this is a perfect example of needing both a local and reference to the same "object" in a single context, and being forced to disambiguate one of them with an_id suffix. The confusing ones in this line aredocument, andsplitter.chunk andchunk_id are clear and easily distinguished, even thoughchunk_id does not in fact refer to a foreign key in achunks table, it's clearly marked as a reference.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

chunk_id is the ordering of the chunk within the document.chunk_index sounds good.

"CREATE TABLE IF NOT EXISTS %s ( \
id serial8 PRIMARY KEY,\
created_at timestamptz NOT NULL DEFAULT now(),\
chunk int8 NOT NULL REFERENCES %s\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

_id is not just a railsism, it's a popular and significant readability improvement to easily distinguish local fields from foreign keys, which isn't really Hungarian notation. It's part of what-this-field-represents, rather than what-data-type-is-this-field.

It's more similar to the convention many programming languages use by adding* or& to distinguish references and pointers from local objects on the stack.

model_params = results[0]["parameters"]

# get all chunks that don't have embeddings
embeddings_statement = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This needs to be refactored into the insert statement to avoid round tripping the vector.

embeddings_table = self._create_or_get_embeddings_table(
conn, model_id=model_id, splitter_id=splitter_id
)
select_statement = "SELECT name, parameters FROM %s WHERE id = %d;" % (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would include this with a CTE in the embeddings statement as well to avoid the round trip.

results = run_select_statement(conn, select_statement)

model = results[0]["name"]
query_embeddings = self._get_embeddings(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A query builder can be used for the CTE here.

query_embeddings = self._get_embeddings(
conn, query, model_name=model, parameters=query_parameters
)
embeddings_table = self._create_or_get_embeddings_table(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This needs to be cached some how to avoid multiple round trips to the db for this function.

)

select_statement = (
"SELECT chunk, 1 - (%s.embedding <=> %s::float8[]::vector) AS score FROM %s ORDER BY score DESC LIMIT %d;"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This needs to be embedded with all the results queries to avoid n+1 queries

for result in results:
_out = {}
_out["score"] = result["score"]
select_statement = "SELECT chunk, document FROM %s WHERE id = %d" % (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Prefer joins in SQL to round trips from the app.

self.pool.putconn(conn)
return Collection(self.pool, name)

def delete_collection(self, name: str) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hard delete is a bit of a footgun.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yeah, we could have some archiving mechanism, which simply alters the schema toname + '_archive', and then enable delete or restore on archives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Adding more storage space is cheap. Recovering lost data is expensive and scary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Also, we should support "undo".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

that's what I meant by "restore"

solidsnack reacted with thumbs up emoji
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That does not guarantee people aren't still using the table, if they are ignoring the flag by using poorly written homegrown queries, which is a potentially desirable outcome for projects that outgrow simple SDK designs, and what to leverage the full expressive power of SQL.

solidsnack

This comment was marked as duplicate.

solidsnack

This comment was marked as duplicate.

Copy link
Contributor

@montanalowmontanalow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

🥳

@santiatpmlsantiatpml merged commitcac1a6a intomasterMay 23, 2023
@santiatpmlsantiatpml deleted the santi-pgml-memory-sdk-python branchMay 23, 2023 22:42
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@montanalowmontanalowmontanalow approved these changes

+2 more reviewers

@solidsnacksolidsnacksolidsnack approved these changes

@levkklevkklevkk approved these changes

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

6 participants

@santiatpml@levkk@solidsnack@montanalow@santiadavani

[8]ページ先頭

©2009-2025 Movatter.jp