The proposed change added new wrapper for pgml.embed, this will instead of return a single row, return a table structure which is very useful for batch processing strings.

Example usage is

select*frompgml.embed2('all-MiniLM-L6-v2', (select array_agg(phrase)from (select*from phraseslimit10)));

To import the function is as follows:

CREATE OR REPLACEFUNCTIONpgml."embed2"(    transformerTEXT,    inputsTEXT[],    kwargs JSONB DEFAULT'{}') RETURNS TABLE (textTEXT, embeddingreal[])     LANGUAGE c IMMUTABLE STRICT PARALLEL SAFEAS'MODULE_PATHNAME','embed_batch2_wrapper';

ns1000 added4 commits

November 23, 2023 15:32

Added pyarrow==11.0.0 to requirements to solve issue where postgres w…

503dc71

…ould segfault after a client session which used pgml command closes. The issue can be identified in postgres log files with the line 'arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit'

Merge branch 'master' ofhttps://github.com/postgresml/postgresml

52e5b8d

Added embed_batch2, which will batch process multiple strings and the…

3462868

…n return the embeddings as a table

Merge branch 'master' ofhttps://github.com/postgresml/postgresml

c5c8640

montanalow requested changes

Nov 25, 2023

View reviewed changes

Copy link

Contributor

montanalow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You can add a migration to create the function for people who upgrade, e.g.
https://github.com/postgresml/postgresml/blob/master/pgml-extension/sql/pgml--2.7.13--2.8.0.sql

pgml-extension/src/api.rs

		@@ -558,6 +558,26 @@ pub fn embed_batch(
		}
		}

		#[cfg(all(feature = "python", not(feature = "use_as_lib")))]
		#[pg_extern(immutable, parallel_safe, name = "embed2")]
		pub fn embed_batch2<'a>(

Copy link

Contributor

montanalowNov 25, 2023•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

My rough thoughts, without running the code on some examples.

I think we should name this SQLembed_3 and Rustembed_batch_3 with the goal of establishing this as the 3.0embed API, as well as a pattern for releasing 3.0 APIs early as we developing them in an alpha state (with potentially breaking changes, where we completely drop them in 3.1 in favor of the newly established default behavior).

Your example convinces me that batch APIs should return a table, but I think that table's rows should beJSONB with {id, embedding} keys (at least), unless there is a significant performance implication on that front. My thinking is that embedding models are getting more complicated and now some take JSON rather than TEXT forinputs including aprompt. It would be nice to have an optionalid in the input JSON, and if it's not present, then just return the entire input JSON as theid, which acts just like your TEXT as the key.

Final thought is thatkwargs is JSONB currently, which works well with the underlying Python dependencies, but I'd like to structure it as much as possible for final 3.0. We should find a way to flag this obviously as an alpha API, that will be broken and eventually dropped when a final version is available.

Copy link

ContributorAuthor

ns1000 commentedNov 25, 2023

So it turns out the batching is not really necessary to achieve speed. When running on CPU within the Postgres python VM, you really need to torch.set_num_threads(1) in order to get the maximum speed. Leaving it to the default value, which is the number of CPUs was creating the slow down problems for me. It will still use all the CPUs when use threads=1.

I am using a debian system, with python 3.11 and postgres 16 to test all this.

Copy link

Contributor

montanalow commentedNov 25, 2023

So it turns out the batching is not really necessary to achieve speed. When running on CPU within the Postgres python VM, you really need to torch.set_num_threads(1) in order to get the maximum speed. Leaving it to the default value, which is the number of CPUs was creating the slow down problems for me. It will still use all the CPUs when use threads=1.
I am using a debian system, with python 3.11 and postgres 16 to test all this.

Ah, so this is actually another hit on#1161

Labels

None yet

2 participants

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added embed2 which returns a table structure#1186

Are you sure you want to change the base?

Added embed2 which returns a table structure#1186

Uh oh!

Conversation

ns1000 commentedNov 25, 2023

Uh oh!

montanalow left a comment

Choose a reason for hiding this comment

Uh oh!

montanalowNov 25, 2023•
edited
Loading

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ns1000 commentedNov 25, 2023

Uh oh!

montanalow commentedNov 25, 2023

Uh oh!

Uh oh!

Movatterモバイル変換

Added embed2 which returns a table structure#1186

Are you sure you want to change the base?

Added embed2 which returns a table structure#1186

Uh oh!

Conversation

ns1000 commentedNov 25, 2023

Uh oh!

montanalow left a comment

Choose a reason for hiding this comment

Uh oh!

montanalowNov 25, 2023• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ns1000 commentedNov 25, 2023

Uh oh!

montanalow commentedNov 25, 2023

Uh oh!

Uh oh!

montanalowNov 25, 2023•
edited
Loading