NotificationsYou must be signed in to change notification settings
Fork352
Star6.6k

Separate embedding kwargs into init kwargs and encode kwargs#1555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

montanalow merged 2 commits intopostgresml:masterfromtomaarsen:sentence_transformers_init_kwargs

Jul 12, 2024

Merged

Separate embedding kwargs into init kwargs and encode kwargs#1555

montanalow merged 2 commits intopostgresml:masterfromtomaarsen:sentence_transformers_init_kwargs

Jul 12, 2024

Conversation

Copy link

Contributor

tomaarsen commentedJul 12, 2024

Resolves#1169

Hello!

Pull Request overview

Separate embedding kwargs into init kwargs and encode kwargs
Introduces support for custom code models viatrust_remote_code (e.g.pgml.embed trust_remote_code #1169)
Introduces support for private models viatoken (previously only possible via an environment variable, which FYI is still the recommended approach for security)
Introduces support for Matryoshka models such asthis Vietnamese one, which was trained such that embeddings can be truncated to smaller sizes with minimal performance loss & much faster retrieval, viatruncate_dim.
Introduces advanced loading support viamodel_kwargs/tokenizer_kwargs/config_kwargs. The first is most useful for inference, e.g. allowing loading models in lower precision for faster inference:model_kwargs={"torch_dtype": "bfloat16"}.

Details

This PR splitskwargs inpgml.embed into two types of kwargs: formodel = SentenceTransformer(..., **kwargs) and formodel.encode(..., **kwargs). This is currently done using a simple filter that checks for kwargs that are only (e.g.trust_remote_code) or primarily (e.g.device) relevant for the initialization.

I want to give a big preface that I have not tested this (!). My bandwidth is a bit too small this week for that I'm afraid. Another note is thatmodel_kwargs/tokenizer_kwargs/config_kwargs andtruncate_dim were only introduced in Sentence Transformers v3.0.0, whereas this project seems to be on v2.7 still. (FYI: ST v3.0 does not introduce breaking changes for inference, so upgrading should be safe).

Tom Aarsen

Separate embedding kwargs into init kwargs and encode kwargs

455861e

montanalow self-requested a review

July 12, 2024 14:41

montanalow force-pushed thesentence_transformers_init_kwargs branch 2 times, most recently from470f2d3 to18be006Compare

July 12, 2024 14:58

move embedding tests into their own file

465f38d

montanalow force-pushed thesentence_transformers_init_kwargs branch from18be006 to465f38dCompare

July 12, 2024 14:59

Copy link

Contributor

montanalow commentedJul 12, 2024

Thanks for the PR. I've added our embedding tests to CI, since we generally don't run the whole transformers suite due to the model download times. Confirmed thattrust_remote_code flag now works as expected.