Legacy

The legacywith_embeddings API is for Python only and is deprecated.

Hugging Face

The most popular open source option is to use thesentence-transformers library, which can be installed via pip.

pipinstallsentence-transformers

The example below shows how to use theparaphrase-albert-small-v2 model to generate embeddings for a given document.

fromsentence_transformersimportSentenceTransformername="paraphrase-albert-small-v2"model=SentenceTransformer(name)# used for both training and queryingdefembed_func(batch):return[model.encode(sentence)forsentenceinbatch]

OpenAI

Another popular alternative is to use an external API like OpenAI'sembeddings API.

importopenaiimportos# Configuring the environment variable OPENAI_API_KEYif"OPENAI_API_KEY"notinos.environ:# OR set the key here as a variableopenai.api_key="sk-..."client=openai.OpenAI()defembed_func(c):rs=client.embeddings.create(input=c,model="text-embedding-ada-002")return[record.embeddingforrecordinrs["data"]]

Applying an embedding function to data

Using an embedding function, you can apply it to raw datato generate embeddings for each record.

Say you have a pandas DataFrame with atext column that you want embedded,you can use thewith_embeddings function to generate embeddings and add them to an existing table.

importpandasaspdfromlancedb.embeddingsimportwith_embeddingsdf=pd.DataFrame([{"text":"pepperoni"},{"text":"pineapple"}])data=with_embeddings(embed_func,df)# The output is used to create / append to a tabletbl=db.create_table("my_table",data=data)

If your data is in a different column, you can specify thecolumn kwarg towith_embeddings.

By default, LanceDB calls the function with batches of 1000 rows. This can be configuredusing thebatch_size parameter towith_embeddings.

LanceDB automatically wraps the function with retry and rate-limit logic to ensure the OpenAIAPI call is reliable.

Querying using an embedding function

Warning

At query time, youmust use the same embedding function you used to vectorize your data.If you use a different embedding function, the embeddings will not reside in the same vectorspace and the results will be nonsensical.

Python

query="What's the best pizza topping?"query_vector=embed_func([query])[0]results=(tbl.search(query_vector).limit(10).to_pandas())

The above snippet returns a pandas DataFrame with the 10 closest vectors to the query.

Movatterモバイル変換

Legacy

Hugging Face

OpenAI

Applying an embedding function to data

Querying using an embedding function