Get Started
Due to the nature of vector embeddings, they can be used to represent any kind of data, from text to images to audio.This makes them a very powerful tool for machine learning practitioners.However, there's no one-size-fits-all solution for generating embeddings - there are many different libraries and APIs(both commercial and open source) that can be used to generate embeddings from structured/unstructured data.
LanceDB supports 3 methods of working with embeddings.
- You can manually generate embeddings for the data and queries. This is done outside of LanceDB.
- You can use the built-inembedding functions to embed the data and queries in the background.
- You can define your owncustom embedding function that extends the default embedding functions.
For python users, there is also a legacywith_embeddings API.It is retained for compatibility and will be removed in a future version.
Quickstart
To get started with embeddings, you can use the built-in embedding functions.
OpenAI Embedding function
LanceDB registers the OpenAI embeddings function in the registry asopenai
. You can pass any supported model name to thecreate
. By default it uses"text-embedding-ada-002"
.
importlancedbfromlancedb.pydanticimportLanceModel,Vectorfromlancedb.embeddingsimportget_registrydb=lancedb.connect("/tmp/db")func=get_registry().get("openai").create(name="text-embedding-ada-002")classWords(LanceModel):text:str=func.SourceField()vector:Vector(func.ndims())=func.VectorField()table=db.create_table("words",schema=Words,mode="overwrite")table.add([{"text":"hello world"},{"text":"goodbye world"}])query="greetings"actual=table.search(query).limit(1).to_pydantic(Words)[0]print(actual.text)
import*aslancedbfrom"@lancedb/lancedb";import"@lancedb/lancedb/embedding/openai";import{LanceSchema,getRegistry,register}from"@lancedb/lancedb/embedding";import{EmbeddingFunction}from"@lancedb/lancedb/embedding";import{typeFloat,Float32,Utf8}from"apache-arrow";constdb=awaitlancedb.connect(databaseDir);constfunc=getRegistry().get("openai")?.create({model:"text-embedding-ada-002"})asEmbeddingFunction;constwordsSchema=LanceSchema({text:func.sourceField(newUtf8()),vector:func.vectorField(),});consttbl=awaitdb.createEmptyTable("words",wordsSchema,{mode:"overwrite",});awaittbl.add([{text:"hello world"},{text:"goodbye world"}]);constquery="greetings";constactual=(awaittbl.search(query).limit(1).toArray())[0];
usestd::{iter::once,sync::Arc};usearrow_array::{Float64Array,Int32Array,RecordBatch,RecordBatchIterator,StringArray};usearrow_schema::{DataType,Field,Schema};usefutures::StreamExt;uselancedb::{arrow::IntoArrow,connect,embeddings::{openai::OpenAIEmbeddingFunction,EmbeddingDefinition,EmbeddingFunction},query::{ExecutableQuery,QueryBase},Result,};#[tokio::main]asyncfnmain()->Result<()>{lettempdir=tempfile::tempdir().unwrap();lettempdir=tempdir.path().to_str().unwrap();letapi_key=std::env::var("OPENAI_API_KEY").expect("OPENAI_API_KEY is not set");letembedding=Arc::new(OpenAIEmbeddingFunction::new_with_model(api_key,"text-embedding-3-large",)?);letdb=connect(tempdir).execute().await?;db.embedding_registry().register("openai",embedding.clone())?;lettable=db.create_table("vectors",make_data()).add_embedding(EmbeddingDefinition::new("text","openai",Some("embeddings"),))?.execute().await?;letquery=Arc::new(StringArray::from_iter_values(once("something warm")));letquery_vector=embedding.compute_query_embeddings(query)?;letmutresults=table.vector_search(query_vector)?.limit(1).execute().await?;letrb=results.next().await.unwrap()?;letout=rb.column_by_name("text").unwrap().as_any().downcast_ref::<StringArray>().unwrap();lettext=out.iter().next().unwrap().unwrap();println!("Closest match: {}",text);Ok(())}
Sentence Transformers Embedding function
LanceDB registers the Sentence Transformers embeddings function in the registry assentence-transformers
. You can pass any supported model name to thecreate
. By default it uses"sentence-transformers/paraphrase-MiniLM-L6-v2"
.
importlancedbfromlancedb.pydanticimportLanceModel,Vectorfromlancedb.embeddingsimportget_registrydb=lancedb.connect("/tmp/db")model=get_registry().get("sentence-transformers").create(name="BAAI/bge-small-en-v1.5",device="cpu")classWords(LanceModel):text:str=model.SourceField()vector:Vector(model.ndims())=model.VectorField()table=db.create_table("words",schema=Words)table.add([{"text":"hello world"},{"text":"goodbye world"}])query="greetings"actual=table.search(query).limit(1).to_pydantic(Words)[0]print(actual.text)
Coming Soon!
Coming Soon!
Embedding function with LanceDB cloud
Embedding functions are now supported on LanceDB cloud. The embeddings will be generated on the source device and sent to the cloud. This means that the source device must have the necessary resources to generate the embeddings. Here's an example using the OpenAI embedding function:
importosimportlancedbfromlancedb.pydanticimportLanceModel,Vectorfromlancedb.embeddingsimportget_registryos.environ['OPENAI_API_KEY']="..."db=lancedb.connect(uri="db://....",api_key="sk_...",region="us-east-1")func=get_registry().get("openai").create()classWords(LanceModel):text:str=func.SourceField()vector:Vector(func.ndims())=func.VectorField()table=db.create_table("words",schema=Words)table.add([{"text":"hello world"},{"text":"goodbye world"}])query="greetings"actual=table.search(query).limit(1).to_pydantic(Words)[0]print(actual.text)