Working with tables
A Table is a collection of Records in a LanceDB Database. Tables in Lance have a schema that defines the columns and their types. These schemas can include nested columns and can evolve over time.
This guide will show how to create tables, insert data into them, and update the data.
Creating a LanceDB Table
Initialize a LanceDB connection and create a table
LanceDB allows ingesting data from various sources -dict
,list[dict]
,pd.DataFrame
,pa.Table
or aIterator[pa.RecordBatch]
. Let's take a look at some of the these.
From list of tuples or dictionaries
Note
If the table already exists, LanceDB will raise an error by default.
create_table
supports an optionalexist_ok
parameter. When set to Trueand the table exists, then it simply opens the existing table. The data youpassed in will NOT be appended to the table in that case.
Sometimes you want to make sure that you start fresh. If you want tooverwrite the table, you can pass in mode="overwrite" to the createTable function.
You can create a LanceDB table in JavaScript using an array of records as follows.
const_tbl=awaitdb.createTable("myTable",[{vector:[3.1,4.1],item:"foo",price:10.0},{vector:[5.9,26.5],item:"bar",price:20.0},],{mode:"overwrite"},);
This will infer the schema from the provided data. If you want to explicitly provide a schema, you can useapache-arrow
to declare a schema
constschema=newarrow.Schema([newarrow.Field("vector",newarrow.FixedSizeList(2,newarrow.Field("item",newarrow.Float32(),true),),),newarrow.Field("item",newarrow.Utf8(),true),newarrow.Field("price",newarrow.Float32(),true),]);constdata=[{vector:[3.1,4.1],item:"foo",price:10.0},{vector:[5.9,26.5],item:"bar",price:20.0},];consttbl=awaitdb.createTable("myTable",data,{schema,});
Note
createTable
supports an optionalexistsOk
parameter. When set to trueand the table exists, then it simply opens the existing table. The data youpassed in will NOT be appended to the table in that case.
Sometimes you want to make sure that you start fresh. If you want tooverwrite the table, you can pass in mode: "overwrite" to the createTable function.
consttbl=awaitdb.createTable("myTable",[{vector:[3.1,4.1],item:"foo",price:10.0},{vector:[5.9,26.5],item:"bar",price:20.0},],{writeMode:lancedb.WriteMode.Overwrite},);
This will infer the schema from the provided data. If you want to explicitly provide a schema, you can use apache-arrow to declare a schema
constschema=newarrow.Schema([newarrow.Field("vector",newarrow.FixedSizeList(2,newarrow.Field("item",newarrow.Float32(),true),),),newarrow.Field("item",newarrow.Utf8(),true),newarrow.Field("price",newarrow.Float32(),true),]);constdata=[{vector:[3.1,4.1],item:"foo",price:10.0},{vector:[5.9,26.5],item:"bar",price:20.0},];consttbl=awaitdb.createTable({name:"myTableWithSchema",data,schema,});
Warning
existsOk
is not available invectordb
If the table already exists, vectordb will raise an error by default.You can usewriteMode: WriteMode.Overwrite
to overwrite the table.But this will delete the existing table and create a new one with the same name.
Sometimes you want to make sure that you start fresh.
If you want to overwrite the table, you can pass inwriteMode: lancedb.WriteMode.Overwrite
to the createTable function.
From a Pandas DataFrame
Note
Data is converted to Arrow before being written to disk. For maximum control over how data is saved, either provide the PyArrow schema to convert to or else provide a PyArrow Table directly.
Thevector
column needs to be aVector (defined aspyarrow.FixedSizeList) type.
From a Polars DataFrame
LanceDB supportsPolars, a modern, fast DataFrame librarywritten in Rust. Just like in Pandas, the Polars integration is enabled by PyArrowunder the hood. A deeper integration between LanceDB Tables and Polars DataFramesis on the way.
From an Arrow Table
You can also create LanceDB tables directly from Arrow tables.LanceDB supports float16 data type!
importpyarrowaspaimportnumpyasnpdim=16total=2schema=pa.schema([pa.field("vector",pa.list_(pa.float16(),dim)),pa.field("text",pa.string())])data=pa.Table.from_arrays([pa.array([np.random.randn(dim).astype(np.float16)for_inrange(total)],pa.list_(pa.float16(),dim),),pa.array(["foo","bar"]),],["vector","text"],)tbl=db.create_table("f16_tbl",data,schema=schema)
importpolarsasplimportnumpyasnpdim=16total=2schema=pa.schema([pa.field("vector",pa.list_(pa.float16(),dim)),pa.field("text",pa.string())])data=pa.Table.from_arrays([pa.array([np.random.randn(dim).astype(np.float16)for_inrange(total)],pa.list_(pa.float16(),dim),),pa.array(["foo","bar"]),],["vector","text"],)async_tbl=awaitasync_db.create_table("f16_tbl_async",data,schema=schema)
constdb=awaitlancedb.connect(databaseDir);constdim=16;consttotal=10;constf16Schema=newSchema([newField("id",newInt32()),newField("vector",newFixedSizeList(dim,newField("item",newFloat16(),true)),false,),]);constdata=lancedb.makeArrowTable(Array.from(Array(total),(_,i)=>({id:i,vector:Array.from(Array(dim),Math.random),})),{schema:f16Schema},);const_table=awaitdb.createTable("f16_tbl",data);
constdim=16;consttotal=10;constschema=newSchema([newField("id",newInt32()),newField("vector",newFixedSizeList(dim,newField("item",newFloat16(),true)),false,),]);constdata=lancedb.makeArrowTable(Array.from(Array(total),(_,i)=>({id:i,vector:Array.from(Array(dim),Math.random),})),{schema},);consttable=awaitdb.createTable("f16_tbl",data);
From Pydantic Models
When you create an empty table without data, you must specify the table schema.LanceDB supports creating tables by specifying a PyArrow schema or a specializedPydantic model calledLanceModel
.
For example, the following Content model specifies a table with 5 columns:movie_id
,vector
,genres
,title
, andimdb_id
. When you create a table, you canpass the class as the value of theschema
parameter tocreate_table
.Thevector
column is aVector
type, which is a specialized Pydantic type thatcan be configured with the vector dimensions. It is also important to note thatLanceDB only understands subclasses oflancedb.pydantic.LanceModel
(which itself derives frompydantic.BaseModel
).
fromlancedb.pydanticimportVector,LanceModelimportpyarrowaspaclassContent(LanceModel):movie_id:intvector:Vector(128)genres:strtitle:strimdb_id:int@propertydefimdb_url(self)->str:returnf"https://www.imdb.com/title/tt{self.imdb_id}"async_tbl=awaitasync_db.create_table("movielens_small_async",schema=Content)
Nested schemas
Sometimes your data model may contain nested objects.For example, you may want to store the document stringand the document source name as a nested Document object:
This can be used as the type of a LanceDB table column:
This creates a struct column called "document" that has two subfieldscalled "content" and "source":
In [28]: tbl.schemaOut[28]:id: string not nullvector: fixed_size_list<item: float>[1536] not null child 0, item: floatdocument: struct<content: string not null, source: string not null> not null child 0, content: string not null child 1, source: string not null
Validators
Note that neither Pydantic nor PyArrow automatically validates that input datais of the correct timezone, but this is easy to add as a custom field validator:
fromdatetimeimportdatetimefromzoneinfoimportZoneInfofromlancedb.pydanticimportLanceModelfrompydanticimportField,field_validator,ValidationError,ValidationInfotzname="America/New_York"tz=ZoneInfo(tzname)classTestModel(LanceModel):dt_with_tz:datetime=Field(json_schema_extra={"tz":tzname})@field_validator('dt_with_tz')@classmethoddeftz_must_match(cls,dt:datetime)->datetime:assertdt.tzinfo==tzreturndtok=TestModel(dt_with_tz=datetime.now(tz))try:TestModel(dt_with_tz=datetime.now(ZoneInfo("Asia/Shanghai")))assert0==1,"this should raise ValidationError"exceptValidationError:print("A ValidationError was raised.")pass
When you run this code it should print "A ValidationError was raised."
Pydantic custom types
LanceDB does NOT yet support converting pydantic custom types. If this is something you need,please file a feature request on theLanceDB Github repo.
Using Iterators / Writing Large Datasets
It is recommended to use iterators to add large datasets in batches when creating your table in one go. This does not create multiple versions of your dataset unlike manually adding batches usingtable.add()
LanceDB additionally supports PyArrow'sRecordBatch
Iterators or other generators producing supported data types.
Here's an example using usingRecordBatch
iterator for creating tables.
importpyarrowaspadefmake_batches():foriinrange(5):yieldpa.RecordBatch.from_arrays([pa.array([[3.1,4.1,5.1,6.1],[5.9,26.5,4.7,32.8]],pa.list_(pa.float32(),4),),pa.array(["foo","bar"]),pa.array([10.0,20.0]),],["vector","item","price"],)schema=pa.schema([pa.field("vector",pa.list_(pa.float32(),4)),pa.field("item",pa.utf8()),pa.field("price",pa.float32()),])db.create_table("batched_tale",make_batches(),schema=schema)
importpyarrowaspadefmake_batches():foriinrange(5):yieldpa.RecordBatch.from_arrays([pa.array([[3.1,4.1,5.1,6.1],[5.9,26.5,4.7,32.8]],pa.list_(pa.float32(),4),),pa.array(["foo","bar"]),pa.array([10.0,20.0]),],["vector","item","price"],)schema=pa.schema([pa.field("vector",pa.list_(pa.float32(),4)),pa.field("item",pa.utf8()),pa.field("price",pa.float32()),])awaitasync_db.create_table("batched_table",make_batches(),schema=schema)
You can also use iterators of other types like Pandas DataFrame or Pylists directly in the above example.
Open existing tables
If you forget the name of your table, you can always get a listing of all table names.
Then, you can open any existing tables.
Creating empty table
You can create an empty table for scenarios where you want to add data to the table later. An example would be when you want to collect data from a stream/external file and then add it to a table in batches.
An empty table can be initialized via a PyArrow schema.
Alternatively, you can also use Pydantic to specify the schema for the empty table. Note that we do notdirectly importpydantic
but instead uselancedb.pydantic
which is a subclass ofpydantic.BaseModel
that has been extended to support LanceDB specific types likeVector
.
Once the empty table has been created, you can add data to it via the various methods listed in theAdding to a table section.
Adding to a table
After a table has been created, you can always add more data to it using theadd
method
You can add any of the valid data structures accepted by LanceDB table, i.e,dict
,list[dict]
,pd.DataFrame
, orIterator[pa.RecordBatch]
. Below are some examples.
Add a Pandas DataFrame
Add a Polars DataFrame
Add an Iterator
You can also add a large dataset batch in one go using Iterator of any supported data types.
Add a PyArrow table
If you have data coming in as a PyArrow table, you can add it directly to the LanceDB table.
Add a Pydantic Model
Assuming that a table has been created with the correct schema as shownabove, you can add data items that are valid Pydantic models to the table.
Ingesting Pydantic models with LanceDB embedding API
When using LanceDB's embedding API, you can add Pydantic models directly to the table. LanceDB will automatically convert thevector
field to a vector before adding it to the table. You need to specify the default value ofvector
field as None to allow LanceDB to automatically vectorize the data.
importlancedbfromlancedb.pydanticimportVector,LanceModelfromlancedb.embeddingsimportget_registryembed_fcn=get_registry().get("huggingface").create(name="BAAI/bge-small-en-v1.5")classSchema(LanceModel):text:str=embed_fcn.SourceField()vector:Vector(embed_fcn.ndims())=embed_fcn.VectorField(default=None)tbl=db.create_table("my_table_with_embedding",schema=Schema,mode="overwrite")models=[Schema(text="hello"),Schema(text="world")]tbl.add(models)
importlancedbfromlancedb.pydanticimportVector,LanceModelfromlancedb.embeddingsimportget_registryembed_fcn=get_registry().get("huggingface").create(name="BAAI/bge-small-en-v1.5")classSchema(LanceModel):text:str=embed_fcn.SourceField()vector:Vector(embed_fcn.ndims())=embed_fcn.VectorField(default=None)async_tbl=awaitasync_db.create_table("my_table_async_with_embedding",schema=Schema,mode="overwrite")models=[Schema(text="hello"),Schema(text="world")]awaitasync_tbl.add(models)
Upserting into a table
Upserting lets you insert new rows or update existing rows in a table. To upsertin LanceDB, use the merge insert API.
table=db.create_table("users",[{"id":0,"name":"Alice"},{"id":1,"name":"Bob"},],)new_users=[{"id":1,"name":"Bobby"},{"id":2,"name":"Charlie"},]res=(table.merge_insert("id").when_matched_update_all().when_not_matched_insert_all().execute(new_users))table.count_rows()# 3res# {'num_inserted_rows': 1, 'num_updated_rows': 1, 'num_deleted_rows': 0}
table=awaitdb.create_table("users",[{"id":0,"name":"Alice"},{"id":1,"name":"Bob"},],)new_users=[{"id":1,"name":"Bobby"},{"id":2,"name":"Charlie"},]res=await(table.merge_insert("id").when_matched_update_all().when_not_matched_insert_all().execute(new_users))awaittable.count_rows()# 3res# MergeResult(version=2, num_updated_rows=1,# num_inserted_rows=1, num_deleted_rows=0)
consttable=awaitdb.createTable("users",[{id:0,name:"Alice"},{id:1,name:"Bob"},]);constnewUsers=[{id:1,name:"Bobby"},{id:2,name:"Charlie"},];awaittable.mergeInsert("id").whenMatchedUpdateAll().whenNotMatchedInsertAll().execute(newUsers);awaittable.countRows();// 3
Read more in the guide onmerge insert.
Deleting from a table
Use thedelete()
method on tables to delete rows from a table. To choose which rows to delete, provide a filter that matches on the metadata columns. This can delete any number of rows that match the filter.
Deleting row with specific column value
data=[{"x":1,"vector":[1,2]},{"x":2,"vector":[3,4]},{"x":3,"vector":[5,6]},]async_db=awaitlancedb.connect_async(uri)async_tbl=awaitasync_db.create_table("delete_row_async",data)awaitasync_tbl.to_pandas()# x vector# 0 1 [1.0, 2.0]# 1 2 [3.0, 4.0]# 2 3 [5.0, 6.0]awaitasync_tbl.delete("x = 2")awaitasync_tbl.to_pandas()# x vector# 0 1 [1.0, 2.0]# 1 3 [5.0, 6.0]
Delete from a list of values
Updating a table
This can be used to update zero to all rows depending on how many rows match the where clause. The update queries follow the form of a SQL UPDATE statement. Thewhere
parameter is a SQL filter that matches on the metadata columns. Thevalues
orvalues_sql
parameters are used to provide the new values for the columns.
Parameter | Type | Description |
---|---|---|
where | str | The SQL where clause to use when updating rows. For example,'x = 2' or'x IN (1, 2, 3)' . The filter must not be empty, or it will error. |
values | dict | The values to update. The keys are the column names and the values are the values to set. |
values_sql | dict | The values to update. The keys are the column names and the values are the SQL expressions to set. For example,{'x': 'x + 1'} will increment the value of thex column by 1. |
SQL syntax
SeeSQL filters for more information on the supported SQL syntax.
Warning
Updating nested columns is not yet supported.
API Reference:lancedb.table.Table.update
importlancedbimportpandasaspd# Create a table from a pandas DataFramedata=pd.DataFrame({"x":[1,2,3],"vector":[[1,2],[3,4],[5,6]]})tbl=db.create_table("test_table",data,mode="overwrite")# Update the table where x = 2tbl.update(where="x = 2",values={"vector":[10,10]})# Get the updated table as a pandas DataFramedf=tbl.to_pandas()print(df)
importlancedbimportpandasaspd# Create a table from a pandas DataFramedata=pd.DataFrame({"x":[1,2,3],"vector":[[1,2],[3,4],[5,6]]})async_tbl=awaitasync_db.create_table("update_table_async",data)# Update the table where x = 2awaitasync_tbl.update({"vector":[10,10]},where="x = 2")# Get the updated table as a pandas DataFramedf=awaitasync_tbl.to_pandas()# Print the DataFrameprint(df)
Output
API Reference:lancedb.Table.update
API Reference:vectordb.Table.update
Updating using a sql query
Thevalues
parameter is used to provide the new values for the columns as literal values. You can also use thevalues_sql
/valuesSql
parameter to provide SQL expressions for the new values. For example, you can usevalues_sql="x + 1"
to increment the value of thex
column by 1.
Note
When rows are updated, they are moved out of the index. The row will still show up in ANN queries, but the query will not be as fast as it would be if the row was in the index. If you update a large proportion of rows, consider rebuilding the index afterwards.
Drop a table
Use thedrop_table()
method on the database to remove a table.
This permanently removes the table and is not recoverable, unlike deleting rows. By default, if the table does not exist an exception is raised. To suppress this, you can pass inignore_missing=True
.
Changing schemas
While tables must have a schema specified when they are created, you canchange the schema over time. There's three methods to alter the schema ofa table:
add_columns
: Add new columns to the tablealter_columns
: Alter the name, nullability, or data type of a columndrop_columns
: Drop columns from the table
Adding new columns
You can add new columns to the table with theadd_columns
method. New columnsare filled with values based on a SQL expression. For example, you can add a newcolumny
to the table, fill it with the value ofx * 2
and set the expecteddata type for it.
API Reference:lancedb.table.Table.add_columns
If you want to fill it with null, you can usecast(NULL as <data_type>)
asthe SQL expression to fill the column with nulls, while controlling the datatype of the column. Available data types are base on theDataFusion data types.You can use any of the SQL types, such asBIGINT
:
Using Arrow data types and thearrow_typeof
function is not yet supported.
Altering existing columns
You can alter the name, nullability, or data type of a column with thealter_columns
method.
Changing the name or nullability of a column just updates the metadata. Becauseof this, it's a fast operation. Changing the data type of a column requiresrewriting the column, which can be a heavy operation.
API Reference:lancedb.table.Table.alter_columns
Dropping columns
You can drop columns from the table with thedrop_columns
method. This willwill remove the column from the schema.
API Reference:lancedb.table.Table.drop_columns
Handling bad vectors
In LanceDB Python, you can use theon_bad_vectors
parameter to choose howinvalid vector values are handled. Invalid vectors are vectors that are not validbecause:
- They are the wrong dimension
- They contain NaN values
- They are null but are on a non-nullable field
By default, LanceDB will raise an error if it encounters a bad vector. You canalso choose one of the following options:
drop
: Ignore rows with bad vectorsfill
: Replace bad values (NaNs) or missing values (too few dimensions) with the fill value specified in thefill_value
parameter. An input like[1.0, NaN, 3.0]
will be replaced with[1.0, 0.0, 3.0]
iffill_value=0.0
.null
: Replace bad vectors with null (only works if the column is nullable). A bad vector[1.0, NaN, 3.0]
will be replaced withnull
if the column is nullable. If the vector column is non-nullable, then bad vectors will cause an error
Consistency
In LanceDB OSS, users can set theread_consistency_interval
parameter on connections to achieve different levels of read consistency. This parameter determines how frequently the database synchronizes with the underlying storage system to check for updates made by other processes. If another process updates a table, the database will not see the changes until the next synchronization.
There are three possible settings forread_consistency_interval
:
- Unset (default): The database does not check for updates to tables made by other processes. This provides the best query performance, but means that clients may not see the most up-to-date data. This setting is suitable for applications where the data does not change during the lifetime of the table reference.
- Zero seconds (Strong consistency): The database checks for updates on every read. This provides the strongest consistency guarantees, ensuring that all clients see the latest committed data. However, it has the most overhead. This setting is suitable when consistency matters more than having high QPS.
- Custom interval (Eventual consistency): The database checks for updates at a custom interval, such as every 5 seconds. This provides eventual consistency, allowing for some lag between write and read operations. Performance wise, this is a middle ground between strong consistency and no consistency check. This setting is suitable for applications where immediate consistency is not critical, but clients should see updated data eventually.
Consistency in LanceDB Cloud
This is only tune-able in LanceDB OSS. In LanceDB Cloud, readers are always eventually consistent.
To set strong consistency, usetimedelta(0)
:
For eventual consistency, use a customtimedelta
:
By default, aTable
will never check for updates from other writers. To manually check for updates you can usecheckout_latest
:
What's next?
Learn the best practices on creating an ANN index and getting the most out of it.