Pandas and PyArrow
Because Lance is built on top ofApache Arrow,LanceDB is tightly integrated with the Python data ecosystem, includingPandasand PyArrow. The sequence of steps in a typical workflow is shown below.
Create dataset
First, we need to connect to a LanceDB database.
We can load a PandasDataFrame
to LanceDB directly.
Similar to thepyarrow.write_dataset()
method, LanceDB'sdb.create_table()
accepts data in a variety of forms.
If you have a dataset that is larger than memory, you can create a table withIterator[pyarrow.RecordBatch]
to lazily load the data:
fromtypingimportIterableimportpyarrowaspadefmake_batches()->Iterable[pa.RecordBatch]:foriinrange(5):yieldpa.RecordBatch.from_arrays([pa.array([[3.1,4.1],[5.9,26.5]]),pa.array(["foo","bar"]),pa.array([10.0,20.0]),],["vector","item","price"],)schema=pa.schema([pa.field("vector",pa.list_(pa.float32())),pa.field("item",pa.utf8()),pa.field("price",pa.float32()),])table=db.create_table("iterable_table",data=make_batches(),schema=schema)
fromtypingimportIterableimportpyarrowaspadefmake_batches()->Iterable[pa.RecordBatch]:foriinrange(5):yieldpa.RecordBatch.from_arrays([pa.array([[3.1,4.1],[5.9,26.5]]),pa.array(["foo","bar"]),pa.array([10.0,20.0]),],["vector","item","price"],)schema=pa.schema([pa.field("vector",pa.list_(pa.float32())),pa.field("item",pa.utf8()),pa.field("price",pa.float32()),])awaitasync_db.create_table("iterable_table_async",data=make_batches(),schema=schema)
You will find detailed instructions of creating a LanceDB dataset inGetting Started andAPIsections.
Vector search
We can now perform similarity search via the LanceDB Python API.
If you have a simple filter, it's faster to provide awhere
clause to LanceDB'ssearch
method.For more complex filters or aggregations, you can always resort to using the underlyingDataFrame
methods after performing a search.
# Apply the filter via LanceDBresults=table.search([100,100]).where("price < 15").to_pandas()assertlen(results)==1assertresults["item"].iloc[0]=="foo"# Apply the filter via Pandasdf=results=table.search([100,100]).to_pandas()results=df[df.price<15]assertlen(results)==1assertresults["item"].iloc[0]=="foo"
# Apply the filter via LanceDBresults=await(awaitasync_tbl.search([100,100])).where("price < 15").to_pandas()assertlen(results)==1assertresults["item"].iloc[0]=="foo"# Apply the filter via Pandasdf=results=await(awaitasync_tbl.search([100,100])).to_pandas()results=df[df.price<15]assertlen(results)==1assertresults["item"].iloc[0]=="foo"