Pandas and PyArrow

Because Lance is built on top ofApache Arrow,LanceDB is tightly integrated with the Python data ecosystem, includingPandasand PyArrow. The sequence of steps in a typical workflow is shown below.

Create dataset

First, we need to connect to a LanceDB database.

Sync APIAsync API

importlancedburi="data/sample-lancedb"db=lancedb.connect(uri)

importlancedburi="data/sample-lancedb"async_db=awaitlancedb.connect_async(uri)

We can load a PandasDataFrame to LanceDB directly.

Sync APIAsync API

importpandasaspddata=pd.DataFrame({"vector":[[3.1,4.1],[5.9,26.5]],"item":["foo","bar"],"price":[10.0,20.0],})table=db.create_table("pd_table",data=data)

importpandasaspddata=pd.DataFrame({"vector":[[3.1,4.1],[5.9,26.5]],"item":["foo","bar"],"price":[10.0,20.0],})awaitasync_db.create_table("pd_table_async",data=data)

Similar to thepyarrow.write_dataset() method, LanceDB'sdb.create_table() accepts data in a variety of forms.

If you have a dataset that is larger than memory, you can create a table withIterator[pyarrow.RecordBatch] to lazily load the data:

Sync APIAsync API

fromtypingimportIterableimportpyarrowaspadefmake_batches()->Iterable[pa.RecordBatch]:foriinrange(5):yieldpa.RecordBatch.from_arrays([pa.array([[3.1,4.1],[5.9,26.5]]),pa.array(["foo","bar"]),pa.array([10.0,20.0]),],["vector","item","price"],)schema=pa.schema([pa.field("vector",pa.list_(pa.float32())),pa.field("item",pa.utf8()),pa.field("price",pa.float32()),])table=db.create_table("iterable_table",data=make_batches(),schema=schema)

fromtypingimportIterableimportpyarrowaspadefmake_batches()->Iterable[pa.RecordBatch]:foriinrange(5):yieldpa.RecordBatch.from_arrays([pa.array([[3.1,4.1],[5.9,26.5]]),pa.array(["foo","bar"]),pa.array([10.0,20.0]),],["vector","item","price"],)schema=pa.schema([pa.field("vector",pa.list_(pa.float32())),pa.field("item",pa.utf8()),pa.field("price",pa.float32()),])awaitasync_db.create_table("iterable_table_async",data=make_batches(),schema=schema)

You will find detailed instructions of creating a LanceDB dataset inGetting Started andAPIsections.

Vector search

We can now perform similarity search via the LanceDB Python API.

Sync APIAsync API

# Open the table previously created.table=db.open_table("pd_table")query_vector=[100,100]# Pandas DataFramedf=table.search(query_vector).limit(1).to_pandas()print(df)

# Open the table previously created.async_tbl=awaitasync_db.open_table("pd_table_async")query_vector=[100,100]# Pandas DataFramedf=await(awaitasync_tbl.search(query_vector)).limit(1).to_pandas()print(df)

    vector     item  price    _distance0  [5.9, 26.5]  bar   20.0  14257.05957

If you have a simple filter, it's faster to provide awhere clause to LanceDB'ssearch method.For more complex filters or aggregations, you can always resort to using the underlyingDataFrame methods after performing a search.

Sync APIAsync API

# Apply the filter via LanceDBresults=table.search([100,100]).where("price < 15").to_pandas()assertlen(results)==1assertresults["item"].iloc[0]=="foo"# Apply the filter via Pandasdf=results=table.search([100,100]).to_pandas()results=df[df.price<15]assertlen(results)==1assertresults["item"].iloc[0]=="foo"

# Apply the filter via LanceDBresults=await(awaitasync_tbl.search([100,100])).where("price < 15").to_pandas()assertlen(results)==1assertresults["item"].iloc[0]=="foo"# Apply the filter via Pandasdf=results=await(awaitasync_tbl.search([100,100])).to_pandas()results=df[df.price<15]assertlen(results)==1assertresults["item"].iloc[0]=="foo"

Movatterモバイル変換

Pandas and PyArrow

Create dataset

Vector search