Movatterモバイル変換


[0]ホーム

URL:


Skip to content

This documentation site is deprecated. Please visit our new documentation site at lancedb.com/docs for the latest information.

Pandas and PyArrow

Because Lance is built on top ofApache Arrow,LanceDB is tightly integrated with the Python data ecosystem, includingPandasand PyArrow. The sequence of steps in a typical workflow is shown below.

Create dataset

First, we need to connect to a LanceDB database.

importlancedburi="data/sample-lancedb"db=lancedb.connect(uri)
importlancedburi="data/sample-lancedb"async_db=awaitlancedb.connect_async(uri)

We can load a PandasDataFrame to LanceDB directly.

importpandasaspddata=pd.DataFrame({"vector":[[3.1,4.1],[5.9,26.5]],"item":["foo","bar"],"price":[10.0,20.0],})table=db.create_table("pd_table",data=data)
importpandasaspddata=pd.DataFrame({"vector":[[3.1,4.1],[5.9,26.5]],"item":["foo","bar"],"price":[10.0,20.0],})awaitasync_db.create_table("pd_table_async",data=data)

Similar to thepyarrow.write_dataset() method, LanceDB'sdb.create_table() accepts data in a variety of forms.

If you have a dataset that is larger than memory, you can create a table withIterator[pyarrow.RecordBatch] to lazily load the data:

fromtypingimportIterableimportpyarrowaspadefmake_batches()->Iterable[pa.RecordBatch]:foriinrange(5):yieldpa.RecordBatch.from_arrays([pa.array([[3.1,4.1],[5.9,26.5]]),pa.array(["foo","bar"]),pa.array([10.0,20.0]),],["vector","item","price"],)schema=pa.schema([pa.field("vector",pa.list_(pa.float32())),pa.field("item",pa.utf8()),pa.field("price",pa.float32()),])table=db.create_table("iterable_table",data=make_batches(),schema=schema)
fromtypingimportIterableimportpyarrowaspadefmake_batches()->Iterable[pa.RecordBatch]:foriinrange(5):yieldpa.RecordBatch.from_arrays([pa.array([[3.1,4.1],[5.9,26.5]]),pa.array(["foo","bar"]),pa.array([10.0,20.0]),],["vector","item","price"],)schema=pa.schema([pa.field("vector",pa.list_(pa.float32())),pa.field("item",pa.utf8()),pa.field("price",pa.float32()),])awaitasync_db.create_table("iterable_table_async",data=make_batches(),schema=schema)

You will find detailed instructions of creating a LanceDB dataset inGetting Started andAPIsections.

Vector search

We can now perform similarity search via the LanceDB Python API.

# Open the table previously created.table=db.open_table("pd_table")query_vector=[100,100]# Pandas DataFramedf=table.search(query_vector).limit(1).to_pandas()print(df)
# Open the table previously created.async_tbl=awaitasync_db.open_table("pd_table_async")query_vector=[100,100]# Pandas DataFramedf=await(awaitasync_tbl.search(query_vector)).limit(1).to_pandas()print(df)
    vector     item  price    _distance0  [5.9, 26.5]  bar   20.0  14257.05957

If you have a simple filter, it's faster to provide awhere clause to LanceDB'ssearch method.For more complex filters or aggregations, you can always resort to using the underlyingDataFrame methods after performing a search.

# Apply the filter via LanceDBresults=table.search([100,100]).where("price < 15").to_pandas()assertlen(results)==1assertresults["item"].iloc[0]=="foo"# Apply the filter via Pandasdf=results=table.search([100,100]).to_pandas()results=df[df.price<15]assertlen(results)==1assertresults["item"].iloc[0]=="foo"
# Apply the filter via LanceDBresults=await(awaitasync_tbl.search([100,100])).where("price < 15").to_pandas()assertlen(results)==1assertresults["item"].iloc[0]=="foo"# Apply the filter via Pandasdf=results=await(awaitasync_tbl.search([100,100])).to_pandas()results=df[df.price<15]assertlen(results)==1assertresults["item"].iloc[0]=="foo"

[8]ページ先頭

©2009-2025 Movatter.jp