Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Python API Reference

This section contains the API reference for the Python API. There is asynchronous and an asynchronous API client.

The general flow of using the API is:

  1. Uselancedb.connect orlancedb.connect_async to connect to a database.
  2. Use the returnedlancedb.DBConnection orlancedb.AsyncConnection to create or open tables.
  3. Use the returnedlancedb.table.Table orlancedb.AsyncTable to query or modify tables.

Installation

pipinstalllancedb

The following methods describe the synchronous API client. Thereis also anasynchronous API client.

Connections (Synchronous)

lancedb.connect

connect(uri:URI,*,api_key:Optional[str]=None,region:str='us-east-1',host_override:Optional[str]=None,read_consistency_interval:Optional[timedelta]=None,request_thread_pool:Optional[Union[int,ThreadPoolExecutor]]=None,client_config:Union[ClientConfig,Dict[str,Any],None]=None,storage_options:Optional[Dict[str,str]]=None,**kwargs:Any)->DBConnection

Connect to a LanceDB database.

Parameters:

  • uri (URI) –

    The uri of the database.

  • api_key (Optional[str], default:None) –

    If presented, connect to LanceDB cloud.Otherwise, connect to a database on file system or cloud storage.Can be set via environment variableLANCEDB_API_KEY.

  • region (str, default:'us-east-1') –

    The region to use for LanceDB Cloud.

  • host_override (Optional[str], default:None) –

    The override url for LanceDB Cloud.

  • read_consistency_interval (Optional[timedelta], default:None) –

    (For LanceDB OSS only)The interval at which to check for updates to the table from otherprocesses. If None, then consistency is not checked. For performancereasons, this is the default. For strong consistency, set this tozero seconds. Then every read will check for updates from otherprocesses. As a compromise, you can set this to a non-zero timedeltafor eventual consistency. If more than that interval has passed sincethe last check, then the table will be checked for updates. Note: thisconsistency only applies to read operations. Write operations arealways consistent.

  • client_config (Union[ClientConfig,Dict[str,Any], None], default:None) –

    Configuration options for the LanceDB Cloud HTTP client. If a dict, thenthe keys are the attributes of the ClientConfig class. If None, then thedefault configuration is used.

  • storage_options (Optional[Dict[str,str]], default:None) –

    Additional options for the storage backend. See available options athttps://lancedb.github.io/lancedb/guides/storage/

Examples:

For a local directory, provide a path for the database:

>>>importlancedb>>>db=lancedb.connect("~/.lancedb")

For object storage, use a URI prefix:

>>>db=lancedb.connect("s3://my-bucket/lancedb",...storage_options={"aws_access_key_id":"***"})

Connect to LanceDB cloud:

>>>db=lancedb.connect("db://my_database",api_key="ldb_...",...client_config={"retry_config":{"retries":5}})

Returns:

Source code inlancedb/__init__.py
defconnect(uri:URI,*,api_key:Optional[str]=None,region:str="us-east-1",host_override:Optional[str]=None,read_consistency_interval:Optional[timedelta]=None,request_thread_pool:Optional[Union[int,ThreadPoolExecutor]]=None,client_config:Union[ClientConfig,Dict[str,Any],None]=None,storage_options:Optional[Dict[str,str]]=None,**kwargs:Any,)->DBConnection:"""Connect to a LanceDB database.    Parameters    ----------    uri: str or Path        The uri of the database.    api_key: str, optional        If presented, connect to LanceDB cloud.        Otherwise, connect to a database on file system or cloud storage.        Can be set via environment variable `LANCEDB_API_KEY`.    region: str, default "us-east-1"        The region to use for LanceDB Cloud.    host_override: str, optional        The override url for LanceDB Cloud.    read_consistency_interval: timedelta, default None        (For LanceDB OSS only)        The interval at which to check for updates to the table from other        processes. If None, then consistency is not checked. For performance        reasons, this is the default. For strong consistency, set this to        zero seconds. Then every read will check for updates from other        processes. As a compromise, you can set this to a non-zero timedelta        for eventual consistency. If more than that interval has passed since        the last check, then the table will be checked for updates. Note: this        consistency only applies to read operations. Write operations are        always consistent.    client_config: ClientConfig or dict, optional        Configuration options for the LanceDB Cloud HTTP client. If a dict, then        the keys are the attributes of the ClientConfig class. If None, then the        default configuration is used.    storage_options: dict, optional        Additional options for the storage backend. See available options at        <https://lancedb.github.io/lancedb/guides/storage/>    Examples    --------    For a local directory, provide a path for the database:    >>> import lancedb    >>> db = lancedb.connect("~/.lancedb")    For object storage, use a URI prefix:    >>> db = lancedb.connect("s3://my-bucket/lancedb",    ...                      storage_options={"aws_access_key_id": "***"})    Connect to LanceDB cloud:    >>> db = lancedb.connect("db://my_database", api_key="ldb_...",    ...                      client_config={"retry_config": {"retries": 5}})    Returns    -------    conn : DBConnection        A connection to a LanceDB database.    """ifisinstance(uri,str)anduri.startswith("db://"):ifapi_keyisNone:api_key=os.environ.get("LANCEDB_API_KEY")ifapi_keyisNone:raiseValueError(f"api_key is required to connected LanceDB cloud:{uri}")ifisinstance(request_thread_pool,int):request_thread_pool=ThreadPoolExecutor(request_thread_pool)returnRemoteDBConnection(uri,api_key,region,host_override,# TODO: remove this (deprecation warning downstream)request_thread_pool=request_thread_pool,client_config=client_config,storage_options=storage_options,**kwargs,)ifkwargs:raiseValueError(f"Unknown keyword arguments:{kwargs}")returnLanceDBConnection(uri,read_consistency_interval=read_consistency_interval,storage_options=storage_options,)

lancedb.db.DBConnection

Bases:EnforceOverrides

An active LanceDB connection interface.

Source code inlancedb/db.py
 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308
classDBConnection(EnforceOverrides):"""An active LanceDB connection interface."""@abstractmethoddeftable_names(self,page_token:Optional[str]=None,limit:int=10)->Iterable[str]:"""List all tables in this database, in sorted order        Parameters        ----------        page_token: str, optional            The token to use for pagination. If not present, start from the beginning.            Typically, this token is last table name from the previous page.            Only supported by LanceDb Cloud.        limit: int, default 10            The size of the page to return.            Only supported by LanceDb Cloud.        Returns        -------        Iterable of str        """pass@abstractmethoddefcreate_table(self,name:str,data:Optional[DATA]=None,schema:Optional[Union[pa.Schema,LanceModel]]=None,mode:str="create",exist_ok:bool=False,on_bad_vectors:str="error",fill_value:float=0.0,embedding_functions:Optional[List[EmbeddingFunctionConfig]]=None,*,storage_options:Optional[Dict[str,str]]=None,data_storage_version:Optional[str]=None,enable_v2_manifest_paths:Optional[bool]=None,)->Table:"""Create a [Table][lancedb.table.Table] in the database.        Parameters        ----------        name: str            The name of the table.        data: The data to initialize the table, *optional*            User must provide at least one of `data` or `schema`.            Acceptable types are:            - list-of-dict            - pandas.DataFrame            - pyarrow.Table or pyarrow.RecordBatch        schema: The schema of the table, *optional*            Acceptable types are:            - pyarrow.Schema            - [LanceModel][lancedb.pydantic.LanceModel]        mode: str; default "create"            The mode to use when creating the table.            Can be either "create" or "overwrite".            By default, if the table already exists, an exception is raised.            If you want to overwrite the table, use mode="overwrite".        exist_ok: bool, default False            If a table by the same name already exists, then raise an exception            if exist_ok=False. If exist_ok=True, then open the existing table;            it will not add the provided data but will validate against any            schema that's specified.        on_bad_vectors: str, default "error"            What to do if any of the vectors are not the same size or contains NaNs.            One of "error", "drop", "fill".        fill_value: float            The value to use when filling vectors. Only used if on_bad_vectors="fill".        storage_options: dict, optional            Additional options for the storage backend. Options already set on the            connection will be inherited by the table, but can be overridden here.            See available options at            <https://lancedb.github.io/lancedb/guides/storage/>        data_storage_version: optional, str, default "stable"            Deprecated.  Set `storage_options` when connecting to the database and set            `new_table_data_storage_version` in the options.        enable_v2_manifest_paths: optional, bool, default False            Deprecated.  Set `storage_options` when connecting to the database and set            `new_table_enable_v2_manifest_paths` in the options.        Returns        -------        LanceTable            A reference to the newly created table.        !!! note            The vector index won't be created by default.            To create the index, call the `create_index` method on the table.        Examples        --------        Can create with list of tuples or dictionaries:        >>> import lancedb        >>> db = lancedb.connect("./.lancedb")        >>> data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},        ...         {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]        >>> db.create_table("my_table", data)        LanceTable(name='my_table', version=1, ...)        >>> db["my_table"].head()        pyarrow.Table        vector: fixed_size_list<item: float>[2]          child 0, item: float        lat: double        long: double        ----        vector: [[[1.1,1.2],[0.2,1.8]]]        lat: [[45.5,40.1]]        long: [[-122.7,-74.1]]        You can also pass a pandas DataFrame:        >>> import pandas as pd        >>> data = pd.DataFrame({        ...    "vector": [[1.1, 1.2], [0.2, 1.8]],        ...    "lat": [45.5, 40.1],        ...    "long": [-122.7, -74.1]        ... })        >>> db.create_table("table2", data)        LanceTable(name='table2', version=1, ...)        >>> db["table2"].head()        pyarrow.Table        vector: fixed_size_list<item: float>[2]          child 0, item: float        lat: double        long: double        ----        vector: [[[1.1,1.2],[0.2,1.8]]]        lat: [[45.5,40.1]]        long: [[-122.7,-74.1]]        Data is converted to Arrow before being written to disk. For maximum        control over how data is saved, either provide the PyArrow schema to        convert to or else provide a [PyArrow Table](pyarrow.Table) directly.        >>> import pyarrow as pa        >>> custom_schema = pa.schema([        ...   pa.field("vector", pa.list_(pa.float32(), 2)),        ...   pa.field("lat", pa.float32()),        ...   pa.field("long", pa.float32())        ... ])        >>> db.create_table("table3", data, schema = custom_schema)        LanceTable(name='table3', version=1, ...)        >>> db["table3"].head()        pyarrow.Table        vector: fixed_size_list<item: float>[2]          child 0, item: float        lat: float        long: float        ----        vector: [[[1.1,1.2],[0.2,1.8]]]        lat: [[45.5,40.1]]        long: [[-122.7,-74.1]]        It is also possible to create an table from `[Iterable[pa.RecordBatch]]`:        >>> import pyarrow as pa        >>> def make_batches():        ...     for i in range(5):        ...         yield pa.RecordBatch.from_arrays(        ...             [        ...                 pa.array([[3.1, 4.1], [5.9, 26.5]],        ...                     pa.list_(pa.float32(), 2)),        ...                 pa.array(["foo", "bar"]),        ...                 pa.array([10.0, 20.0]),        ...             ],        ...             ["vector", "item", "price"],        ...         )        >>> schema=pa.schema([        ...     pa.field("vector", pa.list_(pa.float32(), 2)),        ...     pa.field("item", pa.utf8()),        ...     pa.field("price", pa.float32()),        ... ])        >>> db.create_table("table4", make_batches(), schema=schema)        LanceTable(name='table4', version=1, ...)        """raiseNotImplementedErrordef__getitem__(self,name:str)->LanceTable:returnself.open_table(name)defopen_table(self,name:str,*,storage_options:Optional[Dict[str,str]]=None,index_cache_size:Optional[int]=None,)->Table:"""Open a Lance Table in the database.        Parameters        ----------        name: str            The name of the table.        index_cache_size: int, default 256            Set the size of the index cache, specified as a number of entries            The exact meaning of an "entry" will depend on the type of index:            * IVF - there is one entry for each IVF partition            * BTREE - there is one entry for the entire index            This cache applies to the entire opened table, across all indices.            Setting this value higher will increase performance on larger datasets            at the expense of more RAM        storage_options: dict, optional            Additional options for the storage backend. Options already set on the            connection will be inherited by the table, but can be overridden here.            See available options at            <https://lancedb.github.io/lancedb/guides/storage/>        Returns        -------        A LanceTable object representing the table.        """raiseNotImplementedErrordefdrop_table(self,name:str):"""Drop a table from the database.        Parameters        ----------        name: str            The name of the table.        """raiseNotImplementedErrordefrename_table(self,cur_name:str,new_name:str):"""Rename a table in the database.        Parameters        ----------        cur_name: str            The current name of the table.        new_name: str            The new name of the table.        """raiseNotImplementedErrordefdrop_database(self):"""        Drop database        This is the same thing as dropping all the tables        """raiseNotImplementedErrordefdrop_all_tables(self):"""        Drop all tables from the database        """raiseNotImplementedError@propertydefuri(self)->str:returnself._uri

table_namesabstractmethod

table_names(page_token:Optional[str]=None,limit:int=10)->Iterable[str]

List all tables in this database, in sorted order

Parameters:

  • page_token (Optional[str], default:None) –

    The token to use for pagination. If not present, start from the beginning.Typically, this token is last table name from the previous page.Only supported by LanceDb Cloud.

  • limit (int, default:10) –

    The size of the page to return.Only supported by LanceDb Cloud.

Returns:

  • Iterable of str
Source code inlancedb/db.py
@abstractmethoddeftable_names(self,page_token:Optional[str]=None,limit:int=10)->Iterable[str]:"""List all tables in this database, in sorted order    Parameters    ----------    page_token: str, optional        The token to use for pagination. If not present, start from the beginning.        Typically, this token is last table name from the previous page.        Only supported by LanceDb Cloud.    limit: int, default 10        The size of the page to return.        Only supported by LanceDb Cloud.    Returns    -------    Iterable of str    """pass

create_tableabstractmethod

create_table(name:str,data:Optional[DATA]=None,schema:Optional[Union[Schema,LanceModel]]=None,mode:str='create',exist_ok:bool=False,on_bad_vectors:str='error',fill_value:float=0.0,embedding_functions:Optional[List[EmbeddingFunctionConfig]]=None,*,storage_options:Optional[Dict[str,str]]=None,data_storage_version:Optional[str]=None,enable_v2_manifest_paths:Optional[bool]=None)->Table

Create aTable in the database.

Parameters:

  • name (str) –

    The name of the table.

  • data (Optional[DATA], default:None) –

    User must provide at least one ofdata orschema.Acceptable types are:

    • list-of-dict

    • pandas.DataFrame

    • pyarrow.Table or pyarrow.RecordBatch

  • schema (Optional[Union[Schema,LanceModel]], default:None) –

    Acceptable types are:

  • mode (str, default:'create') –

    The mode to use when creating the table.Can be either "create" or "overwrite".By default, if the table already exists, an exception is raised.If you want to overwrite the table, use mode="overwrite".

  • exist_ok (bool, default:False) –

    If a table by the same name already exists, then raise an exceptionif exist_ok=False. If exist_ok=True, then open the existing table;it will not add the provided data but will validate against anyschema that's specified.

  • on_bad_vectors (str, default:'error') –

    What to do if any of the vectors are not the same size or contains NaNs.One of "error", "drop", "fill".

  • fill_value (float, default:0.0) –

    The value to use when filling vectors. Only used if on_bad_vectors="fill".

  • storage_options (Optional[Dict[str,str]], default:None) –

    Additional options for the storage backend. Options already set on theconnection will be inherited by the table, but can be overridden here.See available options athttps://lancedb.github.io/lancedb/guides/storage/

  • data_storage_version (Optional[str], default:None) –

    Deprecated. Setstorage_options when connecting to the database and setnew_table_data_storage_version in the options.

  • enable_v2_manifest_paths (Optional[bool], default:None) –

    Deprecated. Setstorage_options when connecting to the database and setnew_table_enable_v2_manifest_paths in the options.

Returns:

  • LanceTable

    A reference to the newly created table.

  • !!! note

    The vector index won't be created by default.To create the index, call thecreate_index method on the table.

Examples:

Can create with list of tuples or dictionaries:

>>>importlancedb>>>db=lancedb.connect("./.lancedb")>>>data=[{"vector":[1.1,1.2],"lat":45.5,"long":-122.7},...{"vector":[0.2,1.8],"lat":40.1,"long":-74.1}]>>>db.create_table("my_table",data)LanceTable(name='my_table', version=1, ...)>>>db["my_table"].head()pyarrow.Tablevector: fixed_size_list<item: float>[2]  child 0, item: floatlat: doublelong: double----vector: [[[1.1,1.2],[0.2,1.8]]]lat: [[45.5,40.1]]long: [[-122.7,-74.1]]

You can also pass a pandas DataFrame:

>>>importpandasaspd>>>data=pd.DataFrame({..."vector":[[1.1,1.2],[0.2,1.8]],..."lat":[45.5,40.1],..."long":[-122.7,-74.1]...})>>>db.create_table("table2",data)LanceTable(name='table2', version=1, ...)>>>db["table2"].head()pyarrow.Tablevector: fixed_size_list<item: float>[2]  child 0, item: floatlat: doublelong: double----vector: [[[1.1,1.2],[0.2,1.8]]]lat: [[45.5,40.1]]long: [[-122.7,-74.1]]

Data is converted to Arrow before being written to disk. For maximumcontrol over how data is saved, either provide the PyArrow schema toconvert to or else provide aPyArrow Table directly.

>>>importpyarrowaspa>>>custom_schema=pa.schema([...pa.field("vector",pa.list_(pa.float32(),2)),...pa.field("lat",pa.float32()),...pa.field("long",pa.float32())...])>>>db.create_table("table3",data,schema=custom_schema)LanceTable(name='table3', version=1, ...)>>>db["table3"].head()pyarrow.Tablevector: fixed_size_list<item: float>[2]  child 0, item: floatlat: floatlong: float----vector: [[[1.1,1.2],[0.2,1.8]]]lat: [[45.5,40.1]]long: [[-122.7,-74.1]]

It is also possible to create an table from[Iterable[pa.RecordBatch]]:

>>>importpyarrowaspa>>>defmake_batches():...foriinrange(5):...yieldpa.RecordBatch.from_arrays(...[...pa.array([[3.1,4.1],[5.9,26.5]],...pa.list_(pa.float32(),2)),...pa.array(["foo","bar"]),...pa.array([10.0,20.0]),...],...["vector","item","price"],...)>>>schema=pa.schema([...pa.field("vector",pa.list_(pa.float32(),2)),...pa.field("item",pa.utf8()),...pa.field("price",pa.float32()),...])>>>db.create_table("table4",make_batches(),schema=schema)LanceTable(name='table4', version=1, ...)
Source code inlancedb/db.py
@abstractmethoddefcreate_table(self,name:str,data:Optional[DATA]=None,schema:Optional[Union[pa.Schema,LanceModel]]=None,mode:str="create",exist_ok:bool=False,on_bad_vectors:str="error",fill_value:float=0.0,embedding_functions:Optional[List[EmbeddingFunctionConfig]]=None,*,storage_options:Optional[Dict[str,str]]=None,data_storage_version:Optional[str]=None,enable_v2_manifest_paths:Optional[bool]=None,)->Table:"""Create a [Table][lancedb.table.Table] in the database.    Parameters    ----------    name: str        The name of the table.    data: The data to initialize the table, *optional*        User must provide at least one of `data` or `schema`.        Acceptable types are:        - list-of-dict        - pandas.DataFrame        - pyarrow.Table or pyarrow.RecordBatch    schema: The schema of the table, *optional*        Acceptable types are:        - pyarrow.Schema        - [LanceModel][lancedb.pydantic.LanceModel]    mode: str; default "create"        The mode to use when creating the table.        Can be either "create" or "overwrite".        By default, if the table already exists, an exception is raised.        If you want to overwrite the table, use mode="overwrite".    exist_ok: bool, default False        If a table by the same name already exists, then raise an exception        if exist_ok=False. If exist_ok=True, then open the existing table;        it will not add the provided data but will validate against any        schema that's specified.    on_bad_vectors: str, default "error"        What to do if any of the vectors are not the same size or contains NaNs.        One of "error", "drop", "fill".    fill_value: float        The value to use when filling vectors. Only used if on_bad_vectors="fill".    storage_options: dict, optional        Additional options for the storage backend. Options already set on the        connection will be inherited by the table, but can be overridden here.        See available options at        <https://lancedb.github.io/lancedb/guides/storage/>    data_storage_version: optional, str, default "stable"        Deprecated.  Set `storage_options` when connecting to the database and set        `new_table_data_storage_version` in the options.    enable_v2_manifest_paths: optional, bool, default False        Deprecated.  Set `storage_options` when connecting to the database and set        `new_table_enable_v2_manifest_paths` in the options.    Returns    -------    LanceTable        A reference to the newly created table.    !!! note        The vector index won't be created by default.        To create the index, call the `create_index` method on the table.    Examples    --------    Can create with list of tuples or dictionaries:    >>> import lancedb    >>> db = lancedb.connect("./.lancedb")    >>> data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},    ...         {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]    >>> db.create_table("my_table", data)    LanceTable(name='my_table', version=1, ...)    >>> db["my_table"].head()    pyarrow.Table    vector: fixed_size_list<item: float>[2]      child 0, item: float    lat: double    long: double    ----    vector: [[[1.1,1.2],[0.2,1.8]]]    lat: [[45.5,40.1]]    long: [[-122.7,-74.1]]    You can also pass a pandas DataFrame:    >>> import pandas as pd    >>> data = pd.DataFrame({    ...    "vector": [[1.1, 1.2], [0.2, 1.8]],    ...    "lat": [45.5, 40.1],    ...    "long": [-122.7, -74.1]    ... })    >>> db.create_table("table2", data)    LanceTable(name='table2', version=1, ...)    >>> db["table2"].head()    pyarrow.Table    vector: fixed_size_list<item: float>[2]      child 0, item: float    lat: double    long: double    ----    vector: [[[1.1,1.2],[0.2,1.8]]]    lat: [[45.5,40.1]]    long: [[-122.7,-74.1]]    Data is converted to Arrow before being written to disk. For maximum    control over how data is saved, either provide the PyArrow schema to    convert to or else provide a [PyArrow Table](pyarrow.Table) directly.    >>> import pyarrow as pa    >>> custom_schema = pa.schema([    ...   pa.field("vector", pa.list_(pa.float32(), 2)),    ...   pa.field("lat", pa.float32()),    ...   pa.field("long", pa.float32())    ... ])    >>> db.create_table("table3", data, schema = custom_schema)    LanceTable(name='table3', version=1, ...)    >>> db["table3"].head()    pyarrow.Table    vector: fixed_size_list<item: float>[2]      child 0, item: float    lat: float    long: float    ----    vector: [[[1.1,1.2],[0.2,1.8]]]    lat: [[45.5,40.1]]    long: [[-122.7,-74.1]]    It is also possible to create an table from `[Iterable[pa.RecordBatch]]`:    >>> import pyarrow as pa    >>> def make_batches():    ...     for i in range(5):    ...         yield pa.RecordBatch.from_arrays(    ...             [    ...                 pa.array([[3.1, 4.1], [5.9, 26.5]],    ...                     pa.list_(pa.float32(), 2)),    ...                 pa.array(["foo", "bar"]),    ...                 pa.array([10.0, 20.0]),    ...             ],    ...             ["vector", "item", "price"],    ...         )    >>> schema=pa.schema([    ...     pa.field("vector", pa.list_(pa.float32(), 2)),    ...     pa.field("item", pa.utf8()),    ...     pa.field("price", pa.float32()),    ... ])    >>> db.create_table("table4", make_batches(), schema=schema)    LanceTable(name='table4', version=1, ...)    """raiseNotImplementedError

open_table

open_table(name:str,*,storage_options:Optional[Dict[str,str]]=None,index_cache_size:Optional[int]=None)->Table

Open a Lance Table in the database.

Parameters:

  • name (str) –

    The name of the table.

  • index_cache_size (Optional[int], default:None) –

    Set the size of the index cache, specified as a number of entries

    The exact meaning of an "entry" will depend on the type of index:* IVF - there is one entry for each IVF partition* BTREE - there is one entry for the entire index

    This cache applies to the entire opened table, across all indices.Setting this value higher will increase performance on larger datasetsat the expense of more RAM

  • storage_options (Optional[Dict[str,str]], default:None) –

    Additional options for the storage backend. Options already set on theconnection will be inherited by the table, but can be overridden here.See available options athttps://lancedb.github.io/lancedb/guides/storage/

Returns:

  • A LanceTable object representing the table.
Source code inlancedb/db.py
defopen_table(self,name:str,*,storage_options:Optional[Dict[str,str]]=None,index_cache_size:Optional[int]=None,)->Table:"""Open a Lance Table in the database.    Parameters    ----------    name: str        The name of the table.    index_cache_size: int, default 256        Set the size of the index cache, specified as a number of entries        The exact meaning of an "entry" will depend on the type of index:        * IVF - there is one entry for each IVF partition        * BTREE - there is one entry for the entire index        This cache applies to the entire opened table, across all indices.        Setting this value higher will increase performance on larger datasets        at the expense of more RAM    storage_options: dict, optional        Additional options for the storage backend. Options already set on the        connection will be inherited by the table, but can be overridden here.        See available options at        <https://lancedb.github.io/lancedb/guides/storage/>    Returns    -------    A LanceTable object representing the table.    """raiseNotImplementedError

drop_table

drop_table(name:str)

Drop a table from the database.

Parameters:

  • name (str) –

    The name of the table.

Source code inlancedb/db.py
defdrop_table(self,name:str):"""Drop a table from the database.    Parameters    ----------    name: str        The name of the table.    """raiseNotImplementedError

rename_table

rename_table(cur_name:str,new_name:str)

Rename a table in the database.

Parameters:

  • cur_name (str) –

    The current name of the table.

  • new_name (str) –

    The new name of the table.

Source code inlancedb/db.py
defrename_table(self,cur_name:str,new_name:str):"""Rename a table in the database.    Parameters    ----------    cur_name: str        The current name of the table.    new_name: str        The new name of the table.    """raiseNotImplementedError

drop_database

drop_database()

Drop databaseThis is the same thing as dropping all the tables

Source code inlancedb/db.py
defdrop_database(self):"""    Drop database    This is the same thing as dropping all the tables    """raiseNotImplementedError

drop_all_tables

drop_all_tables()

Drop all tables from the database

Source code inlancedb/db.py
defdrop_all_tables(self):"""    Drop all tables from the database    """raiseNotImplementedError

Tables (Synchronous)

lancedb.table.Table

Bases:ABC

A Table is a collection of Records in a LanceDB Database.

Examples:

Create usingDBConnection.create_table(more examples in that method's documentation).

>>>importlancedb>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",data=[{"vector":[1.1,1.2],"b":2}])>>>table.head()pyarrow.Tablevector: fixed_size_list<item: float>[2]  child 0, item: floatb: int64----vector: [[[1.1,1.2]]]b: [[2]]

Can append new data withTable.add().

>>>table.add([{"vector":[0.5,1.3],"b":4}])AddResult(version=2)

Can query the table withTable.search.

>>>table.search([0.4,0.4]).select(["b","vector"]).to_pandas()   b      vector  _distance0  4  [0.5, 1.3]       0.821  2  [1.1, 1.2]       1.13

Search queries are much faster when an index is created. SeeTable.create_index.

Source code inlancedb/table.py
 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 9991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264126512661267126812691270127112721273127412751276127712781279128012811282128312841285128612871288128912901291129212931294129512961297129812991300130113021303130413051306130713081309131013111312131313141315131613171318131913201321132213231324132513261327132813291330133113321333133413351336133713381339134013411342134313441345134613471348134913501351135213531354135513561357135813591360136113621363136413651366136713681369137013711372137313741375137613771378137913801381138213831384138513861387138813891390139113921393139413951396139713981399140014011402140314041405140614071408140914101411141214131414141514161417141814191420142114221423142414251426142714281429143014311432143314341435143614371438143914401441144214431444144514461447144814491450145114521453145414551456145714581459146014611462146314641465146614671468146914701471147214731474147514761477147814791480148114821483148414851486148714881489149014911492149314941495149614971498149915001501150215031504150515061507150815091510151115121513151415151516151715181519152015211522152315241525152615271528152915301531153215331534153515361537153815391540154115421543154415451546154715481549
classTable(ABC):"""    A Table is a collection of Records in a LanceDB Database.    Examples    --------    Create using [DBConnection.create_table][lancedb.DBConnection.create_table]    (more examples in that method's documentation).    >>> import lancedb    >>> db = lancedb.connect("./.lancedb")    >>> table = db.create_table("my_table", data=[{"vector": [1.1, 1.2], "b": 2}])    >>> table.head()    pyarrow.Table    vector: fixed_size_list<item: float>[2]      child 0, item: float    b: int64    ----    vector: [[[1.1,1.2]]]    b: [[2]]    Can append new data with [Table.add()][lancedb.table.Table.add].    >>> table.add([{"vector": [0.5, 1.3], "b": 4}])    AddResult(version=2)    Can query the table with [Table.search][lancedb.table.Table.search].    >>> table.search([0.4, 0.4]).select(["b", "vector"]).to_pandas()       b      vector  _distance    0  4  [0.5, 1.3]       0.82    1  2  [1.1, 1.2]       1.13    Search queries are much faster when an index is created. See    [Table.create_index][lancedb.table.Table.create_index].    """@property@abstractmethoddefname(self)->str:"""The name of this Table"""raiseNotImplementedError@property@abstractmethoddefversion(self)->int:"""The version of this Table"""raiseNotImplementedError@property@abstractmethoddefschema(self)->pa.Schema:"""The [Arrow Schema](https://arrow.apache.org/docs/python/api/datatypes.html#)        of this Table        """raiseNotImplementedError@property@abstractmethoddeftags(self)->Tags:"""Tag management for the table.        Similar to Git, tags are a way to add metadata to a specific version of the        table.        .. warning::            Tagged versions are exempted from the :py:meth:`cleanup_old_versions()`            process.            To remove a version that has been tagged, you must first            :py:meth:`~Tags.delete` the associated tag.        Examples        --------        .. code-block:: python            table = db.open_table("my_table")            table.tags.create("v2-prod-20250203", 10)            tags = table.tags.list()        """raiseNotImplementedErrordef__len__(self)->int:"""The number of rows in this Table"""returnself.count_rows(None)@property@abstractmethoddefembedding_functions(self)->Dict[str,EmbeddingFunctionConfig]:"""        Get a mapping from vector column name to it's configured embedding function.        """@abstractmethoddefcount_rows(self,filter:Optional[str]=None)->int:"""        Count the number of rows in the table.        Parameters        ----------        filter: str, optional            A SQL where clause to filter the rows to count.        """raiseNotImplementedErrordefto_pandas(self)->"pandas.DataFrame":"""Return the table as a pandas DataFrame.        Returns        -------        pd.DataFrame        """returnself.to_arrow().to_pandas()@abstractmethoddefto_arrow(self)->pa.Table:"""Return the table as a pyarrow Table.        Returns        -------        pa.Table        """raiseNotImplementedErrordefcreate_index(self,metric="l2",num_partitions=256,num_sub_vectors=96,vector_column_name:str=VECTOR_COLUMN_NAME,replace:bool=True,accelerator:Optional[str]=None,index_cache_size:Optional[int]=None,*,index_type:VectorIndexType="IVF_PQ",wait_timeout:Optional[timedelta]=None,num_bits:int=8,max_iterations:int=50,sample_rate:int=256,m:int=20,ef_construction:int=300,):"""Create an index on the table.        Parameters        ----------        metric: str, default "l2"            The distance metric to use when creating the index.            Valid values are "l2", "cosine", "dot", or "hamming".            l2 is euclidean distance.            Hamming is available only for binary vectors.        num_partitions: int, default 256            The number of IVF partitions to use when creating the index.            Default is 256.        num_sub_vectors: int, default 96            The number of PQ sub-vectors to use when creating the index.            Default is 96.        vector_column_name: str, default "vector"            The vector column name to create the index.        replace: bool, default True            - If True, replace the existing index if it exists.            - If False, raise an error if duplicate index exists.        accelerator: str, default None            If set, use the given accelerator to create the index.            Only support "cuda" for now.        index_cache_size : int, optional            The size of the index cache in number of entries. Default value is 256.        num_bits: int            The number of bits to encode sub-vectors. Only used with the IVF_PQ index.            Only 4 and 8 are supported.        wait_timeout: timedelta, optional            The timeout to wait if indexing is asynchronous.        """raiseNotImplementedErrordefdrop_index(self,name:str)->None:"""        Drop an index from the table.        Parameters        ----------        name: str            The name of the index to drop.        Notes        -----        This does not delete the index from disk, it just removes it from the table.        To delete the index, run [optimize][lancedb.table.Table.optimize]        after dropping the index.        Use [list_indices][lancedb.table.Table.list_indices] to find the names of        the indices.        """raiseNotImplementedErrordefwait_for_index(self,index_names:Iterable[str],timeout:timedelta=timedelta(seconds=300))->None:"""        Wait for indexing to complete for the given index names.        This will poll the table until all the indices are fully indexed,        or raise a timeout exception if the timeout is reached.        Parameters        ----------        index_names: str            The name of the indices to poll        timeout: timedelta            Timeout to wait for asynchronous indexing. The default is 5 minutes.        """raiseNotImplementedError@abstractmethoddefstats(self)->TableStatistics:"""        Retrieve table and fragment statistics.        """raiseNotImplementedError@abstractmethoddefcreate_scalar_index(self,column:str,*,replace:bool=True,index_type:ScalarIndexType="BTREE",wait_timeout:Optional[timedelta]=None,):"""Create a scalar index on a column.        Parameters        ----------        column : str            The column to be indexed.  Must be a boolean, integer, float,            or string column.        replace : bool, default True            Replace the existing index if it exists.        index_type: Literal["BTREE", "BITMAP", "LABEL_LIST"], default "BTREE"            The type of index to create.        wait_timeout: timedelta, optional            The timeout to wait if indexing is asynchronous.        Examples        --------        Scalar indices, like vector indices, can be used to speed up scans.  A scalar        index can speed up scans that contain filter expressions on the indexed column.        For example, the following scan will be faster if the column ``my_col`` has        a scalar index:        >>> import lancedb # doctest: +SKIP        >>> db = lancedb.connect("/data/lance") # doctest: +SKIP        >>> img_table = db.open_table("images") # doctest: +SKIP        >>> my_df = img_table.search().where("my_col = 7", # doctest: +SKIP        ...                                  prefilter=True).to_pandas()        Scalar indices can also speed up scans containing a vector search and a        prefilter:        >>> import lancedb # doctest: +SKIP        >>> db = lancedb.connect("/data/lance") # doctest: +SKIP        >>> img_table = db.open_table("images") # doctest: +SKIP        >>> img_table.search([1, 2, 3, 4], vector_column_name="vector") # doctest: +SKIP        ...     .where("my_col != 7", prefilter=True)        ...     .to_pandas()        Scalar indices can only speed up scans for basic filters using        equality, comparison, range (e.g. ``my_col BETWEEN 0 AND 100``), and set        membership (e.g. `my_col IN (0, 1, 2)`)        Scalar indices can be used if the filter contains multiple indexed columns and        the filter criteria are AND'd or OR'd together        (e.g. ``my_col < 0 AND other_col> 100``)        Scalar indices may be used if the filter contains non-indexed columns but,        depending on the structure of the filter, they may not be usable.  For example,        if the column ``not_indexed`` does not have a scalar index then the filter        ``my_col = 0 OR not_indexed = 1`` will not be able to use any scalar index on        ``my_col``.        """raiseNotImplementedErrordefcreate_fts_index(self,field_names:Union[str,List[str]],*,ordering_field_names:Optional[Union[str,List[str]]]=None,replace:bool=False,writer_heap_size:Optional[int]=1024*1024*1024,use_tantivy:bool=False,tokenizer_name:Optional[str]=None,with_position:bool=False,# tokenizer configs:base_tokenizer:BaseTokenizerType="simple",language:str="English",max_token_length:Optional[int]=40,lower_case:bool=True,stem:bool=True,remove_stop_words:bool=True,ascii_folding:bool=True,ngram_min_length:int=3,ngram_max_length:int=3,prefix_only:bool=False,wait_timeout:Optional[timedelta]=None,):"""Create a full-text search index on the table.        Warning - this API is highly experimental and is highly likely to change        in the future.        Parameters        ----------        field_names: str or list of str            The name(s) of the field to index.            can be only str if use_tantivy=True for now.        replace: bool, default False            If True, replace the existing index if it exists. Note that this is            not yet an atomic operation; the index will be temporarily            unavailable while the new index is being created.        writer_heap_size: int, default 1GB            Only available with use_tantivy=True        ordering_field_names:            A list of unsigned type fields to index to optionally order            results on at search time.            only available with use_tantivy=True        tokenizer_name: str, default "default"            The tokenizer to use for the index. Can be "raw", "default" or the 2 letter            language code followed by "_stem". So for english it would be "en_stem".            For available languages see: https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html        use_tantivy: bool, default False            If True, use the legacy full-text search implementation based on tantivy.            If False, use the new full-text search implementation based on lance-index.        with_position: bool, default False            Only available with use_tantivy=False            If False, do not store the positions of the terms in the text.            This can reduce the size of the index and improve indexing speed.            But it will raise an exception for phrase queries.        base_tokenizer : str, default "simple"            The base tokenizer to use for tokenization. Options are:            - "simple": Splits text by whitespace and punctuation.            - "whitespace": Split text by whitespace, but not punctuation.            - "raw": No tokenization. The entire text is treated as a single token.            - "ngram": N-Gram tokenizer.        language : str, default "English"            The language to use for tokenization.        max_token_length : int, default 40            The maximum token length to index. Tokens longer than this length will be            ignored.        lower_case : bool, default True            Whether to convert the token to lower case. This makes queries            case-insensitive.        stem : bool, default True            Whether to stem the token. Stemming reduces words to their root form.            For example, in English "running" and "runs" would both be reduced to "run".        remove_stop_words : bool, default True            Whether to remove stop words. Stop words are common words that are often            removed from text before indexing. For example, in English "the" and "and".        ascii_folding : bool, default True            Whether to fold ASCII characters. This converts accented characters to            their ASCII equivalent. For example, "café" would be converted to "cafe".        ngram_min_length: int, default 3            The minimum length of an n-gram.        ngram_max_length: int, default 3            The maximum length of an n-gram.        prefix_only: bool, default False            Whether to only index the prefix of the token for ngram tokenizer.        wait_timeout: timedelta, optional            The timeout to wait if indexing is asynchronous.        """raiseNotImplementedError@abstractmethoddefadd(self,data:DATA,mode:AddMode="append",on_bad_vectors:OnBadVectorsType="error",fill_value:float=0.0,)->AddResult:"""Add more data to the [Table](Table).        Parameters        ----------        data: DATA            The data to insert into the table. Acceptable types are:            - list-of-dict            - pandas.DataFrame            - pyarrow.Table or pyarrow.RecordBatch        mode: str            The mode to use when writing the data. Valid values are            "append" and "overwrite".        on_bad_vectors: str, default "error"            What to do if any of the vectors are not the same size or contains NaNs.            One of "error", "drop", "fill".        fill_value: float, default 0.            The value to use when filling vectors. Only used if on_bad_vectors="fill".        Returns        -------        AddResult            An object containing the new version number of the table after adding data.        """raiseNotImplementedErrordefmerge_insert(self,on:Union[str,Iterable[str]])->LanceMergeInsertBuilder:"""        Returns a [`LanceMergeInsertBuilder`][lancedb.merge.LanceMergeInsertBuilder]        that can be used to create a "merge insert" operation        This operation can add rows, update rows, and remove rows all in a single        transaction. It is a very generic tool that can be used to create        behaviors like "insert if not exists", "update or insert (i.e. upsert)",        or even replace a portion of existing data with new data (e.g. replace        all data where month="january")        The merge insert operation works by combining new data from a        **source table** with existing data in a **target table** by using a        join.  There are three categories of records.        "Matched" records are records that exist in both the source table and        the target table. "Not matched" records exist only in the source table        (e.g. these are new data) "Not matched by source" records exist only        in the target table (this is old data)        The builder returned by this method can be used to customize what        should happen for each category of data.        Please note that the data may appear to be reordered as part of this        operation.  This is because updated rows will be deleted from the        dataset and then reinserted at the end with the new values.        Parameters        ----------        on: Union[str, Iterable[str]]            A column (or columns) to join on.  This is how records from the            source table and target table are matched.  Typically this is some            kind of key or id column.        Examples        --------        >>> import lancedb        >>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})        >>> db = lancedb.connect("./.lancedb")        >>> table = db.create_table("my_table", data)        >>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})        >>> # Perform a "upsert" operation        >>> res = table.merge_insert("a")     \\        ...      .when_matched_update_all()     \\        ...      .when_not_matched_insert_all() \\        ...      .execute(new_data)        >>> res        MergeResult(version=2, num_updated_rows=2, num_inserted_rows=1, num_deleted_rows=0)        >>> # The order of new rows is non-deterministic since we use        >>> # a hash-join as part of this operation and so we sort here        >>> table.to_arrow().sort_by("a").to_pandas()           a  b        0  1  b        1  2  x        2  3  y        3  4  z        """# noqa: E501on=[on]ifisinstance(on,str)elselist(iter(on))returnLanceMergeInsertBuilder(self,on)@abstractmethoddefsearch(self,query:Optional[Union[VEC,str,"PIL.Image.Image",Tuple,FullTextQuery]]=None,vector_column_name:Optional[str]=None,query_type:QueryType="auto",ordering_field_name:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None,)->LanceQueryBuilder:"""Create a search query to find the nearest neighbors        of the given query vector. We currently support [vector search][search]        and [full-text search][experimental-full-text-search].        All query options are defined in        [LanceQueryBuilder][lancedb.query.LanceQueryBuilder].        Examples        --------        >>> import lancedb        >>> db = lancedb.connect("./.lancedb")        >>> data = [        ...    {"original_width": 100, "caption": "bar", "vector": [0.1, 2.3, 4.5]},        ...    {"original_width": 2000, "caption": "foo",  "vector": [0.5, 3.4, 1.3]},        ...    {"original_width": 3000, "caption": "test", "vector": [0.3, 6.2, 2.6]}        ... ]        >>> table = db.create_table("my_table", data)        >>> query = [0.4, 1.4, 2.4]        >>> (table.search(query)        ...     .where("original_width > 1000", prefilter=True)        ...     .select(["caption", "original_width", "vector"])        ...     .limit(2)        ...     .to_pandas())          caption  original_width           vector  _distance        0     foo            2000  [0.5, 3.4, 1.3]   5.220000        1    test            3000  [0.3, 6.2, 2.6]  23.089996        Parameters        ----------        query: list/np.ndarray/str/PIL.Image.Image, default None            The targetted vector to search for.            - *default None*.            Acceptable types are: list, np.ndarray, PIL.Image.Image            - If None then the select/where/limit clauses are applied to filter            the table        vector_column_name: str, optional            The name of the vector column to search.            The vector column needs to be a pyarrow fixed size list type            - If not specified then the vector column is inferred from            the table schema            - If the table has multiple vector columns then the *vector_column_name*            needs to be specified. Otherwise, an error is raised.        query_type: str            *default "auto"*.            Acceptable types are: "vector", "fts", "hybrid", or "auto"            - If "auto" then the query type is inferred from the query;                - If `query` is a list/np.ndarray then the query type is                "vector";                - If `query` is a PIL.Image.Image then either do vector search,                or raise an error if no corresponding embedding function is found.            - If `query` is a string, then the query type is "vector" if the            table has embedding functions else the query type is "fts"        Returns        -------        LanceQueryBuilder            A query builder object representing the query.            Once executed, the query returns            - selected columns            - the vector            - and also the "_distance" column which is the distance between the query            vector and the returned vector.        """raiseNotImplementedError@abstractmethoddef_execute_query(self,query:Query,*,batch_size:Optional[int]=None,timeout:Optional[timedelta]=None,)->pa.RecordBatchReader:...@abstractmethoddef_explain_plan(self,query:Query,verbose:Optional[bool]=False)->str:...@abstractmethoddef_analyze_plan(self,query:Query)->str:...@abstractmethoddef_do_merge(self,merge:LanceMergeInsertBuilder,new_data:DATA,on_bad_vectors:OnBadVectorsType,fill_value:float,)->MergeResult:...@abstractmethoddefdelete(self,where:str)->DeleteResult:"""Delete rows from the table.        This can be used to delete a single row, many rows, all rows, or        sometimes no rows (if your predicate matches nothing).        Parameters        ----------        where: str            The SQL where clause to use when deleting rows.            - For example, 'x = 2' or 'x IN (1, 2, 3)'.            The filter must not be empty, or it will error.        Returns        -------        DeleteResult            An object containing the new version number of the table after deletion.        Examples        --------        >>> import lancedb        >>> data = [        ...    {"x": 1, "vector": [1.0, 2]},        ...    {"x": 2, "vector": [3.0, 4]},        ...    {"x": 3, "vector": [5.0, 6]}        ... ]        >>> db = lancedb.connect("./.lancedb")        >>> table = db.create_table("my_table", data)        >>> table.to_pandas()           x      vector        0  1  [1.0, 2.0]        1  2  [3.0, 4.0]        2  3  [5.0, 6.0]        >>> table.delete("x = 2")        DeleteResult(version=2)        >>> table.to_pandas()           x      vector        0  1  [1.0, 2.0]        1  3  [5.0, 6.0]        If you have a list of values to delete, you can combine them into a        stringified list and use the `IN` operator:        >>> to_remove = [1, 5]        >>> to_remove = ", ".join([str(v) for v in to_remove])        >>> to_remove        '1, 5'        >>> table.delete(f"x IN ({to_remove})")        DeleteResult(version=3)        >>> table.to_pandas()           x      vector        0  3  [5.0, 6.0]        """raiseNotImplementedError@abstractmethoddefupdate(self,where:Optional[str]=None,values:Optional[dict]=None,*,values_sql:Optional[Dict[str,str]]=None,)->UpdateResult:"""        This can be used to update zero to all rows depending on how many        rows match the where clause. If no where clause is provided, then        all rows will be updated.        Either `values` or `values_sql` must be provided. You cannot provide        both.        Parameters        ----------        where: str, optional            The SQL where clause to use when updating rows. For example, 'x = 2'            or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.        values: dict, optional            The values to update. The keys are the column names and the values            are the values to set.        values_sql: dict, optional            The values to update, expressed as SQL expression strings. These can            reference existing columns. For example, {"x": "x + 1"} will increment            the x column by 1.        Returns        -------        UpdateResult            - rows_updated: The number of rows that were updated            - version: The new version number of the table after the update        Examples        --------        >>> import lancedb        >>> import pandas as pd        >>> data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1.0, 2], [3, 4], [5, 6]]})        >>> db = lancedb.connect("./.lancedb")        >>> table = db.create_table("my_table", data)        >>> table.to_pandas()           x      vector        0  1  [1.0, 2.0]        1  2  [3.0, 4.0]        2  3  [5.0, 6.0]        >>> table.update(where="x = 2", values={"vector": [10.0, 10]})        UpdateResult(rows_updated=1, version=2)        >>> table.to_pandas()           x        vector        0  1    [1.0, 2.0]        1  3    [5.0, 6.0]        2  2  [10.0, 10.0]        >>> table.update(values_sql={"x": "x + 1"})        UpdateResult(rows_updated=3, version=3)        >>> table.to_pandas()           x        vector        0  2    [1.0, 2.0]        1  4    [5.0, 6.0]        2  3  [10.0, 10.0]        """raiseNotImplementedError@abstractmethoddefcleanup_old_versions(self,older_than:Optional[timedelta]=None,*,delete_unverified:bool=False,)->"CleanupStats":"""        Clean up old versions of the table, freeing disk space.        Parameters        ----------        older_than: timedelta, default None            The minimum age of the version to delete. If None, then this defaults            to two weeks.        delete_unverified: bool, default False            Because they may be part of an in-progress transaction, files newer            than 7 days old are not deleted by default. If you are sure that            there are no in-progress transactions, then you can set this to True            to delete all files older than `older_than`.        Returns        -------        CleanupStats            The stats of the cleanup operation, including how many bytes were            freed.        See Also        --------        [Table.optimize][lancedb.table.Table.optimize]: A more comprehensive            optimization operation that includes cleanup as well as other operations.        Notes        -----        This function is not available in LanceDb Cloud (since LanceDB        Cloud manages cleanup for you automatically)        """@abstractmethoddefcompact_files(self,*args,**kwargs):"""        Run the compaction process on the table.        This can be run after making several small appends to optimize the table        for faster reads.        Arguments are passed onto Lance's        [compact_files][lance.dataset.DatasetOptimizer.compact_files].        For most cases, the default should be fine.        See Also        --------        [Table.optimize][lancedb.table.Table.optimize]: A more comprehensive            optimization operation that includes cleanup as well as other operations.        Notes        -----        This function is not available in LanceDB Cloud (since LanceDB        Cloud manages compaction for you automatically)        """@abstractmethoddefoptimize(self,*,cleanup_older_than:Optional[timedelta]=None,delete_unverified:bool=False,retrain:bool=False,):"""        Optimize the on-disk data and indices for better performance.        Modeled after ``VACUUM`` in PostgreSQL.        Optimization covers three operations:         * Compaction: Merges small files into larger ones         * Prune: Removes old versions of the dataset         * Index: Optimizes the indices, adding new data to existing indices        Parameters        ----------        cleanup_older_than: timedelta, optional default 7 days            All files belonging to versions older than this will be removed.  Set            to 0 days to remove all versions except the latest.  The latest version            is never removed.        delete_unverified: bool, default False            Files leftover from a failed transaction may appear to be part of an            in-progress operation (e.g. appending new data) and these files will not            be deleted unless they are at least 7 days old. If delete_unverified is True            then these files will be deleted regardless of their age.        retrain: bool, default False            If True, retrain the vector indices, this would refine the IVF clustering            and quantization, which may improve the search accuracy. It's faster than            re-creating the index from scratch, so it's recommended to try this first,            when the data distribution has changed significantly.        Experimental API        ----------------        The optimization process is undergoing active development and may change.        Our goal with these changes is to improve the performance of optimization and        reduce the complexity.        That being said, it is essential today to run optimize if you want the best        performance.  It should be stable and safe to use in production, but it our        hope that the API may be simplified (or not even need to be called) in the        future.        The frequency an application shoudl call optimize is based on the frequency of        data modifications.  If data is frequently added, deleted, or updated then        optimize should be run frequently.  A good rule of thumb is to run optimize if        you have added or modified 100,000 or more records or run more than 20 data        modification operations.        """@abstractmethoddeflist_indices(self)->Iterable[IndexConfig]:"""        List all indices that have been created with        [Table.create_index][lancedb.table.Table.create_index]        """@abstractmethoddefindex_stats(self,index_name:str)->Optional[IndexStatistics]:"""        Retrieve statistics about an index        Parameters        ----------        index_name: str            The name of the index to retrieve statistics for        Returns        -------        IndexStatistics or None            The statistics about the index. Returns None if the index does not exist.        """@abstractmethoddefadd_columns(self,transforms:Dict[str,str]|pa.Field|List[pa.Field]|pa.Schema):"""        Add new columns with defined values.        Parameters        ----------        transforms: Dict[str, str], pa.Field, List[pa.Field], pa.Schema            A map of column name to a SQL expression to use to calculate the            value of the new column. These expressions will be evaluated for            each row in the table, and can reference existing columns.            Alternatively, a pyarrow Field or Schema can be provided to add            new columns with the specified data types. The new columns will            be initialized with null values.        Returns        -------        AddColumnsResult            version: the new version number of the table after adding columns.        """@abstractmethoddefalter_columns(self,*alterations:Iterable[Dict[str,str]]):"""        Alter column names and nullability.        Parameters        ----------        alterations : Iterable[Dict[str, Any]]            A sequence of dictionaries, each with the following keys:            - "path": str                The column path to alter. For a top-level column, this is the name.                For a nested column, this is the dot-separated path, e.g. "a.b.c".            - "rename": str, optional                The new name of the column. If not specified, the column name is                not changed.            - "data_type": pyarrow.DataType, optional               The new data type of the column. Existing values will be casted               to this type. If not specified, the column data type is not changed.            - "nullable": bool, optional                Whether the column should be nullable. If not specified, the column                nullability is not changed. Only non-nullable columns can be changed                to nullable. Currently, you cannot change a nullable column to                non-nullable.        Returns        -------        AlterColumnsResult            version: the new version number of the table after the alteration.        """@abstractmethoddefdrop_columns(self,columns:Iterable[str])->DropColumnsResult:"""        Drop columns from the table.        Parameters        ----------        columns : Iterable[str]            The names of the columns to drop.        Returns        -------        DropColumnsResult            version: the new version number of the table dropping the columns.        """@abstractmethoddefcheckout(self,version:Union[int,str]):"""        Checks out a specific version of the Table        Any read operation on the table will now access the data at the checked out        version. As a consequence, calling this method will disable any read consistency        interval that was previously set.        This is a read-only operation that turns the table into a sort of "view"        or "detached head".  Other table instances will not be affected.  To make the        change permanent you can use the `[Self::restore]` method.        Any operation that modifies the table will fail while the table is in a checked        out state.        Parameters        ----------        version: int | str,            The version to check out. A version number (`int`) or a tag            (`str`) can be provided.        To return the table to a normal state use `[Self::checkout_latest]`        """@abstractmethoddefcheckout_latest(self):"""        Ensures the table is pointing at the latest version        This can be used to manually update a table when the read_consistency_interval        is None        It can also be used to undo a `[Self::checkout]` operation        """@abstractmethoddefrestore(self,version:Optional[Union[int,str]]=None):"""Restore a version of the table. This is an in-place operation.        This creates a new version where the data is equivalent to the        specified previous version. Data is not copied (as of python-v0.2.1).        Parameters        ----------        version : int or str, default None            The version number or version tag to restore.            If unspecified then restores the currently checked out version.            If the currently checked out version is the            latest version then this is a no-op.        """@abstractmethoddeflist_versions(self)->List[Dict[str,Any]]:"""List all versions of the table"""@cached_propertydef_dataset_uri(self)->str:return_table_uri(self._conn.uri,self.name)def_get_fts_index_path(self)->Tuple[str,pa_fs.FileSystem,bool]:from.remote.tableimportRemoteTableifisinstance(self,RemoteTable)orget_uri_scheme(self._dataset_uri)!="file":return("",None,False)path=join_uri(self._dataset_uri,"_indices","fts")fs,path=fs_from_uri(path)index_exists=fs.get_file_info(path).type!=pa_fs.FileType.NotFoundreturn(path,fs,index_exists)@abstractmethoddefuses_v2_manifest_paths(self)->bool:"""        Check if the table is using the new v2 manifest paths.        Returns        -------        bool            True if the table is using the new v2 manifest paths, False otherwise.        """@abstractmethoddefmigrate_v2_manifest_paths(self):"""        Migrate the manifest paths to the new format.        This will update the manifest to use the new v2 format for paths.        This function is idempotent, and can be run multiple times without        changing the state of the object store.        !!! danger            This should not be run while other concurrent operations are happening.            And it should also run until completion before resuming other operations.        You can use        [Table.uses_v2_manifest_paths][lancedb.table.Table.uses_v2_manifest_paths]        to check if the table is already using the new path style.        """

nameabstractmethodproperty

name:str

The name of this Table

versionabstractmethodproperty

version:int

The version of this Table

schemaabstractmethodproperty

schema:Schema

TheArrow Schemaof this Table

tagsabstractmethodproperty

tags:Tags

Tag management for the table.

Similar to Git, tags are a way to add metadata to a specific version of thetable.

.. warning::

Tagged versions are exempted from the :py:meth:`cleanup_old_versions()`process.To remove a version that has been tagged, you must first:py:meth:`~Tags.delete` the associated tag.

Examples:

.. code-block:: python

table = db.open_table("my_table")table.tags.create("v2-prod-20250203", 10)tags = table.tags.list()

embedding_functionsabstractmethodproperty

embedding_functions:Dict[str,EmbeddingFunctionConfig]

Get a mapping from vector column name to it's configured embedding function.

__len__

__len__()->int

The number of rows in this Table

Source code inlancedb/table.py
def__len__(self)->int:"""The number of rows in this Table"""returnself.count_rows(None)

count_rowsabstractmethod

count_rows(filter:Optional[str]=None)->int

Count the number of rows in the table.

Parameters:

  • filter (Optional[str], default:None) –

    A SQL where clause to filter the rows to count.

Source code inlancedb/table.py
@abstractmethoddefcount_rows(self,filter:Optional[str]=None)->int:"""    Count the number of rows in the table.    Parameters    ----------    filter: str, optional        A SQL where clause to filter the rows to count.    """raiseNotImplementedError

to_pandas

to_pandas()->'pandas.DataFrame'

Return the table as a pandas DataFrame.

Returns:

  • DataFrame
Source code inlancedb/table.py
defto_pandas(self)->"pandas.DataFrame":"""Return the table as a pandas DataFrame.    Returns    -------    pd.DataFrame    """returnself.to_arrow().to_pandas()

to_arrowabstractmethod

to_arrow()->Table

Return the table as a pyarrow Table.

Returns:

Source code inlancedb/table.py
@abstractmethoddefto_arrow(self)->pa.Table:"""Return the table as a pyarrow Table.    Returns    -------    pa.Table    """raiseNotImplementedError

create_index

create_index(metric='l2',num_partitions=256,num_sub_vectors=96,vector_column_name:str=VECTOR_COLUMN_NAME,replace:bool=True,accelerator:Optional[str]=None,index_cache_size:Optional[int]=None,*,index_type:VectorIndexType='IVF_PQ',wait_timeout:Optional[timedelta]=None,num_bits:int=8,max_iterations:int=50,sample_rate:int=256,m:int=20,ef_construction:int=300)

Create an index on the table.

Parameters:

  • metric

    The distance metric to use when creating the index.Valid values are "l2", "cosine", "dot", or "hamming".l2 is euclidean distance.Hamming is available only for binary vectors.

  • num_partitions

    The number of IVF partitions to use when creating the index.Default is 256.

  • num_sub_vectors

    The number of PQ sub-vectors to use when creating the index.Default is 96.

  • vector_column_name (str, default:VECTOR_COLUMN_NAME) –

    The vector column name to create the index.

  • replace (bool, default:True) –
    • If True, replace the existing index if it exists.

    • If False, raise an error if duplicate index exists.

  • accelerator (Optional[str], default:None) –

    If set, use the given accelerator to create the index.Only support "cuda" for now.

  • index_cache_size (int, default:None) –

    The size of the index cache in number of entries. Default value is 256.

  • num_bits (int, default:8) –

    The number of bits to encode sub-vectors. Only used with the IVF_PQ index.Only 4 and 8 are supported.

  • wait_timeout (Optional[timedelta], default:None) –

    The timeout to wait if indexing is asynchronous.

Source code inlancedb/table.py
defcreate_index(self,metric="l2",num_partitions=256,num_sub_vectors=96,vector_column_name:str=VECTOR_COLUMN_NAME,replace:bool=True,accelerator:Optional[str]=None,index_cache_size:Optional[int]=None,*,index_type:VectorIndexType="IVF_PQ",wait_timeout:Optional[timedelta]=None,num_bits:int=8,max_iterations:int=50,sample_rate:int=256,m:int=20,ef_construction:int=300,):"""Create an index on the table.    Parameters    ----------    metric: str, default "l2"        The distance metric to use when creating the index.        Valid values are "l2", "cosine", "dot", or "hamming".        l2 is euclidean distance.        Hamming is available only for binary vectors.    num_partitions: int, default 256        The number of IVF partitions to use when creating the index.        Default is 256.    num_sub_vectors: int, default 96        The number of PQ sub-vectors to use when creating the index.        Default is 96.    vector_column_name: str, default "vector"        The vector column name to create the index.    replace: bool, default True        - If True, replace the existing index if it exists.        - If False, raise an error if duplicate index exists.    accelerator: str, default None        If set, use the given accelerator to create the index.        Only support "cuda" for now.    index_cache_size : int, optional        The size of the index cache in number of entries. Default value is 256.    num_bits: int        The number of bits to encode sub-vectors. Only used with the IVF_PQ index.        Only 4 and 8 are supported.    wait_timeout: timedelta, optional        The timeout to wait if indexing is asynchronous.    """raiseNotImplementedError

drop_index

drop_index(name:str)->None

Drop an index from the table.

Parameters:

  • name (str) –

    The name of the index to drop.

Notes

This does not delete the index from disk, it just removes it from the table.To delete the index, runoptimizeafter dropping the index.

Uselist_indices to find the names ofthe indices.

Source code inlancedb/table.py
defdrop_index(self,name:str)->None:"""    Drop an index from the table.    Parameters    ----------    name: str        The name of the index to drop.    Notes    -----    This does not delete the index from disk, it just removes it from the table.    To delete the index, run [optimize][lancedb.table.Table.optimize]    after dropping the index.    Use [list_indices][lancedb.table.Table.list_indices] to find the names of    the indices.    """raiseNotImplementedError

wait_for_index

wait_for_index(index_names:Iterable[str],timeout:timedelta=timedelta(seconds=300))->None

Wait for indexing to complete for the given index names.This will poll the table until all the indices are fully indexed,or raise a timeout exception if the timeout is reached.

Parameters:

  • index_names (Iterable[str]) –

    The name of the indices to poll

  • timeout (timedelta, default:timedelta(seconds=300)) –

    Timeout to wait for asynchronous indexing. The default is 5 minutes.

Source code inlancedb/table.py
defwait_for_index(self,index_names:Iterable[str],timeout:timedelta=timedelta(seconds=300))->None:"""    Wait for indexing to complete for the given index names.    This will poll the table until all the indices are fully indexed,    or raise a timeout exception if the timeout is reached.    Parameters    ----------    index_names: str        The name of the indices to poll    timeout: timedelta        Timeout to wait for asynchronous indexing. The default is 5 minutes.    """raiseNotImplementedError

statsabstractmethod

stats()->TableStatistics

Retrieve table and fragment statistics.

Source code inlancedb/table.py
@abstractmethoddefstats(self)->TableStatistics:"""    Retrieve table and fragment statistics.    """raiseNotImplementedError

create_scalar_indexabstractmethod

create_scalar_index(column:str,*,replace:bool=True,index_type:ScalarIndexType='BTREE',wait_timeout:Optional[timedelta]=None)

Create a scalar index on a column.

Parameters:

  • column (str) –

    The column to be indexed. Must be a boolean, integer, float,or string column.

  • replace (bool, default:True) –

    Replace the existing index if it exists.

  • index_type (ScalarIndexType, default:'BTREE') –

    The type of index to create.

  • wait_timeout (Optional[timedelta], default:None) –

    The timeout to wait if indexing is asynchronous.

Examples:

Scalar indices, like vector indices, can be used to speed up scans. A scalarindex can speed up scans that contain filter expressions on the indexed column.For example, the following scan will be faster if the columnmy_col hasa scalar index:

>>>importlancedb>>>db=lancedb.connect("/data/lance")>>>img_table=db.open_table("images")>>>my_df=img_table.search().where("my_col = 7",...prefilter=True).to_pandas()

Scalar indices can also speed up scans containing a vector search and aprefilter:

>>>importlancedb>>>db=lancedb.connect("/data/lance")>>>img_table=db.open_table("images")>>>img_table.search([1,2,3,4],vector_column_name="vector")....where("my_col != 7",prefilter=True)....to_pandas()

Scalar indices can only speed up scans for basic filters usingequality, comparison, range (e.g.my_col BETWEEN 0 AND 100), and setmembership (e.g.my_col IN (0, 1, 2))

Scalar indices can be used if the filter contains multiple indexed columns andthe filter criteria are AND'd or OR'd together(e.g.my_col < 0 AND other_col> 100)

Scalar indices may be used if the filter contains non-indexed columns but,depending on the structure of the filter, they may not be usable. For example,if the columnnot_indexed does not have a scalar index then the filtermy_col = 0 OR not_indexed = 1 will not be able to use any scalar index onmy_col.

Source code inlancedb/table.py
@abstractmethoddefcreate_scalar_index(self,column:str,*,replace:bool=True,index_type:ScalarIndexType="BTREE",wait_timeout:Optional[timedelta]=None,):"""Create a scalar index on a column.    Parameters    ----------    column : str        The column to be indexed.  Must be a boolean, integer, float,        or string column.    replace : bool, default True        Replace the existing index if it exists.    index_type: Literal["BTREE", "BITMAP", "LABEL_LIST"], default "BTREE"        The type of index to create.    wait_timeout: timedelta, optional        The timeout to wait if indexing is asynchronous.    Examples    --------    Scalar indices, like vector indices, can be used to speed up scans.  A scalar    index can speed up scans that contain filter expressions on the indexed column.    For example, the following scan will be faster if the column ``my_col`` has    a scalar index:    >>> import lancedb # doctest: +SKIP    >>> db = lancedb.connect("/data/lance") # doctest: +SKIP    >>> img_table = db.open_table("images") # doctest: +SKIP    >>> my_df = img_table.search().where("my_col = 7", # doctest: +SKIP    ...                                  prefilter=True).to_pandas()    Scalar indices can also speed up scans containing a vector search and a    prefilter:    >>> import lancedb # doctest: +SKIP    >>> db = lancedb.connect("/data/lance") # doctest: +SKIP    >>> img_table = db.open_table("images") # doctest: +SKIP    >>> img_table.search([1, 2, 3, 4], vector_column_name="vector") # doctest: +SKIP    ...     .where("my_col != 7", prefilter=True)    ...     .to_pandas()    Scalar indices can only speed up scans for basic filters using    equality, comparison, range (e.g. ``my_col BETWEEN 0 AND 100``), and set    membership (e.g. `my_col IN (0, 1, 2)`)    Scalar indices can be used if the filter contains multiple indexed columns and    the filter criteria are AND'd or OR'd together    (e.g. ``my_col < 0 AND other_col> 100``)    Scalar indices may be used if the filter contains non-indexed columns but,    depending on the structure of the filter, they may not be usable.  For example,    if the column ``not_indexed`` does not have a scalar index then the filter    ``my_col = 0 OR not_indexed = 1`` will not be able to use any scalar index on    ``my_col``.    """raiseNotImplementedError

create_fts_index

create_fts_index(field_names:Union[str,List[str]],*,ordering_field_names:Optional[Union[str,List[str]]]=None,replace:bool=False,writer_heap_size:Optional[int]=1024*1024*1024,use_tantivy:bool=False,tokenizer_name:Optional[str]=None,with_position:bool=False,base_tokenizer:BaseTokenizerType='simple',language:str='English',max_token_length:Optional[int]=40,lower_case:bool=True,stem:bool=True,remove_stop_words:bool=True,ascii_folding:bool=True,ngram_min_length:int=3,ngram_max_length:int=3,prefix_only:bool=False,wait_timeout:Optional[timedelta]=None)

Create a full-text search index on the table.

Warning - this API is highly experimental and is highly likely to changein the future.

Parameters:

  • field_names (Union[str,List[str]]) –

    The name(s) of the field to index.can be only str if use_tantivy=True for now.

  • replace (bool, default:False) –

    If True, replace the existing index if it exists. Note that this isnot yet an atomic operation; the index will be temporarilyunavailable while the new index is being created.

  • writer_heap_size (Optional[int], default:1024 * 1024 * 1024) –

    Only available with use_tantivy=True

  • ordering_field_names (Optional[Union[str,List[str]]], default:None) –

    A list of unsigned type fields to index to optionally orderresults on at search time.only available with use_tantivy=True

  • tokenizer_name (Optional[str], default:None) –

    The tokenizer to use for the index. Can be "raw", "default" or the 2 letterlanguage code followed by "_stem". So for english it would be "en_stem".For available languages see: https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html

  • use_tantivy (bool, default:False) –

    If True, use the legacy full-text search implementation based on tantivy.If False, use the new full-text search implementation based on lance-index.

  • with_position (bool, default:False) –

    Only available with use_tantivy=FalseIf False, do not store the positions of the terms in the text.This can reduce the size of the index and improve indexing speed.But it will raise an exception for phrase queries.

  • base_tokenizer (str, default:"simple") –

    The base tokenizer to use for tokenization. Options are:- "simple": Splits text by whitespace and punctuation.- "whitespace": Split text by whitespace, but not punctuation.- "raw": No tokenization. The entire text is treated as a single token.- "ngram": N-Gram tokenizer.

  • language (str, default:"English") –

    The language to use for tokenization.

  • max_token_length (int, default:40) –

    The maximum token length to index. Tokens longer than this length will beignored.

  • lower_case (bool, default:True) –

    Whether to convert the token to lower case. This makes queriescase-insensitive.

  • stem (bool, default:True) –

    Whether to stem the token. Stemming reduces words to their root form.For example, in English "running" and "runs" would both be reduced to "run".

  • remove_stop_words (bool, default:True) –

    Whether to remove stop words. Stop words are common words that are oftenremoved from text before indexing. For example, in English "the" and "and".

  • ascii_folding (bool, default:True) –

    Whether to fold ASCII characters. This converts accented characters totheir ASCII equivalent. For example, "café" would be converted to "cafe".

  • ngram_min_length (int, default:3) –

    The minimum length of an n-gram.

  • ngram_max_length (int, default:3) –

    The maximum length of an n-gram.

  • prefix_only (bool, default:False) –

    Whether to only index the prefix of the token for ngram tokenizer.

  • wait_timeout (Optional[timedelta], default:None) –

    The timeout to wait if indexing is asynchronous.

Source code inlancedb/table.py
defcreate_fts_index(self,field_names:Union[str,List[str]],*,ordering_field_names:Optional[Union[str,List[str]]]=None,replace:bool=False,writer_heap_size:Optional[int]=1024*1024*1024,use_tantivy:bool=False,tokenizer_name:Optional[str]=None,with_position:bool=False,# tokenizer configs:base_tokenizer:BaseTokenizerType="simple",language:str="English",max_token_length:Optional[int]=40,lower_case:bool=True,stem:bool=True,remove_stop_words:bool=True,ascii_folding:bool=True,ngram_min_length:int=3,ngram_max_length:int=3,prefix_only:bool=False,wait_timeout:Optional[timedelta]=None,):"""Create a full-text search index on the table.    Warning - this API is highly experimental and is highly likely to change    in the future.    Parameters    ----------    field_names: str or list of str        The name(s) of the field to index.        can be only str if use_tantivy=True for now.    replace: bool, default False        If True, replace the existing index if it exists. Note that this is        not yet an atomic operation; the index will be temporarily        unavailable while the new index is being created.    writer_heap_size: int, default 1GB        Only available with use_tantivy=True    ordering_field_names:        A list of unsigned type fields to index to optionally order        results on at search time.        only available with use_tantivy=True    tokenizer_name: str, default "default"        The tokenizer to use for the index. Can be "raw", "default" or the 2 letter        language code followed by "_stem". So for english it would be "en_stem".        For available languages see: https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html    use_tantivy: bool, default False        If True, use the legacy full-text search implementation based on tantivy.        If False, use the new full-text search implementation based on lance-index.    with_position: bool, default False        Only available with use_tantivy=False        If False, do not store the positions of the terms in the text.        This can reduce the size of the index and improve indexing speed.        But it will raise an exception for phrase queries.    base_tokenizer : str, default "simple"        The base tokenizer to use for tokenization. Options are:        - "simple": Splits text by whitespace and punctuation.        - "whitespace": Split text by whitespace, but not punctuation.        - "raw": No tokenization. The entire text is treated as a single token.        - "ngram": N-Gram tokenizer.    language : str, default "English"        The language to use for tokenization.    max_token_length : int, default 40        The maximum token length to index. Tokens longer than this length will be        ignored.    lower_case : bool, default True        Whether to convert the token to lower case. This makes queries        case-insensitive.    stem : bool, default True        Whether to stem the token. Stemming reduces words to their root form.        For example, in English "running" and "runs" would both be reduced to "run".    remove_stop_words : bool, default True        Whether to remove stop words. Stop words are common words that are often        removed from text before indexing. For example, in English "the" and "and".    ascii_folding : bool, default True        Whether to fold ASCII characters. This converts accented characters to        their ASCII equivalent. For example, "café" would be converted to "cafe".    ngram_min_length: int, default 3        The minimum length of an n-gram.    ngram_max_length: int, default 3        The maximum length of an n-gram.    prefix_only: bool, default False        Whether to only index the prefix of the token for ngram tokenizer.    wait_timeout: timedelta, optional        The timeout to wait if indexing is asynchronous.    """raiseNotImplementedError

addabstractmethod

add(data:DATA,mode:AddMode='append',on_bad_vectors:OnBadVectorsType='error',fill_value:float=0.0)->AddResult

Add more data to theTable.

Parameters:

  • data (DATA) –

    The data to insert into the table. Acceptable types are:

    • list-of-dict

    • pandas.DataFrame

    • pyarrow.Table or pyarrow.RecordBatch

  • mode (AddMode, default:'append') –

    The mode to use when writing the data. Valid values are"append" and "overwrite".

  • on_bad_vectors (OnBadVectorsType, default:'error') –

    What to do if any of the vectors are not the same size or contains NaNs.One of "error", "drop", "fill".

  • fill_value (float, default:0.0) –

    The value to use when filling vectors. Only used if on_bad_vectors="fill".

Returns:

  • AddResult

    An object containing the new version number of the table after adding data.

Source code inlancedb/table.py
@abstractmethoddefadd(self,data:DATA,mode:AddMode="append",on_bad_vectors:OnBadVectorsType="error",fill_value:float=0.0,)->AddResult:"""Add more data to the [Table](Table).    Parameters    ----------    data: DATA        The data to insert into the table. Acceptable types are:        - list-of-dict        - pandas.DataFrame        - pyarrow.Table or pyarrow.RecordBatch    mode: str        The mode to use when writing the data. Valid values are        "append" and "overwrite".    on_bad_vectors: str, default "error"        What to do if any of the vectors are not the same size or contains NaNs.        One of "error", "drop", "fill".    fill_value: float, default 0.        The value to use when filling vectors. Only used if on_bad_vectors="fill".    Returns    -------    AddResult        An object containing the new version number of the table after adding data.    """raiseNotImplementedError

merge_insert

merge_insert(on:Union[str,Iterable[str]])->LanceMergeInsertBuilder

Returns aLanceMergeInsertBuilderthat can be used to create a "merge insert" operation

This operation can add rows, update rows, and remove rows all in a singletransaction. It is a very generic tool that can be used to createbehaviors like "insert if not exists", "update or insert (i.e. upsert)",or even replace a portion of existing data with new data (e.g. replaceall data where month="january")

The merge insert operation works by combining new data from asource table with existing data in atarget table by using ajoin. There are three categories of records.

"Matched" records are records that exist in both the source table andthe target table. "Not matched" records exist only in the source table(e.g. these are new data) "Not matched by source" records exist onlyin the target table (this is old data)

The builder returned by this method can be used to customize whatshould happen for each category of data.

Please note that the data may appear to be reordered as part of thisoperation. This is because updated rows will be deleted from thedataset and then reinserted at the end with the new values.

Parameters:

  • on (Union[str,Iterable[str]]) –

    A column (or columns) to join on. This is how records from thesource table and target table are matched. Typically this is somekind of key or id column.

Examples:

>>>importlancedb>>>data=pa.table({"a":[2,1,3],"b":["a","b","c"]})>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",data)>>>new_data=pa.table({"a":[2,3,4],"b":["x","y","z"]})>>># Perform a "upsert" operation>>>res=table.merge_insert("a")     \....when_matched_update_all()     \....when_not_matched_insert_all() \....execute(new_data)>>>resMergeResult(version=2, num_updated_rows=2, num_inserted_rows=1, num_deleted_rows=0)>>># The order of new rows is non-deterministic since we use>>># a hash-join as part of this operation and so we sort here>>>table.to_arrow().sort_by("a").to_pandas()   a  b0  1  b1  2  x2  3  y3  4  z
Source code inlancedb/table.py
defmerge_insert(self,on:Union[str,Iterable[str]])->LanceMergeInsertBuilder:"""    Returns a [`LanceMergeInsertBuilder`][lancedb.merge.LanceMergeInsertBuilder]    that can be used to create a "merge insert" operation    This operation can add rows, update rows, and remove rows all in a single    transaction. It is a very generic tool that can be used to create    behaviors like "insert if not exists", "update or insert (i.e. upsert)",    or even replace a portion of existing data with new data (e.g. replace    all data where month="january")    The merge insert operation works by combining new data from a    **source table** with existing data in a **target table** by using a    join.  There are three categories of records.    "Matched" records are records that exist in both the source table and    the target table. "Not matched" records exist only in the source table    (e.g. these are new data) "Not matched by source" records exist only    in the target table (this is old data)    The builder returned by this method can be used to customize what    should happen for each category of data.    Please note that the data may appear to be reordered as part of this    operation.  This is because updated rows will be deleted from the    dataset and then reinserted at the end with the new values.    Parameters    ----------    on: Union[str, Iterable[str]]        A column (or columns) to join on.  This is how records from the        source table and target table are matched.  Typically this is some        kind of key or id column.    Examples    --------    >>> import lancedb    >>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})    >>> db = lancedb.connect("./.lancedb")    >>> table = db.create_table("my_table", data)    >>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})    >>> # Perform a "upsert" operation    >>> res = table.merge_insert("a")     \\    ...      .when_matched_update_all()     \\    ...      .when_not_matched_insert_all() \\    ...      .execute(new_data)    >>> res    MergeResult(version=2, num_updated_rows=2, num_inserted_rows=1, num_deleted_rows=0)    >>> # The order of new rows is non-deterministic since we use    >>> # a hash-join as part of this operation and so we sort here    >>> table.to_arrow().sort_by("a").to_pandas()       a  b    0  1  b    1  2  x    2  3  y    3  4  z    """# noqa: E501on=[on]ifisinstance(on,str)elselist(iter(on))returnLanceMergeInsertBuilder(self,on)

searchabstractmethod

search(query:Optional[Union[VEC,str,'PIL.Image.Image',Tuple,FullTextQuery]]=None,vector_column_name:Optional[str]=None,query_type:QueryType='auto',ordering_field_name:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None)->LanceQueryBuilder

Create a search query to find the nearest neighborsof the given query vector. We currently supportvector searchand [full-text search][experimental-full-text-search].

All query options are defined inLanceQueryBuilder.

Examples:

>>>importlancedb>>>db=lancedb.connect("./.lancedb")>>>data=[...{"original_width":100,"caption":"bar","vector":[0.1,2.3,4.5]},...{"original_width":2000,"caption":"foo","vector":[0.5,3.4,1.3]},...{"original_width":3000,"caption":"test","vector":[0.3,6.2,2.6]}...]>>>table=db.create_table("my_table",data)>>>query=[0.4,1.4,2.4]>>>(table.search(query)....where("original_width > 1000",prefilter=True)....select(["caption","original_width","vector"])....limit(2)....to_pandas())  caption  original_width           vector  _distance0     foo            2000  [0.5, 3.4, 1.3]   5.2200001    test            3000  [0.3, 6.2, 2.6]  23.089996

Parameters:

  • query (Optional[Union[VEC,str, 'PIL.Image.Image',Tuple,FullTextQuery]], default:None) –

    The targetted vector to search for.

    • default None.Acceptable types are: list, np.ndarray, PIL.Image.Image

    • If None then the select/where/limit clauses are applied to filterthe table

  • vector_column_name (Optional[str], default:None) –

    The name of the vector column to search.

    The vector column needs to be a pyarrow fixed size list type

    • If not specified then the vector column is inferred fromthe table schema

    • If the table has multiple vector columns then thevector_column_nameneeds to be specified. Otherwise, an error is raised.

  • query_type (QueryType, default:'auto') –

    default "auto".Acceptable types are: "vector", "fts", "hybrid", or "auto"

    • If "auto" then the query type is inferred from the query;

      • Ifquery is a list/np.ndarray then the query type is"vector";

      • Ifquery is a PIL.Image.Image then either do vector search,or raise an error if no corresponding embedding function is found.

    • Ifquery is a string, then the query type is "vector" if thetable has embedding functions else the query type is "fts"

Returns:

  • LanceQueryBuilder

    A query builder object representing the query.Once executed, the query returns

    • selected columns

    • the vector

    • and also the "_distance" column which is the distance between the queryvector and the returned vector.

Source code inlancedb/table.py
@abstractmethoddefsearch(self,query:Optional[Union[VEC,str,"PIL.Image.Image",Tuple,FullTextQuery]]=None,vector_column_name:Optional[str]=None,query_type:QueryType="auto",ordering_field_name:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None,)->LanceQueryBuilder:"""Create a search query to find the nearest neighbors    of the given query vector. We currently support [vector search][search]    and [full-text search][experimental-full-text-search].    All query options are defined in    [LanceQueryBuilder][lancedb.query.LanceQueryBuilder].    Examples    --------    >>> import lancedb    >>> db = lancedb.connect("./.lancedb")    >>> data = [    ...    {"original_width": 100, "caption": "bar", "vector": [0.1, 2.3, 4.5]},    ...    {"original_width": 2000, "caption": "foo",  "vector": [0.5, 3.4, 1.3]},    ...    {"original_width": 3000, "caption": "test", "vector": [0.3, 6.2, 2.6]}    ... ]    >>> table = db.create_table("my_table", data)    >>> query = [0.4, 1.4, 2.4]    >>> (table.search(query)    ...     .where("original_width > 1000", prefilter=True)    ...     .select(["caption", "original_width", "vector"])    ...     .limit(2)    ...     .to_pandas())      caption  original_width           vector  _distance    0     foo            2000  [0.5, 3.4, 1.3]   5.220000    1    test            3000  [0.3, 6.2, 2.6]  23.089996    Parameters    ----------    query: list/np.ndarray/str/PIL.Image.Image, default None        The targetted vector to search for.        - *default None*.        Acceptable types are: list, np.ndarray, PIL.Image.Image        - If None then the select/where/limit clauses are applied to filter        the table    vector_column_name: str, optional        The name of the vector column to search.        The vector column needs to be a pyarrow fixed size list type        - If not specified then the vector column is inferred from        the table schema        - If the table has multiple vector columns then the *vector_column_name*        needs to be specified. Otherwise, an error is raised.    query_type: str        *default "auto"*.        Acceptable types are: "vector", "fts", "hybrid", or "auto"        - If "auto" then the query type is inferred from the query;            - If `query` is a list/np.ndarray then the query type is            "vector";            - If `query` is a PIL.Image.Image then either do vector search,            or raise an error if no corresponding embedding function is found.        - If `query` is a string, then the query type is "vector" if the        table has embedding functions else the query type is "fts"    Returns    -------    LanceQueryBuilder        A query builder object representing the query.        Once executed, the query returns        - selected columns        - the vector        - and also the "_distance" column which is the distance between the query        vector and the returned vector.    """raiseNotImplementedError

deleteabstractmethod

delete(where:str)->DeleteResult

Delete rows from the table.

This can be used to delete a single row, many rows, all rows, orsometimes no rows (if your predicate matches nothing).

Parameters:

  • where (str) –

    The SQL where clause to use when deleting rows.

    • For example, 'x = 2' or 'x IN (1, 2, 3)'.

    The filter must not be empty, or it will error.

Returns:

  • DeleteResult

    An object containing the new version number of the table after deletion.

Examples:

>>>importlancedb>>>data=[...{"x":1,"vector":[1.0,2]},...{"x":2,"vector":[3.0,4]},...{"x":3,"vector":[5.0,6]}...]>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",data)>>>table.to_pandas()   x      vector0  1  [1.0, 2.0]1  2  [3.0, 4.0]2  3  [5.0, 6.0]>>>table.delete("x = 2")DeleteResult(version=2)>>>table.to_pandas()   x      vector0  1  [1.0, 2.0]1  3  [5.0, 6.0]

If you have a list of values to delete, you can combine them into astringified list and use theIN operator:

>>>to_remove=[1,5]>>>to_remove=", ".join([str(v)forvinto_remove])>>>to_remove'1, 5'>>>table.delete(f"x IN ({to_remove})")DeleteResult(version=3)>>>table.to_pandas()   x      vector0  3  [5.0, 6.0]
Source code inlancedb/table.py
@abstractmethoddefdelete(self,where:str)->DeleteResult:"""Delete rows from the table.    This can be used to delete a single row, many rows, all rows, or    sometimes no rows (if your predicate matches nothing).    Parameters    ----------    where: str        The SQL where clause to use when deleting rows.        - For example, 'x = 2' or 'x IN (1, 2, 3)'.        The filter must not be empty, or it will error.    Returns    -------    DeleteResult        An object containing the new version number of the table after deletion.    Examples    --------    >>> import lancedb    >>> data = [    ...    {"x": 1, "vector": [1.0, 2]},    ...    {"x": 2, "vector": [3.0, 4]},    ...    {"x": 3, "vector": [5.0, 6]}    ... ]    >>> db = lancedb.connect("./.lancedb")    >>> table = db.create_table("my_table", data)    >>> table.to_pandas()       x      vector    0  1  [1.0, 2.0]    1  2  [3.0, 4.0]    2  3  [5.0, 6.0]    >>> table.delete("x = 2")    DeleteResult(version=2)    >>> table.to_pandas()       x      vector    0  1  [1.0, 2.0]    1  3  [5.0, 6.0]    If you have a list of values to delete, you can combine them into a    stringified list and use the `IN` operator:    >>> to_remove = [1, 5]    >>> to_remove = ", ".join([str(v) for v in to_remove])    >>> to_remove    '1, 5'    >>> table.delete(f"x IN ({to_remove})")    DeleteResult(version=3)    >>> table.to_pandas()       x      vector    0  3  [5.0, 6.0]    """raiseNotImplementedError

updateabstractmethod

update(where:Optional[str]=None,values:Optional[dict]=None,*,values_sql:Optional[Dict[str,str]]=None)->UpdateResult

This can be used to update zero to all rows depending on how manyrows match the where clause. If no where clause is provided, thenall rows will be updated.

Eithervalues orvalues_sql must be provided. You cannot provideboth.

Parameters:

  • where (Optional[str], default:None) –

    The SQL where clause to use when updating rows. For example, 'x = 2'or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.

  • values (Optional[dict], default:None) –

    The values to update. The keys are the column names and the valuesare the values to set.

  • values_sql (Optional[Dict[str,str]], default:None) –

    The values to update, expressed as SQL expression strings. These canreference existing columns. For example, {"x": "x + 1"} will incrementthe x column by 1.

Returns:

  • UpdateResult
    • rows_updated: The number of rows that were updated
    • version: The new version number of the table after the update

Examples:

>>>importlancedb>>>importpandasaspd>>>data=pd.DataFrame({"x":[1,2,3],"vector":[[1.0,2],[3,4],[5,6]]})>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",data)>>>table.to_pandas()   x      vector0  1  [1.0, 2.0]1  2  [3.0, 4.0]2  3  [5.0, 6.0]>>>table.update(where="x = 2",values={"vector":[10.0,10]})UpdateResult(rows_updated=1, version=2)>>>table.to_pandas()   x        vector0  1    [1.0, 2.0]1  3    [5.0, 6.0]2  2  [10.0, 10.0]>>>table.update(values_sql={"x":"x + 1"})UpdateResult(rows_updated=3, version=3)>>>table.to_pandas()   x        vector0  2    [1.0, 2.0]1  4    [5.0, 6.0]2  3  [10.0, 10.0]
Source code inlancedb/table.py
@abstractmethoddefupdate(self,where:Optional[str]=None,values:Optional[dict]=None,*,values_sql:Optional[Dict[str,str]]=None,)->UpdateResult:"""    This can be used to update zero to all rows depending on how many    rows match the where clause. If no where clause is provided, then    all rows will be updated.    Either `values` or `values_sql` must be provided. You cannot provide    both.    Parameters    ----------    where: str, optional        The SQL where clause to use when updating rows. For example, 'x = 2'        or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.    values: dict, optional        The values to update. The keys are the column names and the values        are the values to set.    values_sql: dict, optional        The values to update, expressed as SQL expression strings. These can        reference existing columns. For example, {"x": "x + 1"} will increment        the x column by 1.    Returns    -------    UpdateResult        - rows_updated: The number of rows that were updated        - version: The new version number of the table after the update    Examples    --------    >>> import lancedb    >>> import pandas as pd    >>> data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1.0, 2], [3, 4], [5, 6]]})    >>> db = lancedb.connect("./.lancedb")    >>> table = db.create_table("my_table", data)    >>> table.to_pandas()       x      vector    0  1  [1.0, 2.0]    1  2  [3.0, 4.0]    2  3  [5.0, 6.0]    >>> table.update(where="x = 2", values={"vector": [10.0, 10]})    UpdateResult(rows_updated=1, version=2)    >>> table.to_pandas()       x        vector    0  1    [1.0, 2.0]    1  3    [5.0, 6.0]    2  2  [10.0, 10.0]    >>> table.update(values_sql={"x": "x + 1"})    UpdateResult(rows_updated=3, version=3)    >>> table.to_pandas()       x        vector    0  2    [1.0, 2.0]    1  4    [5.0, 6.0]    2  3  [10.0, 10.0]    """raiseNotImplementedError

cleanup_old_versionsabstractmethod

cleanup_old_versions(older_than:Optional[timedelta]=None,*,delete_unverified:bool=False)->'CleanupStats'

Clean up old versions of the table, freeing disk space.

Parameters:

  • older_than (Optional[timedelta], default:None) –

    The minimum age of the version to delete. If None, then this defaultsto two weeks.

  • delete_unverified (bool, default:False) –

    Because they may be part of an in-progress transaction, files newerthan 7 days old are not deleted by default. If you are sure thatthere are no in-progress transactions, then you can set this to Trueto delete all files older thanolder_than.

Returns:

  • CleanupStats

    The stats of the cleanup operation, including how many bytes werefreed.

See Also

Table.optimize: A more comprehensive optimization operation that includes cleanup as well as other operations.

Notes

This function is not available in LanceDb Cloud (since LanceDBCloud manages cleanup for you automatically)

Source code inlancedb/table.py
@abstractmethoddefcleanup_old_versions(self,older_than:Optional[timedelta]=None,*,delete_unverified:bool=False,)->"CleanupStats":"""    Clean up old versions of the table, freeing disk space.    Parameters    ----------    older_than: timedelta, default None        The minimum age of the version to delete. If None, then this defaults        to two weeks.    delete_unverified: bool, default False        Because they may be part of an in-progress transaction, files newer        than 7 days old are not deleted by default. If you are sure that        there are no in-progress transactions, then you can set this to True        to delete all files older than `older_than`.    Returns    -------    CleanupStats        The stats of the cleanup operation, including how many bytes were        freed.    See Also    --------    [Table.optimize][lancedb.table.Table.optimize]: A more comprehensive        optimization operation that includes cleanup as well as other operations.    Notes    -----    This function is not available in LanceDb Cloud (since LanceDB    Cloud manages cleanup for you automatically)    """

compact_filesabstractmethod

compact_files(*args,**kwargs)

Run the compaction process on the table.This can be run after making several small appends to optimize the tablefor faster reads.

Arguments are passed onto Lance's[compact_files][lance.dataset.DatasetOptimizer.compact_files].For most cases, the default should be fine.

See Also

Table.optimize: A more comprehensive optimization operation that includes cleanup as well as other operations.

Notes

This function is not available in LanceDB Cloud (since LanceDBCloud manages compaction for you automatically)

Source code inlancedb/table.py
@abstractmethoddefcompact_files(self,*args,**kwargs):"""    Run the compaction process on the table.    This can be run after making several small appends to optimize the table    for faster reads.    Arguments are passed onto Lance's    [compact_files][lance.dataset.DatasetOptimizer.compact_files].    For most cases, the default should be fine.    See Also    --------    [Table.optimize][lancedb.table.Table.optimize]: A more comprehensive        optimization operation that includes cleanup as well as other operations.    Notes    -----    This function is not available in LanceDB Cloud (since LanceDB    Cloud manages compaction for you automatically)    """

optimizeabstractmethod

optimize(*,cleanup_older_than:Optional[timedelta]=None,delete_unverified:bool=False,retrain:bool=False)

Optimize the on-disk data and indices for better performance.

Modeled afterVACUUM in PostgreSQL.

Optimization covers three operations:

  • Compaction: Merges small files into larger ones
  • Prune: Removes old versions of the dataset
  • Index: Optimizes the indices, adding new data to existing indices

Parameters:

  • cleanup_older_than (Optional[timedelta], default:None) –

    All files belonging to versions older than this will be removed. Setto 0 days to remove all versions except the latest. The latest versionis never removed.

  • delete_unverified (bool, default:False) –

    Files leftover from a failed transaction may appear to be part of anin-progress operation (e.g. appending new data) and these files will notbe deleted unless they are at least 7 days old. If delete_unverified is Truethen these files will be deleted regardless of their age.

  • retrain (bool, default:False) –

    If True, retrain the vector indices, this would refine the IVF clusteringand quantization, which may improve the search accuracy. It's faster thanre-creating the index from scratch, so it's recommended to try this first,when the data distribution has changed significantly.

Experimental API

The optimization process is undergoing active development and may change.Our goal with these changes is to improve the performance of optimization andreduce the complexity.

That being said, it is essential today to run optimize if you want the bestperformance. It should be stable and safe to use in production, but it ourhope that the API may be simplified (or not even need to be called) in thefuture.

The frequency an application shoudl call optimize is based on the frequency ofdata modifications. If data is frequently added, deleted, or updated thenoptimize should be run frequently. A good rule of thumb is to run optimize ifyou have added or modified 100,000 or more records or run more than 20 datamodification operations.

Source code inlancedb/table.py
@abstractmethoddefoptimize(self,*,cleanup_older_than:Optional[timedelta]=None,delete_unverified:bool=False,retrain:bool=False,):"""    Optimize the on-disk data and indices for better performance.    Modeled after ``VACUUM`` in PostgreSQL.    Optimization covers three operations:     * Compaction: Merges small files into larger ones     * Prune: Removes old versions of the dataset     * Index: Optimizes the indices, adding new data to existing indices    Parameters    ----------    cleanup_older_than: timedelta, optional default 7 days        All files belonging to versions older than this will be removed.  Set        to 0 days to remove all versions except the latest.  The latest version        is never removed.    delete_unverified: bool, default False        Files leftover from a failed transaction may appear to be part of an        in-progress operation (e.g. appending new data) and these files will not        be deleted unless they are at least 7 days old. If delete_unverified is True        then these files will be deleted regardless of their age.    retrain: bool, default False        If True, retrain the vector indices, this would refine the IVF clustering        and quantization, which may improve the search accuracy. It's faster than        re-creating the index from scratch, so it's recommended to try this first,        when the data distribution has changed significantly.    Experimental API    ----------------    The optimization process is undergoing active development and may change.    Our goal with these changes is to improve the performance of optimization and    reduce the complexity.    That being said, it is essential today to run optimize if you want the best    performance.  It should be stable and safe to use in production, but it our    hope that the API may be simplified (or not even need to be called) in the    future.    The frequency an application shoudl call optimize is based on the frequency of    data modifications.  If data is frequently added, deleted, or updated then    optimize should be run frequently.  A good rule of thumb is to run optimize if    you have added or modified 100,000 or more records or run more than 20 data    modification operations.    """

list_indicesabstractmethod

list_indices()->Iterable[IndexConfig]

List all indices that have been created withTable.create_index

Source code inlancedb/table.py
@abstractmethoddeflist_indices(self)->Iterable[IndexConfig]:"""    List all indices that have been created with    [Table.create_index][lancedb.table.Table.create_index]    """

index_statsabstractmethod

index_stats(index_name:str)->Optional[IndexStatistics]

Retrieve statistics about an index

Parameters:

  • index_name (str) –

    The name of the index to retrieve statistics for

Returns:

  • IndexStatistics or None

    The statistics about the index. Returns None if the index does not exist.

Source code inlancedb/table.py
@abstractmethoddefindex_stats(self,index_name:str)->Optional[IndexStatistics]:"""    Retrieve statistics about an index    Parameters    ----------    index_name: str        The name of the index to retrieve statistics for    Returns    -------    IndexStatistics or None        The statistics about the index. Returns None if the index does not exist.    """

add_columnsabstractmethod

add_columns(transforms:Dict[str,str]|Field|List[Field]|Schema)

Add new columns with defined values.

Parameters:

  • transforms (Dict[str,str] |Field |List[Field] |Schema) –

    A map of column name to a SQL expression to use to calculate thevalue of the new column. These expressions will be evaluated foreach row in the table, and can reference existing columns.Alternatively, a pyarrow Field or Schema can be provided to addnew columns with the specified data types. The new columns willbe initialized with null values.

Returns:

  • AddColumnsResult

    version: the new version number of the table after adding columns.

Source code inlancedb/table.py
@abstractmethoddefadd_columns(self,transforms:Dict[str,str]|pa.Field|List[pa.Field]|pa.Schema):"""    Add new columns with defined values.    Parameters    ----------    transforms: Dict[str, str], pa.Field, List[pa.Field], pa.Schema        A map of column name to a SQL expression to use to calculate the        value of the new column. These expressions will be evaluated for        each row in the table, and can reference existing columns.        Alternatively, a pyarrow Field or Schema can be provided to add        new columns with the specified data types. The new columns will        be initialized with null values.    Returns    -------    AddColumnsResult        version: the new version number of the table after adding columns.    """

alter_columnsabstractmethod

alter_columns(*alterations:Iterable[Dict[str,str]])

Alter column names and nullability.

Parameters:

  • alterations (Iterable[Dict[str,Any]], default:()) –

    A sequence of dictionaries, each with the following keys:- "path": str The column path to alter. For a top-level column, this is the name. For a nested column, this is the dot-separated path, e.g. "a.b.c".- "rename": str, optional The new name of the column. If not specified, the column name is not changed.- "data_type": pyarrow.DataType, optional The new data type of the column. Existing values will be casted to this type. If not specified, the column data type is not changed.- "nullable": bool, optional Whether the column should be nullable. If not specified, the column nullability is not changed. Only non-nullable columns can be changed to nullable. Currently, you cannot change a nullable column to non-nullable.

Returns:

  • AlterColumnsResult

    version: the new version number of the table after the alteration.

Source code inlancedb/table.py
@abstractmethoddefalter_columns(self,*alterations:Iterable[Dict[str,str]]):"""    Alter column names and nullability.    Parameters    ----------    alterations : Iterable[Dict[str, Any]]        A sequence of dictionaries, each with the following keys:        - "path": str            The column path to alter. For a top-level column, this is the name.            For a nested column, this is the dot-separated path, e.g. "a.b.c".        - "rename": str, optional            The new name of the column. If not specified, the column name is            not changed.        - "data_type": pyarrow.DataType, optional           The new data type of the column. Existing values will be casted           to this type. If not specified, the column data type is not changed.        - "nullable": bool, optional            Whether the column should be nullable. If not specified, the column            nullability is not changed. Only non-nullable columns can be changed            to nullable. Currently, you cannot change a nullable column to            non-nullable.    Returns    -------    AlterColumnsResult        version: the new version number of the table after the alteration.    """

drop_columnsabstractmethod

drop_columns(columns:Iterable[str])->DropColumnsResult

Drop columns from the table.

Parameters:

  • columns (Iterable[str]) –

    The names of the columns to drop.

Returns:

  • DropColumnsResult

    version: the new version number of the table dropping the columns.

Source code inlancedb/table.py
@abstractmethoddefdrop_columns(self,columns:Iterable[str])->DropColumnsResult:"""    Drop columns from the table.    Parameters    ----------    columns : Iterable[str]        The names of the columns to drop.    Returns    -------    DropColumnsResult        version: the new version number of the table dropping the columns.    """

checkoutabstractmethod

checkout(version:Union[int,str])

Checks out a specific version of the Table

Any read operation on the table will now access the data at the checked outversion. As a consequence, calling this method will disable any read consistencyinterval that was previously set.

This is a read-only operation that turns the table into a sort of "view"or "detached head". Other table instances will not be affected. To make thechange permanent you can use the[Self::restore] method.

Any operation that modifies the table will fail while the table is in a checkedout state.

Parameters:

  • version (Union[int,str]) –

    The version to check out. A version number (int) or a tag(str) can be provided.

  • To
Source code inlancedb/table.py
@abstractmethoddefcheckout(self,version:Union[int,str]):"""    Checks out a specific version of the Table    Any read operation on the table will now access the data at the checked out    version. As a consequence, calling this method will disable any read consistency    interval that was previously set.    This is a read-only operation that turns the table into a sort of "view"    or "detached head".  Other table instances will not be affected.  To make the    change permanent you can use the `[Self::restore]` method.    Any operation that modifies the table will fail while the table is in a checked    out state.    Parameters    ----------    version: int | str,        The version to check out. A version number (`int`) or a tag        (`str`) can be provided.    To return the table to a normal state use `[Self::checkout_latest]`    """

checkout_latestabstractmethod

checkout_latest()

Ensures the table is pointing at the latest version

This can be used to manually update a table when the read_consistency_intervalis NoneIt can also be used to undo a[Self::checkout] operation

Source code inlancedb/table.py
@abstractmethoddefcheckout_latest(self):"""    Ensures the table is pointing at the latest version    This can be used to manually update a table when the read_consistency_interval    is None    It can also be used to undo a `[Self::checkout]` operation    """

restoreabstractmethod

restore(version:Optional[Union[int,str]]=None)

Restore a version of the table. This is an in-place operation.

This creates a new version where the data is equivalent to thespecified previous version. Data is not copied (as of python-v0.2.1).

Parameters:

  • version (int orstr, default:None) –

    The version number or version tag to restore.If unspecified then restores the currently checked out version.If the currently checked out version is thelatest version then this is a no-op.

Source code inlancedb/table.py
@abstractmethoddefrestore(self,version:Optional[Union[int,str]]=None):"""Restore a version of the table. This is an in-place operation.    This creates a new version where the data is equivalent to the    specified previous version. Data is not copied (as of python-v0.2.1).    Parameters    ----------    version : int or str, default None        The version number or version tag to restore.        If unspecified then restores the currently checked out version.        If the currently checked out version is the        latest version then this is a no-op.    """

list_versionsabstractmethod

list_versions()->List[Dict[str,Any]]

List all versions of the table

Source code inlancedb/table.py
@abstractmethoddeflist_versions(self)->List[Dict[str,Any]]:"""List all versions of the table"""

uses_v2_manifest_pathsabstractmethod

uses_v2_manifest_paths()->bool

Check if the table is using the new v2 manifest paths.

Returns:

  • bool

    True if the table is using the new v2 manifest paths, False otherwise.

Source code inlancedb/table.py
@abstractmethoddefuses_v2_manifest_paths(self)->bool:"""    Check if the table is using the new v2 manifest paths.    Returns    -------    bool        True if the table is using the new v2 manifest paths, False otherwise.    """

migrate_v2_manifest_pathsabstractmethod

migrate_v2_manifest_paths()

Migrate the manifest paths to the new format.

This will update the manifest to use the new v2 format for paths.

This function is idempotent, and can be run multiple times withoutchanging the state of the object store.

Danger

This should not be run while other concurrent operations are happening.And it should also run until completion before resuming other operations.

You can useTable.uses_v2_manifest_pathsto check if the table is already using the new path style.

Source code inlancedb/table.py
@abstractmethoddefmigrate_v2_manifest_paths(self):"""    Migrate the manifest paths to the new format.    This will update the manifest to use the new v2 format for paths.    This function is idempotent, and can be run multiple times without    changing the state of the object store.    !!! danger        This should not be run while other concurrent operations are happening.        And it should also run until completion before resuming other operations.    You can use    [Table.uses_v2_manifest_paths][lancedb.table.Table.uses_v2_manifest_paths]    to check if the table is already using the new path style.    """

Querying (Synchronous)

lancedb.query.Query

Bases:BaseModel

A LanceDB Query

Queries are constructed by theTable.search method. This class is apython representation of the query. Normally you will not need to interactwith this class directly. You can build up a query and execute it usingcollection methods such asto_batches(),to_arrow(),to_pandas(),etc.

However, you can use theto_query() method to get the underlying query object.This can be useful for serializing a query or using it in a different context.

Attributes:

  • filter (Optional[str]) –

    sql filter to refine the query with

  • limit (Optional[int]) –

    The limit on the number of results to return. If this is a vector or FTS query,then this is required. If this is a plain SQL query, then this is optional.

  • offset (Optional[int]) –

    The offset to start fetching results from

    This is ignored for vector / FTS search (will be None).

  • columns (Optional[Union[List[str],Dict[str,str]]]) –

    which columns to return in the results

    This can be a list of column names or a dictionary. If it is a dictionary,then the keys are the column names and the values are sql expressions touse to calculate the result.

    If this is None then all columns are returned. This can be expensive.

  • with_row_id (Optional[bool]) –

    if True then include the row id in the results

  • vector (Optional[Union[List[float],List[List[float]],Array,List[Array]]]) –

    the vector to search for, if this a vector search or hybrid search. It willbe None for full text search and plain SQL filtering.

  • vector_column (Optional[str]) –

    the name of the vector column to use for vector search

    If this is None then a default vector column will be used.

  • distance_type (Optional[str]) –

    the distance type to use for vector search

    This can be l2 (default), cosine and dot. Seemetric definitions formore details.

    If this is not a vector search this will be None.

  • postfilter (bool) –

    if True then apply the filter after vector / FTS search. This is ignored forplain SQL filtering.

  • nprobes (Optional[int]) –

    The number of IVF partitions to search. If this is None then a defaultnumber of partitions will be used.

    • A higher number makes search more accurate but also slower.

    • See discussion inQuerying an ANN Index for tuning advice.

    Will be None if this is not a vector search.

  • refine_factor (Optional[int]) –

    Refine the results by reading extra elements and re-ranking them in memory.

    • A higher number makes search more accurate but also slower.

    • See discussion inQuerying an ANN Index for tuning advice.

    Will be None if this is not a vector search.

  • lower_bound (Optional[float]) –

    The lower bound for distance search

    Only results with a distance greater than or equal to this valuewill be returned.

    This will only be set on vector search.

  • upper_bound (Optional[float]) –

    The upper bound for distance search

    Only results with a distance less than or equal to this valuewill be returned.

    This will only be set on vector search.

  • ef (Optional[int]) –

    The size of the nearest neighbor list maintained during HNSW search

    This will only be set on vector search.

  • full_text_query (Optional[Union[str,dict]]) –

    The full text search query

    This can be a string or a dictionary. A dictionary will be used to searchmultiple columns. The keys are the column names and the values are thesearch queries.

    This will only be set on FTS or hybrid queries.

  • fast_search (Optional[bool]) –

    Skip a flat search of unindexed data. This will improvesearch performance but search results will not include unindexed data.

    The default is False

Source code inlancedb/query.py
classQuery(pydantic.BaseModel):"""A LanceDB Query    Queries are constructed by the `Table.search` method.  This class is a    python representation of the query.  Normally you will not need to interact    with this class directly.  You can build up a query and execute it using    collection methods such as `to_batches()`, `to_arrow()`, `to_pandas()`,    etc.    However, you can use the `to_query()` method to get the underlying query object.    This can be useful for serializing a query or using it in a different context.    Attributes    ----------    filter : Optional[str]        sql filter to refine the query with    limit : Optional[int]        The limit on the number of results to return.  If this is a vector or FTS query,        then this is required.  If this is a plain SQL query, then this is optional.    offset: Optional[int]        The offset to start fetching results from        This is ignored for vector / FTS search (will be None).    columns : Optional[Union[List[str], Dict[str, str]]]        which columns to return in the results        This can be a list of column names or a dictionary.  If it is a dictionary,        then the keys are the column names and the values are sql expressions to        use to calculate the result.        If this is None then all columns are returned.  This can be expensive.    with_row_id : Optional[bool]        if True then include the row id in the results    vector : Optional[Union[List[float], List[List[float]], pa.Array, List[pa.Array]]]        the vector to search for, if this a vector search or hybrid search.  It will        be None for full text search and plain SQL filtering.    vector_column : Optional[str]        the name of the vector column to use for vector search        If this is None then a default vector column will be used.    distance_type : Optional[str]        the distance type to use for vector search        This can be l2 (default), cosine and dot.  See [metric definitions][search] for        more details.        If this is not a vector search this will be None.    postfilter : bool        if True then apply the filter after vector / FTS search.  This is ignored for        plain SQL filtering.    nprobes : Optional[int]        The number of IVF partitions to search.  If this is None then a default        number of partitions will be used.        - A higher number makes search more accurate but also slower.        - See discussion in [Querying an ANN Index][querying-an-ann-index] for          tuning advice.        Will be None if this is not a vector search.    refine_factor : Optional[int]        Refine the results by reading extra elements and re-ranking them in memory.        - A higher number makes search more accurate but also slower.        - See discussion in [Querying an ANN Index][querying-an-ann-index] for          tuning advice.        Will be None if this is not a vector search.    lower_bound : Optional[float]        The lower bound for distance search        Only results with a distance greater than or equal to this value        will be returned.        This will only be set on vector search.    upper_bound : Optional[float]        The upper bound for distance search        Only results with a distance less than or equal to this value        will be returned.        This will only be set on vector search.    ef : Optional[int]        The size of the nearest neighbor list maintained during HNSW search        This will only be set on vector search.    full_text_query : Optional[Union[str, dict]]        The full text search query        This can be a string or a dictionary.  A dictionary will be used to search        multiple columns.  The keys are the column names and the values are the        search queries.        This will only be set on FTS or hybrid queries.    fast_search: Optional[bool]        Skip a flat search of unindexed data. This will improve        search performance but search results will not include unindexed data.        The default is False    """# The name of the vector column to use for vector search.vector_column:Optional[str]=None# vector to search for## Note: today this will be floats on the sync path and pa.Array on the async# path though in the future we should unify this to pa.Array everywherevector:Annotated[Optional[Union[List[float],List[List[float]],pa.Array,List[pa.Array]]],ensure_vector_query,]=None# sql filter to refine the query withfilter:Optional[str]=None# if True then apply the filter after vector searchpostfilter:Optional[bool]=None# full text search queryfull_text_query:Optional[FullTextSearchQuery]=None# top k results to returnlimit:Optional[int]=None# distance type to use for vector searchdistance_type:Optional[str]=None# which columns to return in the resultscolumns:Optional[Union[List[str],Dict[str,str]]]=None# minimum number of IVF partitions to search## If None then a default value (20) will be used.minimum_nprobes:Optional[int]=None# maximum number of IVF partitions to search## If None then a default value (20) will be used.## If 0 then no limit will be applied and all partitions could be searched# if needed to satisfy the limit.maximum_nprobes:Optional[int]=None# lower bound for distance searchlower_bound:Optional[float]=None# upper bound for distance searchupper_bound:Optional[float]=None# multiplier for the number of results to inspect for rerankingrefine_factor:Optional[int]=None# if true, include the row id in the resultswith_row_id:Optional[bool]=None# offset to start fetching results fromoffset:Optional[int]=None# if true, will only search the indexed datafast_search:Optional[bool]=None# size of the nearest neighbor list maintained during HNSW searchef:Optional[int]=None# Bypass the vector index and use a brute force searchbypass_vector_index:Optional[bool]=None@classmethoddeffrom_inner(cls,req:PyQueryRequest)->Self:query=cls()query.limit=req.limitquery.offset=req.offsetquery.filter=req.filterquery.full_text_query=req.full_text_searchquery.columns=req.selectquery.with_row_id=req.with_row_idquery.vector_column=req.columnquery.vector=req.query_vectorquery.distance_type=req.distance_typequery.minimum_nprobes=req.minimum_nprobesquery.maximum_nprobes=req.maximum_nprobesquery.lower_bound=req.lower_boundquery.upper_bound=req.upper_boundquery.ef=req.efquery.refine_factor=req.refine_factorquery.bypass_vector_index=req.bypass_vector_indexquery.postfilter=req.postfilterifreq.full_text_searchisnotNone:query.full_text_query=FullTextSearchQuery(columns=None,query=req.full_text_search,)returnquery# This tells pydantic to allow custom types (needed for the `vector` query since# pa.Array wouln't be allowed otherwise)ifPYDANTIC_VERSION.major<2:# Pydantic 1.x compatclassConfig:arbitrary_types_allowed=Trueelse:model_config={"arbitrary_types_allowed":True}

lancedb.query.LanceQueryBuilder

Bases:ABC

An abstract query builder. Subclasses are defined for vector search,full text search, hybrid, and plain SQL filtering.

Source code inlancedb/query.py
 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999100010011002100310041005100610071008100910101011101210131014101510161017101810191020
classLanceQueryBuilder(ABC):"""An abstract query builder. Subclasses are defined for vector search,    full text search, hybrid, and plain SQL filtering.    """@classmethoddefcreate(cls,table:"Table",query:Optional[Union[np.ndarray,str,"PIL.Image.Image",Tuple]],query_type:str,vector_column_name:str,ordering_field_name:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None,fast_search:bool=None,)->Self:"""        Create a query builder based on the given query and query type.        Parameters        ----------        table: Table            The table to query.        query: Optional[Union[np.ndarray, str, "PIL.Image.Image", Tuple]]            The query to use. If None, an empty query builder is returned            which performs simple SQL filtering.        query_type: str            The type of query to perform. One of "vector", "fts", "hybrid", or "auto".            If "auto", the query type is inferred based on the query.        vector_column_name: str            The name of the vector column to use for vector search.        fast_search: bool            Skip flat search of unindexed data.        """# Check hybrid search first as it supports empty query patternifquery_type=="hybrid":# hybrid fts and vector queryreturnLanceHybridQueryBuilder(table,query,vector_column_name,fts_columns=fts_columns)ifqueryisNone:returnLanceEmptyQueryBuilder(table)# remember the string query for reranking purposestr_query=queryifisinstance(query,str)elseNone# convert "auto" query_type to "vector", "fts"# or "hybrid" and convert the query to vector if neededquery,query_type=cls._resolve_query(table,query,query_type,vector_column_name)ifquery_type=="hybrid":returnLanceHybridQueryBuilder(table,query,vector_column_name,fts_columns=fts_columns)ifisinstance(query,(str,FullTextQuery)):# ftsreturnLanceFtsQueryBuilder(table,query,ordering_field_name=ordering_field_name,fts_columns=fts_columns,)ifisinstance(query,list):query=np.array(query,dtype=np.float32)elifisinstance(query,np.ndarray):query=query.astype(np.float32)else:raiseTypeError(f"Unsupported query type:{type(query)}")returnLanceVectorQueryBuilder(table,query,vector_column_name,str_query,fast_search)@classmethoddef_resolve_query(cls,table,query,query_type,vector_column_name):# If query_type is fts, then query must be a string.# otherwise raise TypeErrorifquery_type=="fts":ifnotisinstance(query,(str,FullTextQuery)):raiseTypeError(f"'fts' query must be a string or FullTextQuery:{type(query)}")returnquery,query_typeelifquery_type=="vector":query=cls._query_to_vector(table,query,vector_column_name)returnquery,query_typeelifquery_type=="auto":ifisinstance(query,(list,np.ndarray)):returnquery,"vector"else:conf=table.embedding_functions.get(vector_column_name)ifconfisnotNone:query=conf.function.compute_query_embeddings_with_retry(query)[0]returnquery,"vector"else:returnquery,"fts"else:raiseValueError(f"Invalid query_type, must be 'vector', 'fts', or 'auto':{query_type}")@classmethoddef_query_to_vector(cls,table,query,vector_column_name):ifisinstance(query,(list,np.ndarray)):returnqueryconf=table.embedding_functions.get(vector_column_name)ifconfisnotNone:returnconf.function.compute_query_embeddings_with_retry(query)[0]else:msg=f"No embedding function for{vector_column_name}"raiseValueError(msg)def__init__(self,table:"Table"):self._table=tableself._limit=Noneself._offset=Noneself._columns=Noneself._where=Noneself._postfilter=Noneself._with_row_id=Noneself._vector=Noneself._text=Noneself._ef=Noneself._bypass_vector_index=None@deprecation.deprecated(deprecated_in="0.3.1",removed_in="0.4.0",current_version=__version__,details="Use to_pandas() instead",)defto_df(self)->"pd.DataFrame":"""        *Deprecated alias for `to_pandas()`. Please use `to_pandas()` instead.*        Execute the query and return the results as a pandas DataFrame.        In addition to the selected columns, LanceDB also returns a vector        and also the "_distance" column which is the distance between the query        vector and the returned vector.        """returnself.to_pandas()defto_pandas(self,flatten:Optional[Union[int,bool]]=None,*,timeout:Optional[timedelta]=None,)->"pd.DataFrame":"""        Execute the query and return the results as a pandas DataFrame.        In addition to the selected columns, LanceDB also returns a vector        and also the "_distance" column which is the distance between the query        vector and the returned vector.        Parameters        ----------        flatten: Optional[Union[int, bool]]            If flatten is True, flatten all nested columns.            If flatten is an integer, flatten the nested columns up to the            specified depth.            If unspecified, do not flatten the nested columns.        timeout: Optional[timedelta]            The maximum time to wait for the query to complete.            If None, wait indefinitely.        """tbl=flatten_columns(self.to_arrow(timeout=timeout),flatten)returntbl.to_pandas()@abstractmethoddefto_arrow(self,*,timeout:Optional[timedelta]=None)->pa.Table:"""        Execute the query and return the results as an        [Apache Arrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).        In addition to the selected columns, LanceDB also returns a vector        and also the "_distance" column which is the distance between the query        vector and the returned vectors.        Parameters        ----------        timeout: Optional[timedelta]            The maximum time to wait for the query to complete.            If None, wait indefinitely.        """raiseNotImplementedError@abstractmethoddefto_batches(self,/,batch_size:Optional[int]=None,*,timeout:Optional[timedelta]=None,)->pa.RecordBatchReader:"""        Execute the query and return the results as a pyarrow        [RecordBatchReader](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html)        Parameters        ----------        batch_size: int            The maximum number of selected records in a RecordBatch object.        timeout: Optional[timedelta]            The maximum time to wait for the query to complete.            If None, wait indefinitely.        """raiseNotImplementedErrordefto_list(self,*,timeout:Optional[timedelta]=None)->List[dict]:"""        Execute the query and return the results as a list of dictionaries.        Each list entry is a dictionary with the selected column names as keys,        or all table columns if `select` is not called. The vector and the "_distance"        fields are returned whether or not they're explicitly selected.        Parameters        ----------        timeout: Optional[timedelta]            The maximum time to wait for the query to complete.            If None, wait indefinitely.        """returnself.to_arrow(timeout=timeout).to_pylist()defto_pydantic(self,model:Type[LanceModel],*,timeout:Optional[timedelta]=None)->List[LanceModel]:"""Return the table as a list of pydantic models.        Parameters        ----------        model: Type[LanceModel]            The pydantic model to use.        timeout: Optional[timedelta]            The maximum time to wait for the query to complete.            If None, wait indefinitely.        Returns        -------        List[LanceModel]        """return[model(**{k:vfork,vinrow.items()ifkinmodel.field_names()})forrowinself.to_arrow(timeout=timeout).to_pylist()]defto_polars(self,*,timeout:Optional[timedelta]=None)->"pl.DataFrame":"""        Execute the query and return the results as a Polars DataFrame.        In addition to the selected columns, LanceDB also returns a vector        and also the "_distance" column which is the distance between the query        vector and the returned vector.        Parameters        ----------        timeout: Optional[timedelta]            The maximum time to wait for the query to complete.            If None, wait indefinitely.        """importpolarsasplreturnpl.from_arrow(self.to_arrow(timeout=timeout))deflimit(self,limit:Union[int,None])->Self:"""Set the maximum number of results to return.        Parameters        ----------        limit: int            The maximum number of results to return.            The default query limit is 10 results.            For ANN/KNN queries, you must specify a limit.            For plain searches, all records are returned if limit not set.            *WARNING* if you have a large dataset, setting            the limit to a large number, e.g. the table size,            can potentially result in reading a            large amount of data into memory and cause            out of memory issues.        Returns        -------        LanceQueryBuilder            The LanceQueryBuilder object.        """iflimitisNoneorlimit<=0:ifisinstance(self,LanceVectorQueryBuilder):raiseValueError("Limit is required for ANN/KNN queries")else:self._limit=Noneelse:self._limit=limitreturnselfdefoffset(self,offset:int)->Self:"""Set the offset for the results.        Parameters        ----------        offset: int            The offset to start fetching results from.        Returns        -------        LanceQueryBuilder            The LanceQueryBuilder object.        """ifoffsetisNoneoroffset<=0:self._offset=0else:self._offset=offsetreturnselfdefselect(self,columns:Union[list[str],dict[str,str]])->Self:"""Set the columns to return.        Parameters        ----------        columns: list of str, or dict of str to str default None            List of column names to be fetched.            Or a dictionary of column names to SQL expressions.            All columns are fetched if None or unspecified.        Returns        -------        LanceQueryBuilder            The LanceQueryBuilder object.        """ifisinstance(columns,list)orisinstance(columns,dict):self._columns=columnselse:raiseValueError("columns must be a list or a dictionary")returnselfdefwhere(self,where:str,prefilter:bool=True)->Self:"""Set the where clause.        Parameters        ----------        where: str            The where clause which is a valid SQL where clause. See            `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_            for valid SQL expressions.        prefilter: bool, default True            If True, apply the filter before vector search, otherwise the            filter is applied on the result of vector search.            This feature is **EXPERIMENTAL** and may be removed and modified            without warning in the future.        Returns        -------        LanceQueryBuilder            The LanceQueryBuilder object.        """self._where=whereself._postfilter=notprefilterreturnselfdefwith_row_id(self,with_row_id:bool)->Self:"""Set whether to return row ids.        Parameters        ----------        with_row_id: bool            If True, return _rowid column in the results.        Returns        -------        LanceQueryBuilder            The LanceQueryBuilder object.        """self._with_row_id=with_row_idreturnselfdefexplain_plan(self,verbose:Optional[bool]=False)->str:"""Return the execution plan for this query.        Examples        --------        >>> import lancedb        >>> db = lancedb.connect("./.lancedb")        >>> table = db.create_table("my_table", [{"vector": [99.0, 99]}])        >>> query = [100, 100]        >>> plan = table.search(query).explain_plan(True)        >>> print(plan) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE        ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]        GlobalLimitExec: skip=0, fetch=10          FilterExec: _distance@2 IS NOT NULL            SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]              KNNVectorDistance: metric=l2                LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false        Parameters        ----------        verbose : bool, default False            Use a verbose output format.        Returns        -------        plan : str        """# noqa: E501returnself._table._explain_plan(self.to_query_object(),verbose=verbose)defanalyze_plan(self)->str:"""        Run the query and return its execution plan with runtime metrics.        This returns detailed metrics for each step, such as elapsed time,        rows processed, bytes read, and I/O stats. It is useful for debugging        and performance tuning.        Examples        --------        >>> import lancedb        >>> db = lancedb.connect("./.lancedb")        >>> table = db.create_table("my_table", [{"vector": [99.0, 99]}])        >>> query = [100, 100]        >>> plan = table.search(query).analyze_plan()        >>> print(plan)  # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE        AnalyzeExec verbose=true, metrics=[]          ProjectionExec: expr=[...], metrics=[...]            GlobalLimitExec: skip=0, fetch=10, metrics=[...]              FilterExec: _distance@2 IS NOT NULL,              metrics=[output_rows=..., elapsed_compute=...]                SortExec: TopK(fetch=10), expr=[...],                preserve_partitioning=[...],                metrics=[output_rows=..., elapsed_compute=..., row_replacements=...]                  KNNVectorDistance: metric=l2,                  metrics=[output_rows=..., elapsed_compute=..., output_batches=...]                    LanceScan: uri=..., projection=[vector], row_id=true,                    row_addr=false, ordered=false,                    metrics=[output_rows=..., elapsed_compute=...,                    bytes_read=..., iops=..., requests=...]        Returns        -------        plan : str            The physical query execution plan with runtime metrics.        """returnself._table._analyze_plan(self.to_query_object())defvector(self,vector:Union[np.ndarray,list])->Self:"""Set the vector to search for.        Parameters        ----------        vector: np.ndarray or list            The vector to search for.        Returns        -------        LanceQueryBuilder            The LanceQueryBuilder object.        """raiseNotImplementedErrordeftext(self,text:str|FullTextQuery)->Self:"""Set the text to search for.        Parameters        ----------        text: str | FullTextQuery            If a string, it is treated as a MatchQuery.            If a FullTextQuery object, it is used directly.        Returns        -------        LanceQueryBuilder            The LanceQueryBuilder object.        """raiseNotImplementedError@abstractmethoddefrerank(self,reranker:Reranker)->Self:"""Rerank the results using the specified reranker.        Parameters        ----------        reranker: Reranker            The reranker to use.        Returns        -------        The LanceQueryBuilder object.        """raiseNotImplementedError@abstractmethoddefto_query_object(self)->Query:"""Return a serializable representation of the query        Returns        -------        Query            The serializable representation of the query        """raiseNotImplementedError

createclassmethod

create(table:'Table',query:Optional[Union[ndarray,str,'PIL.Image.Image',Tuple]],query_type:str,vector_column_name:str,ordering_field_name:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None,fast_search:bool=None)->Self

Create a query builder based on the given query and query type.

Parameters:

  • table ('Table') –

    The table to query.

  • query (Optional[Union[ndarray,str, 'PIL.Image.Image',Tuple]]) –

    The query to use. If None, an empty query builder is returnedwhich performs simple SQL filtering.

  • query_type (str) –

    The type of query to perform. One of "vector", "fts", "hybrid", or "auto".If "auto", the query type is inferred based on the query.

  • vector_column_name (str) –

    The name of the vector column to use for vector search.

  • fast_search (bool, default:None) –

    Skip flat search of unindexed data.

Source code inlancedb/query.py
@classmethoddefcreate(cls,table:"Table",query:Optional[Union[np.ndarray,str,"PIL.Image.Image",Tuple]],query_type:str,vector_column_name:str,ordering_field_name:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None,fast_search:bool=None,)->Self:"""    Create a query builder based on the given query and query type.    Parameters    ----------    table: Table        The table to query.    query: Optional[Union[np.ndarray, str, "PIL.Image.Image", Tuple]]        The query to use. If None, an empty query builder is returned        which performs simple SQL filtering.    query_type: str        The type of query to perform. One of "vector", "fts", "hybrid", or "auto".        If "auto", the query type is inferred based on the query.    vector_column_name: str        The name of the vector column to use for vector search.    fast_search: bool        Skip flat search of unindexed data.    """# Check hybrid search first as it supports empty query patternifquery_type=="hybrid":# hybrid fts and vector queryreturnLanceHybridQueryBuilder(table,query,vector_column_name,fts_columns=fts_columns)ifqueryisNone:returnLanceEmptyQueryBuilder(table)# remember the string query for reranking purposestr_query=queryifisinstance(query,str)elseNone# convert "auto" query_type to "vector", "fts"# or "hybrid" and convert the query to vector if neededquery,query_type=cls._resolve_query(table,query,query_type,vector_column_name)ifquery_type=="hybrid":returnLanceHybridQueryBuilder(table,query,vector_column_name,fts_columns=fts_columns)ifisinstance(query,(str,FullTextQuery)):# ftsreturnLanceFtsQueryBuilder(table,query,ordering_field_name=ordering_field_name,fts_columns=fts_columns,)ifisinstance(query,list):query=np.array(query,dtype=np.float32)elifisinstance(query,np.ndarray):query=query.astype(np.float32)else:raiseTypeError(f"Unsupported query type:{type(query)}")returnLanceVectorQueryBuilder(table,query,vector_column_name,str_query,fast_search)

to_df

to_df()->'pd.DataFrame'

Deprecated alias forto_pandas(). Please useto_pandas() instead.

Execute the query and return the results as a pandas DataFrame.In addition to the selected columns, LanceDB also returns a vectorand also the "_distance" column which is the distance between the queryvector and the returned vector.

Source code inlancedb/query.py
@deprecation.deprecated(deprecated_in="0.3.1",removed_in="0.4.0",current_version=__version__,details="Use to_pandas() instead",)defto_df(self)->"pd.DataFrame":"""    *Deprecated alias for `to_pandas()`. Please use `to_pandas()` instead.*    Execute the query and return the results as a pandas DataFrame.    In addition to the selected columns, LanceDB also returns a vector    and also the "_distance" column which is the distance between the query    vector and the returned vector.    """returnself.to_pandas()

to_pandas

to_pandas(flatten:Optional[Union[int,bool]]=None,*,timeout:Optional[timedelta]=None)->'pd.DataFrame'

Execute the query and return the results as a pandas DataFrame.In addition to the selected columns, LanceDB also returns a vectorand also the "_distance" column which is the distance between the queryvector and the returned vector.

Parameters:

  • flatten (Optional[Union[int,bool]], default:None) –

    If flatten is True, flatten all nested columns.If flatten is an integer, flatten the nested columns up to thespecified depth.If unspecified, do not flatten the nested columns.

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If None, wait indefinitely.

Source code inlancedb/query.py
defto_pandas(self,flatten:Optional[Union[int,bool]]=None,*,timeout:Optional[timedelta]=None,)->"pd.DataFrame":"""    Execute the query and return the results as a pandas DataFrame.    In addition to the selected columns, LanceDB also returns a vector    and also the "_distance" column which is the distance between the query    vector and the returned vector.    Parameters    ----------    flatten: Optional[Union[int, bool]]        If flatten is True, flatten all nested columns.        If flatten is an integer, flatten the nested columns up to the        specified depth.        If unspecified, do not flatten the nested columns.    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If None, wait indefinitely.    """tbl=flatten_columns(self.to_arrow(timeout=timeout),flatten)returntbl.to_pandas()

to_arrowabstractmethod

to_arrow(*,timeout:Optional[timedelta]=None)->Table

Execute the query and return the results as anApache Arrow Table.

In addition to the selected columns, LanceDB also returns a vectorand also the "_distance" column which is the distance between the queryvector and the returned vectors.

Parameters:

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If None, wait indefinitely.

Source code inlancedb/query.py
@abstractmethoddefto_arrow(self,*,timeout:Optional[timedelta]=None)->pa.Table:"""    Execute the query and return the results as an    [Apache Arrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).    In addition to the selected columns, LanceDB also returns a vector    and also the "_distance" column which is the distance between the query    vector and the returned vectors.    Parameters    ----------    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If None, wait indefinitely.    """raiseNotImplementedError

to_batchesabstractmethod

to_batches(batch_size:Optional[int]=None,*,timeout:Optional[timedelta]=None)->RecordBatchReader

Execute the query and return the results as a pyarrowRecordBatchReader

Parameters:

  • batch_size (Optional[int], default:None) –

    The maximum number of selected records in a RecordBatch object.

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If None, wait indefinitely.

Source code inlancedb/query.py
@abstractmethoddefto_batches(self,/,batch_size:Optional[int]=None,*,timeout:Optional[timedelta]=None,)->pa.RecordBatchReader:"""    Execute the query and return the results as a pyarrow    [RecordBatchReader](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html)    Parameters    ----------    batch_size: int        The maximum number of selected records in a RecordBatch object.    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If None, wait indefinitely.    """raiseNotImplementedError

to_list

to_list(*,timeout:Optional[timedelta]=None)->List[dict]

Execute the query and return the results as a list of dictionaries.

Each list entry is a dictionary with the selected column names as keys,or all table columns ifselect is not called. The vector and the "_distance"fields are returned whether or not they're explicitly selected.

Parameters:

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If None, wait indefinitely.

Source code inlancedb/query.py
defto_list(self,*,timeout:Optional[timedelta]=None)->List[dict]:"""    Execute the query and return the results as a list of dictionaries.    Each list entry is a dictionary with the selected column names as keys,    or all table columns if `select` is not called. The vector and the "_distance"    fields are returned whether or not they're explicitly selected.    Parameters    ----------    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If None, wait indefinitely.    """returnself.to_arrow(timeout=timeout).to_pylist()

to_pydantic

to_pydantic(model:Type[LanceModel],*,timeout:Optional[timedelta]=None)->List[LanceModel]

Return the table as a list of pydantic models.

Parameters:

  • model (Type[LanceModel]) –

    The pydantic model to use.

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If None, wait indefinitely.

Returns:

Source code inlancedb/query.py
defto_pydantic(self,model:Type[LanceModel],*,timeout:Optional[timedelta]=None)->List[LanceModel]:"""Return the table as a list of pydantic models.    Parameters    ----------    model: Type[LanceModel]        The pydantic model to use.    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If None, wait indefinitely.    Returns    -------    List[LanceModel]    """return[model(**{k:vfork,vinrow.items()ifkinmodel.field_names()})forrowinself.to_arrow(timeout=timeout).to_pylist()]

to_polars

to_polars(*,timeout:Optional[timedelta]=None)->'pl.DataFrame'

Execute the query and return the results as a Polars DataFrame.In addition to the selected columns, LanceDB also returns a vectorand also the "_distance" column which is the distance between the queryvector and the returned vector.

Parameters:

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If None, wait indefinitely.

Source code inlancedb/query.py
defto_polars(self,*,timeout:Optional[timedelta]=None)->"pl.DataFrame":"""    Execute the query and return the results as a Polars DataFrame.    In addition to the selected columns, LanceDB also returns a vector    and also the "_distance" column which is the distance between the query    vector and the returned vector.    Parameters    ----------    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If None, wait indefinitely.    """importpolarsasplreturnpl.from_arrow(self.to_arrow(timeout=timeout))

limit

limit(limit:Union[int,None])->Self

Set the maximum number of results to return.

Parameters:

  • limit (Union[int, None]) –

    The maximum number of results to return.The default query limit is 10 results.For ANN/KNN queries, you must specify a limit.For plain searches, all records are returned if limit not set.WARNING if you have a large dataset, settingthe limit to a large number, e.g. the table size,can potentially result in reading alarge amount of data into memory and causeout of memory issues.

Returns:

Source code inlancedb/query.py
deflimit(self,limit:Union[int,None])->Self:"""Set the maximum number of results to return.    Parameters    ----------    limit: int        The maximum number of results to return.        The default query limit is 10 results.        For ANN/KNN queries, you must specify a limit.        For plain searches, all records are returned if limit not set.        *WARNING* if you have a large dataset, setting        the limit to a large number, e.g. the table size,        can potentially result in reading a        large amount of data into memory and cause        out of memory issues.    Returns    -------    LanceQueryBuilder        The LanceQueryBuilder object.    """iflimitisNoneorlimit<=0:ifisinstance(self,LanceVectorQueryBuilder):raiseValueError("Limit is required for ANN/KNN queries")else:self._limit=Noneelse:self._limit=limitreturnself

offset

offset(offset:int)->Self

Set the offset for the results.

Parameters:

  • offset (int) –

    The offset to start fetching results from.

Returns:

Source code inlancedb/query.py
defoffset(self,offset:int)->Self:"""Set the offset for the results.    Parameters    ----------    offset: int        The offset to start fetching results from.    Returns    -------    LanceQueryBuilder        The LanceQueryBuilder object.    """ifoffsetisNoneoroffset<=0:self._offset=0else:self._offset=offsetreturnself

select

select(columns:Union[list[str],dict[str,str]])->Self

Set the columns to return.

Parameters:

  • columns (Union[list[str],dict[str,str]]) –

    List of column names to be fetched.Or a dictionary of column names to SQL expressions.All columns are fetched if None or unspecified.

Returns:

Source code inlancedb/query.py
defselect(self,columns:Union[list[str],dict[str,str]])->Self:"""Set the columns to return.    Parameters    ----------    columns: list of str, or dict of str to str default None        List of column names to be fetched.        Or a dictionary of column names to SQL expressions.        All columns are fetched if None or unspecified.    Returns    -------    LanceQueryBuilder        The LanceQueryBuilder object.    """ifisinstance(columns,list)orisinstance(columns,dict):self._columns=columnselse:raiseValueError("columns must be a list or a dictionary")returnself

where

where(where:str,prefilter:bool=True)->Self

Set the where clause.

Parameters:

  • where (str) –

    The where clause which is a valid SQL where clause. SeeLance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>_for valid SQL expressions.

  • prefilter (bool, default:True) –

    If True, apply the filter before vector search, otherwise thefilter is applied on the result of vector search.This feature isEXPERIMENTAL and may be removed and modifiedwithout warning in the future.

Returns:

Source code inlancedb/query.py
defwhere(self,where:str,prefilter:bool=True)->Self:"""Set the where clause.    Parameters    ----------    where: str        The where clause which is a valid SQL where clause. See        `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_        for valid SQL expressions.    prefilter: bool, default True        If True, apply the filter before vector search, otherwise the        filter is applied on the result of vector search.        This feature is **EXPERIMENTAL** and may be removed and modified        without warning in the future.    Returns    -------    LanceQueryBuilder        The LanceQueryBuilder object.    """self._where=whereself._postfilter=notprefilterreturnself

with_row_id

with_row_id(with_row_id:bool)->Self

Set whether to return row ids.

Parameters:

  • with_row_id (bool) –

    If True, return _rowid column in the results.

Returns:

Source code inlancedb/query.py
defwith_row_id(self,with_row_id:bool)->Self:"""Set whether to return row ids.    Parameters    ----------    with_row_id: bool        If True, return _rowid column in the results.    Returns    -------    LanceQueryBuilder        The LanceQueryBuilder object.    """self._with_row_id=with_row_idreturnself

explain_plan

explain_plan(verbose:Optional[bool]=False)->str

Return the execution plan for this query.

Examples:

>>>importlancedb>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",[{"vector":[99.0,99]}])>>>query=[100,100]>>>plan=table.search(query).explain_plan(True)>>>print(plan)ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]GlobalLimitExec: skip=0, fetch=10  FilterExec: _distance@2 IS NOT NULL    SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]      KNNVectorDistance: metric=l2        LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

Parameters:

  • verbose (bool, default:False) –

    Use a verbose output format.

Returns:

  • plan (str) –
Source code inlancedb/query.py
defexplain_plan(self,verbose:Optional[bool]=False)->str:"""Return the execution plan for this query.    Examples    --------    >>> import lancedb    >>> db = lancedb.connect("./.lancedb")    >>> table = db.create_table("my_table", [{"vector": [99.0, 99]}])    >>> query = [100, 100]    >>> plan = table.search(query).explain_plan(True)    >>> print(plan) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE    ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]    GlobalLimitExec: skip=0, fetch=10      FilterExec: _distance@2 IS NOT NULL        SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]          KNNVectorDistance: metric=l2            LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false    Parameters    ----------    verbose : bool, default False        Use a verbose output format.    Returns    -------    plan : str    """# noqa: E501returnself._table._explain_plan(self.to_query_object(),verbose=verbose)

analyze_plan

analyze_plan()->str

Run the query and return its execution plan with runtime metrics.

This returns detailed metrics for each step, such as elapsed time,rows processed, bytes read, and I/O stats. It is useful for debuggingand performance tuning.

Examples:

>>>importlancedb>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",[{"vector":[99.0,99]}])>>>query=[100,100]>>>plan=table.search(query).analyze_plan()>>>print(plan)AnalyzeExec verbose=true, metrics=[]  ProjectionExec: expr=[...], metrics=[...]    GlobalLimitExec: skip=0, fetch=10, metrics=[...]      FilterExec: _distance@2 IS NOT NULL,      metrics=[output_rows=..., elapsed_compute=...]        SortExec: TopK(fetch=10), expr=[...],        preserve_partitioning=[...],        metrics=[output_rows=..., elapsed_compute=..., row_replacements=...]          KNNVectorDistance: metric=l2,          metrics=[output_rows=..., elapsed_compute=..., output_batches=...]            LanceScan: uri=..., projection=[vector], row_id=true,            row_addr=false, ordered=false,            metrics=[output_rows=..., elapsed_compute=...,            bytes_read=..., iops=..., requests=...]

Returns:

  • plan (str) –

    The physical query execution plan with runtime metrics.

Source code inlancedb/query.py
defanalyze_plan(self)->str:"""    Run the query and return its execution plan with runtime metrics.    This returns detailed metrics for each step, such as elapsed time,    rows processed, bytes read, and I/O stats. It is useful for debugging    and performance tuning.    Examples    --------    >>> import lancedb    >>> db = lancedb.connect("./.lancedb")    >>> table = db.create_table("my_table", [{"vector": [99.0, 99]}])    >>> query = [100, 100]    >>> plan = table.search(query).analyze_plan()    >>> print(plan)  # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE    AnalyzeExec verbose=true, metrics=[]      ProjectionExec: expr=[...], metrics=[...]        GlobalLimitExec: skip=0, fetch=10, metrics=[...]          FilterExec: _distance@2 IS NOT NULL,          metrics=[output_rows=..., elapsed_compute=...]            SortExec: TopK(fetch=10), expr=[...],            preserve_partitioning=[...],            metrics=[output_rows=..., elapsed_compute=..., row_replacements=...]              KNNVectorDistance: metric=l2,              metrics=[output_rows=..., elapsed_compute=..., output_batches=...]                LanceScan: uri=..., projection=[vector], row_id=true,                row_addr=false, ordered=false,                metrics=[output_rows=..., elapsed_compute=...,                bytes_read=..., iops=..., requests=...]    Returns    -------    plan : str        The physical query execution plan with runtime metrics.    """returnself._table._analyze_plan(self.to_query_object())

vector

vector(vector:Union[ndarray,list])->Self

Set the vector to search for.

Parameters:

  • vector (Union[ndarray,list]) –

    The vector to search for.

Returns:

Source code inlancedb/query.py
defvector(self,vector:Union[np.ndarray,list])->Self:"""Set the vector to search for.    Parameters    ----------    vector: np.ndarray or list        The vector to search for.    Returns    -------    LanceQueryBuilder        The LanceQueryBuilder object.    """raiseNotImplementedError

text

text(text:str|FullTextQuery)->Self

Set the text to search for.

Parameters:

  • text (str |FullTextQuery) –

    If a string, it is treated as a MatchQuery.If a FullTextQuery object, it is used directly.

Returns:

Source code inlancedb/query.py
deftext(self,text:str|FullTextQuery)->Self:"""Set the text to search for.    Parameters    ----------    text: str | FullTextQuery        If a string, it is treated as a MatchQuery.        If a FullTextQuery object, it is used directly.    Returns    -------    LanceQueryBuilder        The LanceQueryBuilder object.    """raiseNotImplementedError

rerankabstractmethod

rerank(reranker:Reranker)->Self

Rerank the results using the specified reranker.

Parameters:

  • reranker (Reranker) –

    The reranker to use.

Returns:

  • The LanceQueryBuilder object.
Source code inlancedb/query.py
@abstractmethoddefrerank(self,reranker:Reranker)->Self:"""Rerank the results using the specified reranker.    Parameters    ----------    reranker: Reranker        The reranker to use.    Returns    -------    The LanceQueryBuilder object.    """raiseNotImplementedError

to_query_objectabstractmethod

to_query_object()->Query

Return a serializable representation of the query

Returns:

  • Query

    The serializable representation of the query

Source code inlancedb/query.py
@abstractmethoddefto_query_object(self)->Query:"""Return a serializable representation of the query    Returns    -------    Query        The serializable representation of the query    """raiseNotImplementedError

lancedb.query.LanceVectorQueryBuilder

Bases:LanceQueryBuilder

Examples:

>>>importlancedb>>>data=[{"vector":[1.1,1.2],"b":2},...{"vector":[0.5,1.3],"b":4},...{"vector":[0.4,0.4],"b":6},...{"vector":[0.4,0.4],"b":10}]>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",data=data)>>>(table.search([0.4,0.4])....distance_type("cosine")....where("b < 10")....select(["b","vector"])....limit(2)....to_pandas())   b      vector  _distance0  6  [0.4, 0.4]   0.0000001  2  [1.1, 1.2]   0.000944
Source code inlancedb/query.py
102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187118811891190119111921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241124212431244124512461247124812491250125112521253125412551256125712581259126012611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316131713181319132013211322132313241325132613271328132913301331133213331334133513361337133813391340134113421343134413451346134713481349135013511352135313541355135613571358135913601361136213631364136513661367136813691370137113721373137413751376137713781379138013811382138313841385138613871388138913901391139213931394139513961397
classLanceVectorQueryBuilder(LanceQueryBuilder):"""    Examples    --------    >>> import lancedb    >>> data = [{"vector": [1.1, 1.2], "b": 2},    ...         {"vector": [0.5, 1.3], "b": 4},    ...         {"vector": [0.4, 0.4], "b": 6},    ...         {"vector": [0.4, 0.4], "b": 10}]    >>> db = lancedb.connect("./.lancedb")    >>> table = db.create_table("my_table", data=data)    >>> (table.search([0.4, 0.4])    ...       .distance_type("cosine")    ...       .where("b < 10")    ...       .select(["b", "vector"])    ...       .limit(2)    ...       .to_pandas())       b      vector  _distance    0  6  [0.4, 0.4]   0.000000    1  2  [1.1, 1.2]   0.000944    """def__init__(self,table:"Table",query:Union[np.ndarray,list,"PIL.Image.Image"],vector_column:str,str_query:Optional[str]=None,fast_search:bool=None,):super().__init__(table)self._query=queryself._distance_type=Noneself._minimum_nprobes=Noneself._maximum_nprobes=Noneself._lower_bound=Noneself._upper_bound=Noneself._refine_factor=Noneself._vector_column=vector_columnself._postfilter=Noneself._reranker=Noneself._str_query=str_queryself._fast_search=fast_searchdefmetric(self,metric:Literal["l2","cosine","dot"])->LanceVectorQueryBuilder:"""Set the distance metric to use.        This is an alias for distance_type() and may be deprecated in the future.        Please use distance_type() instead.        Parameters        ----------        metric: "l2" or "cosine" or "dot"            The distance metric to use. By default "l2" is used.        Returns        -------        LanceVectorQueryBuilder            The LanceQueryBuilder object.        """returnself.distance_type(metric)defdistance_type(self,distance_type:Literal["l2","cosine","dot"])->"LanceVectorQueryBuilder":"""Set the distance metric to use.        When performing a vector search we try and find the "nearest" vectors according        to some kind of distance metric. This parameter controls which distance metric        to use.        Note: if there is a vector index then the distance type used MUST match the        distance type used to train the vector index. If this is not done then the        results will be invalid.        Parameters        ----------        distance_type: "l2" or "cosine" or "dot"            The distance metric to use. By default "l2" is used.        Returns        -------        LanceVectorQueryBuilder            The LanceQueryBuilder object.        """self._distance_type=distance_type.lower()returnselfdefnprobes(self,nprobes:int)->LanceVectorQueryBuilder:"""Set the number of probes to use.        Higher values will yield better recall (more likely to find vectors if        they exist) at the expense of latency.        See discussion in [Querying an ANN Index][querying-an-ann-index] for        tuning advice.        This method sets both the minimum and maximum number of probes to the same        value. See `minimum_nprobes` and `maximum_nprobes` for more fine-grained        control.        Parameters        ----------        nprobes: int            The number of probes to use.        Returns        -------        LanceVectorQueryBuilder            The LanceQueryBuilder object.        """self._minimum_nprobes=nprobesself._maximum_nprobes=nprobesreturnselfdefminimum_nprobes(self,minimum_nprobes:int)->LanceVectorQueryBuilder:"""Set the minimum number of probes to use.        See `nprobes` for more details.        These partitions will be searched on every vector query and will increase recall        at the expense of latency.        """self._minimum_nprobes=minimum_nprobesreturnselfdefmaximum_nprobes(self,maximum_nprobes:int)->LanceVectorQueryBuilder:"""Set the maximum number of probes to use.        See `nprobes` for more details.        If this value is greater than `minimum_nprobes` then the excess partitions        will be searched only if we have not found enough results.        This can be useful when there is a narrow filter to allow these queries to        spend more time searching and avoid potential false negatives.        If this value is 0 then no limit will be applied and all partitions could be        searched if needed to satisfy the limit.        """self._maximum_nprobes=maximum_nprobesreturnselfdefdistance_range(self,lower_bound:Optional[float]=None,upper_bound:Optional[float]=None)->LanceVectorQueryBuilder:"""Set the distance range to use.        Only rows with distances within range [lower_bound, upper_bound)        will be returned.        Parameters        ----------        lower_bound: Optional[float]            The lower bound of the distance range.        upper_bound: Optional[float]            The upper bound of the distance range.        Returns        -------        LanceVectorQueryBuilder            The LanceQueryBuilder object.        """self._lower_bound=lower_boundself._upper_bound=upper_boundreturnselfdefef(self,ef:int)->LanceVectorQueryBuilder:"""Set the number of candidates to consider during search.        Higher values will yield better recall (more likely to find vectors if        they exist) at the expense of latency.        This only applies to the HNSW-related index.        The default value is 1.5 * limit.        Parameters        ----------        ef: int            The number of candidates to consider during search.        Returns        -------        LanceVectorQueryBuilder            The LanceQueryBuilder object.        """self._ef=efreturnselfdefrefine_factor(self,refine_factor:int)->LanceVectorQueryBuilder:"""Set the refine factor to use, increasing the number of vectors sampled.        As an example, a refine factor of 2 will sample 2x as many vectors as        requested, re-ranks them, and returns the top half most relevant results.        See discussion in [Querying an ANN Index][querying-an-ann-index] for        tuning advice.        Parameters        ----------        refine_factor: int            The refine factor to use.        Returns        -------        LanceVectorQueryBuilder            The LanceQueryBuilder object.        """self._refine_factor=refine_factorreturnselfdefto_arrow(self,*,timeout:Optional[timedelta]=None)->pa.Table:"""        Execute the query and return the results as an        [Apache Arrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).        In addition to the selected columns, LanceDB also returns a vector        and also the "_distance" column which is the distance between the query        vector and the returned vectors.        Parameters        ----------        timeout: Optional[timedelta]            The maximum time to wait for the query to complete.            If None, wait indefinitely.        """returnself.to_batches(timeout=timeout).read_all()defto_query_object(self)->Query:"""        Build a Query object        This can be used to serialize a query        """vector=self._queryifisinstance(self._query,list)elseself._query.tolist()ifisinstance(vector[0],np.ndarray):vector=[v.tolist()forvinvector]returnQuery(vector=vector,filter=self._where,postfilter=self._postfilter,limit=self._limit,distance_type=self._distance_type,columns=self._columns,minimum_nprobes=self._minimum_nprobes,maximum_nprobes=self._maximum_nprobes,lower_bound=self._lower_bound,upper_bound=self._upper_bound,refine_factor=self._refine_factor,vector_column=self._vector_column,with_row_id=self._with_row_id,offset=self._offset,fast_search=self._fast_search,ef=self._ef,bypass_vector_index=self._bypass_vector_index,)defto_batches(self,/,batch_size:Optional[int]=None,*,timeout:Optional[timedelta]=None,)->pa.RecordBatchReader:"""        Execute the query and return the result as a RecordBatchReader object.        Parameters        ----------        batch_size: int            The maximum number of selected records in a RecordBatch object.        timeout: timedelta, default None            The maximum time to wait for the query to complete.            If None, wait indefinitely.        Returns        -------        pa.RecordBatchReader        """vector=self._queryifisinstance(self._query,list)elseself._query.tolist()ifisinstance(vector[0],np.ndarray):vector=[v.tolist()forvinvector]query=self.to_query_object()result_set=self._table._execute_query(query,batch_size=batch_size,timeout=timeout)ifself._rerankerisnotNone:rs_table=result_set.read_all()result_set=self._reranker.rerank_vector(self._str_query,rs_table)check_reranker_result(result_set)# convert result_set back to RecordBatchReaderresult_set=pa.RecordBatchReader.from_batches(result_set.schema,result_set.to_batches())returnresult_setdefwhere(self,where:str,prefilter:bool=None)->LanceVectorQueryBuilder:"""Set the where clause.        Parameters        ----------        where: str            The where clause which is a valid SQL where clause. See            `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_            for valid SQL expressions.        prefilter: bool, default True            If True, apply the filter before vector search, otherwise the            filter is applied on the result of vector search.        Returns        -------        LanceQueryBuilder            The LanceQueryBuilder object.        """self._where=whereifprefilterisnotNone:self._postfilter=notprefilterreturnselfdefrerank(self,reranker:Reranker,query_string:Optional[str]=None)->LanceVectorQueryBuilder:"""Rerank the results using the specified reranker.        Parameters        ----------        reranker: Reranker            The reranker to use.        query_string: Optional[str]            The query to use for reranking. This needs to be specified explicitly here            as the query used for vector search may already be vectorized and the            reranker requires a string query.            This is only required if the query used for vector search is not a string.            Note: This doesn't yet support the case where the query is multimodal or a            list of vectors.        Returns        -------        LanceVectorQueryBuilder            The LanceQueryBuilder object.        """self._reranker=rerankerifself._str_queryisNoneandquery_stringisNone:raiseValueError("""                The query used for vector search is not a string.                In this case, the reranker query needs to be specified explicitly.                """)ifquery_stringisnotNoneandnotisinstance(query_string,str):raiseValueError("Reranking currently only supports string queries")self._str_query=query_stringifquery_stringisnotNoneelseself._str_queryifreranker.score=="all":self.with_row_id(True)returnselfdefbypass_vector_index(self)->LanceVectorQueryBuilder:"""        If this is called then any vector index is skipped        An exhaustive (flat) search will be performed.  The query vector will        be compared to every vector in the table.  At high scales this can be        expensive.  However, this is often still useful.  For example, skipping        the vector index can give you ground truth results which you can use to        calculate your recall to select an appropriate value for nprobes.        Returns        -------        LanceVectorQueryBuilder            The LanceVectorQueryBuilder object.        """self._bypass_vector_index=Truereturnself

metric

metric(metric:Literal['l2','cosine','dot'])->LanceVectorQueryBuilder

Set the distance metric to use.

This is an alias for distance_type() and may be deprecated in the future.Please use distance_type() instead.

Parameters:

  • metric (Literal['l2', 'cosine', 'dot']) –

    The distance metric to use. By default "l2" is used.

Returns:

Source code inlancedb/query.py
defmetric(self,metric:Literal["l2","cosine","dot"])->LanceVectorQueryBuilder:"""Set the distance metric to use.    This is an alias for distance_type() and may be deprecated in the future.    Please use distance_type() instead.    Parameters    ----------    metric: "l2" or "cosine" or "dot"        The distance metric to use. By default "l2" is used.    Returns    -------    LanceVectorQueryBuilder        The LanceQueryBuilder object.    """returnself.distance_type(metric)

distance_type

distance_type(distance_type:Literal['l2','cosine','dot'])->'LanceVectorQueryBuilder'

Set the distance metric to use.

When performing a vector search we try and find the "nearest" vectors accordingto some kind of distance metric. This parameter controls which distance metricto use.

Note: if there is a vector index then the distance type used MUST match thedistance type used to train the vector index. If this is not done then theresults will be invalid.

Parameters:

  • distance_type (Literal['l2', 'cosine', 'dot']) –

    The distance metric to use. By default "l2" is used.

Returns:

Source code inlancedb/query.py
defdistance_type(self,distance_type:Literal["l2","cosine","dot"])->"LanceVectorQueryBuilder":"""Set the distance metric to use.    When performing a vector search we try and find the "nearest" vectors according    to some kind of distance metric. This parameter controls which distance metric    to use.    Note: if there is a vector index then the distance type used MUST match the    distance type used to train the vector index. If this is not done then the    results will be invalid.    Parameters    ----------    distance_type: "l2" or "cosine" or "dot"        The distance metric to use. By default "l2" is used.    Returns    -------    LanceVectorQueryBuilder        The LanceQueryBuilder object.    """self._distance_type=distance_type.lower()returnself

nprobes

nprobes(nprobes:int)->LanceVectorQueryBuilder

Set the number of probes to use.

Higher values will yield better recall (more likely to find vectors ifthey exist) at the expense of latency.

See discussion inQuerying an ANN Index fortuning advice.

This method sets both the minimum and maximum number of probes to the samevalue. Seeminimum_nprobes andmaximum_nprobes for more fine-grainedcontrol.

Parameters:

  • nprobes (int) –

    The number of probes to use.

Returns:

Source code inlancedb/query.py
defnprobes(self,nprobes:int)->LanceVectorQueryBuilder:"""Set the number of probes to use.    Higher values will yield better recall (more likely to find vectors if    they exist) at the expense of latency.    See discussion in [Querying an ANN Index][querying-an-ann-index] for    tuning advice.    This method sets both the minimum and maximum number of probes to the same    value. See `minimum_nprobes` and `maximum_nprobes` for more fine-grained    control.    Parameters    ----------    nprobes: int        The number of probes to use.    Returns    -------    LanceVectorQueryBuilder        The LanceQueryBuilder object.    """self._minimum_nprobes=nprobesself._maximum_nprobes=nprobesreturnself

minimum_nprobes

minimum_nprobes(minimum_nprobes:int)->LanceVectorQueryBuilder

Set the minimum number of probes to use.

Seenprobes for more details.

These partitions will be searched on every vector query and will increase recallat the expense of latency.

Source code inlancedb/query.py
defminimum_nprobes(self,minimum_nprobes:int)->LanceVectorQueryBuilder:"""Set the minimum number of probes to use.    See `nprobes` for more details.    These partitions will be searched on every vector query and will increase recall    at the expense of latency.    """self._minimum_nprobes=minimum_nprobesreturnself

maximum_nprobes

maximum_nprobes(maximum_nprobes:int)->LanceVectorQueryBuilder

Set the maximum number of probes to use.

Seenprobes for more details.

If this value is greater thanminimum_nprobes then the excess partitionswill be searched only if we have not found enough results.

This can be useful when there is a narrow filter to allow these queries tospend more time searching and avoid potential false negatives.

If this value is 0 then no limit will be applied and all partitions could besearched if needed to satisfy the limit.

Source code inlancedb/query.py
defmaximum_nprobes(self,maximum_nprobes:int)->LanceVectorQueryBuilder:"""Set the maximum number of probes to use.    See `nprobes` for more details.    If this value is greater than `minimum_nprobes` then the excess partitions    will be searched only if we have not found enough results.    This can be useful when there is a narrow filter to allow these queries to    spend more time searching and avoid potential false negatives.    If this value is 0 then no limit will be applied and all partitions could be    searched if needed to satisfy the limit.    """self._maximum_nprobes=maximum_nprobesreturnself

distance_range

distance_range(lower_bound:Optional[float]=None,upper_bound:Optional[float]=None)->LanceVectorQueryBuilder

Set the distance range to use.

Only rows with distances within range [lower_bound, upper_bound)will be returned.

Parameters:

  • lower_bound (Optional[float], default:None) –

    The lower bound of the distance range.

  • upper_bound (Optional[float], default:None) –

    The upper bound of the distance range.

Returns:

Source code inlancedb/query.py
defdistance_range(self,lower_bound:Optional[float]=None,upper_bound:Optional[float]=None)->LanceVectorQueryBuilder:"""Set the distance range to use.    Only rows with distances within range [lower_bound, upper_bound)    will be returned.    Parameters    ----------    lower_bound: Optional[float]        The lower bound of the distance range.    upper_bound: Optional[float]        The upper bound of the distance range.    Returns    -------    LanceVectorQueryBuilder        The LanceQueryBuilder object.    """self._lower_bound=lower_boundself._upper_bound=upper_boundreturnself

ef

Set the number of candidates to consider during search.

Higher values will yield better recall (more likely to find vectors ifthey exist) at the expense of latency.

This only applies to the HNSW-related index.The default value is 1.5 * limit.

Parameters:

  • ef (int) –

    The number of candidates to consider during search.

Returns:

Source code inlancedb/query.py
defef(self,ef:int)->LanceVectorQueryBuilder:"""Set the number of candidates to consider during search.    Higher values will yield better recall (more likely to find vectors if    they exist) at the expense of latency.    This only applies to the HNSW-related index.    The default value is 1.5 * limit.    Parameters    ----------    ef: int        The number of candidates to consider during search.    Returns    -------    LanceVectorQueryBuilder        The LanceQueryBuilder object.    """self._ef=efreturnself

refine_factor

refine_factor(refine_factor:int)->LanceVectorQueryBuilder

Set the refine factor to use, increasing the number of vectors sampled.

As an example, a refine factor of 2 will sample 2x as many vectors asrequested, re-ranks them, and returns the top half most relevant results.

See discussion inQuerying an ANN Index fortuning advice.

Parameters:

  • refine_factor (int) –

    The refine factor to use.

Returns:

Source code inlancedb/query.py
defrefine_factor(self,refine_factor:int)->LanceVectorQueryBuilder:"""Set the refine factor to use, increasing the number of vectors sampled.    As an example, a refine factor of 2 will sample 2x as many vectors as    requested, re-ranks them, and returns the top half most relevant results.    See discussion in [Querying an ANN Index][querying-an-ann-index] for    tuning advice.    Parameters    ----------    refine_factor: int        The refine factor to use.    Returns    -------    LanceVectorQueryBuilder        The LanceQueryBuilder object.    """self._refine_factor=refine_factorreturnself

to_arrow

to_arrow(*,timeout:Optional[timedelta]=None)->Table

Execute the query and return the results as anApache Arrow Table.

In addition to the selected columns, LanceDB also returns a vectorand also the "_distance" column which is the distance between the queryvector and the returned vectors.

Parameters:

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If None, wait indefinitely.

Source code inlancedb/query.py
defto_arrow(self,*,timeout:Optional[timedelta]=None)->pa.Table:"""    Execute the query and return the results as an    [Apache Arrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).    In addition to the selected columns, LanceDB also returns a vector    and also the "_distance" column which is the distance between the query    vector and the returned vectors.    Parameters    ----------    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If None, wait indefinitely.    """returnself.to_batches(timeout=timeout).read_all()

to_query_object

to_query_object()->Query

Build a Query object

This can be used to serialize a query

Source code inlancedb/query.py
defto_query_object(self)->Query:"""    Build a Query object    This can be used to serialize a query    """vector=self._queryifisinstance(self._query,list)elseself._query.tolist()ifisinstance(vector[0],np.ndarray):vector=[v.tolist()forvinvector]returnQuery(vector=vector,filter=self._where,postfilter=self._postfilter,limit=self._limit,distance_type=self._distance_type,columns=self._columns,minimum_nprobes=self._minimum_nprobes,maximum_nprobes=self._maximum_nprobes,lower_bound=self._lower_bound,upper_bound=self._upper_bound,refine_factor=self._refine_factor,vector_column=self._vector_column,with_row_id=self._with_row_id,offset=self._offset,fast_search=self._fast_search,ef=self._ef,bypass_vector_index=self._bypass_vector_index,)

to_batches

to_batches(batch_size:Optional[int]=None,*,timeout:Optional[timedelta]=None)->RecordBatchReader

Execute the query and return the result as a RecordBatchReader object.

Parameters:

  • batch_size (Optional[int], default:None) –

    The maximum number of selected records in a RecordBatch object.

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If None, wait indefinitely.

Returns:

Source code inlancedb/query.py
defto_batches(self,/,batch_size:Optional[int]=None,*,timeout:Optional[timedelta]=None,)->pa.RecordBatchReader:"""    Execute the query and return the result as a RecordBatchReader object.    Parameters    ----------    batch_size: int        The maximum number of selected records in a RecordBatch object.    timeout: timedelta, default None        The maximum time to wait for the query to complete.        If None, wait indefinitely.    Returns    -------    pa.RecordBatchReader    """vector=self._queryifisinstance(self._query,list)elseself._query.tolist()ifisinstance(vector[0],np.ndarray):vector=[v.tolist()forvinvector]query=self.to_query_object()result_set=self._table._execute_query(query,batch_size=batch_size,timeout=timeout)ifself._rerankerisnotNone:rs_table=result_set.read_all()result_set=self._reranker.rerank_vector(self._str_query,rs_table)check_reranker_result(result_set)# convert result_set back to RecordBatchReaderresult_set=pa.RecordBatchReader.from_batches(result_set.schema,result_set.to_batches())returnresult_set

where

where(where:str,prefilter:bool=None)->LanceVectorQueryBuilder

Set the where clause.

Parameters:

  • where (str) –

    The where clause which is a valid SQL where clause. SeeLance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>_for valid SQL expressions.

  • prefilter (bool, default:None) –

    If True, apply the filter before vector search, otherwise thefilter is applied on the result of vector search.

Returns:

Source code inlancedb/query.py
defwhere(self,where:str,prefilter:bool=None)->LanceVectorQueryBuilder:"""Set the where clause.    Parameters    ----------    where: str        The where clause which is a valid SQL where clause. See        `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_        for valid SQL expressions.    prefilter: bool, default True        If True, apply the filter before vector search, otherwise the        filter is applied on the result of vector search.    Returns    -------    LanceQueryBuilder        The LanceQueryBuilder object.    """self._where=whereifprefilterisnotNone:self._postfilter=notprefilterreturnself

rerank

rerank(reranker:Reranker,query_string:Optional[str]=None)->LanceVectorQueryBuilder

Rerank the results using the specified reranker.

Parameters:

  • reranker (Reranker) –

    The reranker to use.

  • query_string (Optional[str], default:None) –

    The query to use for reranking. This needs to be specified explicitly hereas the query used for vector search may already be vectorized and thereranker requires a string query.This is only required if the query used for vector search is not a string.Note: This doesn't yet support the case where the query is multimodal or alist of vectors.

Returns:

Source code inlancedb/query.py
defrerank(self,reranker:Reranker,query_string:Optional[str]=None)->LanceVectorQueryBuilder:"""Rerank the results using the specified reranker.    Parameters    ----------    reranker: Reranker        The reranker to use.    query_string: Optional[str]        The query to use for reranking. This needs to be specified explicitly here        as the query used for vector search may already be vectorized and the        reranker requires a string query.        This is only required if the query used for vector search is not a string.        Note: This doesn't yet support the case where the query is multimodal or a        list of vectors.    Returns    -------    LanceVectorQueryBuilder        The LanceQueryBuilder object.    """self._reranker=rerankerifself._str_queryisNoneandquery_stringisNone:raiseValueError("""            The query used for vector search is not a string.            In this case, the reranker query needs to be specified explicitly.            """)ifquery_stringisnotNoneandnotisinstance(query_string,str):raiseValueError("Reranking currently only supports string queries")self._str_query=query_stringifquery_stringisnotNoneelseself._str_queryifreranker.score=="all":self.with_row_id(True)returnself

bypass_vector_index

bypass_vector_index()->LanceVectorQueryBuilder

If this is called then any vector index is skipped

An exhaustive (flat) search will be performed. The query vector willbe compared to every vector in the table. At high scales this can beexpensive. However, this is often still useful. For example, skippingthe vector index can give you ground truth results which you can use tocalculate your recall to select an appropriate value for nprobes.

Returns:

Source code inlancedb/query.py
defbypass_vector_index(self)->LanceVectorQueryBuilder:"""    If this is called then any vector index is skipped    An exhaustive (flat) search will be performed.  The query vector will    be compared to every vector in the table.  At high scales this can be    expensive.  However, this is often still useful.  For example, skipping    the vector index can give you ground truth results which you can use to    calculate your recall to select an appropriate value for nprobes.    Returns    -------    LanceVectorQueryBuilder        The LanceVectorQueryBuilder object.    """self._bypass_vector_index=Truereturnself

lancedb.query.LanceFtsQueryBuilder

Bases:LanceQueryBuilder

A builder for full text search for LanceDB.

Source code inlancedb/query.py
classLanceFtsQueryBuilder(LanceQueryBuilder):"""A builder for full text search for LanceDB."""def__init__(self,table:"Table",query:str|FullTextQuery,ordering_field_name:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None,):super().__init__(table)self._query=queryself._phrase_query=Falseself.ordering_field_name=ordering_field_nameself._reranker=Noneifisinstance(fts_columns,str):fts_columns=[fts_columns]self._fts_columns=fts_columnsdefphrase_query(self,phrase_query:bool=True)->LanceFtsQueryBuilder:"""Set whether to use phrase query.        Parameters        ----------        phrase_query: bool, default True            If True, then the query will be wrapped in quotes and            double quotes replaced by single quotes.        Returns        -------        LanceFtsQueryBuilder            The LanceFtsQueryBuilder object.        """self._phrase_query=phrase_queryreturnselfdefto_query_object(self)->Query:returnQuery(columns=self._columns,filter=self._where,limit=self._limit,postfilter=self._postfilter,with_row_id=self._with_row_id,full_text_query=FullTextSearchQuery(query=self._query,columns=self._fts_columns),offset=self._offset,)defto_arrow(self,*,timeout:Optional[timedelta]=None)->pa.Table:path,fs,exist=self._table._get_fts_index_path()ifexist:returnself.tantivy_to_arrow()query=self._queryifself._phrase_query:ifisinstance(query,str):ifnotquery.startswith('"')ornotquery.endswith('"'):query=f'"{query}"'elifisinstance(query,FullTextQuery)andnotisinstance(query,PhraseQuery):raiseTypeError("Please use PhraseQuery for phrase queries.")query=self.to_query_object()results=self._table._execute_query(query,timeout=timeout)results=results.read_all()ifself._rerankerisnotNone:results=self._reranker.rerank_fts(self._query,results)check_reranker_result(results)returnresultsdefto_batches(self,/,batch_size:Optional[int]=None,timeout:Optional[timedelta]=None):raiseNotImplementedError("to_batches on an FTS query")deftantivy_to_arrow(self)->pa.Table:try:importtantivyexceptImportError:raiseImportError("Please install tantivy-py `pip install tantivy` to use the full text search feature."# noqa: E501)from.ftsimportsearch_index# get the index pathpath,fs,exist=self._table._get_fts_index_path()# check if the index existifnotexist:raiseFileNotFoundError("Fts index does not exist. ""Please first call table.create_fts_index(['<field_names>']) to ""create the fts index.")# Check that we are on local filesystemifnotisinstance(fs,pa_fs.LocalFileSystem):raiseNotImplementedError("Tantivy-based full text search ""is only supported on the local filesystem")# open the indexindex=tantivy.Index.open(path)# get the scores and doc idsquery=self._queryifself._phrase_query:query=query.replace('"',"'")query=f'"{query}"'limit=self._limitifself._limitisnotNoneelse10row_ids,scores=search_index(index,query,limit,ordering_field=self.ordering_field_name)iflen(row_ids)==0:empty_schema=pa.schema([pa.field("_score",pa.float32())])returnpa.Table.from_batches([],schema=empty_schema)scores=pa.array(scores)output_tbl=self._table.to_lance().take(row_ids,columns=self._columns)output_tbl=output_tbl.append_column("_score",scores)# this needs to match vector search results which are uint64row_ids=pa.array(row_ids,type=pa.uint64())ifself._whereisnotNone:tmp_name="__lancedb__duckdb__indexer__"output_tbl=output_tbl.append_column(tmp_name,pa.array(range(len(output_tbl))))try:# TODO would be great to have Substrait generate pyarrow compute# expressions or conversely have pyarrow support SQL expressions# using Substraitimportduckdbindexer=duckdb.sql(f"SELECT{tmp_name} FROM output_tbl WHERE{self._where}").to_arrow_table()[tmp_name]output_tbl=output_tbl.take(indexer).drop([tmp_name])row_ids=row_ids.take(indexer)exceptImportError:importtempfileimportlance# TODO Use "memory://" instead once that's supportedwithtempfile.TemporaryDirectory()astmp:ds=lance.write_dataset(output_tbl,tmp)output_tbl=ds.to_table(filter=self._where)indexer=output_tbl[tmp_name]row_ids=row_ids.take(indexer)output_tbl=output_tbl.drop([tmp_name])ifself._with_row_id:output_tbl=output_tbl.append_column("_rowid",row_ids)ifself._rerankerisnotNone:output_tbl=self._reranker.rerank_fts(self._query,output_tbl)returnoutput_tbldefrerank(self,reranker:Reranker)->LanceFtsQueryBuilder:"""Rerank the results using the specified reranker.        Parameters        ----------        reranker: Reranker            The reranker to use.        Returns        -------        LanceFtsQueryBuilder            The LanceQueryBuilder object.        """self._reranker=rerankerifreranker.score=="all":self.with_row_id(True)returnself

phrase_query

phrase_query(phrase_query:bool=True)->LanceFtsQueryBuilder

Set whether to use phrase query.

Parameters:

  • phrase_query (bool, default:True) –

    If True, then the query will be wrapped in quotes anddouble quotes replaced by single quotes.

Returns:

Source code inlancedb/query.py
defphrase_query(self,phrase_query:bool=True)->LanceFtsQueryBuilder:"""Set whether to use phrase query.    Parameters    ----------    phrase_query: bool, default True        If True, then the query will be wrapped in quotes and        double quotes replaced by single quotes.    Returns    -------    LanceFtsQueryBuilder        The LanceFtsQueryBuilder object.    """self._phrase_query=phrase_queryreturnself

rerank

rerank(reranker:Reranker)->LanceFtsQueryBuilder

Rerank the results using the specified reranker.

Parameters:

  • reranker (Reranker) –

    The reranker to use.

Returns:

Source code inlancedb/query.py
defrerank(self,reranker:Reranker)->LanceFtsQueryBuilder:"""Rerank the results using the specified reranker.    Parameters    ----------    reranker: Reranker        The reranker to use.    Returns    -------    LanceFtsQueryBuilder        The LanceQueryBuilder object.    """self._reranker=rerankerifreranker.score=="all":self.with_row_id(True)returnself

lancedb.query.LanceHybridQueryBuilder

Bases:LanceQueryBuilder

A query builder that performs hybrid vector and full text search.Results are combined and reranked based on the specified reranker.By default, the results are reranked using the RRFReranker, whichuses reciprocal rank fusion score for reranking.

To make the vector and fts results comparable, the scores are normalized.Instead of normalizing scores, thenormalize parameter can be set to "rank"in thererank method to convert the scores to ranks and then normalize them.

Source code inlancedb/query.py
1614161516161617161816191620162116221623162416251626162716281629163016311632163316341635163616371638163916401641164216431644164516461647164816491650165116521653165416551656165716581659166016611662166316641665166616671668166916701671167216731674167516761677167816791680168116821683168416851686168716881689169016911692169316941695169616971698169917001701170217031704170517061707170817091710171117121713171417151716171717181719172017211722172317241725172617271728172917301731173217331734173517361737173817391740174117421743174417451746174717481749175017511752175317541755175617571758175917601761176217631764176517661767176817691770177117721773177417751776177717781779178017811782178317841785178617871788178917901791179217931794179517961797179817991800180118021803180418051806180718081809181018111812181318141815181618171818181918201821182218231824182518261827182818291830183118321833183418351836183718381839184018411842184318441845184618471848184918501851185218531854185518561857185818591860186118621863186418651866186718681869187018711872187318741875187618771878187918801881188218831884188518861887188818891890189118921893189418951896189718981899190019011902190319041905190619071908190919101911191219131914191519161917191819191920192119221923192419251926192719281929193019311932193319341935193619371938193919401941194219431944194519461947194819491950195119521953195419551956195719581959196019611962196319641965196619671968196919701971197219731974197519761977197819791980198119821983198419851986198719881989199019911992199319941995199619971998199920002001200220032004200520062007200820092010201120122013201420152016201720182019202020212022202320242025202620272028202920302031203220332034203520362037203820392040204120422043204420452046204720482049205020512052205320542055205620572058205920602061206220632064206520662067206820692070207120722073207420752076207720782079208020812082208320842085208620872088208920902091209220932094209520962097209820992100210121022103210421052106210721082109211021112112211321142115211621172118211921202121212221232124212521262127212821292130213121322133213421352136
classLanceHybridQueryBuilder(LanceQueryBuilder):"""    A query builder that performs hybrid vector and full text search.    Results are combined and reranked based on the specified reranker.    By default, the results are reranked using the RRFReranker, which    uses reciprocal rank fusion score for reranking.    To make the vector and fts results comparable, the scores are normalized.    Instead of normalizing scores, the `normalize` parameter can be set to "rank"    in the `rerank` method to convert the scores to ranks and then normalize them.    """def__init__(self,table:"Table",query:Optional[Union[str,FullTextQuery]]=None,vector_column:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None,):super().__init__(table)self._query=queryself._vector_column=vector_columnself._fts_columns=fts_columnsself._norm=Noneself._reranker=Noneself._minimum_nprobes=Noneself._maximum_nprobes=Noneself._refine_factor=Noneself._distance_type=Noneself._phrase_query=Noneself._lower_bound=Noneself._upper_bound=Nonedef_validate_query(self,query,vector=None,text=None):ifqueryisnotNoneand(vectorisnotNoneortextisnotNone):raiseValueError("You can either provide a string query in search() method""or set `vector()` and `text()` explicitly for hybrid search.""But not both.")vector_query=vectorifvectorisnotNoneelsequeryifnotisinstance(vector_query,(str,list,np.ndarray)):raiseValueError("Vector query must be either a string or a vector")text_query=textorqueryiftext_queryisNone:raiseValueError("Text query must be provided for hybrid search.")ifnotisinstance(text_query,(str,FullTextQuery)):raiseValueError("Text query must be a string or FullTextQuery")returnvector_query,text_querydefphrase_query(self,phrase_query:bool=None)->LanceHybridQueryBuilder:"""Set whether to use phrase query.        Parameters        ----------        phrase_query: bool, default True            If True, then the query will be wrapped in quotes and            double quotes replaced by single quotes.        Returns        -------        LanceHybridQueryBuilder            The LanceHybridQueryBuilder object.        """self._phrase_query=phrase_queryreturnselfdefto_query_object(self)->Query:raiseNotImplementedError("to_query_object not yet supported on a hybrid query")defto_arrow(self,*,timeout:Optional[timedelta]=None)->pa.Table:self._create_query_builders()withThreadPoolExecutor()asexecutor:fts_future=executor.submit(self._fts_query.with_row_id(True).to_arrow,timeout=timeout)vector_future=executor.submit(self._vector_query.with_row_id(True).to_arrow,timeout=timeout)fts_results=fts_future.result()vector_results=vector_future.result()returnself._combine_hybrid_results(fts_results=fts_results,vector_results=vector_results,norm=self._norm,fts_query=self._fts_query._query,reranker=self._reranker,limit=self._limit,with_row_ids=self._with_row_id,)@staticmethoddef_combine_hybrid_results(fts_results:pa.Table,vector_results:pa.Table,norm:str,fts_query:str,reranker,limit:int,with_row_ids:bool,)->pa.Table:ifnorm=="rank":vector_results=LanceHybridQueryBuilder._rank(vector_results,"_distance")fts_results=LanceHybridQueryBuilder._rank(fts_results,"_score")original_distances=Noneoriginal_scores=Noneoriginal_distance_row_ids=Noneoriginal_score_row_ids=None# normalize the scores to be between 0 and 1, 0 being most relevant# We check whether the results (vector and FTS) are empty, because when# they are, they often are missing the _rowid column, which causes an errorifvector_results.num_rows>0:distance_i=vector_results.column_names.index("_distance")original_distances=vector_results.column(distance_i)original_distance_row_ids=vector_results.column("_rowid")vector_results=vector_results.set_column(distance_i,vector_results.field(distance_i),LanceHybridQueryBuilder._normalize_scores(original_distances),)# In fts higher scores represent relevance. Not inverting them here as# rerankers might need to preserve this score to support `return_score="all"`iffts_results.num_rows>0:score_i=fts_results.column_names.index("_score")original_scores=fts_results.column(score_i)original_score_row_ids=fts_results.column("_rowid")fts_results=fts_results.set_column(score_i,fts_results.field(score_i),LanceHybridQueryBuilder._normalize_scores(original_scores),)results=reranker.rerank_hybrid(fts_query,vector_results,fts_results)check_reranker_result(results)if"_distance"inresults.column_namesandoriginal_distancesisnotNone:# restore the original distancesindices=pc.index_in(results["_rowid"],original_distance_row_ids,skip_nulls=True)original_distances=pc.take(original_distances,indices)distance_i=results.column_names.index("_distance")results=results.set_column(distance_i,"_distance",original_distances)if"_score"inresults.column_namesandoriginal_scoresisnotNone:# restore the original scoresindices=pc.index_in(results["_rowid"],original_score_row_ids,skip_nulls=True)original_scores=pc.take(original_scores,indices)score_i=results.column_names.index("_score")results=results.set_column(score_i,"_score",original_scores)results=results.slice(length=limit)ifnotwith_row_ids:results=results.drop(["_rowid"])returnresultsdefto_batches(self,/,batch_size:Optional[int]=None,timeout:Optional[timedelta]=None):raiseNotImplementedError("to_batches not yet supported on a hybrid query")@staticmethoddef_rank(results:pa.Table,column:str,ascending:bool=True):iflen(results)==0:returnresults# Get the _score column from resultsscores=results.column(column).to_numpy()sort_indices=np.argsort(scores)ifnotascending:sort_indices=sort_indices[::-1]ranks=np.empty_like(sort_indices)ranks[sort_indices]=np.arange(len(scores))+1# replace the _score column with the ranks_score_idx=results.column_names.index(column)results=results.set_column(_score_idx,column,pa.array(ranks,type=pa.float32()))returnresults@staticmethoddef_normalize_scores(scores:pa.Array,invert=False)->pa.Array:iflen(scores)==0:returnscores# normalize the scores by subtracting the min and dividing by the maxmin,max=pc.min_max(scores).values()rng=pc.subtract(max,min)ifnotpc.equal(rng,pa.scalar(0.0)).as_py():scores=pc.divide(pc.subtract(scores,min),rng)elifnotpc.equal(max,pa.scalar(0.0)).as_py():# If rng is 0, then we at least want the scores to be 0scores=pc.subtract(scores,min)ifinvert:scores=pc.subtract(1,scores)returnscoresdefrerank(self,reranker:Reranker=RRFReranker(),normalize:str="score",)->LanceHybridQueryBuilder:"""        Rerank the hybrid search results using the specified reranker. The reranker        must be an instance of Reranker class.        Parameters        ----------        reranker: Reranker, default RRFReranker()            The reranker to use. Must be an instance of Reranker class.        normalize: str, default "score"            The method to normalize the scores. Can be "rank" or "score". If "rank",            the scores are converted to ranks and then normalized. If "score", the            scores are normalized directly.        Returns        -------        LanceHybridQueryBuilder            The LanceHybridQueryBuilder object.        """ifnormalizenotin["rank","score"]:raiseValueError("normalize must be 'rank' or 'score'.")ifrerankerandnotisinstance(reranker,Reranker):raiseValueError("reranker must be an instance of Reranker class.")self._norm=normalizeself._reranker=rerankerifreranker.score=="all":self.with_row_id(True)returnselfdefnprobes(self,nprobes:int)->LanceHybridQueryBuilder:"""        Set the number of probes to use for vector search.        Higher values will yield better recall (more likely to find vectors if        they exist) at the expense of latency.        Parameters        ----------        nprobes: int            The number of probes to use.        Returns        -------        LanceHybridQueryBuilder            The LanceHybridQueryBuilder object.        """self._minimum_nprobes=nprobesself._maximum_nprobes=nprobesreturnselfdefminimum_nprobes(self,minimum_nprobes:int)->LanceHybridQueryBuilder:"""Set the minimum number of probes to use.        See `nprobes` for more details.        """self._minimum_nprobes=minimum_nprobesreturnselfdefmaximum_nprobes(self,maximum_nprobes:int)->LanceHybridQueryBuilder:"""Set the maximum number of probes to use.        See `nprobes` for more details.        """self._maximum_nprobes=maximum_nprobesreturnselfdefdistance_range(self,lower_bound:Optional[float]=None,upper_bound:Optional[float]=None)->LanceHybridQueryBuilder:"""        Set the distance range to use.        Only rows with distances within range [lower_bound, upper_bound)        will be returned.        Parameters        ----------        lower_bound: Optional[float]            The lower bound of the distance range.        upper_bound: Optional[float]            The upper bound of the distance range.        Returns        -------        LanceHybridQueryBuilder            The LanceHybridQueryBuilder object.        """self._lower_bound=lower_boundself._upper_bound=upper_boundreturnselfdefef(self,ef:int)->LanceHybridQueryBuilder:"""        Set the number of candidates to consider during search.        Higher values will yield better recall (more likely to find vectors if        they exist) at the expense of latency.        This only applies to the HNSW-related index.        The default value is 1.5 * limit.        Parameters        ----------        ef: int            The number of candidates to consider during search.        Returns        -------        LanceHybridQueryBuilder            The LanceHybridQueryBuilder object.        """self._ef=efreturnselfdefmetric(self,metric:Literal["l2","cosine","dot"])->LanceHybridQueryBuilder:"""Set the distance metric to use.        This is an alias for distance_type() and may be deprecated in the future.        Please use distance_type() instead.        Parameters        ----------        metric: "l2" or "cosine" or "dot"            The distance metric to use. By default "l2" is used.        Returns        -------        LanceVectorQueryBuilder            The LanceQueryBuilder object.        """returnself.distance_type(metric)defdistance_type(self,distance_type:Literal["l2","cosine","dot"])->"LanceHybridQueryBuilder":"""Set the distance metric to use.        When performing a vector search we try and find the "nearest" vectors according        to some kind of distance metric. This parameter controls which distance metric        to use.        Note: if there is a vector index then the distance type used MUST match the        distance type used to train the vector index. If this is not done then the        results will be invalid.        Parameters        ----------        distance_type: "l2" or "cosine" or "dot"            The distance metric to use. By default "l2" is used.        Returns        -------        LanceVectorQueryBuilder            The LanceQueryBuilder object.        """self._distance_type=distance_type.lower()returnselfdefrefine_factor(self,refine_factor:int)->LanceHybridQueryBuilder:"""        Refine the vector search results by reading extra elements and        re-ranking them in memory.        Parameters        ----------        refine_factor: int            The refine factor to use.        Returns        -------        LanceHybridQueryBuilder            The LanceHybridQueryBuilder object.        """self._refine_factor=refine_factorreturnselfdefvector(self,vector:Union[np.ndarray,list])->LanceHybridQueryBuilder:self._vector=vectorreturnselfdeftext(self,text:str|FullTextQuery)->LanceHybridQueryBuilder:self._text=textreturnselfdefbypass_vector_index(self)->LanceHybridQueryBuilder:"""        If this is called then any vector index is skipped        An exhaustive (flat) search will be performed.  The query vector will        be compared to every vector in the table.  At high scales this can be        expensive.  However, this is often still useful.  For example, skipping        the vector index can give you ground truth results which you can use to        calculate your recall to select an appropriate value for nprobes.        Returns        -------        LanceHybridQueryBuilder            The LanceHybridQueryBuilder object.        """self._bypass_vector_index=Truereturnselfdefexplain_plan(self,verbose:Optional[bool]=False)->str:"""Return the execution plan for this query.        Examples        --------        >>> import lancedb        >>> db = lancedb.connect("./.lancedb")        >>> table = db.create_table("my_table", [{"vector": [99.0, 99]}])        >>> query = [100, 100]        >>> plan = table.search(query).explain_plan(True)        >>> print(plan) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE        ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]        GlobalLimitExec: skip=0, fetch=10          FilterExec: _distance@2 IS NOT NULL            SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]              KNNVectorDistance: metric=l2                LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false        Parameters        ----------        verbose : bool, default False            Use a verbose output format.        Returns        -------        plan : str        """# noqa: E501self._create_query_builders()results=["Vector Search Plan:"]results.append(self._table._explain_plan(self._vector_query.to_query_object(),verbose=verbose))results.append("FTS Search Plan:")results.append(self._table._explain_plan(self._fts_query.to_query_object(),verbose=verbose))return"\n".join(results)defanalyze_plan(self):"""Execute the query and display with runtime metrics.        Returns        -------        plan : str        """self._create_query_builders()results=["Vector Search Plan:"]results.append(self._table._analyze_plan(self._vector_query.to_query_object()))results.append("FTS Search Plan:")results.append(self._table._analyze_plan(self._fts_query.to_query_object()))return"\n".join(results)def_create_query_builders(self):"""Set up and configure the vector and FTS query builders."""vector_query,fts_query=self._validate_query(self._query,self._vector,self._text)self._fts_query=LanceFtsQueryBuilder(self._table,fts_query,fts_columns=self._fts_columns)vector_query=self._query_to_vector(self._table,vector_query,self._vector_column)self._vector_query=LanceVectorQueryBuilder(self._table,vector_query,self._vector_column)# Apply common configurationsifself._limit:self._vector_query.limit(self._limit)self._fts_query.limit(self._limit)ifself._columns:self._vector_query.select(self._columns)self._fts_query.select(self._columns)ifself._where:self._vector_query.where(self._where,self._postfilter)self._fts_query.where(self._where,self._postfilter)ifself._with_row_id:self._vector_query.with_row_id(True)self._fts_query.with_row_id(True)ifself._phrase_query:self._fts_query.phrase_query(True)ifself._distance_type:self._vector_query.metric(self._distance_type)ifself._minimum_nprobes:self._vector_query.minimum_nprobes(self._minimum_nprobes)ifself._maximum_nprobesisnotNone:self._vector_query.maximum_nprobes(self._maximum_nprobes)ifself._refine_factor:self._vector_query.refine_factor(self._refine_factor)ifself._ef:self._vector_query.ef(self._ef)ifself._bypass_vector_index:self._vector_query.bypass_vector_index()ifself._lower_boundorself._upper_bound:self._vector_query.distance_range(lower_bound=self._lower_bound,upper_bound=self._upper_bound)ifself._rerankerisNone:self._reranker=RRFReranker()

phrase_query

phrase_query(phrase_query:bool=None)->LanceHybridQueryBuilder

Set whether to use phrase query.

Parameters:

  • phrase_query (bool, default:None) –

    If True, then the query will be wrapped in quotes anddouble quotes replaced by single quotes.

Returns:

Source code inlancedb/query.py
defphrase_query(self,phrase_query:bool=None)->LanceHybridQueryBuilder:"""Set whether to use phrase query.    Parameters    ----------    phrase_query: bool, default True        If True, then the query will be wrapped in quotes and        double quotes replaced by single quotes.    Returns    -------    LanceHybridQueryBuilder        The LanceHybridQueryBuilder object.    """self._phrase_query=phrase_queryreturnself

rerank

rerank(reranker:Reranker=RRFReranker(),normalize:str='score')->LanceHybridQueryBuilder

Rerank the hybrid search results using the specified reranker. The rerankermust be an instance of Reranker class.

Parameters:

  • reranker (Reranker, default:RRFReranker()) –

    The reranker to use. Must be an instance of Reranker class.

  • normalize (str, default:'score') –

    The method to normalize the scores. Can be "rank" or "score". If "rank",the scores are converted to ranks and then normalized. If "score", thescores are normalized directly.

Returns:

Source code inlancedb/query.py
defrerank(self,reranker:Reranker=RRFReranker(),normalize:str="score",)->LanceHybridQueryBuilder:"""    Rerank the hybrid search results using the specified reranker. The reranker    must be an instance of Reranker class.    Parameters    ----------    reranker: Reranker, default RRFReranker()        The reranker to use. Must be an instance of Reranker class.    normalize: str, default "score"        The method to normalize the scores. Can be "rank" or "score". If "rank",        the scores are converted to ranks and then normalized. If "score", the        scores are normalized directly.    Returns    -------    LanceHybridQueryBuilder        The LanceHybridQueryBuilder object.    """ifnormalizenotin["rank","score"]:raiseValueError("normalize must be 'rank' or 'score'.")ifrerankerandnotisinstance(reranker,Reranker):raiseValueError("reranker must be an instance of Reranker class.")self._norm=normalizeself._reranker=rerankerifreranker.score=="all":self.with_row_id(True)returnself

nprobes

nprobes(nprobes:int)->LanceHybridQueryBuilder

Set the number of probes to use for vector search.

Higher values will yield better recall (more likely to find vectors ifthey exist) at the expense of latency.

Parameters:

  • nprobes (int) –

    The number of probes to use.

Returns:

Source code inlancedb/query.py
defnprobes(self,nprobes:int)->LanceHybridQueryBuilder:"""    Set the number of probes to use for vector search.    Higher values will yield better recall (more likely to find vectors if    they exist) at the expense of latency.    Parameters    ----------    nprobes: int        The number of probes to use.    Returns    -------    LanceHybridQueryBuilder        The LanceHybridQueryBuilder object.    """self._minimum_nprobes=nprobesself._maximum_nprobes=nprobesreturnself

minimum_nprobes

minimum_nprobes(minimum_nprobes:int)->LanceHybridQueryBuilder

Set the minimum number of probes to use.

Seenprobes for more details.

Source code inlancedb/query.py
defminimum_nprobes(self,minimum_nprobes:int)->LanceHybridQueryBuilder:"""Set the minimum number of probes to use.    See `nprobes` for more details.    """self._minimum_nprobes=minimum_nprobesreturnself

maximum_nprobes

maximum_nprobes(maximum_nprobes:int)->LanceHybridQueryBuilder

Set the maximum number of probes to use.

Seenprobes for more details.

Source code inlancedb/query.py
defmaximum_nprobes(self,maximum_nprobes:int)->LanceHybridQueryBuilder:"""Set the maximum number of probes to use.    See `nprobes` for more details.    """self._maximum_nprobes=maximum_nprobesreturnself

distance_range

distance_range(lower_bound:Optional[float]=None,upper_bound:Optional[float]=None)->LanceHybridQueryBuilder

Set the distance range to use.

Only rows with distances within range [lower_bound, upper_bound)will be returned.

Parameters:

  • lower_bound (Optional[float], default:None) –

    The lower bound of the distance range.

  • upper_bound (Optional[float], default:None) –

    The upper bound of the distance range.

Returns:

Source code inlancedb/query.py
defdistance_range(self,lower_bound:Optional[float]=None,upper_bound:Optional[float]=None)->LanceHybridQueryBuilder:"""    Set the distance range to use.    Only rows with distances within range [lower_bound, upper_bound)    will be returned.    Parameters    ----------    lower_bound: Optional[float]        The lower bound of the distance range.    upper_bound: Optional[float]        The upper bound of the distance range.    Returns    -------    LanceHybridQueryBuilder        The LanceHybridQueryBuilder object.    """self._lower_bound=lower_boundself._upper_bound=upper_boundreturnself

ef

Set the number of candidates to consider during search.

Higher values will yield better recall (more likely to find vectors ifthey exist) at the expense of latency.

This only applies to the HNSW-related index.The default value is 1.5 * limit.

Parameters:

  • ef (int) –

    The number of candidates to consider during search.

Returns:

Source code inlancedb/query.py
defef(self,ef:int)->LanceHybridQueryBuilder:"""    Set the number of candidates to consider during search.    Higher values will yield better recall (more likely to find vectors if    they exist) at the expense of latency.    This only applies to the HNSW-related index.    The default value is 1.5 * limit.    Parameters    ----------    ef: int        The number of candidates to consider during search.    Returns    -------    LanceHybridQueryBuilder        The LanceHybridQueryBuilder object.    """self._ef=efreturnself

metric

metric(metric:Literal['l2','cosine','dot'])->LanceHybridQueryBuilder

Set the distance metric to use.

This is an alias for distance_type() and may be deprecated in the future.Please use distance_type() instead.

Parameters:

  • metric (Literal['l2', 'cosine', 'dot']) –

    The distance metric to use. By default "l2" is used.

Returns:

Source code inlancedb/query.py
defmetric(self,metric:Literal["l2","cosine","dot"])->LanceHybridQueryBuilder:"""Set the distance metric to use.    This is an alias for distance_type() and may be deprecated in the future.    Please use distance_type() instead.    Parameters    ----------    metric: "l2" or "cosine" or "dot"        The distance metric to use. By default "l2" is used.    Returns    -------    LanceVectorQueryBuilder        The LanceQueryBuilder object.    """returnself.distance_type(metric)

distance_type

distance_type(distance_type:Literal['l2','cosine','dot'])->'LanceHybridQueryBuilder'

Set the distance metric to use.

When performing a vector search we try and find the "nearest" vectors accordingto some kind of distance metric. This parameter controls which distance metricto use.

Note: if there is a vector index then the distance type used MUST match thedistance type used to train the vector index. If this is not done then theresults will be invalid.

Parameters:

  • distance_type (Literal['l2', 'cosine', 'dot']) –

    The distance metric to use. By default "l2" is used.

Returns:

Source code inlancedb/query.py
defdistance_type(self,distance_type:Literal["l2","cosine","dot"])->"LanceHybridQueryBuilder":"""Set the distance metric to use.    When performing a vector search we try and find the "nearest" vectors according    to some kind of distance metric. This parameter controls which distance metric    to use.    Note: if there is a vector index then the distance type used MUST match the    distance type used to train the vector index. If this is not done then the    results will be invalid.    Parameters    ----------    distance_type: "l2" or "cosine" or "dot"        The distance metric to use. By default "l2" is used.    Returns    -------    LanceVectorQueryBuilder        The LanceQueryBuilder object.    """self._distance_type=distance_type.lower()returnself

refine_factor

refine_factor(refine_factor:int)->LanceHybridQueryBuilder

Refine the vector search results by reading extra elements andre-ranking them in memory.

Parameters:

  • refine_factor (int) –

    The refine factor to use.

Returns:

Source code inlancedb/query.py
defrefine_factor(self,refine_factor:int)->LanceHybridQueryBuilder:"""    Refine the vector search results by reading extra elements and    re-ranking them in memory.    Parameters    ----------    refine_factor: int        The refine factor to use.    Returns    -------    LanceHybridQueryBuilder        The LanceHybridQueryBuilder object.    """self._refine_factor=refine_factorreturnself

bypass_vector_index

bypass_vector_index()->LanceHybridQueryBuilder

If this is called then any vector index is skipped

An exhaustive (flat) search will be performed. The query vector willbe compared to every vector in the table. At high scales this can beexpensive. However, this is often still useful. For example, skippingthe vector index can give you ground truth results which you can use tocalculate your recall to select an appropriate value for nprobes.

Returns:

Source code inlancedb/query.py
defbypass_vector_index(self)->LanceHybridQueryBuilder:"""    If this is called then any vector index is skipped    An exhaustive (flat) search will be performed.  The query vector will    be compared to every vector in the table.  At high scales this can be    expensive.  However, this is often still useful.  For example, skipping    the vector index can give you ground truth results which you can use to    calculate your recall to select an appropriate value for nprobes.    Returns    -------    LanceHybridQueryBuilder        The LanceHybridQueryBuilder object.    """self._bypass_vector_index=Truereturnself

explain_plan

explain_plan(verbose:Optional[bool]=False)->str

Return the execution plan for this query.

Examples:

>>>importlancedb>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",[{"vector":[99.0,99]}])>>>query=[100,100]>>>plan=table.search(query).explain_plan(True)>>>print(plan)ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]GlobalLimitExec: skip=0, fetch=10  FilterExec: _distance@2 IS NOT NULL    SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]      KNNVectorDistance: metric=l2        LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

Parameters:

  • verbose (bool, default:False) –

    Use a verbose output format.

Returns:

  • plan (str) –
Source code inlancedb/query.py
defexplain_plan(self,verbose:Optional[bool]=False)->str:"""Return the execution plan for this query.    Examples    --------    >>> import lancedb    >>> db = lancedb.connect("./.lancedb")    >>> table = db.create_table("my_table", [{"vector": [99.0, 99]}])    >>> query = [100, 100]    >>> plan = table.search(query).explain_plan(True)    >>> print(plan) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE    ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]    GlobalLimitExec: skip=0, fetch=10      FilterExec: _distance@2 IS NOT NULL        SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]          KNNVectorDistance: metric=l2            LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false    Parameters    ----------    verbose : bool, default False        Use a verbose output format.    Returns    -------    plan : str    """# noqa: E501self._create_query_builders()results=["Vector Search Plan:"]results.append(self._table._explain_plan(self._vector_query.to_query_object(),verbose=verbose))results.append("FTS Search Plan:")results.append(self._table._explain_plan(self._fts_query.to_query_object(),verbose=verbose))return"\n".join(results)

analyze_plan

analyze_plan()

Execute the query and display with runtime metrics.

Returns:

  • plan (str) –
Source code inlancedb/query.py
defanalyze_plan(self):"""Execute the query and display with runtime metrics.    Returns    -------    plan : str    """self._create_query_builders()results=["Vector Search Plan:"]results.append(self._table._analyze_plan(self._vector_query.to_query_object()))results.append("FTS Search Plan:")results.append(self._table._analyze_plan(self._fts_query.to_query_object()))return"\n".join(results)

Embeddings

lancedb.embeddings.registry.EmbeddingFunctionRegistry

This is a singleton class used to register embedding functionsand fetch them by name. It also handles serializing and deserializing.You can implement your own embedding function by subclassing EmbeddingFunctionor TextEmbeddingFunction and registering it with the registry.

NOTE: Here TEXT is a type alias for Union[str, List[str], pa.Array, pa.ChunkedArray, np.ndarray]

Examples:

>>>registry=EmbeddingFunctionRegistry.get_instance()>>>@registry.register("my-embedding-function")...classMyEmbeddingFunction(EmbeddingFunction):...defndims(self)->int:...return128......defcompute_query_embeddings(self,query:str,*args,**kwargs):...returnself.compute_source_embeddings(query,*args,**kwargs)......defcompute_source_embeddings(self,texts,*args,**kwargs):...return[np.random.rand(self.ndims())for_inrange(len(texts))]...>>>registry.get("my-embedding-function")<class 'lancedb.embeddings.registry.MyEmbeddingFunction'>
Source code inlancedb/embeddings/registry.py
classEmbeddingFunctionRegistry:"""    This is a singleton class used to register embedding functions    and fetch them by name. It also handles serializing and deserializing.    You can implement your own embedding function by subclassing EmbeddingFunction    or TextEmbeddingFunction and registering it with the registry.    NOTE: Here TEXT is a type alias for Union[str, List[str], pa.Array,          pa.ChunkedArray, np.ndarray]    Examples    --------    >>> registry = EmbeddingFunctionRegistry.get_instance()    >>> @registry.register("my-embedding-function")    ... class MyEmbeddingFunction(EmbeddingFunction):    ...     def ndims(self) -> int:    ...         return 128    ...    ...     def compute_query_embeddings(self, query: str, *args, **kwargs):    ...         return self.compute_source_embeddings(query, *args, **kwargs)    ...    ...     def compute_source_embeddings(self, texts, *args, **kwargs):    ...         return [np.random.rand(self.ndims()) for _ in range(len(texts))]    ...    >>> registry.get("my-embedding-function")    <class 'lancedb.embeddings.registry.MyEmbeddingFunction'>    """@classmethoddefget_instance(cls):return__REGISTRY__def__init__(self):self._functions={}self._variables={}defregister(self,alias:str=None):"""        This creates a decorator that can be used to register        an EmbeddingFunction.        Parameters        ----------        alias : Optional[str]            a human friendly name for the embedding function. If not            provided, the class name will be used.        """# This is a decorator for a class that inherits from BaseModel# It adds the class to the registrydefdecorator(cls):ifnotissubclass(cls,EmbeddingFunction):raiseTypeError("Must be a subclass of EmbeddingFunction")ifcls.__name__inself._functions:raiseKeyError(f"{cls.__name__} was already registered")key=aliasorcls.__name__self._functions[key]=clscls.__embedding_function_registry_alias__=aliasreturnclsreturndecoratordefreset(self):"""        Reset the registry to its initial state        """self._functions={}defget(self,name:str):"""        Fetch an embedding function class by name        Parameters        ----------        name : str            The name of the embedding function to fetch            Either the alias or the class name if no alias was provided            during registration        """returnself._functions[name]defparse_functions(self,metadata:Optional[Dict[bytes,bytes]])->Dict[str,"EmbeddingFunctionConfig"]:"""        Parse the metadata from an arrow table and        return a mapping of the vector column to the        embedding function and source column        Parameters        ----------        metadata : Optional[Dict[bytes, bytes]]            The metadata from an arrow table. Note that            the keys and values are bytes (pyarrow api)        Returns        -------        functions : dict            A mapping of vector column name to embedding function.            An empty dict is returned if input is None or does not            contain b"embedding_functions".        """ifmetadataisNone:return{}# Look at both bytes and string keys, since we might use eitherserialized=metadata.get(b"embedding_functions",metadata.get("embedding_functions"))ifserializedisNone:return{}raw_list=json.loads(serialized.decode("utf-8"))return{obj["vector_column"]:EmbeddingFunctionConfig(vector_column=obj["vector_column"],source_column=obj["source_column"],function=self.get(obj["name"])(**obj["model"]),)forobjinraw_list}deffunction_to_metadata(self,conf:"EmbeddingFunctionConfig"):"""        Convert the given embedding function and source / vector column configs        into a config dictionary that can be serialized into arrow metadata        """func=conf.functionname=getattr(func,"__embedding_function_registry_alias__",func.__class__.__name__)json_data=func.safe_model_dump()return{"name":name,"model":json_data,"source_column":conf.source_column,"vector_column":conf.vector_column,}defget_table_metadata(self,func_list):"""        Convert a list of embedding functions and source / vector configs        into a config dictionary that can be serialized into arrow metadata        """iffunc_listisNoneorlen(func_list)==0:returnNonejson_data=[self.function_to_metadata(func)forfuncinfunc_list]# Note that metadata dictionary values must be bytes# so we need to json dump then utf8 encodemetadata=json.dumps(json_data,indent=2).encode("utf-8")return{"embedding_functions":metadata}defset_var(self,name:str,value:str)->None:"""        Set a variable. These can be accessed in embedding configuration using        the syntax `$var:variable_name`. If they are not set, an error will be        thrown letting you know which variable is missing. If you want to supply        a default value, you can add an additional part in the configuration        like so: `$var:variable_name:default_value`. Default values can be        used for runtime configurations that are not sensitive, such as        whether to use a GPU for inference.        The name must not contain a colon. Default values can contain colons.        """if":"inname:raiseValueError("Variable names cannot contain colons")self._variables[name]=valuedefget_var(self,name:str)->str:"""        Get a variable.        """returnself._variables[name]

register

register(alias:str=None)

This creates a decorator that can be used to registeran EmbeddingFunction.

Parameters:

  • alias (Optional[str], default:None) –

    a human friendly name for the embedding function. If notprovided, the class name will be used.

Source code inlancedb/embeddings/registry.py
defregister(self,alias:str=None):"""    This creates a decorator that can be used to register    an EmbeddingFunction.    Parameters    ----------    alias : Optional[str]        a human friendly name for the embedding function. If not        provided, the class name will be used.    """# This is a decorator for a class that inherits from BaseModel# It adds the class to the registrydefdecorator(cls):ifnotissubclass(cls,EmbeddingFunction):raiseTypeError("Must be a subclass of EmbeddingFunction")ifcls.__name__inself._functions:raiseKeyError(f"{cls.__name__} was already registered")key=aliasorcls.__name__self._functions[key]=clscls.__embedding_function_registry_alias__=aliasreturnclsreturndecorator

reset

reset()

Reset the registry to its initial state

Source code inlancedb/embeddings/registry.py
defreset(self):"""    Reset the registry to its initial state    """self._functions={}

get

get(name:str)

Fetch an embedding function class by name

Parameters:

  • name (str) –

    The name of the embedding function to fetchEither the alias or the class name if no alias was providedduring registration

Source code inlancedb/embeddings/registry.py
defget(self,name:str):"""    Fetch an embedding function class by name    Parameters    ----------    name : str        The name of the embedding function to fetch        Either the alias or the class name if no alias was provided        during registration    """returnself._functions[name]

parse_functions

parse_functions(metadata:Optional[Dict[bytes,bytes]])->Dict[str,EmbeddingFunctionConfig]

Parse the metadata from an arrow table andreturn a mapping of the vector column to theembedding function and source column

Parameters:

  • metadata (Optional[Dict[bytes,bytes]]) –

    The metadata from an arrow table. Note thatthe keys and values are bytes (pyarrow api)

Returns:

  • functions (dict) –

    A mapping of vector column name to embedding function.An empty dict is returned if input is None or does notcontain b"embedding_functions".

Source code inlancedb/embeddings/registry.py
defparse_functions(self,metadata:Optional[Dict[bytes,bytes]])->Dict[str,"EmbeddingFunctionConfig"]:"""    Parse the metadata from an arrow table and    return a mapping of the vector column to the    embedding function and source column    Parameters    ----------    metadata : Optional[Dict[bytes, bytes]]        The metadata from an arrow table. Note that        the keys and values are bytes (pyarrow api)    Returns    -------    functions : dict        A mapping of vector column name to embedding function.        An empty dict is returned if input is None or does not        contain b"embedding_functions".    """ifmetadataisNone:return{}# Look at both bytes and string keys, since we might use eitherserialized=metadata.get(b"embedding_functions",metadata.get("embedding_functions"))ifserializedisNone:return{}raw_list=json.loads(serialized.decode("utf-8"))return{obj["vector_column"]:EmbeddingFunctionConfig(vector_column=obj["vector_column"],source_column=obj["source_column"],function=self.get(obj["name"])(**obj["model"]),)forobjinraw_list}

function_to_metadata

function_to_metadata(conf:EmbeddingFunctionConfig)

Convert the given embedding function and source / vector column configsinto a config dictionary that can be serialized into arrow metadata

Source code inlancedb/embeddings/registry.py
deffunction_to_metadata(self,conf:"EmbeddingFunctionConfig"):"""    Convert the given embedding function and source / vector column configs    into a config dictionary that can be serialized into arrow metadata    """func=conf.functionname=getattr(func,"__embedding_function_registry_alias__",func.__class__.__name__)json_data=func.safe_model_dump()return{"name":name,"model":json_data,"source_column":conf.source_column,"vector_column":conf.vector_column,}

get_table_metadata

get_table_metadata(func_list)

Convert a list of embedding functions and source / vector configsinto a config dictionary that can be serialized into arrow metadata

Source code inlancedb/embeddings/registry.py
defget_table_metadata(self,func_list):"""    Convert a list of embedding functions and source / vector configs    into a config dictionary that can be serialized into arrow metadata    """iffunc_listisNoneorlen(func_list)==0:returnNonejson_data=[self.function_to_metadata(func)forfuncinfunc_list]# Note that metadata dictionary values must be bytes# so we need to json dump then utf8 encodemetadata=json.dumps(json_data,indent=2).encode("utf-8")return{"embedding_functions":metadata}

set_var

set_var(name:str,value:str)->None

Set a variable. These can be accessed in embedding configuration usingthe syntax$var:variable_name. If they are not set, an error will bethrown letting you know which variable is missing. If you want to supplya default value, you can add an additional part in the configurationlike so:$var:variable_name:default_value. Default values can beused for runtime configurations that are not sensitive, such aswhether to use a GPU for inference.

The name must not contain a colon. Default values can contain colons.

Source code inlancedb/embeddings/registry.py
defset_var(self,name:str,value:str)->None:"""    Set a variable. These can be accessed in embedding configuration using    the syntax `$var:variable_name`. If they are not set, an error will be    thrown letting you know which variable is missing. If you want to supply    a default value, you can add an additional part in the configuration    like so: `$var:variable_name:default_value`. Default values can be    used for runtime configurations that are not sensitive, such as    whether to use a GPU for inference.    The name must not contain a colon. Default values can contain colons.    """if":"inname:raiseValueError("Variable names cannot contain colons")self._variables[name]=value

get_var

get_var(name:str)->str

Get a variable.

Source code inlancedb/embeddings/registry.py
defget_var(self,name:str)->str:"""    Get a variable.    """returnself._variables[name]

lancedb.embeddings.base.EmbeddingFunctionConfig

Bases:BaseModel

This model encapsulates the configuration for a embedding functionin a lancedb table. It holds the embedding function, the source column,and the vector column

Source code inlancedb/embeddings/base.py
classEmbeddingFunctionConfig(BaseModel):"""    This model encapsulates the configuration for a embedding function    in a lancedb table. It holds the embedding function, the source column,    and the vector column    """vector_column:strsource_column:strfunction:EmbeddingFunction

lancedb.embeddings.base.EmbeddingFunction

Bases:BaseModel,ABC

An ABC for embedding functions.

All concrete embedding functions must implement the following methods:1. compute_query_embeddings() which takes a query and returns a list of embeddings2. compute_source_embeddings() which returns a list of embeddings for the source columnFor text data, the two will be the same. For multi-modal data, the source columnmight be images and the vector column might be text.3. ndims() which returns the number of dimensions of the vector column

Source code inlancedb/embeddings/base.py
classEmbeddingFunction(BaseModel,ABC):"""    An ABC for embedding functions.    All concrete embedding functions must implement the following methods:    1. compute_query_embeddings() which takes a query and returns a list of embeddings    2. compute_source_embeddings() which returns a list of embeddings for       the source column    For text data, the two will be the same. For multi-modal data, the source column    might be images and the vector column might be text.    3. ndims() which returns the number of dimensions of the vector column    """__slots__=("__weakref__",)# pydantic 1.x compatibilitymax_retries:int=(7# Setting 0 disables retires. Maybe this should not be enabled by default,)_ndims:int=PrivateAttr()_original_args:dict=PrivateAttr()@classmethoddefcreate(cls,**kwargs):"""        Create an instance of the embedding function        """resolved_kwargs=cls.__resolveVariables(kwargs)instance=cls(**resolved_kwargs)instance._original_args=kwargsreturninstance@classmethoddef__resolveVariables(cls,args:dict)->dict:"""        Resolve variables in the args        """from.registryimportEmbeddingFunctionRegistrynew_args=copy.deepcopy(args)registry=EmbeddingFunctionRegistry.get_instance()sensitive_keys=cls.sensitive_keys()fork,vinnew_args.items():ifisinstance(v,str)andnotv.startswith("$var:")andkinsensitive_keys:exc=ValueError(f"Sensitive key '{k}' cannot be set to a hardcoded value")add_note(exc,"Help: Use $var: to set sensitive keys to variables")raiseexcifisinstance(v,str)andv.startswith("$var:"):parts=v[5:].split(":",maxsplit=1)iflen(parts)==1:try:new_args[k]=registry.get_var(parts[0])exceptKeyError:exc=ValueError("Variable '{}' not found in registry".format(parts[0]))add_note(exc,"Help: Variables are reset in new Python sessions. ""Use `registry.set_var` to set variables.",)raiseexcelse:name,default=partstry:new_args[k]=registry.get_var(name)exceptKeyError:new_args[k]=defaultreturnnew_args@staticmethoddefsensitive_keys()->List[str]:"""        Return a list of keys that are sensitive and should not be allowed        to be set to hardcoded values in the config. For example, API keys.        """return[]@abstractmethoddefcompute_query_embeddings(self,*args,**kwargs)->list[Union[np.array,None]]:"""        Compute the embeddings for a given user query        Returns        -------        A list of embeddings for each input. The embedding of each input can be None        when the embedding is not valid.        """pass@abstractmethoddefcompute_source_embeddings(self,*args,**kwargs)->list[Union[np.array,None]]:"""Compute the embeddings for the source column in the database        Returns        -------        A list of embeddings for each input. The embedding of each input can be None        when the embedding is not valid.        """passdefcompute_query_embeddings_with_retry(self,*args,**kwargs)->list[Union[np.array,None]]:"""Compute the embeddings for a given user query with retries        Returns        -------        A list of embeddings for each input. The embedding of each input can be None        when the embedding is not valid.        """returnretry_with_exponential_backoff(self.compute_query_embeddings,max_retries=self.max_retries)(*args,**kwargs,)defcompute_source_embeddings_with_retry(self,*args,**kwargs)->list[Union[np.array,None]]:"""Compute the embeddings for the source column in the database with retries.        Returns        -------        A list of embeddings for each input. The embedding of each input can be None        when the embedding is not valid.        """returnretry_with_exponential_backoff(self.compute_source_embeddings,max_retries=self.max_retries)(*args,**kwargs)defsanitize_input(self,texts:TEXT)->Union[List[str],np.ndarray]:"""        Sanitize the input to the embedding function.        """ifisinstance(texts,str):texts=[texts]elifisinstance(texts,pa.Array):texts=texts.to_pylist()elifisinstance(texts,pa.ChunkedArray):texts=texts.combine_chunks().to_pylist()returntextsdefsafe_model_dump(self):ifnothasattr(self,"_original_args"):raiseValueError("EmbeddingFunction was not created with EmbeddingFunction.create()")returnself._original_args@abstractmethoddefndims(self)->int:"""        Return the dimensions of the vector column        """passdefSourceField(self,**kwargs):"""        Creates a pydantic Field that can automatically annotate        the source column for this embedding function        """returnField(json_schema_extra={"source_column_for":self},**kwargs)defVectorField(self,**kwargs):"""        Creates a pydantic Field that can automatically annotate        the target vector column for this embedding function        """returnField(json_schema_extra={"vector_column_for":self},**kwargs)def__eq__(self,__value:object)->bool:ifnothasattr(__value,"__dict__"):returnFalsereturnvars(self)==vars(__value)def__hash__(self)->int:returnhash(frozenset(vars(self).items()))

createclassmethod

create(**kwargs)

Create an instance of the embedding function

Source code inlancedb/embeddings/base.py
@classmethoddefcreate(cls,**kwargs):"""    Create an instance of the embedding function    """resolved_kwargs=cls.__resolveVariables(kwargs)instance=cls(**resolved_kwargs)instance._original_args=kwargsreturninstance

__resolveVariablesclassmethod

__resolveVariables(args:dict)->dict

Resolve variables in the args

Source code inlancedb/embeddings/base.py
@classmethoddef__resolveVariables(cls,args:dict)->dict:"""    Resolve variables in the args    """from.registryimportEmbeddingFunctionRegistrynew_args=copy.deepcopy(args)registry=EmbeddingFunctionRegistry.get_instance()sensitive_keys=cls.sensitive_keys()fork,vinnew_args.items():ifisinstance(v,str)andnotv.startswith("$var:")andkinsensitive_keys:exc=ValueError(f"Sensitive key '{k}' cannot be set to a hardcoded value")add_note(exc,"Help: Use $var: to set sensitive keys to variables")raiseexcifisinstance(v,str)andv.startswith("$var:"):parts=v[5:].split(":",maxsplit=1)iflen(parts)==1:try:new_args[k]=registry.get_var(parts[0])exceptKeyError:exc=ValueError("Variable '{}' not found in registry".format(parts[0]))add_note(exc,"Help: Variables are reset in new Python sessions. ""Use `registry.set_var` to set variables.",)raiseexcelse:name,default=partstry:new_args[k]=registry.get_var(name)exceptKeyError:new_args[k]=defaultreturnnew_args

sensitive_keysstaticmethod

sensitive_keys()->List[str]

Return a list of keys that are sensitive and should not be allowedto be set to hardcoded values in the config. For example, API keys.

Source code inlancedb/embeddings/base.py
@staticmethoddefsensitive_keys()->List[str]:"""    Return a list of keys that are sensitive and should not be allowed    to be set to hardcoded values in the config. For example, API keys.    """return[]

compute_query_embeddingsabstractmethod

compute_query_embeddings(*args,**kwargs)->list[Union[array,None]]

Compute the embeddings for a given user query

Returns:

  • A list of embeddings for each input. The embedding of each input can be None
  • when the embedding is not valid.
Source code inlancedb/embeddings/base.py
@abstractmethoddefcompute_query_embeddings(self,*args,**kwargs)->list[Union[np.array,None]]:"""    Compute the embeddings for a given user query    Returns    -------    A list of embeddings for each input. The embedding of each input can be None    when the embedding is not valid.    """pass

compute_source_embeddingsabstractmethod

compute_source_embeddings(*args,**kwargs)->list[Union[array,None]]

Compute the embeddings for the source column in the database

Returns:

  • A list of embeddings for each input. The embedding of each input can be None
  • when the embedding is not valid.
Source code inlancedb/embeddings/base.py
@abstractmethoddefcompute_source_embeddings(self,*args,**kwargs)->list[Union[np.array,None]]:"""Compute the embeddings for the source column in the database    Returns    -------    A list of embeddings for each input. The embedding of each input can be None    when the embedding is not valid.    """pass

compute_query_embeddings_with_retry

compute_query_embeddings_with_retry(*args,**kwargs)->list[Union[array,None]]

Compute the embeddings for a given user query with retries

Returns:

  • A list of embeddings for each input. The embedding of each input can be None
  • when the embedding is not valid.
Source code inlancedb/embeddings/base.py
defcompute_query_embeddings_with_retry(self,*args,**kwargs)->list[Union[np.array,None]]:"""Compute the embeddings for a given user query with retries    Returns    -------    A list of embeddings for each input. The embedding of each input can be None    when the embedding is not valid.    """returnretry_with_exponential_backoff(self.compute_query_embeddings,max_retries=self.max_retries)(*args,**kwargs,)

compute_source_embeddings_with_retry

compute_source_embeddings_with_retry(*args,**kwargs)->list[Union[array,None]]

Compute the embeddings for the source column in the database with retries.

Returns:

  • A list of embeddings for each input. The embedding of each input can be None
  • when the embedding is not valid.
Source code inlancedb/embeddings/base.py
defcompute_source_embeddings_with_retry(self,*args,**kwargs)->list[Union[np.array,None]]:"""Compute the embeddings for the source column in the database with retries.    Returns    -------    A list of embeddings for each input. The embedding of each input can be None    when the embedding is not valid.    """returnretry_with_exponential_backoff(self.compute_source_embeddings,max_retries=self.max_retries)(*args,**kwargs)

sanitize_input

sanitize_input(texts:TEXT)->Union[List[str],ndarray]

Sanitize the input to the embedding function.

Source code inlancedb/embeddings/base.py
defsanitize_input(self,texts:TEXT)->Union[List[str],np.ndarray]:"""    Sanitize the input to the embedding function.    """ifisinstance(texts,str):texts=[texts]elifisinstance(texts,pa.Array):texts=texts.to_pylist()elifisinstance(texts,pa.ChunkedArray):texts=texts.combine_chunks().to_pylist()returntexts

ndimsabstractmethod

ndims()->int

Return the dimensions of the vector column

Source code inlancedb/embeddings/base.py
@abstractmethoddefndims(self)->int:"""    Return the dimensions of the vector column    """pass

SourceField

SourceField(**kwargs)

Creates a pydantic Field that can automatically annotatethe source column for this embedding function

Source code inlancedb/embeddings/base.py
defSourceField(self,**kwargs):"""    Creates a pydantic Field that can automatically annotate    the source column for this embedding function    """returnField(json_schema_extra={"source_column_for":self},**kwargs)

VectorField

VectorField(**kwargs)

Creates a pydantic Field that can automatically annotatethe target vector column for this embedding function

Source code inlancedb/embeddings/base.py
defVectorField(self,**kwargs):"""    Creates a pydantic Field that can automatically annotate    the target vector column for this embedding function    """returnField(json_schema_extra={"vector_column_for":self},**kwargs)

lancedb.embeddings.base.TextEmbeddingFunction

Bases:EmbeddingFunction

A callable ABC for embedding functions that take text as input

Source code inlancedb/embeddings/base.py
classTextEmbeddingFunction(EmbeddingFunction):"""    A callable ABC for embedding functions that take text as input    """defcompute_query_embeddings(self,query:str,*args,**kwargs)->list[Union[np.array,None]]:returnself.compute_source_embeddings(query,*args,**kwargs)defcompute_source_embeddings(self,texts:TEXT,*args,**kwargs)->list[Union[np.array,None]]:texts=self.sanitize_input(texts)returnself.generate_embeddings(texts)@abstractmethoddefgenerate_embeddings(self,texts:Union[List[str],np.ndarray],*args,**kwargs)->list[Union[np.array,None]]:"""Generate the embeddings for the given texts"""pass

generate_embeddingsabstractmethod

generate_embeddings(texts:Union[List[str],ndarray],*args,**kwargs)->list[Union[array,None]]

Generate the embeddings for the given texts

Source code inlancedb/embeddings/base.py
@abstractmethoddefgenerate_embeddings(self,texts:Union[List[str],np.ndarray],*args,**kwargs)->list[Union[np.array,None]]:"""Generate the embeddings for the given texts"""pass

lancedb.embeddings.sentence_transformers.SentenceTransformerEmbeddings

Bases:TextEmbeddingFunction

An embedding function that uses the sentence-transformers library

https://huggingface.co/sentence-transformers

Parameters:

  • name

    The name of the model to use.

  • device

    The device to use for the model

  • normalize

    Whether to normalize the embeddings

  • trust_remote_code

    Whether to trust the remote code

Source code inlancedb/embeddings/sentence_transformers.py
@register("sentence-transformers")classSentenceTransformerEmbeddings(TextEmbeddingFunction):"""    An embedding function that uses the sentence-transformers library    https://huggingface.co/sentence-transformers    Parameters    ----------    name: str, default "all-MiniLM-L6-v2"        The name of the model to use.    device: str, default "cpu"        The device to use for the model    normalize: bool, default True        Whether to normalize the embeddings    trust_remote_code: bool, default True        Whether to trust the remote code    """name:str="all-MiniLM-L6-v2"device:str="cpu"normalize:bool=Truetrust_remote_code:bool=Truedef__init__(self,**kwargs):super().__init__(**kwargs)self._ndims=None@propertydefembedding_model(self):"""        Get the sentence-transformers embedding model specified by the        name, device, and trust_remote_code. This is cached so that the        model is only loaded once per process.        """returnself.get_embedding_model()defndims(self):ifself._ndimsisNone:self._ndims=len(self.generate_embeddings("foo")[0])returnself._ndimsdefgenerate_embeddings(self,texts:Union[List[str],np.ndarray])->List[np.array]:"""        Get the embeddings for the given texts        Parameters        ----------        texts: list[str] or np.ndarray (of str)            The texts to embed        """returnself.embedding_model.encode(list(texts),convert_to_numpy=True,normalize_embeddings=self.normalize,).tolist()@weak_lru(maxsize=1)defget_embedding_model(self):"""        Get the sentence-transformers embedding model specified by the        name, device, and trust_remote_code. This is cached so that the        model is only loaded once per process.        TODO: use lru_cache instead with a reasonable/configurable maxsize        """sentence_transformers=attempt_import_or_raise("sentence_transformers","sentence-transformers")returnsentence_transformers.SentenceTransformer(self.name,device=self.device,trust_remote_code=self.trust_remote_code)

embedding_modelproperty

embedding_model

Get the sentence-transformers embedding model specified by thename, device, and trust_remote_code. This is cached so that themodel is only loaded once per process.

generate_embeddings

generate_embeddings(texts:Union[List[str],ndarray])->List[array]

Get the embeddings for the given texts

Parameters:

  • texts (Union[List[str],ndarray]) –

    The texts to embed

Source code inlancedb/embeddings/sentence_transformers.py
defgenerate_embeddings(self,texts:Union[List[str],np.ndarray])->List[np.array]:"""    Get the embeddings for the given texts    Parameters    ----------    texts: list[str] or np.ndarray (of str)        The texts to embed    """returnself.embedding_model.encode(list(texts),convert_to_numpy=True,normalize_embeddings=self.normalize,).tolist()

get_embedding_model

get_embedding_model()

Get the sentence-transformers embedding model specified by thename, device, and trust_remote_code. This is cached so that themodel is only loaded once per process.

TODO: use lru_cache instead with a reasonable/configurable maxsize

Source code inlancedb/embeddings/sentence_transformers.py
@weak_lru(maxsize=1)defget_embedding_model(self):"""    Get the sentence-transformers embedding model specified by the    name, device, and trust_remote_code. This is cached so that the    model is only loaded once per process.    TODO: use lru_cache instead with a reasonable/configurable maxsize    """sentence_transformers=attempt_import_or_raise("sentence_transformers","sentence-transformers")returnsentence_transformers.SentenceTransformer(self.name,device=self.device,trust_remote_code=self.trust_remote_code)

lancedb.embeddings.openai.OpenAIEmbeddings

Bases:TextEmbeddingFunction

An embedding function that uses the OpenAI API

https://platform.openai.com/docs/guides/embeddings

This can also be used for open source models thatare compatible with the OpenAI API.

Notes

If you're running an Ollama server locally,you can just override thebase_url parameterand provide the Ollama embedding model you wantto use (https://ollama.com/library):

fromlancedb.embeddingsimportget_registryopenai=get_registry().get("openai")embedding_function=openai.create(name="<ollama-embedding-model-name>",base_url="http://localhost:11434",)
Source code inlancedb/embeddings/openai.py
@register("openai")classOpenAIEmbeddings(TextEmbeddingFunction):"""    An embedding function that uses the OpenAI API    https://platform.openai.com/docs/guides/embeddings    This can also be used for open source models that    are compatible with the OpenAI API.    Notes    -----    If you're running an Ollama server locally,    you can just override the `base_url` parameter    and provide the Ollama embedding model you want    to use (https://ollama.com/library):    ```python    from lancedb.embeddings import get_registry    openai = get_registry().get("openai")    embedding_function = openai.create(        name="<ollama-embedding-model-name>",        base_url="http://localhost:11434",        )    ```    """name:str="text-embedding-ada-002"dim:Optional[int]=Nonebase_url:Optional[str]=Nonedefault_headers:Optional[dict]=Noneorganization:Optional[str]=Noneapi_key:Optional[str]=None# Set true to use Azure OpenAI APIuse_azure:bool=Falsedefndims(self):returnself._ndims@staticmethoddefsensitive_keys():return["api_key"]@staticmethoddefmodel_names():return["text-embedding-ada-002","text-embedding-3-large","text-embedding-3-small",]@cached_propertydef_ndims(self):ifself.name=="text-embedding-ada-002":return1536elifself.name=="text-embedding-3-large":returnself.dimor3072elifself.name=="text-embedding-3-small":returnself.dimor1536else:raiseValueError(f"Unknown model name{self.name}")defgenerate_embeddings(self,texts:Union[List[str],"np.ndarray"])->List["np.array"]:"""        Get the embeddings for the given texts        Parameters        ----------        texts: list[str] or np.ndarray (of str)            The texts to embed        """openai=attempt_import_or_raise("openai")valid_texts=[]valid_indices=[]foridx,textinenumerate(texts):iftext:valid_texts.append(text)valid_indices.append(idx)# TODO retry, rate limit, token limittry:kwargs={"input":valid_texts,"model":self.name,}ifself.name!="text-embedding-ada-002":kwargs["dimensions"]=self.dimrs=self._openai_client.embeddings.create(**kwargs)valid_embeddings={idx:v.embeddingforv,idxinzip(rs.data,valid_indices)}exceptopenai.BadRequestError:logging.exception("Bad request:%s",texts)return[None]*len(texts)exceptException:logging.exception("OpenAI embeddings error")raisereturn[valid_embeddings.get(idx,None)foridxinrange(len(texts))]@cached_propertydef_openai_client(self):openai=attempt_import_or_raise("openai")kwargs={}ifself.base_url:kwargs["base_url"]=self.base_urlifself.default_headers:kwargs["default_headers"]=self.default_headersifself.organization:kwargs["organization"]=self.organizationifself.api_key:kwargs["api_key"]=self.api_keyifself.use_azure:returnopenai.AzureOpenAI(**kwargs)else:returnopenai.OpenAI(**kwargs)

generate_embeddings

generate_embeddings(texts:Union[List[str],ndarray])->List[array]

Get the embeddings for the given texts

Parameters:

  • texts (Union[List[str],ndarray]) –

    The texts to embed

Source code inlancedb/embeddings/openai.py
defgenerate_embeddings(self,texts:Union[List[str],"np.ndarray"])->List["np.array"]:"""    Get the embeddings for the given texts    Parameters    ----------    texts: list[str] or np.ndarray (of str)        The texts to embed    """openai=attempt_import_or_raise("openai")valid_texts=[]valid_indices=[]foridx,textinenumerate(texts):iftext:valid_texts.append(text)valid_indices.append(idx)# TODO retry, rate limit, token limittry:kwargs={"input":valid_texts,"model":self.name,}ifself.name!="text-embedding-ada-002":kwargs["dimensions"]=self.dimrs=self._openai_client.embeddings.create(**kwargs)valid_embeddings={idx:v.embeddingforv,idxinzip(rs.data,valid_indices)}exceptopenai.BadRequestError:logging.exception("Bad request:%s",texts)return[None]*len(texts)exceptException:logging.exception("OpenAI embeddings error")raisereturn[valid_embeddings.get(idx,None)foridxinrange(len(texts))]

lancedb.embeddings.open_clip.OpenClipEmbeddings

Bases:EmbeddingFunction

An embedding function that uses the OpenClip APIFor multi-modal text-to-image search

https://github.com/mlfoundations/open_clip

Source code inlancedb/embeddings/open_clip.py
@register("open-clip")classOpenClipEmbeddings(EmbeddingFunction):"""    An embedding function that uses the OpenClip API    For multi-modal text-to-image search    https://github.com/mlfoundations/open_clip    """name:str="ViT-B-32"pretrained:str="laion2b_s34b_b79k"device:str="cpu"batch_size:int=64normalize:bool=True_model=PrivateAttr()_preprocess=PrivateAttr()_tokenizer=PrivateAttr()def__init__(self,*args,**kwargs):super().__init__(*args,**kwargs)open_clip=attempt_import_or_raise("open_clip","open-clip")model,_,preprocess=open_clip.create_model_and_transforms(self.name,pretrained=self.pretrained)model.to(self.device)self._model,self._preprocess=model,preprocessself._tokenizer=open_clip.get_tokenizer(self.name)self._ndims=Nonedefndims(self):ifself._ndimsisNone:self._ndims=self.generate_text_embeddings("foo").shape[0]returnself._ndimsdefcompute_query_embeddings(self,query:Union[str,"PIL.Image.Image"],*args,**kwargs)->List[np.ndarray]:"""        Compute the embeddings for a given user query        Parameters        ----------        query : Union[str, PIL.Image.Image]            The query to embed. A query can be either text or an image.        """ifisinstance(query,str):return[self.generate_text_embeddings(query)]else:PIL=attempt_import_or_raise("PIL","pillow")ifisinstance(query,PIL.Image.Image):return[self.generate_image_embedding(query)]else:raiseTypeError("OpenClip supports str or PIL Image as query")defgenerate_text_embeddings(self,text:str)->np.ndarray:torch=attempt_import_or_raise("torch")text=self.sanitize_input(text)text=self._tokenizer(text)text.to(self.device)withtorch.no_grad():text_features=self._model.encode_text(text.to(self.device))ifself.normalize:text_features/=text_features.norm(dim=-1,keepdim=True)returntext_features.cpu().numpy().squeeze()defsanitize_input(self,images:IMAGES)->Union[List[bytes],np.ndarray]:"""        Sanitize the input to the embedding function.        """ifisinstance(images,(str,bytes)):images=[images]elifisinstance(images,pa.Array):images=images.to_pylist()elifisinstance(images,pa.ChunkedArray):images=images.combine_chunks().to_pylist()returnimagesdefcompute_source_embeddings(self,images:IMAGES,*args,**kwargs)->List[np.array]:"""        Get the embeddings for the given images        """images=self.sanitize_input(images)embeddings=[]foriinrange(0,len(images),self.batch_size):j=min(i+self.batch_size,len(images))batch=images[i:j]embeddings.extend(self._parallel_get(batch))returnembeddingsdef_parallel_get(self,images:Union[List[str],List[bytes]])->List[np.ndarray]:"""        Issue concurrent requests to retrieve the image data        """withconcurrent.futures.ThreadPoolExecutor()asexecutor:futures=[executor.submit(self.generate_image_embedding,image)forimageinimages]return[future.result()forfutureintqdm(futures)]defgenerate_image_embedding(self,image:Union[str,bytes,"PIL.Image.Image"])->np.ndarray:"""        Generate the embedding for a single image        Parameters        ----------        image : Union[str, bytes, PIL.Image.Image]            The image to embed. If the image is a str, it is treated as a uri.            If the image is bytes, it is treated as the raw image bytes.        """torch=attempt_import_or_raise("torch")# TODO handle retry and errors for httpsimage=self._to_pil(image)image=self._preprocess(image).unsqueeze(0)withtorch.no_grad():returnself._encode_and_normalize_image(image)def_to_pil(self,image:Union[str,bytes]):PIL=attempt_import_or_raise("PIL","pillow")ifisinstance(image,bytes):returnPIL.Image.open(io.BytesIO(image))ifisinstance(image,PIL.Image.Image):returnimageelifisinstance(image,str):parsed=urlparse.urlparse(image)# TODO handle drive letter on windows.ifparsed.scheme=="file":returnPIL.Image.open(parsed.path)elifparsed.scheme=="":returnPIL.Image.open(imageifos.name=="nt"elseparsed.path)elifparsed.scheme.startswith("http"):returnPIL.Image.open(io.BytesIO(url_retrieve(image)))else:raiseNotImplementedError("Only local and http(s) urls are supported")def_encode_and_normalize_image(self,image_tensor:"torch.Tensor"):"""        encode a single image tensor and optionally normalize the output        """image_features=self._model.encode_image(image_tensor.to(self.device))ifself.normalize:image_features/=image_features.norm(dim=-1,keepdim=True)returnimage_features.cpu().numpy().squeeze()

compute_query_embeddings

compute_query_embeddings(query:Union[str,Image],*args,**kwargs)->List[ndarray]

Compute the embeddings for a given user query

Parameters:

  • query (Union[str,Image]) –

    The query to embed. A query can be either text or an image.

Source code inlancedb/embeddings/open_clip.py
defcompute_query_embeddings(self,query:Union[str,"PIL.Image.Image"],*args,**kwargs)->List[np.ndarray]:"""    Compute the embeddings for a given user query    Parameters    ----------    query : Union[str, PIL.Image.Image]        The query to embed. A query can be either text or an image.    """ifisinstance(query,str):return[self.generate_text_embeddings(query)]else:PIL=attempt_import_or_raise("PIL","pillow")ifisinstance(query,PIL.Image.Image):return[self.generate_image_embedding(query)]else:raiseTypeError("OpenClip supports str or PIL Image as query")

sanitize_input

sanitize_input(images:IMAGES)->Union[List[bytes],ndarray]

Sanitize the input to the embedding function.

Source code inlancedb/embeddings/open_clip.py
defsanitize_input(self,images:IMAGES)->Union[List[bytes],np.ndarray]:"""    Sanitize the input to the embedding function.    """ifisinstance(images,(str,bytes)):images=[images]elifisinstance(images,pa.Array):images=images.to_pylist()elifisinstance(images,pa.ChunkedArray):images=images.combine_chunks().to_pylist()returnimages

compute_source_embeddings

compute_source_embeddings(images:IMAGES,*args,**kwargs)->List[array]

Get the embeddings for the given images

Source code inlancedb/embeddings/open_clip.py
defcompute_source_embeddings(self,images:IMAGES,*args,**kwargs)->List[np.array]:"""    Get the embeddings for the given images    """images=self.sanitize_input(images)embeddings=[]foriinrange(0,len(images),self.batch_size):j=min(i+self.batch_size,len(images))batch=images[i:j]embeddings.extend(self._parallel_get(batch))returnembeddings

generate_image_embedding

generate_image_embedding(image:Union[str,bytes,Image])->ndarray

Generate the embedding for a single image

Parameters:

  • image (Union[str,bytes,Image]) –

    The image to embed. If the image is a str, it is treated as a uri.If the image is bytes, it is treated as the raw image bytes.

Source code inlancedb/embeddings/open_clip.py
defgenerate_image_embedding(self,image:Union[str,bytes,"PIL.Image.Image"])->np.ndarray:"""    Generate the embedding for a single image    Parameters    ----------    image : Union[str, bytes, PIL.Image.Image]        The image to embed. If the image is a str, it is treated as a uri.        If the image is bytes, it is treated as the raw image bytes.    """torch=attempt_import_or_raise("torch")# TODO handle retry and errors for httpsimage=self._to_pil(image)image=self._preprocess(image).unsqueeze(0)withtorch.no_grad():returnself._encode_and_normalize_image(image)

Context

lancedb.context.contextualize

contextualize(raw_df:'pd.DataFrame')->Contextualizer

Create a Contextualizer object for the given DataFrame.

Used to create context windows. Context windows are rolling subsets of textdata.

The input text column should already be separated into rows that will be theunit of the window. So to create a context window over tokens, start witha DataFrame with one token per row. To create a context window over sentences,start with a DataFrame with one sentence per row.

Examples:

>>>fromlancedb.contextimportcontextualize>>>importpandasaspd>>>data=pd.DataFrame({...'token':['The','quick','brown','fox','jumped','over',...'the','lazy','dog','I','love','sandwiches'],...'document_id':[1,1,1,1,1,1,1,1,1,2,2,2]...})

window determines how many rows to include in each window. In our casethis how many tokens, but depending on the input data, it could be sentences,paragraphs, messages, etc.

>>>contextualize(data).window(3).stride(1).text_col('token').to_pandas()                token  document_id0     The quick brown            11     quick brown fox            12    brown fox jumped            13     fox jumped over            14     jumped over the            15       over the lazy            16        the lazy dog            17          lazy dog I            18          dog I love            19   I love sandwiches            210    love sandwiches            2>>>(contextualize(data).window(7).stride(1).min_window_size(7)....text_col('token').to_pandas())                                  token  document_id0   The quick brown fox jumped over the            11  quick brown fox jumped over the lazy            12    brown fox jumped over the lazy dog            13        fox jumped over the lazy dog I            14       jumped over the lazy dog I love            15   over the lazy dog I love sandwiches            1

stride determines how many rows to skip between each window start. This canbe used to reduce the total number of windows generated.

>>>contextualize(data).window(4).stride(2).text_col('token').to_pandas()                    token  document_id0     The quick brown fox            12   brown fox jumped over            14    jumped over the lazy            16          the lazy dog I            18   dog I love sandwiches            110        love sandwiches            2

groupby determines how to group the rows. For example, we would like to havecontext windows that don't cross document boundaries. In this case, we canpassdocument_id as the group by.

>>>(contextualize(data)....window(4).stride(2).text_col('token').groupby('document_id')....to_pandas())                   token  document_id0    The quick brown fox            12  brown fox jumped over            14   jumped over the lazy            16           the lazy dog            19      I love sandwiches            2

min_window_size determines the minimum size of the context windowsthat are generated.This can be used to trim the last few context windowswhich have size less thanmin_window_size.By default context windows of size 1 are skipped.

>>>(contextualize(data)....window(6).stride(3).text_col('token').groupby('document_id')....to_pandas())                             token  document_id0  The quick brown fox jumped over            13     fox jumped over the lazy dog            16                     the lazy dog            19                I love sandwiches            2
>>>(contextualize(data)....window(6).stride(3).min_window_size(4).text_col('token')....groupby('document_id')....to_pandas())                             token  document_id0  The quick brown fox jumped over            13     fox jumped over the lazy dog            1
Source code inlancedb/context.py
defcontextualize(raw_df:"pd.DataFrame")->Contextualizer:"""Create a Contextualizer object for the given DataFrame.    Used to create context windows. Context windows are rolling subsets of text    data.    The input text column should already be separated into rows that will be the    unit of the window. So to create a context window over tokens, start with    a DataFrame with one token per row. To create a context window over sentences,    start with a DataFrame with one sentence per row.    Examples    --------    >>> from lancedb.context import contextualize    >>> import pandas as pd    >>> data = pd.DataFrame({    ...    'token': ['The', 'quick', 'brown', 'fox', 'jumped', 'over',    ...              'the', 'lazy', 'dog', 'I', 'love', 'sandwiches'],    ...    'document_id': [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2]    ... })    ``window`` determines how many rows to include in each window. In our case    this how many tokens, but depending on the input data, it could be sentences,    paragraphs, messages, etc.    >>> contextualize(data).window(3).stride(1).text_col('token').to_pandas()                    token  document_id    0     The quick brown            1    1     quick brown fox            1    2    brown fox jumped            1    3     fox jumped over            1    4     jumped over the            1    5       over the lazy            1    6        the lazy dog            1    7          lazy dog I            1    8          dog I love            1    9   I love sandwiches            2    10    love sandwiches            2    >>> (contextualize(data).window(7).stride(1).min_window_size(7)    ...   .text_col('token').to_pandas())                                      token  document_id    0   The quick brown fox jumped over the            1    1  quick brown fox jumped over the lazy            1    2    brown fox jumped over the lazy dog            1    3        fox jumped over the lazy dog I            1    4       jumped over the lazy dog I love            1    5   over the lazy dog I love sandwiches            1    ``stride`` determines how many rows to skip between each window start. This can    be used to reduce the total number of windows generated.    >>> contextualize(data).window(4).stride(2).text_col('token').to_pandas()                        token  document_id    0     The quick brown fox            1    2   brown fox jumped over            1    4    jumped over the lazy            1    6          the lazy dog I            1    8   dog I love sandwiches            1    10        love sandwiches            2    ``groupby`` determines how to group the rows. For example, we would like to have    context windows that don't cross document boundaries. In this case, we can    pass ``document_id`` as the group by.    >>> (contextualize(data)    ...     .window(4).stride(2).text_col('token').groupby('document_id')    ...     .to_pandas())                       token  document_id    0    The quick brown fox            1    2  brown fox jumped over            1    4   jumped over the lazy            1    6           the lazy dog            1    9      I love sandwiches            2    ``min_window_size`` determines the minimum size of the context windows    that are generated.This can be used to trim the last few context windows    which have size less than ``min_window_size``.    By default context windows of size 1 are skipped.    >>> (contextualize(data)    ...     .window(6).stride(3).text_col('token').groupby('document_id')    ...     .to_pandas())                                 token  document_id    0  The quick brown fox jumped over            1    3     fox jumped over the lazy dog            1    6                     the lazy dog            1    9                I love sandwiches            2    >>> (contextualize(data)    ...     .window(6).stride(3).min_window_size(4).text_col('token')    ...     .groupby('document_id')    ...     .to_pandas())                                 token  document_id    0  The quick brown fox jumped over            1    3     fox jumped over the lazy dog            1    """returnContextualizer(raw_df)

lancedb.context.Contextualizer

Create context windows from a DataFrame.Seelancedb.context.contextualize.

Source code inlancedb/context.py
classContextualizer:"""Create context windows from a DataFrame.    See [lancedb.context.contextualize][].    """def__init__(self,raw_df):self._text_col=Noneself._groupby=Noneself._stride=Noneself._window=Noneself._min_window_size=2self._raw_df=raw_dfdefwindow(self,window:int)->Contextualizer:"""Set the window size. i.e., how many rows to include in each window.        Parameters        ----------        window: int            The window size.        """self._window=windowreturnselfdefstride(self,stride:int)->Contextualizer:"""Set the stride. i.e., how many rows to skip between each window.        Parameters        ----------        stride: int            The stride.        """self._stride=stridereturnselfdefgroupby(self,groupby:str)->Contextualizer:"""Set the groupby column. i.e., how to group the rows.        Windows don't cross groups        Parameters        ----------        groupby: str            The groupby column.        """self._groupby=groupbyreturnselfdeftext_col(self,text_col:str)->Contextualizer:"""Set the text column used to make the context window.        Parameters        ----------        text_col: str            The text column.        """self._text_col=text_colreturnselfdefmin_window_size(self,min_window_size:int)->Contextualizer:"""Set the (optional) min_window_size size for the context window.        Parameters        ----------        min_window_size: int            The min_window_size.        """self._min_window_size=min_window_sizereturnself@deprecation.deprecated(deprecated_in="0.3.1",removed_in="0.4.0",current_version=__version__,details="Use to_pandas() instead",)defto_df(self)->"pd.DataFrame":returnself.to_pandas()defto_pandas(self)->"pd.DataFrame":"""Create the context windows and return a DataFrame."""ifpdisNone:raiseImportError("pandas is required to create context windows using lancedb")ifself._text_colnotinself._raw_df.columns.tolist():raiseMissingColumnError(self._text_col)ifself._windowisNoneorself._window<1:raiseMissingValueError("The value of window is None or less than 1. Specify the ""window size (number of rows to include in each window)")ifself._strideisNoneorself._stride<1:raiseMissingValueError("The value of stride is None or less than 1. Specify the ""stride (number of rows to skip between each window)")defprocess_group(grp):# For each group, create the text rolling window# with values of size >= min_window_sizetext=grp[self._text_col].valuescontexts=grp.iloc[::self._stride,:].copy()windows=[" ".join(text[start_i:min(start_i+self._window,len(grp))])forstart_iinrange(0,len(grp),self._stride)ifstart_i+self._window<=len(grp)orlen(grp)-start_i>=self._min_window_size]# if last few rows droppediflen(windows)<len(contexts):contexts=contexts.iloc[:len(windows)]contexts[self._text_col]=windowsreturncontextsifself._groupbyisNone:returnprocess_group(self._raw_df)# concat result from all groupsreturnpd.concat([process_group(grp)for_,grpinself._raw_df.groupby(self._groupby)])

window

window(window:int)->Contextualizer

Set the window size. i.e., how many rows to include in each window.

Parameters:

  • window (int) –

    The window size.

Source code inlancedb/context.py
defwindow(self,window:int)->Contextualizer:"""Set the window size. i.e., how many rows to include in each window.    Parameters    ----------    window: int        The window size.    """self._window=windowreturnself

stride

stride(stride:int)->Contextualizer

Set the stride. i.e., how many rows to skip between each window.

Parameters:

  • stride (int) –

    The stride.

Source code inlancedb/context.py
defstride(self,stride:int)->Contextualizer:"""Set the stride. i.e., how many rows to skip between each window.    Parameters    ----------    stride: int        The stride.    """self._stride=stridereturnself

groupby

groupby(groupby:str)->Contextualizer

Set the groupby column. i.e., how to group the rows.Windows don't cross groups

Parameters:

  • groupby (str) –

    The groupby column.

Source code inlancedb/context.py
defgroupby(self,groupby:str)->Contextualizer:"""Set the groupby column. i.e., how to group the rows.    Windows don't cross groups    Parameters    ----------    groupby: str        The groupby column.    """self._groupby=groupbyreturnself

text_col

text_col(text_col:str)->Contextualizer

Set the text column used to make the context window.

Parameters:

  • text_col (str) –

    The text column.

Source code inlancedb/context.py
deftext_col(self,text_col:str)->Contextualizer:"""Set the text column used to make the context window.    Parameters    ----------    text_col: str        The text column.    """self._text_col=text_colreturnself

min_window_size

min_window_size(min_window_size:int)->Contextualizer

Set the (optional) min_window_size size for the context window.

Parameters:

  • min_window_size (int) –

    The min_window_size.

Source code inlancedb/context.py
defmin_window_size(self,min_window_size:int)->Contextualizer:"""Set the (optional) min_window_size size for the context window.    Parameters    ----------    min_window_size: int        The min_window_size.    """self._min_window_size=min_window_sizereturnself

to_pandas

to_pandas()->'pd.DataFrame'

Create the context windows and return a DataFrame.

Source code inlancedb/context.py
defto_pandas(self)->"pd.DataFrame":"""Create the context windows and return a DataFrame."""ifpdisNone:raiseImportError("pandas is required to create context windows using lancedb")ifself._text_colnotinself._raw_df.columns.tolist():raiseMissingColumnError(self._text_col)ifself._windowisNoneorself._window<1:raiseMissingValueError("The value of window is None or less than 1. Specify the ""window size (number of rows to include in each window)")ifself._strideisNoneorself._stride<1:raiseMissingValueError("The value of stride is None or less than 1. Specify the ""stride (number of rows to skip between each window)")defprocess_group(grp):# For each group, create the text rolling window# with values of size >= min_window_sizetext=grp[self._text_col].valuescontexts=grp.iloc[::self._stride,:].copy()windows=[" ".join(text[start_i:min(start_i+self._window,len(grp))])forstart_iinrange(0,len(grp),self._stride)ifstart_i+self._window<=len(grp)orlen(grp)-start_i>=self._min_window_size]# if last few rows droppediflen(windows)<len(contexts):contexts=contexts.iloc[:len(windows)]contexts[self._text_col]=windowsreturncontextsifself._groupbyisNone:returnprocess_group(self._raw_df)# concat result from all groupsreturnpd.concat([process_group(grp)for_,grpinself._raw_df.groupby(self._groupby)])

Full text search

lancedb.fts.create_index

create_index(index_path:str,text_fields:List[str],ordering_fields:Optional[List[str]]=None,tokenizer_name:str='default')->Index

Create a new Index (not populated)

Parameters:

  • index_path (str) –

    Path to the index directory

  • text_fields (List[str]) –

    List of text fields to index

  • ordering_fields (Optional[List[str]], default:None) –

    List of unsigned type fields to order by at search time

  • tokenizer_name (str, default:"default") –

    The tokenizer to use

Returns:

  • index (Index) –

    The index object (not yet populated)

Source code inlancedb/fts.py
defcreate_index(index_path:str,text_fields:List[str],ordering_fields:Optional[List[str]]=None,tokenizer_name:str="default",)->tantivy.Index:"""    Create a new Index (not populated)    Parameters    ----------    index_path : str        Path to the index directory    text_fields : List[str]        List of text fields to index    ordering_fields: List[str]        List of unsigned type fields to order by at search time    tokenizer_name : str, default "default"        The tokenizer to use    Returns    -------    index : tantivy.Index        The index object (not yet populated)    """ifordering_fieldsisNone:ordering_fields=[]# Declaring our schema.schema_builder=tantivy.SchemaBuilder()# special field that we'll populate with row_idschema_builder.add_integer_field("doc_id",stored=True)# data fieldsfornameintext_fields:schema_builder.add_text_field(name,stored=True,tokenizer_name=tokenizer_name)ifordering_fields:fornameinordering_fields:schema_builder.add_unsigned_field(name,fast=True)schema=schema_builder.build()os.makedirs(index_path,exist_ok=True)index=tantivy.Index(schema,path=index_path)returnindex

lancedb.fts.populate_index

populate_index(index:Index,table:LanceTable,fields:List[str],writer_heap_size:Optional[int]=None,ordering_fields:Optional[List[str]]=None)->int

Populate an index with data from a LanceTable

Parameters:

  • index (Index) –

    The index object

  • table (LanceTable) –

    The table to index

  • fields (List[str]) –

    List of fields to index

  • writer_heap_size (int, default:None) –

    The writer heap size in bytes, defaults to 1GB

Returns:

  • int

    The number of rows indexed

Source code inlancedb/fts.py
defpopulate_index(index:tantivy.Index,table:LanceTable,fields:List[str],writer_heap_size:Optional[int]=None,ordering_fields:Optional[List[str]]=None,)->int:"""    Populate an index with data from a LanceTable    Parameters    ----------    index : tantivy.Index        The index object    table : LanceTable        The table to index    fields : List[str]        List of fields to index    writer_heap_size : int        The writer heap size in bytes, defaults to 1GB    Returns    -------    int        The number of rows indexed    """ifordering_fieldsisNone:ordering_fields=[]writer_heap_size=writer_heap_sizeor1024*1024*1024# first check the fields exist and are string or large string typenested=[]fornameinfields:try:f=table.schema.field(name)# raises KeyError if not foundexceptKeyError:f=resolve_path(table.schema,name)nested.append(name)ifnotpa.types.is_string(f.type)andnotpa.types.is_large_string(f.type):raiseTypeError(f"Field{name} is not a string type")# create a tantivy writerwriter=index.writer(heap_size=writer_heap_size)# write data into indexdataset=table.to_lance()row_id=0max_nested_level=0iflen(nested)>0:max_nested_level=max([len(name.split("."))fornameinnested])forbindataset.to_batches(columns=fields+ordering_fields):ifmax_nested_level>0:b=pa.Table.from_batches([b])for_inrange(max_nested_level-1):b=b.flatten()foriinrange(b.num_rows):doc=tantivy.Document()fornameinfields:value=b[name][i].as_py()ifvalueisnotNone:doc.add_text(name,value)fornameinordering_fields:value=b[name][i].as_py()ifvalueisnotNone:doc.add_unsigned(name,value)ifnotdoc.is_empty:doc.add_integer("doc_id",row_id)writer.add_document(doc)row_id+=1# commit changeswriter.commit()returnrow_id

lancedb.fts.search_index

search_index(index:Index,query:str,limit:int=10,ordering_field=None)->Tuple[Tuple[int],Tuple[float]]

Search an index for a query

Parameters:

  • index (Index) –

    The index object

  • query (str) –

    The query string

  • limit (int, default:10) –

    The maximum number of results to return

Returns:

  • ids_and_score (list[tuple[int],tuple[float]]) –

    A tuple of two tuples, the first containing the document idsand the second containing the scores

Source code inlancedb/fts.py
defsearch_index(index:tantivy.Index,query:str,limit:int=10,ordering_field=None)->Tuple[Tuple[int],Tuple[float]]:"""    Search an index for a query    Parameters    ----------    index : tantivy.Index        The index object    query : str        The query string    limit : int        The maximum number of results to return    Returns    -------    ids_and_score: list[tuple[int], tuple[float]]        A tuple of two tuples, the first containing the document ids        and the second containing the scores    """searcher=index.searcher()query=index.parse_query(query)# get top resultsifordering_field:results=searcher.search(query,limit,order_by_field=ordering_field)else:results=searcher.search(query,limit)ifresults.count==0:returntuple(),tuple()returntuple(zip(*[(searcher.doc(doc_address)["doc_id"][0],score)forscore,doc_addressinresults.hits]))

Utilities

lancedb.schema.vector

vector(dimension:int,value_type:DataType=pa.float32())->DataType

A help function to create a vector type.

Parameters:

  • dimension (int) –
  • value_type (DataType, default:float32()) –

    The type of the value in the vector.

Returns:

  • A PyArrow DataType for vectors.

Examples:

>>>importpyarrowaspa>>>importlancedb>>>schema=pa.schema([...pa.field("id",pa.int64()),...pa.field("vector",lancedb.vector(756)),...])
Source code inlancedb/schema.py
defvector(dimension:int,value_type:pa.DataType=pa.float32())->pa.DataType:"""A help function to create a vector type.    Parameters    ----------    dimension: The dimension of the vector.    value_type: pa.DataType, optional        The type of the value in the vector.    Returns    -------    A PyArrow DataType for vectors.    Examples    --------    >>> import pyarrow as pa    >>> import lancedb    >>> schema = pa.schema([    ...     pa.field("id", pa.int64()),    ...     pa.field("vector", lancedb.vector(756)),    ... ])    """returnpa.list_(value_type,dimension)

lancedb.merge.LanceMergeInsertBuilder

Bases:object

Builder for a LanceDB merge insert operation

Seemerge_insert formore context

Source code inlancedb/merge.py
classLanceMergeInsertBuilder(object):"""Builder for a LanceDB merge insert operation    See [`merge_insert`][lancedb.table.Table.merge_insert] for    more context    """def__init__(self,table:"Table",on:List[str]):# noqa: F821# Do not put a docstring here.  This method should be hidden# from API docs.  Users should use merge_insert to create# this object.self._table=tableself._on=onself._when_matched_update_all=Falseself._when_matched_update_all_condition=Noneself._when_not_matched_insert_all=Falseself._when_not_matched_by_source_delete=Falseself._when_not_matched_by_source_condition=Noneself._timeout=Nonedefwhen_matched_update_all(self,*,where:Optional[str]=None)->LanceMergeInsertBuilder:"""        Rows that exist in both the source table (new data) and        the target table (old data) will be updated, replacing        the old row with the corresponding matching row.        If there are multiple matches then the behavior is undefined.        Currently this causes multiple copies of the row to be created        but that behavior is subject to change.        """self._when_matched_update_all=Trueself._when_matched_update_all_condition=wherereturnselfdefwhen_not_matched_insert_all(self)->LanceMergeInsertBuilder:"""        Rows that exist only in the source table (new data) should        be inserted into the target table.        """self._when_not_matched_insert_all=Truereturnselfdefwhen_not_matched_by_source_delete(self,condition:Optional[str]=None)->LanceMergeInsertBuilder:"""        Rows that exist only in the target table (old data) will be        deleted.  An optional condition can be provided to limit what        data is deleted.        Parameters        ----------        condition: Optional[str], default None            If None then all such rows will be deleted.  Otherwise the            condition will be used as an SQL filter to limit what rows            are deleted.        """self._when_not_matched_by_source_delete=TrueifconditionisnotNone:self._when_not_matched_by_source_condition=conditionreturnselfdefexecute(self,new_data:DATA,on_bad_vectors:str="error",fill_value:float=0.0,timeout:Optional[timedelta]=None,)->MergeInsertResult:"""        Executes the merge insert operation        Nothing is returned but the [`Table`][lancedb.table.Table] is updated        Parameters        ----------        new_data: DATA            New records which will be matched against the existing records            to potentially insert or update into the table.  This parameter            can be anything you use for [`add`][lancedb.table.Table.add]        on_bad_vectors: str, default "error"            What to do if any of the vectors are not the same size or contains NaNs.            One of "error", "drop", "fill".        fill_value: float, default 0.            The value to use when filling vectors. Only used if on_bad_vectors="fill".        timeout: Optional[timedelta], default None            Maximum time to run the operation before cancelling it.            By default, there is a 30-second timeout that is only enforced after the            first attempt. This is to prevent spending too long retrying to resolve            conflicts. For example, if a write attempt takes 20 seconds and fails,            the second attempt will be cancelled after 10 seconds, hitting the            30-second timeout. However, a write that takes one hour and succeeds on the            first attempt will not be cancelled.            When this is set, the timeout is enforced on all attempts, including            the first.        Returns        -------        MergeInsertResult            version: the new version number of the table after doing merge insert.        """iftimeoutisnotNone:self._timeout=timeoutreturnself._table._do_merge(self,new_data,on_bad_vectors,fill_value)

when_matched_update_all

when_matched_update_all(*,where:Optional[str]=None)->LanceMergeInsertBuilder

Rows that exist in both the source table (new data) andthe target table (old data) will be updated, replacingthe old row with the corresponding matching row.

If there are multiple matches then the behavior is undefined.Currently this causes multiple copies of the row to be createdbut that behavior is subject to change.

Source code inlancedb/merge.py
defwhen_matched_update_all(self,*,where:Optional[str]=None)->LanceMergeInsertBuilder:"""    Rows that exist in both the source table (new data) and    the target table (old data) will be updated, replacing    the old row with the corresponding matching row.    If there are multiple matches then the behavior is undefined.    Currently this causes multiple copies of the row to be created    but that behavior is subject to change.    """self._when_matched_update_all=Trueself._when_matched_update_all_condition=wherereturnself

when_not_matched_insert_all

when_not_matched_insert_all()->LanceMergeInsertBuilder

Rows that exist only in the source table (new data) shouldbe inserted into the target table.

Source code inlancedb/merge.py
defwhen_not_matched_insert_all(self)->LanceMergeInsertBuilder:"""    Rows that exist only in the source table (new data) should    be inserted into the target table.    """self._when_not_matched_insert_all=Truereturnself

when_not_matched_by_source_delete

when_not_matched_by_source_delete(condition:Optional[str]=None)->LanceMergeInsertBuilder

Rows that exist only in the target table (old data) will bedeleted. An optional condition can be provided to limit whatdata is deleted.

Parameters:

  • condition (Optional[str], default:None) –

    If None then all such rows will be deleted. Otherwise thecondition will be used as an SQL filter to limit what rowsare deleted.

Source code inlancedb/merge.py
defwhen_not_matched_by_source_delete(self,condition:Optional[str]=None)->LanceMergeInsertBuilder:"""    Rows that exist only in the target table (old data) will be    deleted.  An optional condition can be provided to limit what    data is deleted.    Parameters    ----------    condition: Optional[str], default None        If None then all such rows will be deleted.  Otherwise the        condition will be used as an SQL filter to limit what rows        are deleted.    """self._when_not_matched_by_source_delete=TrueifconditionisnotNone:self._when_not_matched_by_source_condition=conditionreturnself

execute

execute(new_data:DATA,on_bad_vectors:str='error',fill_value:float=0.0,timeout:Optional[timedelta]=None)->MergeInsertResult

Executes the merge insert operation

Nothing is returned but theTable is updated

Parameters:

  • new_data (DATA) –

    New records which will be matched against the existing recordsto potentially insert or update into the table. This parametercan be anything you use foradd

  • on_bad_vectors (str, default:'error') –

    What to do if any of the vectors are not the same size or contains NaNs.One of "error", "drop", "fill".

  • fill_value (float, default:0.0) –

    The value to use when filling vectors. Only used if on_bad_vectors="fill".

  • timeout (Optional[timedelta], default:None) –

    Maximum time to run the operation before cancelling it.

    By default, there is a 30-second timeout that is only enforced after thefirst attempt. This is to prevent spending too long retrying to resolveconflicts. For example, if a write attempt takes 20 seconds and fails,the second attempt will be cancelled after 10 seconds, hitting the30-second timeout. However, a write that takes one hour and succeeds on thefirst attempt will not be cancelled.

    When this is set, the timeout is enforced on all attempts, includingthe first.

Returns:

  • MergeInsertResult

    version: the new version number of the table after doing merge insert.

Source code inlancedb/merge.py
defexecute(self,new_data:DATA,on_bad_vectors:str="error",fill_value:float=0.0,timeout:Optional[timedelta]=None,)->MergeInsertResult:"""    Executes the merge insert operation    Nothing is returned but the [`Table`][lancedb.table.Table] is updated    Parameters    ----------    new_data: DATA        New records which will be matched against the existing records        to potentially insert or update into the table.  This parameter        can be anything you use for [`add`][lancedb.table.Table.add]    on_bad_vectors: str, default "error"        What to do if any of the vectors are not the same size or contains NaNs.        One of "error", "drop", "fill".    fill_value: float, default 0.        The value to use when filling vectors. Only used if on_bad_vectors="fill".    timeout: Optional[timedelta], default None        Maximum time to run the operation before cancelling it.        By default, there is a 30-second timeout that is only enforced after the        first attempt. This is to prevent spending too long retrying to resolve        conflicts. For example, if a write attempt takes 20 seconds and fails,        the second attempt will be cancelled after 10 seconds, hitting the        30-second timeout. However, a write that takes one hour and succeeds on the        first attempt will not be cancelled.        When this is set, the timeout is enforced on all attempts, including        the first.    Returns    -------    MergeInsertResult        version: the new version number of the table after doing merge insert.    """iftimeoutisnotNone:self._timeout=timeoutreturnself._table._do_merge(self,new_data,on_bad_vectors,fill_value)

Integrations

Pydantic

lancedb.pydantic.pydantic_to_schema

pydantic_to_schema(model:Type[BaseModel])->Schema

Convert aPydantic Model to aPyArrow Schema.

Parameters:

  • model (Type[BaseModel]) –

    The Pydantic BaseModel to convert to Arrow Schema.

Returns:

Examples:

>>>fromtypingimportList,Optional>>>importpydantic>>>fromlancedb.pydanticimportpydantic_to_schema,Vector>>>classFooModel(pydantic.BaseModel):...id:int...s:str...vec:Vector(1536)# fixed_size_list<item: float32>[1536]...li:List[int]...>>>schema=pydantic_to_schema(FooModel)>>>assertschema==pa.schema([...pa.field("id",pa.int64(),False),...pa.field("s",pa.utf8(),False),...pa.field("vec",pa.list_(pa.float32(),1536)),...pa.field("li",pa.list_(pa.int64()),False),...])
Source code inlancedb/pydantic.py
defpydantic_to_schema(model:Type[pydantic.BaseModel])->pa.Schema:"""Convert a [Pydantic Model][pydantic.BaseModel] to a       [PyArrow Schema][pyarrow.Schema].    Parameters    ----------    model : Type[pydantic.BaseModel]        The Pydantic BaseModel to convert to Arrow Schema.    Returns    -------    pyarrow.Schema        The Arrow Schema    Examples    --------    >>> from typing import List, Optional    >>> import pydantic    >>> from lancedb.pydantic import pydantic_to_schema, Vector    >>> class FooModel(pydantic.BaseModel):    ...     id: int    ...     s: str    ...     vec: Vector(1536)  # fixed_size_list<item: float32>[1536]    ...     li: List[int]    ...    >>> schema = pydantic_to_schema(FooModel)    >>> assert schema == pa.schema([    ...     pa.field("id", pa.int64(), False),    ...     pa.field("s", pa.utf8(), False),    ...     pa.field("vec", pa.list_(pa.float32(), 1536)),    ...     pa.field("li", pa.list_(pa.int64()), False),    ... ])    """fields=_pydantic_model_to_fields(model)returnpa.schema(fields)

lancedb.pydantic.vector

vector(dim:int,value_type:DataType=pa.float32())
Source code inlancedb/pydantic.py
defvector(dim:int,value_type:pa.DataType=pa.float32()):# TODO: remove in future releasefromwarningsimportwarnwarn("lancedb.pydantic.vector() is deprecated, use lancedb.pydantic.Vector instead.""This function will be removed in future release",DeprecationWarning,)returnVector(dim,value_type)

lancedb.pydantic.LanceModel

Bases:BaseModel

A Pydantic Model base class that can be converted to a LanceDB Table.

Examples:

>>>importlancedb>>>fromlancedb.pydanticimportLanceModel,Vector>>>>>>classTestModel(LanceModel):...name:str...vector:Vector(2)...>>>db=lancedb.connect("./example")>>>table=db.create_table("test",schema=TestModel)>>>table.add([...TestModel(name="test",vector=[1.0,2.0])...])AddResult(version=2)>>>table.search([0.,0.]).limit(1).to_pydantic(TestModel)[TestModel(name='test', vector=FixedSizeList(dim=2))]
Source code inlancedb/pydantic.py
classLanceModel(pydantic.BaseModel):"""    A Pydantic Model base class that can be converted to a LanceDB Table.    Examples    --------    >>> import lancedb    >>> from lancedb.pydantic import LanceModel, Vector    >>>    >>> class TestModel(LanceModel):    ...     name: str    ...     vector: Vector(2)    ...    >>> db = lancedb.connect("./example")    >>> table = db.create_table("test", schema=TestModel)    >>> table.add([    ...     TestModel(name="test", vector=[1.0, 2.0])    ... ])    AddResult(version=2)    >>> table.search([0., 0.]).limit(1).to_pydantic(TestModel)    [TestModel(name='test', vector=FixedSizeList(dim=2))]    """@classmethoddefto_arrow_schema(cls):"""        Get the Arrow Schema for this model.        """schema=pydantic_to_schema(cls)functions=cls.parse_embedding_functions()iflen(functions)>0:# Prevent circular importfrom.embeddingsimportEmbeddingFunctionRegistrymetadata=EmbeddingFunctionRegistry.get_instance().get_table_metadata(functions)schema=schema.with_metadata(metadata)returnschema@classmethoddeffield_names(cls)->List[str]:"""        Get the field names of this model.        """returnlist(cls.safe_get_fields().keys())@classmethoddefsafe_get_fields(cls):ifPYDANTIC_VERSION.major<2:returncls.__fields__returncls.model_fields@classmethoddefparse_embedding_functions(cls)->List["EmbeddingFunctionConfig"]:"""        Parse the embedding functions from this model.        """from.embeddingsimportEmbeddingFunctionConfigvec_and_function=[]forname,field_infoincls.safe_get_fields().items():func=get_extras(field_info,"vector_column_for")iffuncisnotNone:vec_and_function.append([name,func])configs=[]forvec,funcinvec_and_function:forsource,field_infoincls.safe_get_fields().items():src_func=get_extras(field_info,"source_column_for")ifsrc_funcisfunc:# note we can't use == here since the function is a pydantic# model so two instances of the same function are ==, so if you# have multiple vector columns from multiple sources, both will# be mapped to the same source column# GH594configs.append(EmbeddingFunctionConfig(source_column=source,vector_column=vec,function=func))returnconfigs

to_arrow_schemaclassmethod

to_arrow_schema()

Get the Arrow Schema for this model.

Source code inlancedb/pydantic.py
@classmethoddefto_arrow_schema(cls):"""    Get the Arrow Schema for this model.    """schema=pydantic_to_schema(cls)functions=cls.parse_embedding_functions()iflen(functions)>0:# Prevent circular importfrom.embeddingsimportEmbeddingFunctionRegistrymetadata=EmbeddingFunctionRegistry.get_instance().get_table_metadata(functions)schema=schema.with_metadata(metadata)returnschema

field_namesclassmethod

field_names()->List[str]

Get the field names of this model.

Source code inlancedb/pydantic.py
@classmethoddeffield_names(cls)->List[str]:"""    Get the field names of this model.    """returnlist(cls.safe_get_fields().keys())

parse_embedding_functionsclassmethod

parse_embedding_functions()->List['EmbeddingFunctionConfig']

Parse the embedding functions from this model.

Source code inlancedb/pydantic.py
@classmethoddefparse_embedding_functions(cls)->List["EmbeddingFunctionConfig"]:"""    Parse the embedding functions from this model.    """from.embeddingsimportEmbeddingFunctionConfigvec_and_function=[]forname,field_infoincls.safe_get_fields().items():func=get_extras(field_info,"vector_column_for")iffuncisnotNone:vec_and_function.append([name,func])configs=[]forvec,funcinvec_and_function:forsource,field_infoincls.safe_get_fields().items():src_func=get_extras(field_info,"source_column_for")ifsrc_funcisfunc:# note we can't use == here since the function is a pydantic# model so two instances of the same function are ==, so if you# have multiple vector columns from multiple sources, both will# be mapped to the same source column# GH594configs.append(EmbeddingFunctionConfig(source_column=source,vector_column=vec,function=func))returnconfigs

Reranking

lancedb.rerankers.linear_combination.LinearCombinationReranker

Bases:Reranker

Reranks the results using a linear combination of the scores from thevector and FTS search. For missing scores, fill withfill value.

Parameters:

  • weight (float, default:0.7) –

    The weight to give to the vector score. Must be between 0 and 1.

  • fill (float, default:1.0) –

    The score to give to results that are only in one of the two result sets.This is treated as penalty, so a higher value means a lower score.TODO: We should just hardcode this--its pretty confusing as we invert scores to calculate final score

  • return_score (str, default:"relevance") –

    opntions are "relevance" or "all"The type of score to return. If "relevance", will return only the relevancescore. If "all", will return all scores from the vector and FTS search alongwith the relevance score.

Source code inlancedb/rerankers/linear_combination.py
classLinearCombinationReranker(Reranker):"""    Reranks the results using a linear combination of the scores from the    vector and FTS search. For missing scores, fill with `fill` value.    Parameters    ----------    weight : float, default 0.7        The weight to give to the vector score. Must be between 0 and 1.    fill : float, default 1.0        The score to give to results that are only in one of the two result sets.        This is treated as penalty, so a higher value means a lower score.        TODO: We should just hardcode this--        its pretty confusing as we invert scores to calculate final score    return_score : str, default "relevance"        opntions are "relevance" or "all"        The type of score to return. If "relevance", will return only the relevance        score. If "all", will return all scores from the vector and FTS search along        with the relevance score.    """def__init__(self,weight:float=0.7,fill:float=1.0,return_score="relevance"):ifweight<0orweight>1:raiseValueError("weight must be between 0 and 1.")super().__init__(return_score)self.weight=weightself.fill=filldefrerank_hybrid(self,query:str,# noqa: F821vector_results:pa.Table,fts_results:pa.Table,):combined_results=self.merge_results(vector_results,fts_results,self.fill)returncombined_resultsdefmerge_results(self,vector_results:pa.Table,fts_results:pa.Table,fill:float):# If one is empty then return the other and add _relevance_score# column equal the existing vector or fts scoreiflen(vector_results)==0:results=fts_results.append_column("_relevance_score",pa.array(fts_results["_score"],type=pa.float32()),)ifself.score=="relevance":results=self._keep_relevance_score(results)elifself.score=="all":results=results.append_column("_distance",pa.array([nan]*len(fts_results),type=pa.float32()),)returnresultsiflen(fts_results)==0:# invert the distance to relevance scoreresults=vector_results.append_column("_relevance_score",pa.array([self._invert_score(distance)fordistanceinvector_results["_distance"].to_pylist()],type=pa.float32(),),)ifself.score=="relevance":results=self._keep_relevance_score(results)elifself.score=="all":results=results.append_column("_score",pa.array([nan]*len(vector_results),type=pa.float32()),)returnresultsresults=defaultdict()forvector_resultinvector_results.to_pylist():results[vector_result["_rowid"]]=vector_resultforfts_resultinfts_results.to_pylist():row_id=fts_result["_rowid"]ifrow_idinresults:results[row_id]["_score"]=fts_result["_score"]else:results[row_id]=fts_resultcombined_list=[]forrow_id,resultinresults.items():vector_score=self._invert_score(result.get("_distance",fill))fts_score=result.get("_score",fill)result["_relevance_score"]=self._combine_score(vector_score,fts_score)combined_list.append(result)relevance_score_schema=pa.schema([pa.field("_relevance_score",pa.float32()),])combined_schema=pa.unify_schemas([vector_results.schema,fts_results.schema,relevance_score_schema])tbl=pa.Table.from_pylist(combined_list,schema=combined_schema).sort_by([("_relevance_score","descending")])ifself.score=="relevance":tbl=self._keep_relevance_score(tbl)returntbldef_combine_score(self,vector_score,fts_score):# these scores represent distancereturn1-(self.weight*vector_score+(1-self.weight)*fts_score)def_invert_score(self,dist:float):# Invert the score between relevance and distancereturn1-dist

lancedb.rerankers.cohere.CohereReranker

Bases:Reranker

Reranks the results using the Cohere Rerank API.https://docs.cohere.com/docs/rerank-guide

Parameters:

  • model_name (str, default:"rerank-english-v2.0") –

    The name of the cross encoder model to use. Available cohere models are:- rerank-english-v2.0- rerank-multilingual-v2.0

  • column (str, default:"text") –

    The name of the column to use as input to the cross encoder model.

  • top_n (str, default:None) –

    The number of results to return. If None, will return all results.

Source code inlancedb/rerankers/cohere.py
classCohereReranker(Reranker):"""    Reranks the results using the Cohere Rerank API.    https://docs.cohere.com/docs/rerank-guide    Parameters    ----------    model_name : str, default "rerank-english-v2.0"        The name of the cross encoder model to use. Available cohere models are:        - rerank-english-v2.0        - rerank-multilingual-v2.0    column : str, default "text"        The name of the column to use as input to the cross encoder model.    top_n : str, default None        The number of results to return. If None, will return all results.    """def__init__(self,model_name:str="rerank-english-v3.0",column:str="text",top_n:Union[int,None]=None,return_score="relevance",api_key:Union[str,None]=None,):super().__init__(return_score)self.model_name=model_nameself.column=columnself.top_n=top_nself.api_key=api_key@cached_propertydef_client(self):cohere=attempt_import_or_raise("cohere")# ensure version is at least 0.5.0ifhasattr(cohere,"__version__")andVersion(cohere.__version__)<Version("0.5.0"):raiseValueError(f"cohere version must be at least 0.5.0, found{cohere.__version__}")ifos.environ.get("COHERE_API_KEY")isNoneandself.api_keyisNone:raiseValueError("COHERE_API_KEY not set. Either set it in your environment or\                pass it as `api_key` argument to the CohereReranker.")returncohere.Client(os.environ.get("COHERE_API_KEY")orself.api_key)def_rerank(self,result_set:pa.Table,query:str):result_set=self._handle_empty_results(result_set)iflen(result_set)==0:returnresult_setdocs=result_set[self.column].to_pylist()response=self._client.rerank(query=query,documents=docs,top_n=self.top_n,model=self.model_name,)results=(response.results)# returns list (text, idx, relevance) attributes sorted descending by scoreindices,scores=list(zip(*[(result.index,result.relevance_score)forresultinresults]))# tuplesresult_set=result_set.take(list(indices))# add the scoresresult_set=result_set.append_column("_relevance_score",pa.array(scores,type=pa.float32()))returnresult_setdefrerank_hybrid(self,query:str,vector_results:pa.Table,fts_results:pa.Table,):ifself.score=="all":combined_results=self._merge_and_keep_scores(vector_results,fts_results)else:combined_results=self.merge_results(vector_results,fts_results)combined_results=self._rerank(combined_results,query)ifself.score=="relevance":combined_results=self._keep_relevance_score(combined_results)returncombined_resultsdefrerank_vector(self,query:str,vector_results:pa.Table):vector_results=self._rerank(vector_results,query)ifself.score=="relevance":vector_results=vector_results.drop_columns(["_distance"])returnvector_resultsdefrerank_fts(self,query:str,fts_results:pa.Table):fts_results=self._rerank(fts_results,query)ifself.score=="relevance":fts_results=fts_results.drop_columns(["_score"])returnfts_results

lancedb.rerankers.colbert.ColbertReranker

Bases:AnswerdotaiRerankers

Reranks the results using the ColBERT model.

Parameters:

  • model_name (str, default:"colbert" (colbert-ir/colbert-v2.0)) –

    The name of the cross encoder model to use.

  • column (str, default:"text") –

    The name of the column to use as input to the cross encoder model.

  • return_score (str, default:"relevance") –

    options are "relevance" or "all". Only "relevance" is supported for now.

  • **kwargs

    Additional keyword arguments to pass to the model, for example, 'device'.See AnswerDotAI/rerankers for more information.

Source code inlancedb/rerankers/colbert.py
classColbertReranker(AnswerdotaiRerankers):"""    Reranks the results using the ColBERT model.    Parameters    ----------    model_name : str, default "colbert" (colbert-ir/colbert-v2.0)        The name of the cross encoder model to use.    column : str, default "text"        The name of the column to use as input to the cross encoder model.    return_score : str, default "relevance"        options are "relevance" or "all". Only "relevance" is supported for now.    **kwargs        Additional keyword arguments to pass to the model, for example, 'device'.        See AnswerDotAI/rerankers for more information.    """def__init__(self,model_name:str="colbert-ir/colbertv2.0",column:str="text",return_score="relevance",**kwargs,):super().__init__(model_type="colbert",model_name=model_name,column=column,return_score=return_score,**kwargs,)

lancedb.rerankers.cross_encoder.CrossEncoderReranker

Bases:Reranker

Reranks the results using a cross encoder model. The cross encoder model isused to score the query and each result. The results are then sorted by the score.

Parameters:

  • model_name (str, default:"cross-encoder/ms-marco-TinyBERT-L-6") –

    The name of the cross encoder model to use. See the sentence transformersdocumentation for a list of available models.

  • column (str, default:"text") –

    The name of the column to use as input to the cross encoder model.

  • device (str, default:None) –

    The device to use for the cross encoder model. If None, will use "cuda"if available, otherwise "cpu".

  • return_score (str, default:"relevance") –

    options are "relevance" or "all". Only "relevance" is supported for now.

  • trust_remote_code (bool, default:True) –

    If True, will trust the remote code to be safe. If False, will not trustthe remote code and will not run it

Source code inlancedb/rerankers/cross_encoder.py
classCrossEncoderReranker(Reranker):"""    Reranks the results using a cross encoder model. The cross encoder model is    used to score the query and each result. The results are then sorted by the score.    Parameters    ----------    model_name : str, default "cross-encoder/ms-marco-TinyBERT-L-6"        The name of the cross encoder model to use. See the sentence transformers        documentation for a list of available models.    column : str, default "text"        The name of the column to use as input to the cross encoder model.    device : str, default None        The device to use for the cross encoder model. If None, will use "cuda"        if available, otherwise "cpu".    return_score : str, default "relevance"        options are "relevance" or "all". Only "relevance" is supported for now.    trust_remote_code : bool, default True        If True, will trust the remote code to be safe. If False, will not trust        the remote code and will not run it    """def__init__(self,model_name:str="cross-encoder/ms-marco-TinyBERT-L-6",column:str="text",device:Union[str,None]=None,return_score="relevance",trust_remote_code:bool=True,):super().__init__(return_score)torch=attempt_import_or_raise("torch")self.model_name=model_nameself.column=columnself.device=deviceself.trust_remote_code=trust_remote_codeifself.deviceisNone:self.device="cuda"iftorch.cuda.is_available()else"cpu"@cached_propertydefmodel(self):sbert=attempt_import_or_raise("sentence_transformers")# Allows overriding the automatically selected devicecross_encoder=sbert.CrossEncoder(self.model_name,device=self.device,trust_remote_code=self.trust_remote_code,)returncross_encoderdef_rerank(self,result_set:pa.Table,query:str):result_set=self._handle_empty_results(result_set)iflen(result_set)==0:returnresult_setpassages=result_set[self.column].to_pylist()cross_inp=[[query,passage]forpassageinpassages]cross_scores=self.model.predict(cross_inp)result_set=result_set.append_column("_relevance_score",pa.array(cross_scores,type=pa.float32()))returnresult_setdefrerank_hybrid(self,query:str,vector_results:pa.Table,fts_results:pa.Table,):ifself.score=="all":combined_results=self._merge_and_keep_scores(vector_results,fts_results)else:combined_results=self.merge_results(vector_results,fts_results)combined_results=self._rerank(combined_results,query)# sort the results by _scoreifself.score=="relevance":combined_results=self._keep_relevance_score(combined_results)combined_results=combined_results.sort_by([("_relevance_score","descending")])returncombined_resultsdefrerank_vector(self,query:str,vector_results:pa.Table):vector_results=self._rerank(vector_results,query)ifself.score=="relevance":vector_results=vector_results.drop_columns(["_distance"])vector_results=vector_results.sort_by([("_relevance_score","descending")])returnvector_resultsdefrerank_fts(self,query:str,fts_results:pa.Table):fts_results=self._rerank(fts_results,query)ifself.score=="relevance":fts_results=fts_results.drop_columns(["_score"])fts_results=fts_results.sort_by([("_relevance_score","descending")])returnfts_results

lancedb.rerankers.openai.OpenaiReranker

Bases:Reranker

Reranks the results using the OpenAI API.WARNING: This is a prompt based reranker that uses chat model that isnot a dedicated reranker API. This should be treated as experimental.

Parameters:

  • model_name (str, default:"gpt-4-turbo-preview") –

    The name of the cross encoder model to use.

  • column (str, default:"text") –

    The name of the column to use as input to the cross encoder model.

  • return_score (str, default:"relevance") –

    options are "relevance" or "all". Only "relevance" is supported for now.

  • api_key (str, default:None) –

    The API key to use. If None, will use the OPENAI_API_KEY environment variable.

Source code inlancedb/rerankers/openai.py
classOpenaiReranker(Reranker):"""    Reranks the results using the OpenAI API.    WARNING: This is a prompt based reranker that uses chat model that is    not a dedicated reranker API. This should be treated as experimental.    Parameters    ----------    model_name : str, default "gpt-4-turbo-preview"        The name of the cross encoder model to use.    column : str, default "text"        The name of the column to use as input to the cross encoder model.    return_score : str, default "relevance"        options are "relevance" or "all". Only "relevance" is supported for now.    api_key : str, default None        The API key to use. If None, will use the OPENAI_API_KEY environment variable.    """def__init__(self,model_name:str="gpt-4-turbo-preview",column:str="text",return_score="relevance",api_key:Optional[str]=None,):super().__init__(return_score)self.model_name=model_nameself.column=columnself.api_key=api_keydef_rerank(self,result_set:pa.Table,query:str):result_set=self._handle_empty_results(result_set)iflen(result_set)==0:returnresult_setdocs=result_set[self.column].to_pylist()response=self._client.chat.completions.create(model=self.model_name,response_format={"type":"json_object"},temperature=0,messages=[{"role":"system","content":"You are an expert relevance ranker. Given a list of\                        documents and a query, your job is to determine the relevance\                        each document is for answering the query. Your output is JSON,\                        which is a list of documents. Each document has two fields,\                        content and relevance_score.  relevance_score is from 0.0 to\                        1.0 indicating the relevance of the text to the given query.\                        Make sure to include all documents in the response.",},{"role":"user","content":f"Query:{query} Docs:{docs}"},],)results=json.loads(response.choices[0].message.content)["documents"]docs,scores=list(zip(*[(result["content"],result["relevance_score"])forresultinresults]))# tuples# replace the self.column column with the docsresult_set=result_set.drop(self.column)result_set=result_set.append_column(self.column,pa.array(docs,type=pa.string()))# add the scoresresult_set=result_set.append_column("_relevance_score",pa.array(scores,type=pa.float32()))returnresult_setdefrerank_hybrid(self,query:str,vector_results:pa.Table,fts_results:pa.Table,):ifself.score=="all":combined_results=self._merge_and_keep_scores(vector_results,fts_results)else:combined_results=self.merge_results(vector_results,fts_results)combined_results=self._rerank(combined_results,query)ifself.score=="relevance":combined_results=self._keep_relevance_score(combined_results)combined_results=combined_results.sort_by([("_relevance_score","descending")])returncombined_resultsdefrerank_vector(self,query:str,vector_results:pa.Table):vector_results=self._rerank(vector_results,query)ifself.score=="relevance":vector_results=vector_results.drop_columns(["_distance"])vector_results=vector_results.sort_by([("_relevance_score","descending")])returnvector_resultsdefrerank_fts(self,query:str,fts_results:pa.Table):fts_results=self._rerank(fts_results,query)ifself.score=="relevance":fts_results=fts_results.drop_columns(["_score"])fts_results=fts_results.sort_by([("_relevance_score","descending")])returnfts_results@cached_propertydef_client(self):openai=attempt_import_or_raise("openai")# TODO: force version or handle versions < 1.0ifos.environ.get("OPENAI_API_KEY")isNoneandself.api_keyisNone:raiseValueError("OPENAI_API_KEY not set. Either set it in your environment or\                pass it as `api_key` argument to the CohereReranker.")returnopenai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY")orself.api_key)

Connections (Asynchronous)

Connections represent a connection to a LanceDb database andcan be used to create, list, or open tables.

lancedb.connect_asyncasync

connect_async(uri:URI,*,api_key:Optional[str]=None,region:str='us-east-1',host_override:Optional[str]=None,read_consistency_interval:Optional[timedelta]=None,client_config:Optional[Union[ClientConfig,Dict[str,Any]]]=None,storage_options:Optional[Dict[str,str]]=None)->AsyncConnection

Connect to a LanceDB database.

Parameters:

  • uri (URI) –

    The uri of the database.

  • api_key (Optional[str], default:None) –

    If present, connect to LanceDB cloud.Otherwise, connect to a database on file system or cloud storage.Can be set via environment variableLANCEDB_API_KEY.

  • region (str, default:'us-east-1') –

    The region to use for LanceDB Cloud.

  • host_override (Optional[str], default:None) –

    The override url for LanceDB Cloud.

  • read_consistency_interval (Optional[timedelta], default:None) –

    (For LanceDB OSS only)The interval at which to check for updates to the table from otherprocesses. If None, then consistency is not checked. For performancereasons, this is the default. For strong consistency, set this tozero seconds. Then every read will check for updates from otherprocesses. As a compromise, you can set this to a non-zero timedeltafor eventual consistency. If more than that interval has passed sincethe last check, then the table will be checked for updates. Note: thisconsistency only applies to read operations. Write operations arealways consistent.

  • client_config (Optional[Union[ClientConfig,Dict[str,Any]]], default:None) –

    Configuration options for the LanceDB Cloud HTTP client. If a dict, thenthe keys are the attributes of the ClientConfig class. If None, then thedefault configuration is used.

  • storage_options (Optional[Dict[str,str]], default:None) –

    Additional options for the storage backend. See available options athttps://lancedb.github.io/lancedb/guides/storage/

Examples:

>>>importlancedb>>>asyncdefdoctest_example():...# For a local directory, provide a path to the database...db=awaitlancedb.connect_async("~/.lancedb")...# For object storage, use a URI prefix...db=awaitlancedb.connect_async("s3://my-bucket/lancedb",...storage_options={..."aws_access_key_id":"***"})...# Connect to LanceDB cloud...db=awaitlancedb.connect_async("db://my_database",api_key="ldb_...",...client_config={..."retry_config":{"retries":5}})

Returns:

Source code inlancedb/__init__.py
asyncdefconnect_async(uri:URI,*,api_key:Optional[str]=None,region:str="us-east-1",host_override:Optional[str]=None,read_consistency_interval:Optional[timedelta]=None,client_config:Optional[Union[ClientConfig,Dict[str,Any]]]=None,storage_options:Optional[Dict[str,str]]=None,)->AsyncConnection:"""Connect to a LanceDB database.    Parameters    ----------    uri: str or Path        The uri of the database.    api_key: str, optional        If present, connect to LanceDB cloud.        Otherwise, connect to a database on file system or cloud storage.        Can be set via environment variable `LANCEDB_API_KEY`.    region: str, default "us-east-1"        The region to use for LanceDB Cloud.    host_override: str, optional        The override url for LanceDB Cloud.    read_consistency_interval: timedelta, default None        (For LanceDB OSS only)        The interval at which to check for updates to the table from other        processes. If None, then consistency is not checked. For performance        reasons, this is the default. For strong consistency, set this to        zero seconds. Then every read will check for updates from other        processes. As a compromise, you can set this to a non-zero timedelta        for eventual consistency. If more than that interval has passed since        the last check, then the table will be checked for updates. Note: this        consistency only applies to read operations. Write operations are        always consistent.    client_config: ClientConfig or dict, optional        Configuration options for the LanceDB Cloud HTTP client. If a dict, then        the keys are the attributes of the ClientConfig class. If None, then the        default configuration is used.    storage_options: dict, optional        Additional options for the storage backend. See available options at        <https://lancedb.github.io/lancedb/guides/storage/>    Examples    --------    >>> import lancedb    >>> async def doctest_example():    ...     # For a local directory, provide a path to the database    ...     db = await lancedb.connect_async("~/.lancedb")    ...     # For object storage, use a URI prefix    ...     db = await lancedb.connect_async("s3://my-bucket/lancedb",    ...                                      storage_options={    ...                                          "aws_access_key_id": "***"})    ...     # Connect to LanceDB cloud    ...     db = await lancedb.connect_async("db://my_database", api_key="ldb_...",    ...                                      client_config={    ...                                          "retry_config": {"retries": 5}})    Returns    -------    conn : AsyncConnection        A connection to a LanceDB database.    """ifread_consistency_intervalisnotNone:read_consistency_interval_secs=read_consistency_interval.total_seconds()else:read_consistency_interval_secs=Noneifisinstance(client_config,dict):client_config=ClientConfig(**client_config)returnAsyncConnection(awaitlancedb_connect(sanitize_uri(uri),api_key,region,host_override,read_consistency_interval_secs,client_config,storage_options,))

lancedb.db.AsyncConnection

Bases:object

An active LanceDB connection

To obtain a connection you can use theconnect_asyncfunction.

This could be a native connection (using lance) or a remote connection (e.g. forconnecting to LanceDb Cloud)

Local connections do not currently hold any open resources but they may do so in thefuture (for example, for shared cache or connections to catalog services) Remoteconnections represent an open connection to the remote server. Theclose method can be used to release anyunderlying resources eagerly. The connection can also be used as a context manager.

Connections can be shared on multiple threads and are expected to be long lived.Connections can also be used as a context manager, however, in many cases a singleconnection can be used for the lifetime of the application and so this is oftennot needed. Closing a connection is optional. If it is not closed then it willbe automatically closed when the connection object is deleted.

Examples:

>>>importlancedb>>>asyncdefdoctest_example():...withawaitlancedb.connect_async("/tmp/my_dataset")asconn:...# do something with the connection...pass...# conn is closed here
Source code inlancedb/db.py
513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885
classAsyncConnection(object):"""An active LanceDB connection    To obtain a connection you can use the [connect_async][lancedb.connect_async]    function.    This could be a native connection (using lance) or a remote connection (e.g. for    connecting to LanceDb Cloud)    Local connections do not currently hold any open resources but they may do so in the    future (for example, for shared cache or connections to catalog services) Remote    connections represent an open connection to the remote server.  The    [close][lancedb.db.AsyncConnection.close] method can be used to release any    underlying resources eagerly.  The connection can also be used as a context manager.    Connections can be shared on multiple threads and are expected to be long lived.    Connections can also be used as a context manager, however, in many cases a single    connection can be used for the lifetime of the application and so this is often    not needed.  Closing a connection is optional.  If it is not closed then it will    be automatically closed when the connection object is deleted.    Examples    --------    >>> import lancedb    >>> async def doctest_example():    ...   with await lancedb.connect_async("/tmp/my_dataset") as conn:    ...     # do something with the connection    ...     pass    ...   # conn is closed here    """def__init__(self,connection:LanceDbConnection):self._inner=connectiondef__repr__(self):returnself._inner.__repr__()def__enter__(self):returnselfdef__exit__(self,*_):self.close()defis_open(self):"""Return True if the connection is open."""returnself._inner.is_open()defclose(self):"""Close the connection, releasing any underlying resources.        It is safe to call this method multiple times.        Any attempt to use the connection after it is closed will result in an error."""self._inner.close()@propertydefuri(self)->str:returnself._inner.uriasyncdeftable_names(self,*,start_after:Optional[str]=None,limit:Optional[int]=None)->Iterable[str]:"""List all tables in this database, in sorted order        Parameters        ----------        start_after: str, optional            If present, only return names that come lexicographically after the supplied            value.            This can be combined with limit to implement pagination by setting this to            the last table name from the previous page.        limit: int, default 10            The number of results to return.        Returns        -------        Iterable of str        """returnawaitself._inner.table_names(start_after=start_after,limit=limit)asyncdefcreate_table(self,name:str,data:Optional[DATA]=None,schema:Optional[Union[pa.Schema,LanceModel]]=None,mode:Optional[Literal["create","overwrite"]]=None,exist_ok:Optional[bool]=None,on_bad_vectors:Optional[str]=None,fill_value:Optional[float]=None,storage_options:Optional[Dict[str,str]]=None,*,embedding_functions:Optional[List[EmbeddingFunctionConfig]]=None,)->AsyncTable:"""Create an [AsyncTable][lancedb.table.AsyncTable] in the database.        Parameters        ----------        name: str            The name of the table.        data: The data to initialize the table, *optional*            User must provide at least one of `data` or `schema`.            Acceptable types are:            - list-of-dict            - pandas.DataFrame            - pyarrow.Table or pyarrow.RecordBatch        schema: The schema of the table, *optional*            Acceptable types are:            - pyarrow.Schema            - [LanceModel][lancedb.pydantic.LanceModel]        mode: Literal["create", "overwrite"]; default "create"            The mode to use when creating the table.            Can be either "create" or "overwrite".            By default, if the table already exists, an exception is raised.            If you want to overwrite the table, use mode="overwrite".        exist_ok: bool, default False            If a table by the same name already exists, then raise an exception            if exist_ok=False. If exist_ok=True, then open the existing table;            it will not add the provided data but will validate against any            schema that's specified.        on_bad_vectors: str, default "error"            What to do if any of the vectors are not the same size or contains NaNs.            One of "error", "drop", "fill".        fill_value: float            The value to use when filling vectors. Only used if on_bad_vectors="fill".        storage_options: dict, optional            Additional options for the storage backend. Options already set on the            connection will be inherited by the table, but can be overridden here.            See available options at            <https://lancedb.github.io/lancedb/guides/storage/>        Returns        -------        AsyncTable            A reference to the newly created table.        !!! note            The vector index won't be created by default.            To create the index, call the `create_index` method on the table.        Examples        --------        Can create with list of tuples or dictionaries:        >>> import lancedb        >>> async def doctest_example():        ...     db = await lancedb.connect_async("./.lancedb")        ...     data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},        ...             {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]        ...     my_table = await db.create_table("my_table", data)        ...     print(await my_table.query().limit(5).to_arrow())        >>> import asyncio        >>> asyncio.run(doctest_example())        pyarrow.Table        vector: fixed_size_list<item: float>[2]          child 0, item: float        lat: double        long: double        ----        vector: [[[1.1,1.2],[0.2,1.8]]]        lat: [[45.5,40.1]]        long: [[-122.7,-74.1]]        You can also pass a pandas DataFrame:        >>> import pandas as pd        >>> data = pd.DataFrame({        ...    "vector": [[1.1, 1.2], [0.2, 1.8]],        ...    "lat": [45.5, 40.1],        ...    "long": [-122.7, -74.1]        ... })        >>> async def pandas_example():        ...     db = await lancedb.connect_async("./.lancedb")        ...     my_table = await db.create_table("table2", data)        ...     print(await my_table.query().limit(5).to_arrow())        >>> asyncio.run(pandas_example())        pyarrow.Table        vector: fixed_size_list<item: float>[2]          child 0, item: float        lat: double        long: double        ----        vector: [[[1.1,1.2],[0.2,1.8]]]        lat: [[45.5,40.1]]        long: [[-122.7,-74.1]]        Data is converted to Arrow before being written to disk. For maximum        control over how data is saved, either provide the PyArrow schema to        convert to or else provide a [PyArrow Table](pyarrow.Table) directly.        >>> import pyarrow as pa        >>> custom_schema = pa.schema([        ...   pa.field("vector", pa.list_(pa.float32(), 2)),        ...   pa.field("lat", pa.float32()),        ...   pa.field("long", pa.float32())        ... ])        >>> async def with_schema():        ...     db = await lancedb.connect_async("./.lancedb")        ...     my_table = await db.create_table("table3", data, schema = custom_schema)        ...     print(await my_table.query().limit(5).to_arrow())        >>> asyncio.run(with_schema())        pyarrow.Table        vector: fixed_size_list<item: float>[2]          child 0, item: float        lat: float        long: float        ----        vector: [[[1.1,1.2],[0.2,1.8]]]        lat: [[45.5,40.1]]        long: [[-122.7,-74.1]]        It is also possible to create an table from `[Iterable[pa.RecordBatch]]`:        >>> import pyarrow as pa        >>> def make_batches():        ...     for i in range(5):        ...         yield pa.RecordBatch.from_arrays(        ...             [        ...                 pa.array([[3.1, 4.1], [5.9, 26.5]],        ...                     pa.list_(pa.float32(), 2)),        ...                 pa.array(["foo", "bar"]),        ...                 pa.array([10.0, 20.0]),        ...             ],        ...             ["vector", "item", "price"],        ...         )        >>> schema=pa.schema([        ...     pa.field("vector", pa.list_(pa.float32(), 2)),        ...     pa.field("item", pa.utf8()),        ...     pa.field("price", pa.float32()),        ... ])        >>> async def iterable_example():        ...     db = await lancedb.connect_async("./.lancedb")        ...     await db.create_table("table4", make_batches(), schema=schema)        >>> asyncio.run(iterable_example())        """metadata=Noneifembedding_functionsisnotNone:# If we passed in embedding functions explicitly# then we'll override any schema metadata that# may was implicitly specified by the LanceModel schemaregistry=EmbeddingFunctionRegistry.get_instance()metadata=registry.get_table_metadata(embedding_functions)# Defining defaults here and not in function prototype.  In the future# these defaults will move into rust so better to keep them as None.ifon_bad_vectorsisNone:on_bad_vectors="error"iffill_valueisNone:fill_value=0.0data,schema=sanitize_create_table(data,schema,metadata,on_bad_vectors,fill_value)validate_schema(schema)ifexist_okisNone:exist_ok=FalseifmodeisNone:mode="create"ifmode=="create"andexist_ok:mode="exist_ok"ifdataisNone:new_table=awaitself._inner.create_empty_table(name,mode,schema,storage_options=storage_options,)else:data=data_to_reader(data,schema)new_table=awaitself._inner.create_table(name,mode,data,storage_options=storage_options,)returnAsyncTable(new_table)asyncdefopen_table(self,name:str,storage_options:Optional[Dict[str,str]]=None,index_cache_size:Optional[int]=None,)->AsyncTable:"""Open a Lance Table in the database.        Parameters        ----------        name: str            The name of the table.        storage_options: dict, optional            Additional options for the storage backend. Options already set on the            connection will be inherited by the table, but can be overridden here.            See available options at            <https://lancedb.github.io/lancedb/guides/storage/>        index_cache_size: int, default 256            Set the size of the index cache, specified as a number of entries            The exact meaning of an "entry" will depend on the type of index:            * IVF - there is one entry for each IVF partition            * BTREE - there is one entry for the entire index            This cache applies to the entire opened table, across all indices.            Setting this value higher will increase performance on larger datasets            at the expense of more RAM        Returns        -------        A LanceTable object representing the table.        """table=awaitself._inner.open_table(name,storage_options,index_cache_size)returnAsyncTable(table)asyncdefrename_table(self,old_name:str,new_name:str):"""Rename a table in the database.        Parameters        ----------        old_name: str            The current name of the table.        new_name: str            The new name of the table.        """awaitself._inner.rename_table(old_name,new_name)asyncdefdrop_table(self,name:str,*,ignore_missing:bool=False):"""Drop a table from the database.        Parameters        ----------        name: str            The name of the table.        ignore_missing: bool, default False            If True, ignore if the table does not exist.        """try:awaitself._inner.drop_table(name)exceptValueErrorase:ifnotignore_missing:raiseeiff"Table '{name}' was not found"notinstr(e):raiseeasyncdefdrop_all_tables(self):"""Drop all tables from the database."""awaitself._inner.drop_all_tables()@deprecation.deprecated(deprecated_in="0.15.1",removed_in="0.17",current_version=__version__,details="Use drop_all_tables() instead",)asyncdefdrop_database(self):"""        Drop database        This is the same thing as dropping all the tables        """awaitself._inner.drop_all_tables()

is_open

is_open()

Return True if the connection is open.

Source code inlancedb/db.py
defis_open(self):"""Return True if the connection is open."""returnself._inner.is_open()

close

close()

Close the connection, releasing any underlying resources.

It is safe to call this method multiple times.

Any attempt to use the connection after it is closed will result in an error.

Source code inlancedb/db.py
defclose(self):"""Close the connection, releasing any underlying resources.    It is safe to call this method multiple times.    Any attempt to use the connection after it is closed will result in an error."""self._inner.close()

table_namesasync

table_names(*,start_after:Optional[str]=None,limit:Optional[int]=None)->Iterable[str]

List all tables in this database, in sorted order

Parameters:

  • start_after (Optional[str], default:None) –

    If present, only return names that come lexicographically after the suppliedvalue.

    This can be combined with limit to implement pagination by setting this tothe last table name from the previous page.

  • limit (Optional[int], default:None) –

    The number of results to return.

Returns:

  • Iterable of str
Source code inlancedb/db.py
asyncdeftable_names(self,*,start_after:Optional[str]=None,limit:Optional[int]=None)->Iterable[str]:"""List all tables in this database, in sorted order    Parameters    ----------    start_after: str, optional        If present, only return names that come lexicographically after the supplied        value.        This can be combined with limit to implement pagination by setting this to        the last table name from the previous page.    limit: int, default 10        The number of results to return.    Returns    -------    Iterable of str    """returnawaitself._inner.table_names(start_after=start_after,limit=limit)

create_tableasync

create_table(name:str,data:Optional[DATA]=None,schema:Optional[Union[Schema,LanceModel]]=None,mode:Optional[Literal['create','overwrite']]=None,exist_ok:Optional[bool]=None,on_bad_vectors:Optional[str]=None,fill_value:Optional[float]=None,storage_options:Optional[Dict[str,str]]=None,*,embedding_functions:Optional[List[EmbeddingFunctionConfig]]=None)->AsyncTable

Create anAsyncTable in the database.

Parameters:

  • name (str) –

    The name of the table.

  • data (Optional[DATA], default:None) –

    User must provide at least one ofdata orschema.Acceptable types are:

    • list-of-dict

    • pandas.DataFrame

    • pyarrow.Table or pyarrow.RecordBatch

  • schema (Optional[Union[Schema,LanceModel]], default:None) –

    Acceptable types are:

  • mode (Optional[Literal['create', 'overwrite']], default:None) –

    The mode to use when creating the table.Can be either "create" or "overwrite".By default, if the table already exists, an exception is raised.If you want to overwrite the table, use mode="overwrite".

  • exist_ok (Optional[bool], default:None) –

    If a table by the same name already exists, then raise an exceptionif exist_ok=False. If exist_ok=True, then open the existing table;it will not add the provided data but will validate against anyschema that's specified.

  • on_bad_vectors (Optional[str], default:None) –

    What to do if any of the vectors are not the same size or contains NaNs.One of "error", "drop", "fill".

  • fill_value (Optional[float], default:None) –

    The value to use when filling vectors. Only used if on_bad_vectors="fill".

  • storage_options (Optional[Dict[str,str]], default:None) –

    Additional options for the storage backend. Options already set on theconnection will be inherited by the table, but can be overridden here.See available options athttps://lancedb.github.io/lancedb/guides/storage/

Returns:

  • AsyncTable

    A reference to the newly created table.

  • !!! note

    The vector index won't be created by default.To create the index, call thecreate_index method on the table.

Examples:

Can create with list of tuples or dictionaries:

>>>importlancedb>>>asyncdefdoctest_example():...db=awaitlancedb.connect_async("./.lancedb")...data=[{"vector":[1.1,1.2],"lat":45.5,"long":-122.7},...{"vector":[0.2,1.8],"lat":40.1,"long":-74.1}]...my_table=awaitdb.create_table("my_table",data)...print(awaitmy_table.query().limit(5).to_arrow())>>>importasyncio>>>asyncio.run(doctest_example())pyarrow.Tablevector: fixed_size_list<item: float>[2]  child 0, item: floatlat: doublelong: double----vector: [[[1.1,1.2],[0.2,1.8]]]lat: [[45.5,40.1]]long: [[-122.7,-74.1]]

You can also pass a pandas DataFrame:

>>>importpandasaspd>>>data=pd.DataFrame({..."vector":[[1.1,1.2],[0.2,1.8]],..."lat":[45.5,40.1],..."long":[-122.7,-74.1]...})>>>asyncdefpandas_example():...db=awaitlancedb.connect_async("./.lancedb")...my_table=awaitdb.create_table("table2",data)...print(awaitmy_table.query().limit(5).to_arrow())>>>asyncio.run(pandas_example())pyarrow.Tablevector: fixed_size_list<item: float>[2]  child 0, item: floatlat: doublelong: double----vector: [[[1.1,1.2],[0.2,1.8]]]lat: [[45.5,40.1]]long: [[-122.7,-74.1]]

Data is converted to Arrow before being written to disk. For maximumcontrol over how data is saved, either provide the PyArrow schema toconvert to or else provide aPyArrow Table directly.

>>>importpyarrowaspa>>>custom_schema=pa.schema([...pa.field("vector",pa.list_(pa.float32(),2)),...pa.field("lat",pa.float32()),...pa.field("long",pa.float32())...])>>>asyncdefwith_schema():...db=awaitlancedb.connect_async("./.lancedb")...my_table=awaitdb.create_table("table3",data,schema=custom_schema)...print(awaitmy_table.query().limit(5).to_arrow())>>>asyncio.run(with_schema())pyarrow.Tablevector: fixed_size_list<item: float>[2]  child 0, item: floatlat: floatlong: float----vector: [[[1.1,1.2],[0.2,1.8]]]lat: [[45.5,40.1]]long: [[-122.7,-74.1]]

It is also possible to create an table from[Iterable[pa.RecordBatch]]:

>>>importpyarrowaspa>>>defmake_batches():...foriinrange(5):...yieldpa.RecordBatch.from_arrays(...[...pa.array([[3.1,4.1],[5.9,26.5]],...pa.list_(pa.float32(),2)),...pa.array(["foo","bar"]),...pa.array([10.0,20.0]),...],...["vector","item","price"],...)>>>schema=pa.schema([...pa.field("vector",pa.list_(pa.float32(),2)),...pa.field("item",pa.utf8()),...pa.field("price",pa.float32()),...])>>>asyncdefiterable_example():...db=awaitlancedb.connect_async("./.lancedb")...awaitdb.create_table("table4",make_batches(),schema=schema)>>>asyncio.run(iterable_example())
Source code inlancedb/db.py
asyncdefcreate_table(self,name:str,data:Optional[DATA]=None,schema:Optional[Union[pa.Schema,LanceModel]]=None,mode:Optional[Literal["create","overwrite"]]=None,exist_ok:Optional[bool]=None,on_bad_vectors:Optional[str]=None,fill_value:Optional[float]=None,storage_options:Optional[Dict[str,str]]=None,*,embedding_functions:Optional[List[EmbeddingFunctionConfig]]=None,)->AsyncTable:"""Create an [AsyncTable][lancedb.table.AsyncTable] in the database.    Parameters    ----------    name: str        The name of the table.    data: The data to initialize the table, *optional*        User must provide at least one of `data` or `schema`.        Acceptable types are:        - list-of-dict        - pandas.DataFrame        - pyarrow.Table or pyarrow.RecordBatch    schema: The schema of the table, *optional*        Acceptable types are:        - pyarrow.Schema        - [LanceModel][lancedb.pydantic.LanceModel]    mode: Literal["create", "overwrite"]; default "create"        The mode to use when creating the table.        Can be either "create" or "overwrite".        By default, if the table already exists, an exception is raised.        If you want to overwrite the table, use mode="overwrite".    exist_ok: bool, default False        If a table by the same name already exists, then raise an exception        if exist_ok=False. If exist_ok=True, then open the existing table;        it will not add the provided data but will validate against any        schema that's specified.    on_bad_vectors: str, default "error"        What to do if any of the vectors are not the same size or contains NaNs.        One of "error", "drop", "fill".    fill_value: float        The value to use when filling vectors. Only used if on_bad_vectors="fill".    storage_options: dict, optional        Additional options for the storage backend. Options already set on the        connection will be inherited by the table, but can be overridden here.        See available options at        <https://lancedb.github.io/lancedb/guides/storage/>    Returns    -------    AsyncTable        A reference to the newly created table.    !!! note        The vector index won't be created by default.        To create the index, call the `create_index` method on the table.    Examples    --------    Can create with list of tuples or dictionaries:    >>> import lancedb    >>> async def doctest_example():    ...     db = await lancedb.connect_async("./.lancedb")    ...     data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},    ...             {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]    ...     my_table = await db.create_table("my_table", data)    ...     print(await my_table.query().limit(5).to_arrow())    >>> import asyncio    >>> asyncio.run(doctest_example())    pyarrow.Table    vector: fixed_size_list<item: float>[2]      child 0, item: float    lat: double    long: double    ----    vector: [[[1.1,1.2],[0.2,1.8]]]    lat: [[45.5,40.1]]    long: [[-122.7,-74.1]]    You can also pass a pandas DataFrame:    >>> import pandas as pd    >>> data = pd.DataFrame({    ...    "vector": [[1.1, 1.2], [0.2, 1.8]],    ...    "lat": [45.5, 40.1],    ...    "long": [-122.7, -74.1]    ... })    >>> async def pandas_example():    ...     db = await lancedb.connect_async("./.lancedb")    ...     my_table = await db.create_table("table2", data)    ...     print(await my_table.query().limit(5).to_arrow())    >>> asyncio.run(pandas_example())    pyarrow.Table    vector: fixed_size_list<item: float>[2]      child 0, item: float    lat: double    long: double    ----    vector: [[[1.1,1.2],[0.2,1.8]]]    lat: [[45.5,40.1]]    long: [[-122.7,-74.1]]    Data is converted to Arrow before being written to disk. For maximum    control over how data is saved, either provide the PyArrow schema to    convert to or else provide a [PyArrow Table](pyarrow.Table) directly.    >>> import pyarrow as pa    >>> custom_schema = pa.schema([    ...   pa.field("vector", pa.list_(pa.float32(), 2)),    ...   pa.field("lat", pa.float32()),    ...   pa.field("long", pa.float32())    ... ])    >>> async def with_schema():    ...     db = await lancedb.connect_async("./.lancedb")    ...     my_table = await db.create_table("table3", data, schema = custom_schema)    ...     print(await my_table.query().limit(5).to_arrow())    >>> asyncio.run(with_schema())    pyarrow.Table    vector: fixed_size_list<item: float>[2]      child 0, item: float    lat: float    long: float    ----    vector: [[[1.1,1.2],[0.2,1.8]]]    lat: [[45.5,40.1]]    long: [[-122.7,-74.1]]    It is also possible to create an table from `[Iterable[pa.RecordBatch]]`:    >>> import pyarrow as pa    >>> def make_batches():    ...     for i in range(5):    ...         yield pa.RecordBatch.from_arrays(    ...             [    ...                 pa.array([[3.1, 4.1], [5.9, 26.5]],    ...                     pa.list_(pa.float32(), 2)),    ...                 pa.array(["foo", "bar"]),    ...                 pa.array([10.0, 20.0]),    ...             ],    ...             ["vector", "item", "price"],    ...         )    >>> schema=pa.schema([    ...     pa.field("vector", pa.list_(pa.float32(), 2)),    ...     pa.field("item", pa.utf8()),    ...     pa.field("price", pa.float32()),    ... ])    >>> async def iterable_example():    ...     db = await lancedb.connect_async("./.lancedb")    ...     await db.create_table("table4", make_batches(), schema=schema)    >>> asyncio.run(iterable_example())    """metadata=Noneifembedding_functionsisnotNone:# If we passed in embedding functions explicitly# then we'll override any schema metadata that# may was implicitly specified by the LanceModel schemaregistry=EmbeddingFunctionRegistry.get_instance()metadata=registry.get_table_metadata(embedding_functions)# Defining defaults here and not in function prototype.  In the future# these defaults will move into rust so better to keep them as None.ifon_bad_vectorsisNone:on_bad_vectors="error"iffill_valueisNone:fill_value=0.0data,schema=sanitize_create_table(data,schema,metadata,on_bad_vectors,fill_value)validate_schema(schema)ifexist_okisNone:exist_ok=FalseifmodeisNone:mode="create"ifmode=="create"andexist_ok:mode="exist_ok"ifdataisNone:new_table=awaitself._inner.create_empty_table(name,mode,schema,storage_options=storage_options,)else:data=data_to_reader(data,schema)new_table=awaitself._inner.create_table(name,mode,data,storage_options=storage_options,)returnAsyncTable(new_table)

open_tableasync

open_table(name:str,storage_options:Optional[Dict[str,str]]=None,index_cache_size:Optional[int]=None)->AsyncTable

Open a Lance Table in the database.

Parameters:

  • name (str) –

    The name of the table.

  • storage_options (Optional[Dict[str,str]], default:None) –

    Additional options for the storage backend. Options already set on theconnection will be inherited by the table, but can be overridden here.See available options athttps://lancedb.github.io/lancedb/guides/storage/

  • index_cache_size (Optional[int], default:None) –

    Set the size of the index cache, specified as a number of entries

    The exact meaning of an "entry" will depend on the type of index:* IVF - there is one entry for each IVF partition* BTREE - there is one entry for the entire index

    This cache applies to the entire opened table, across all indices.Setting this value higher will increase performance on larger datasetsat the expense of more RAM

Returns:

  • A LanceTable object representing the table.
Source code inlancedb/db.py
asyncdefopen_table(self,name:str,storage_options:Optional[Dict[str,str]]=None,index_cache_size:Optional[int]=None,)->AsyncTable:"""Open a Lance Table in the database.    Parameters    ----------    name: str        The name of the table.    storage_options: dict, optional        Additional options for the storage backend. Options already set on the        connection will be inherited by the table, but can be overridden here.        See available options at        <https://lancedb.github.io/lancedb/guides/storage/>    index_cache_size: int, default 256        Set the size of the index cache, specified as a number of entries        The exact meaning of an "entry" will depend on the type of index:        * IVF - there is one entry for each IVF partition        * BTREE - there is one entry for the entire index        This cache applies to the entire opened table, across all indices.        Setting this value higher will increase performance on larger datasets        at the expense of more RAM    Returns    -------    A LanceTable object representing the table.    """table=awaitself._inner.open_table(name,storage_options,index_cache_size)returnAsyncTable(table)

rename_tableasync

rename_table(old_name:str,new_name:str)

Rename a table in the database.

Parameters:

  • old_name (str) –

    The current name of the table.

  • new_name (str) –

    The new name of the table.

Source code inlancedb/db.py
asyncdefrename_table(self,old_name:str,new_name:str):"""Rename a table in the database.    Parameters    ----------    old_name: str        The current name of the table.    new_name: str        The new name of the table.    """awaitself._inner.rename_table(old_name,new_name)

drop_tableasync

drop_table(name:str,*,ignore_missing:bool=False)

Drop a table from the database.

Parameters:

  • name (str) –

    The name of the table.

  • ignore_missing (bool, default:False) –

    If True, ignore if the table does not exist.

Source code inlancedb/db.py
asyncdefdrop_table(self,name:str,*,ignore_missing:bool=False):"""Drop a table from the database.    Parameters    ----------    name: str        The name of the table.    ignore_missing: bool, default False        If True, ignore if the table does not exist.    """try:awaitself._inner.drop_table(name)exceptValueErrorase:ifnotignore_missing:raiseeiff"Table '{name}' was not found"notinstr(e):raisee

drop_all_tablesasync

drop_all_tables()

Drop all tables from the database.

Source code inlancedb/db.py
asyncdefdrop_all_tables(self):"""Drop all tables from the database."""awaitself._inner.drop_all_tables()

drop_databaseasync

drop_database()

Drop databaseThis is the same thing as dropping all the tables

Source code inlancedb/db.py
@deprecation.deprecated(deprecated_in="0.15.1",removed_in="0.17",current_version=__version__,details="Use drop_all_tables() instead",)asyncdefdrop_database(self):"""    Drop database    This is the same thing as dropping all the tables    """awaitself._inner.drop_all_tables()

Tables (Asynchronous)

Table hold your actual data as a collection of records / rows.

lancedb.table.AsyncTable

An AsyncTable is a collection of Records in a LanceDB Database.

An AsyncTable can be obtained from theAsyncConnection.create_table andAsyncConnection.open_table methods.

An AsyncTable object is expected to be long lived and reused for multipleoperations. AsyncTable objects will cache a certain amount of index data in memory.This cache will be freed when the Table is garbage collected. To eagerly free thecache you can call theclose method. Once theAsyncTable is closed, it cannot be used for any further operations.

An AsyncTable can also be used as a context manager, and will automatically closewhen the context is exited. Closing a table is optional. If you do not close thetable, it will be closed when the AsyncTable object is garbage collected.

Examples:

Create usingAsyncConnection.create_table(more examples in that method's documentation).

>>>importlancedb>>>asyncdefcreate_a_table():...db=awaitlancedb.connect_async("./.lancedb")...data=[{"vector":[1.1,1.2],"b":2}]...table=awaitdb.create_table("my_table",data=data)...print(awaittable.query().limit(5).to_arrow())>>>importasyncio>>>asyncio.run(create_a_table())pyarrow.Tablevector: fixed_size_list<item: float>[2]  child 0, item: floatb: int64----vector: [[[1.1,1.2]]]b: [[2]]

Can append new data withAsyncTable.add().

>>>asyncdefadd_to_table():...db=awaitlancedb.connect_async("./.lancedb")...table=awaitdb.open_table("my_table")...awaittable.add([{"vector":[0.5,1.3],"b":4}])>>>asyncio.run(add_to_table())

Can query the table withAsyncTable.vector_search.

>>>asyncdefsearch_table_for_vector():...db=awaitlancedb.connect_async("./.lancedb")...table=awaitdb.open_table("my_table")...results=(...awaittable.vector_search([0.4,0.4]).select(["b","vector"]).to_pandas()...)...print(results)>>>asyncio.run(search_table_for_vector())   b      vector  _distance0  4  [0.5, 1.3]       0.821  2  [1.1, 1.2]       1.13

Search queries are much faster when an index is created. SeeAsyncTable.create_index.

Source code inlancedb/table.py
299729982999300030013002300330043005300630073008300930103011301230133014301530163017301830193020302130223023302430253026302730283029303030313032303330343035303630373038303930403041304230433044304530463047304830493050305130523053305430553056305730583059306030613062306330643065306630673068306930703071307230733074307530763077307830793080308130823083308430853086308730883089309030913092309330943095309630973098309931003101310231033104310531063107310831093110311131123113311431153116311731183119312031213122312331243125312631273128312931303131313231333134313531363137313831393140314131423143314431453146314731483149315031513152315331543155315631573158315931603161316231633164316531663167316831693170317131723173317431753176317731783179318031813182318331843185318631873188318931903191319231933194319531963197319831993200320132023203320432053206320732083209321032113212321332143215321632173218321932203221322232233224322532263227322832293230323132323233323432353236323732383239324032413242324332443245324632473248324932503251325232533254325532563257325832593260326132623263326432653266326732683269327032713272327332743275327632773278327932803281328232833284328532863287328832893290329132923293329432953296329732983299330033013302330333043305330633073308330933103311331233133314331533163317331833193320332133223323332433253326332733283329333033313332333333343335333633373338333933403341334233433344334533463347334833493350335133523353335433553356335733583359336033613362336333643365336633673368336933703371337233733374337533763377337833793380338133823383338433853386338733883389339033913392339333943395339633973398339934003401340234033404340534063407340834093410341134123413341434153416341734183419342034213422342334243425342634273428342934303431343234333434343534363437343834393440344134423443344434453446344734483449345034513452345334543455345634573458345934603461346234633464346534663467346834693470347134723473347434753476347734783479348034813482348334843485348634873488348934903491349234933494349534963497349834993500350135023503350435053506350735083509351035113512351335143515351635173518351935203521352235233524352535263527352835293530353135323533353435353536353735383539354035413542354335443545354635473548354935503551355235533554355535563557355835593560356135623563356435653566356735683569357035713572357335743575357635773578357935803581358235833584358535863587358835893590359135923593359435953596359735983599360036013602360336043605360636073608360936103611361236133614361536163617361836193620362136223623362436253626362736283629363036313632363336343635363636373638363936403641364236433644364536463647364836493650365136523653365436553656365736583659366036613662366336643665366636673668366936703671367236733674367536763677367836793680368136823683368436853686368736883689369036913692369336943695369636973698369937003701370237033704370537063707370837093710371137123713371437153716371737183719372037213722372337243725372637273728372937303731373237333734373537363737373837393740374137423743374437453746374737483749375037513752375337543755375637573758375937603761376237633764376537663767376837693770377137723773377437753776377737783779378037813782378337843785378637873788378937903791379237933794379537963797379837993800380138023803380438053806380738083809381038113812381338143815381638173818381938203821382238233824382538263827382838293830383138323833383438353836383738383839384038413842384338443845384638473848384938503851385238533854385538563857385838593860386138623863386438653866386738683869387038713872387338743875387638773878387938803881388238833884388538863887388838893890389138923893389438953896389738983899390039013902390339043905390639073908390939103911391239133914391539163917391839193920392139223923392439253926392739283929393039313932393339343935393639373938393939403941394239433944394539463947394839493950395139523953395439553956395739583959396039613962396339643965396639673968396939703971397239733974397539763977397839793980398139823983398439853986398739883989399039913992399339943995399639973998399940004001400240034004400540064007400840094010401140124013401440154016401740184019402040214022402340244025402640274028402940304031403240334034403540364037403840394040404140424043404440454046404740484049405040514052405340544055405640574058405940604061406240634064406540664067406840694070407140724073407440754076407740784079408040814082408340844085408640874088408940904091409240934094409540964097409840994100410141024103410441054106410741084109411041114112411341144115411641174118411941204121412241234124412541264127412841294130413141324133413441354136413741384139414041414142414341444145414641474148414941504151415241534154415541564157415841594160416141624163416441654166416741684169417041714172
classAsyncTable:"""    An AsyncTable is a collection of Records in a LanceDB Database.    An AsyncTable can be obtained from the    [AsyncConnection.create_table][lancedb.AsyncConnection.create_table] and    [AsyncConnection.open_table][lancedb.AsyncConnection.open_table] methods.    An AsyncTable object is expected to be long lived and reused for multiple    operations. AsyncTable objects will cache a certain amount of index data in memory.    This cache will be freed when the Table is garbage collected.  To eagerly free the    cache you can call the [close][lancedb.AsyncTable.close] method.  Once the    AsyncTable is closed, it cannot be used for any further operations.    An AsyncTable can also be used as a context manager, and will automatically close    when the context is exited.  Closing a table is optional.  If you do not close the    table, it will be closed when the AsyncTable object is garbage collected.    Examples    --------    Create using [AsyncConnection.create_table][lancedb.AsyncConnection.create_table]    (more examples in that method's documentation).    >>> import lancedb    >>> async def create_a_table():    ...     db = await lancedb.connect_async("./.lancedb")    ...     data = [{"vector": [1.1, 1.2], "b": 2}]    ...     table = await db.create_table("my_table", data=data)    ...     print(await table.query().limit(5).to_arrow())    >>> import asyncio    >>> asyncio.run(create_a_table())    pyarrow.Table    vector: fixed_size_list<item: float>[2]      child 0, item: float    b: int64    ----    vector: [[[1.1,1.2]]]    b: [[2]]    Can append new data with [AsyncTable.add()][lancedb.table.AsyncTable.add].    >>> async def add_to_table():    ...     db = await lancedb.connect_async("./.lancedb")    ...     table = await db.open_table("my_table")    ...     await table.add([{"vector": [0.5, 1.3], "b": 4}])    >>> asyncio.run(add_to_table())    Can query the table with    [AsyncTable.vector_search][lancedb.table.AsyncTable.vector_search].    >>> async def search_table_for_vector():    ...     db = await lancedb.connect_async("./.lancedb")    ...     table = await db.open_table("my_table")    ...     results = (    ...       await table.vector_search([0.4, 0.4]).select(["b", "vector"]).to_pandas()    ...     )    ...     print(results)    >>> asyncio.run(search_table_for_vector())       b      vector  _distance    0  4  [0.5, 1.3]       0.82    1  2  [1.1, 1.2]       1.13    Search queries are much faster when an index is created. See    [AsyncTable.create_index][lancedb.table.AsyncTable.create_index].    """def__init__(self,table:LanceDBTable):"""Create a new AsyncTable object.        You should not create AsyncTable objects directly.        Use [AsyncConnection.create_table][lancedb.AsyncConnection.create_table] and        [AsyncConnection.open_table][lancedb.AsyncConnection.open_table] to obtain        Table objects."""self._inner=tabledef__repr__(self):returnself._inner.__repr__()def__enter__(self):returnselfdef__exit__(self,*_):self.close()defis_open(self)->bool:"""Return True if the table is open."""returnself._inner.is_open()defclose(self):"""Close the table and free any resources associated with it.        It is safe to call this method multiple times.        Any attempt to use the table after it has been closed will raise an error."""returnself._inner.close()@propertydefname(self)->str:"""The name of the table."""returnself._inner.name()asyncdefschema(self)->pa.Schema:"""The [Arrow Schema](https://arrow.apache.org/docs/python/api/datatypes.html#)        of this Table        """returnawaitself._inner.schema()asyncdefembedding_functions(self)->Dict[str,EmbeddingFunctionConfig]:"""        Get the embedding functions for the table        Returns        -------        funcs: Dict[str, EmbeddingFunctionConfig]            A mapping of the vector column to the embedding function            or empty dict if not configured.        """schema=awaitself.schema()returnEmbeddingFunctionRegistry.get_instance().parse_functions(schema.metadata)asyncdefcount_rows(self,filter:Optional[str]=None)->int:"""        Count the number of rows in the table.        Parameters        ----------        filter: str, optional            A SQL where clause to filter the rows to count.        """returnawaitself._inner.count_rows(filter)asyncdefhead(self,n=5)->pa.Table:"""        Return the first `n` rows of the table.        Parameters        ----------        n: int, default 5            The number of rows to return.        """returnawaitself.query().limit(n).to_arrow()defquery(self)->AsyncQuery:"""        Returns an [AsyncQuery][lancedb.query.AsyncQuery] that can be used        to search the table.        Use methods on the returned query to control query behavior.  The query        can be executed with methods like [to_arrow][lancedb.query.AsyncQuery.to_arrow],        [to_pandas][lancedb.query.AsyncQuery.to_pandas] and more.        """returnAsyncQuery(self._inner.query())asyncdefto_pandas(self)->"pd.DataFrame":"""Return the table as a pandas DataFrame.        Returns        -------        pd.DataFrame        """return(awaitself.to_arrow()).to_pandas()asyncdefto_arrow(self)->pa.Table:"""Return the table as a pyarrow Table.        Returns        -------        pa.Table        """returnawaitself.query().to_arrow()asyncdefcreate_index(self,column:str,*,replace:Optional[bool]=None,config:Optional[Union[IvfFlat,IvfPq,HnswPq,HnswSq,BTree,Bitmap,LabelList,FTS]]=None,wait_timeout:Optional[timedelta]=None,):"""Create an index to speed up queries        Indices can be created on vector columns or scalar columns.        Indices on vector columns will speed up vector searches.        Indices on scalar columns will speed up filtering (in both        vector and non-vector searches)        Parameters        ----------        column: str            The column to index.        replace: bool, default True            Whether to replace the existing index            If this is false, and another index already exists on the same columns            and the same name, then an error will be returned.  This is true even if            that index is out of date.            The default is True        config: default None            For advanced configuration you can specify the type of index you would            like to create.   You can also specify index-specific parameters when            creating an index object.        wait_timeout: timedelta, optional            The timeout to wait if indexing is asynchronous.        """ifconfigisnotNone:ifnotisinstance(config,(IvfFlat,IvfPq,HnswPq,HnswSq,BTree,Bitmap,LabelList,FTS)):raiseTypeError("config must be an instance of IvfPq, HnswPq, HnswSq, BTree,"" Bitmap, LabelList, or FTS")try:awaitself._inner.create_index(column,index=config,replace=replace,wait_timeout=wait_timeout)exceptValueErrorase:if"not support the requested language"instr(e):supported_langs=", ".join(lang_mapping.values())help_msg=f"Supported languages:{supported_langs}"add_note(e,help_msg)raiseeasyncdefdrop_index(self,name:str)->None:"""        Drop an index from the table.        Parameters        ----------        name: str            The name of the index to drop.        Notes        -----        This does not delete the index from disk, it just removes it from the table.        To delete the index, run [optimize][lancedb.table.AsyncTable.optimize]        after dropping the index.        Use [list_indices][lancedb.table.AsyncTable.list_indices] to find the names        of the indices.        """awaitself._inner.drop_index(name)asyncdefprewarm_index(self,name:str)->None:"""        Prewarm an index in the table.        Parameters        ----------        name: str            The name of the index to prewarm        Notes        -----        This will load the index into memory.  This may reduce the cold-start time for        future queries.  If the index does not fit in the cache then this call may be        wasteful.        """awaitself._inner.prewarm_index(name)asyncdefwait_for_index(self,index_names:Iterable[str],timeout:timedelta=timedelta(seconds=300))->None:"""        Wait for indexing to complete for the given index names.        This will poll the table until all the indices are fully indexed,        or raise a timeout exception if the timeout is reached.        Parameters        ----------        index_names: str            The name of the indices to poll        timeout: timedelta            Timeout to wait for asynchronous indexing. The default is 5 minutes.        """awaitself._inner.wait_for_index(index_names,timeout)asyncdefstats(self)->TableStatistics:"""        Retrieve table and fragment statistics.        """returnawaitself._inner.stats()asyncdefadd(self,data:DATA,*,mode:Optional[Literal["append","overwrite"]]="append",on_bad_vectors:Optional[OnBadVectorsType]=None,fill_value:Optional[float]=None,)->AddResult:"""Add more data to the [Table](Table).        Parameters        ----------        data: DATA            The data to insert into the table. Acceptable types are:            - list-of-dict            - pandas.DataFrame            - pyarrow.Table or pyarrow.RecordBatch        mode: str            The mode to use when writing the data. Valid values are            "append" and "overwrite".        on_bad_vectors: str, default "error"            What to do if any of the vectors are not the same size or contains NaNs.            One of "error", "drop", "fill", "null".        fill_value: float, default 0.            The value to use when filling vectors. Only used if on_bad_vectors="fill".        """schema=awaitself.schema()ifon_bad_vectorsisNone:on_bad_vectors="error"iffill_valueisNone:fill_value=0.0data=_sanitize_data(data,schema,metadata=schema.metadata,on_bad_vectors=on_bad_vectors,fill_value=fill_value,allow_subschema=True,)ifisinstance(data,pa.Table):data=data.to_reader()returnawaitself._inner.add(data,modeor"append")defmerge_insert(self,on:Union[str,Iterable[str]])->LanceMergeInsertBuilder:"""        Returns a [`LanceMergeInsertBuilder`][lancedb.merge.LanceMergeInsertBuilder]        that can be used to create a "merge insert" operation        This operation can add rows, update rows, and remove rows all in a single        transaction. It is a very generic tool that can be used to create        behaviors like "insert if not exists", "update or insert (i.e. upsert)",        or even replace a portion of existing data with new data (e.g. replace        all data where month="january")        The merge insert operation works by combining new data from a        **source table** with existing data in a **target table** by using a        join.  There are three categories of records.        "Matched" records are records that exist in both the source table and        the target table. "Not matched" records exist only in the source table        (e.g. these are new data) "Not matched by source" records exist only        in the target table (this is old data)        The builder returned by this method can be used to customize what        should happen for each category of data.        Please note that the data may appear to be reordered as part of this        operation.  This is because updated rows will be deleted from the        dataset and then reinserted at the end with the new values.        Parameters        ----------        on: Union[str, Iterable[str]]            A column (or columns) to join on.  This is how records from the            source table and target table are matched.  Typically this is some            kind of key or id column.        Examples        --------        >>> import lancedb        >>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})        >>> db = lancedb.connect("./.lancedb")        >>> table = db.create_table("my_table", data)        >>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})        >>> # Perform a "upsert" operation        >>> res = table.merge_insert("a")     \\        ...      .when_matched_update_all()     \\        ...      .when_not_matched_insert_all() \\        ...      .execute(new_data)        >>> res        MergeResult(version=2, num_updated_rows=2, num_inserted_rows=1, num_deleted_rows=0)        >>> # The order of new rows is non-deterministic since we use        >>> # a hash-join as part of this operation and so we sort here        >>> table.to_arrow().sort_by("a").to_pandas()           a  b        0  1  b        1  2  x        2  3  y        3  4  z        """# noqa: E501on=[on]ifisinstance(on,str)elselist(iter(on))returnLanceMergeInsertBuilder(self,on)@overloadasyncdefsearch(self,query:Optional[str]=None,vector_column_name:Optional[str]=None,query_type:Literal["auto"]=...,ordering_field_name:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None,)->Union[AsyncHybridQuery,AsyncFTSQuery,AsyncVectorQuery]:...@overloadasyncdefsearch(self,query:Optional[str]=None,vector_column_name:Optional[str]=None,query_type:Literal["hybrid"]=...,ordering_field_name:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None,)->AsyncHybridQuery:...@overloadasyncdefsearch(self,query:Optional[Union[VEC,"PIL.Image.Image",Tuple]]=None,vector_column_name:Optional[str]=None,query_type:Literal["auto"]=...,ordering_field_name:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None,)->AsyncVectorQuery:...@overloadasyncdefsearch(self,query:Optional[str]=None,vector_column_name:Optional[str]=None,query_type:Literal["fts"]=...,ordering_field_name:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None,)->AsyncFTSQuery:...@overloadasyncdefsearch(self,query:Optional[Union[VEC,str,"PIL.Image.Image",Tuple,FullTextQuery]]=None,vector_column_name:Optional[str]=None,query_type:Literal["vector"]=...,ordering_field_name:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None,)->AsyncVectorQuery:...asyncdefsearch(self,query:Optional[Union[VEC,str,"PIL.Image.Image",Tuple,FullTextQuery]]=None,vector_column_name:Optional[str]=None,query_type:QueryType="auto",ordering_field_name:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None,)->Union[AsyncHybridQuery,AsyncFTSQuery,AsyncVectorQuery]:"""Create a search query to find the nearest neighbors        of the given query vector. We currently support [vector search][search]        and [full-text search][experimental-full-text-search].        All query options are defined in [AsyncQuery][lancedb.query.AsyncQuery].        Parameters        ----------        query: list/np.ndarray/str/PIL.Image.Image, default None            The targetted vector to search for.            - *default None*.            Acceptable types are: list, np.ndarray, PIL.Image.Image            - If None then the select/where/limit clauses are applied to filter            the table        vector_column_name: str, optional            The name of the vector column to search.            The vector column needs to be a pyarrow fixed size list type            - If not specified then the vector column is inferred from            the table schema            - If the table has multiple vector columns then the *vector_column_name*            needs to be specified. Otherwise, an error is raised.        query_type: str            *default "auto"*.            Acceptable types are: "vector", "fts", "hybrid", or "auto"            - If "auto" then the query type is inferred from the query;                - If `query` is a list/np.ndarray then the query type is                "vector";                - If `query` is a PIL.Image.Image then either do vector search,                or raise an error if no corresponding embedding function is found.            - If `query` is a string, then the query type is "vector" if the              table has embedding functions else the query type is "fts"        Returns        -------        LanceQueryBuilder            A query builder object representing the query.        """defis_embedding(query):returnisinstance(query,(list,np.ndarray,pa.Array,pa.ChunkedArray))asyncdefget_embedding_func(vector_column_name:Optional[str],query_type:QueryType,query:Optional[Union[VEC,str,"PIL.Image.Image",Tuple,FullTextQuery]],)->Tuple[str,EmbeddingFunctionConfig]:ifisinstance(query,FullTextQuery):query_type="fts"schema=awaitself.schema()vector_column_name=infer_vector_column_name(schema=schema,query_type=query_type,query=query,vector_column_name=vector_column_name,)funcs=EmbeddingFunctionRegistry.get_instance().parse_functions(schema.metadata)func=funcs.get(vector_column_name)iffuncisNone:error=ValueError(f"Column '{vector_column_name}' has no registered ""embedding function.")iflen(funcs)>0:add_note(error,"Embedding functions are registered for columns: "f"{list(funcs.keys())}",)else:add_note(error,"No embedding functions are registered for any columns.")raiseerrorreturnvector_column_name,funcasyncdefmake_embedding(embedding,query):ifembeddingisnotNone:loop=asyncio.get_running_loop()# This function is likely to block, since it either calls an expensive# function or makes an HTTP request to an embeddings REST API.return(awaitloop.run_in_executor(None,embedding.function.compute_query_embeddings_with_retry,query,))[0]else:returnNoneifquery_type=="auto":# Infer the query type.ifis_embedding(query):vector_query=queryquery_type="vector"elifisinstance(query,FullTextQuery):query_type="fts"elifisinstance(query,str):try:(indices,(vector_column_name,embedding_conf),)=awaitasyncio.gather(self.list_indices(),get_embedding_func(vector_column_name,"auto",query),)exceptValueErrorase:if"Column"instr(e)and"has no registered embedding function"instr(e):# If the column has no registered embedding function,# then it's an FTS query.query_type="fts"else:raiseeelse:ifembedding_confisnotNone:vector_query=awaitmake_embedding(embedding_conf,query)ifany(i.columns[0]==embedding_conf.source_columnandi.index_type=="FTS"foriinindices):query_type="hybrid"else:query_type="vector"else:query_type="fts"else:# it's an image or something else embeddable.query_type="vector"elifquery_type=="vector":ifis_embedding(query):vector_query=queryelse:vector_column_name,embedding_conf=awaitget_embedding_func(vector_column_name,query_type,query)vector_query=awaitmake_embedding(embedding_conf,query)elifquery_type=="hybrid":ifis_embedding(query):raiseValueError("Hybrid search requires a text query")else:vector_column_name,embedding_conf=awaitget_embedding_func(vector_column_name,query_type,query)vector_query=awaitmake_embedding(embedding_conf,query)ifquery_type=="vector":builder=self.query().nearest_to(vector_query)ifvector_column_name:builder=builder.column(vector_column_name)returnbuilderelifquery_type=="fts":returnself.query().nearest_to_text(query,columns=fts_columns)elifquery_type=="hybrid":builder=self.query().nearest_to(vector_query)ifvector_column_name:builder=builder.column(vector_column_name)returnbuilder.nearest_to_text(query,columns=fts_columns)else:raiseValueError(f"Unknown query type: '{query_type}'")defvector_search(self,query_vector:Union[VEC,Tuple],)->AsyncVectorQuery:"""        Search the table with a given query vector.        This is a convenience method for preparing a vector query and        is the same thing as calling `nearestTo` on the builder returned        by `query`.  Seer [nearest_to][lancedb.query.AsyncQuery.nearest_to] for more        details.        """returnself.query().nearest_to(query_vector)def_sync_query_to_async(self,query:Query)->AsyncHybridQuery|AsyncFTSQuery|AsyncVectorQuery|AsyncQuery:async_query=self.query()ifquery.limitisnotNone:async_query=async_query.limit(query.limit)ifquery.offsetisnotNone:async_query=async_query.offset(query.offset)ifquery.columns:async_query=async_query.select(query.columns)ifquery.filter:async_query=async_query.where(query.filter)ifquery.fast_search:async_query=async_query.fast_search()ifquery.with_row_id:async_query=async_query.with_row_id()ifquery.vector:async_query=async_query.nearest_to(query.vector).distance_range(query.lower_bound,query.upper_bound)ifquery.distance_typeisnotNone:async_query=async_query.distance_type(query.distance_type)ifquery.minimum_nprobesisnotNone:async_query=async_query.minimum_nprobes(query.minimum_nprobes)ifquery.maximum_nprobesisnotNone:async_query=async_query.maximum_nprobes(query.maximum_nprobes)ifquery.refine_factorisnotNone:async_query=async_query.refine_factor(query.refine_factor)ifquery.vector_column:async_query=async_query.column(query.vector_column)ifquery.ef:async_query=async_query.ef(query.ef)ifquery.bypass_vector_index:async_query=async_query.bypass_vector_index()ifquery.postfilter:async_query=async_query.postfilter()ifquery.full_text_query:async_query=async_query.nearest_to_text(query.full_text_query.query,query.full_text_query.columns)returnasync_queryasyncdef_execute_query(self,query:Query,*,batch_size:Optional[int]=None,timeout:Optional[timedelta]=None,)->pa.RecordBatchReader:# The sync table calls into this method, so we need to map the# query to the async version of the query and run that here. This is only# used for that code path right now.async_query=self._sync_query_to_async(query)returnawaitasync_query.to_batches(max_batch_length=batch_size,timeout=timeout)asyncdef_explain_plan(self,query:Query,verbose:Optional[bool])->str:# This method is used by the sync tableasync_query=self._sync_query_to_async(query)returnawaitasync_query.explain_plan(verbose)asyncdef_analyze_plan(self,query:Query)->str:# This method is used by the sync tableasync_query=self._sync_query_to_async(query)returnawaitasync_query.analyze_plan()asyncdef_do_merge(self,merge:LanceMergeInsertBuilder,new_data:DATA,on_bad_vectors:OnBadVectorsType,fill_value:float,)->MergeResult:schema=awaitself.schema()ifon_bad_vectorsisNone:on_bad_vectors="error"iffill_valueisNone:fill_value=0.0data=_sanitize_data(new_data,schema,metadata=schema.metadata,on_bad_vectors=on_bad_vectors,fill_value=fill_value,allow_subschema=True,)ifisinstance(data,pa.Table):data=pa.RecordBatchReader.from_batches(data.schema,data.to_batches())returnawaitself._inner.execute_merge_insert(data,dict(on=merge._on,when_matched_update_all=merge._when_matched_update_all,when_matched_update_all_condition=merge._when_matched_update_all_condition,when_not_matched_insert_all=merge._when_not_matched_insert_all,when_not_matched_by_source_delete=merge._when_not_matched_by_source_delete,when_not_matched_by_source_condition=merge._when_not_matched_by_source_condition,timeout=merge._timeout,),)asyncdefdelete(self,where:str)->DeleteResult:"""Delete rows from the table.        This can be used to delete a single row, many rows, all rows, or        sometimes no rows (if your predicate matches nothing).        Parameters        ----------        where: str            The SQL where clause to use when deleting rows.            - For example, 'x = 2' or 'x IN (1, 2, 3)'.            The filter must not be empty, or it will error.        Examples        --------        >>> import lancedb        >>> data = [        ...    {"x": 1, "vector": [1.0, 2]},        ...    {"x": 2, "vector": [3.0, 4]},        ...    {"x": 3, "vector": [5.0, 6]}        ... ]        >>> db = lancedb.connect("./.lancedb")        >>> table = db.create_table("my_table", data)        >>> table.to_pandas()           x      vector        0  1  [1.0, 2.0]        1  2  [3.0, 4.0]        2  3  [5.0, 6.0]        >>> table.delete("x = 2")        DeleteResult(version=2)        >>> table.to_pandas()           x      vector        0  1  [1.0, 2.0]        1  3  [5.0, 6.0]        If you have a list of values to delete, you can combine them into a        stringified list and use the `IN` operator:        >>> to_remove = [1, 5]        >>> to_remove = ", ".join([str(v) for v in to_remove])        >>> to_remove        '1, 5'        >>> table.delete(f"x IN ({to_remove})")        DeleteResult(version=3)        >>> table.to_pandas()           x      vector        0  3  [5.0, 6.0]        """returnawaitself._inner.delete(where)asyncdefupdate(self,updates:Optional[Dict[str,Any]]=None,*,where:Optional[str]=None,updates_sql:Optional[Dict[str,str]]=None,)->UpdateResult:"""        This can be used to update zero to all rows in the table.        If a filter is provided with `where` then only rows matching the        filter will be updated.  Otherwise all rows will be updated.        Parameters        ----------        updates: dict, optional            The updates to apply.  The keys should be the name of the column to            update.  The values should be the new values to assign.  This is            required unless updates_sql is supplied.        where: str, optional            An SQL filter that controls which rows are updated. For example, 'x = 2'            or 'x IN (1, 2, 3)'.  Only rows that satisfy this filter will be udpated.        updates_sql: dict, optional            The updates to apply, expressed as SQL expression strings.  The keys should            be column names. The values should be SQL expressions.  These can be SQL            literals (e.g. "7" or "'foo'") or they can be expressions based on the            previous value of the row (e.g. "x + 1" to increment the x column by 1)        Returns        -------        UpdateResult            An object containing:            - rows_updated: The number of rows that were updated            - version: The new version number of the table after the update        Examples        --------        >>> import asyncio        >>> import lancedb        >>> import pandas as pd        >>> async def demo_update():        ...     data = pd.DataFrame({"x": [1, 2], "vector": [[1, 2], [3, 4]]})        ...     db = await lancedb.connect_async("./.lancedb")        ...     table = await db.create_table("my_table", data)        ...     # x is [1, 2], vector is [[1, 2], [3, 4]]        ...     await table.update({"vector": [10, 10]}, where="x = 2")        ...     # x is [1, 2], vector is [[1, 2], [10, 10]]        ...     await table.update(updates_sql={"x": "x + 1"})        ...     # x is [2, 3], vector is [[1, 2], [10, 10]]        >>> asyncio.run(demo_update())        """ifupdatesisnotNoneandupdates_sqlisnotNone:raiseValueError("Only one of updates or updates_sql can be provided")ifupdatesisNoneandupdates_sqlisNone:raiseValueError("Either updates or updates_sql must be provided")ifupdatesisnotNone:updates_sql={k:value_to_sql(v)fork,vinupdates.items()}returnawaitself._inner.update(updates_sql,where)asyncdefadd_columns(self,transforms:dict[str,str]|pa.field|List[pa.field]|pa.Schema)->AddColumnsResult:"""        Add new columns with defined values.        Parameters        ----------        transforms: Dict[str, str]            A map of column name to a SQL expression to use to calculate the            value of the new column. These expressions will be evaluated for            each row in the table, and can reference existing columns.            Alternatively, you can pass a pyarrow field or schema to add            new columns with NULLs.        Returns        -------        AddColumnsResult            version: the new version number of the table after adding columns.        """ifisinstance(transforms,pa.Field):transforms=[transforms]ifisinstance(transforms,list)andall({isinstance(f,pa.Field)forfintransforms}):transforms=pa.schema(transforms)ifisinstance(transforms,pa.Schema):returnawaitself._inner.add_columns_with_schema(transforms)else:returnawaitself._inner.add_columns(list(transforms.items()))asyncdefalter_columns(self,*alterations:Iterable[dict[str,Any]])->AlterColumnsResult:"""        Alter column names and nullability.        alterations : Iterable[Dict[str, Any]]            A sequence of dictionaries, each with the following keys:            - "path": str                The column path to alter. For a top-level column, this is the name.                For a nested column, this is the dot-separated path, e.g. "a.b.c".            - "rename": str, optional                The new name of the column. If not specified, the column name is                not changed.            - "data_type": pyarrow.DataType, optional               The new data type of the column. Existing values will be casted               to this type. If not specified, the column data type is not changed.            - "nullable": bool, optional                Whether the column should be nullable. If not specified, the column                nullability is not changed. Only non-nullable columns can be changed                to nullable. Currently, you cannot change a nullable column to                non-nullable.        Returns        -------        AlterColumnsResult            version: the new version number of the table after the alteration.        """returnawaitself._inner.alter_columns(alterations)asyncdefdrop_columns(self,columns:Iterable[str]):"""        Drop columns from the table.        Parameters        ----------        columns : Iterable[str]            The names of the columns to drop.        """returnawaitself._inner.drop_columns(columns)asyncdefversion(self)->int:"""        Retrieve the version of the table        LanceDb supports versioning.  Every operation that modifies the table increases        version.  As long as a version hasn't been deleted you can `[Self::checkout]`        that version to view the data at that point.  In addition, you can        `[Self::restore]` the version to replace the current table with a previous        version.        """returnawaitself._inner.version()asyncdeflist_versions(self):"""        List all versions of the table        """versions=awaitself._inner.list_versions()forvinversions:ts_nanos=v["timestamp"]v["timestamp"]=datetime.fromtimestamp(ts_nanos//1e9)+timedelta(microseconds=(ts_nanos%1e9)//1e3)returnversionsasyncdefcheckout(self,version:int|str):"""        Checks out a specific version of the Table        Any read operation on the table will now access the data at the checked out        version. As a consequence, calling this method will disable any read consistency        interval that was previously set.        This is a read-only operation that turns the table into a sort of "view"        or "detached head".  Other table instances will not be affected.  To make the        change permanent you can use the `[Self::restore]` method.        Any operation that modifies the table will fail while the table is in a checked        out state.        Parameters        ----------        version: int | str,            The version to check out. A version number (`int`) or a tag            (`str`) can be provided.        To return the table to a normal state use `[Self::checkout_latest]`        """try:awaitself._inner.checkout(version)exceptRuntimeErrorase:if"not found"instr(e):raiseValueError(f"Version{version} no longer exists. Was it cleaned up?")else:raiseasyncdefcheckout_latest(self):"""        Ensures the table is pointing at the latest version        This can be used to manually update a table when the read_consistency_interval        is None        It can also be used to undo a `[Self::checkout]` operation        """awaitself._inner.checkout_latest()asyncdefrestore(self,version:Optional[int|str]=None):"""        Restore the table to the currently checked out version        This operation will fail if checkout has not been called previously        This operation will overwrite the latest version of the table with a        previous version.  Any changes made since the checked out version will        no longer be visible.        Once the operation concludes the table will no longer be in a checked        out state and the read_consistency_interval, if any, will apply.        """awaitself._inner.restore(version)@propertydeftags(self)->AsyncTags:"""Tag management for the dataset.        Similar to Git, tags are a way to add metadata to a specific version of the        dataset.        .. warning::            Tagged versions are exempted from the            :py:meth:`optimize(cleanup_older_than)` process.            To remove a version that has been tagged, you must first            :py:meth:`~Tags.delete` the associated tag.        """returnAsyncTags(self._inner)asyncdefoptimize(self,*,cleanup_older_than:Optional[timedelta]=None,delete_unverified:bool=False,retrain=False,)->OptimizeStats:"""        Optimize the on-disk data and indices for better performance.        Modeled after ``VACUUM`` in PostgreSQL.        Optimization covers three operations:         * Compaction: Merges small files into larger ones         * Prune: Removes old versions of the dataset         * Index: Optimizes the indices, adding new data to existing indices        Parameters        ----------        cleanup_older_than: timedelta, optional default 7 days            All files belonging to versions older than this will be removed.  Set            to 0 days to remove all versions except the latest.  The latest version            is never removed.        delete_unverified: bool, default False            Files leftover from a failed transaction may appear to be part of an            in-progress operation (e.g. appending new data) and these files will not            be deleted unless they are at least 7 days old. If delete_unverified is True            then these files will be deleted regardless of their age.        retrain: bool, default False            If True, retrain the vector indices, this would refine the IVF clustering            and quantization, which may improve the search accuracy. It's faster than            re-creating the index from scratch, so it's recommended to try this first,            when the data distribution has changed significantly.        Experimental API        ----------------        The optimization process is undergoing active development and may change.        Our goal with these changes is to improve the performance of optimization and        reduce the complexity.        That being said, it is essential today to run optimize if you want the best        performance.  It should be stable and safe to use in production, but it our        hope that the API may be simplified (or not even need to be called) in the        future.        The frequency an application shoudl call optimize is based on the frequency of        data modifications.  If data is frequently added, deleted, or updated then        optimize should be run frequently.  A good rule of thumb is to run optimize if        you have added or modified 100,000 or more records or run more than 20 data        modification operations.        """cleanup_since_ms:Optional[int]=Noneifcleanup_older_thanisnotNone:cleanup_since_ms=round(cleanup_older_than.total_seconds()*1000)returnawaitself._inner.optimize(cleanup_since_ms=cleanup_since_ms,delete_unverified=delete_unverified,retrain=retrain,)asyncdeflist_indices(self)->Iterable[IndexConfig]:"""        List all indices that have been created with Self::create_index        """returnawaitself._inner.list_indices()asyncdefindex_stats(self,index_name:str)->Optional[IndexStatistics]:"""        Retrieve statistics about an index        Parameters        ----------        index_name: str            The name of the index to retrieve statistics for        Returns        -------        IndexStatistics or None            The statistics about the index. Returns None if the index does not exist.        """stats=awaitself._inner.index_stats(index_name)ifstatsisNone:returnNoneelse:returnIndexStatistics(**stats)asyncdefuses_v2_manifest_paths(self)->bool:"""        Check if the table is using the new v2 manifest paths.        Returns        -------        bool            True if the table is using the new v2 manifest paths, False otherwise.        """returnawaitself._inner.uses_v2_manifest_paths()asyncdefmigrate_manifest_paths_v2(self):"""        Migrate the manifest paths to the new format.        This will update the manifest to use the new v2 format for paths.        This function is idempotent, and can be run multiple times without        changing the state of the object store.        !!! danger            This should not be run while other concurrent operations are happening.            And it should also run until completion before resuming other operations.        You can use        [AsyncTable.uses_v2_manifest_paths][lancedb.table.AsyncTable.uses_v2_manifest_paths]        to check if the table is already using the new path style.        """awaitself._inner.migrate_manifest_paths_v2()asyncdefreplace_field_metadata(self,field_name:str,new_metadata:dict[str,str]):"""        Replace the metadata of a field in the schema        Parameters        ----------        field_name: str            The name of the field to replace the metadata for        new_metadata: dict            The new metadata to set        """awaitself._inner.replace_field_metadata(field_name,new_metadata)

nameproperty

name:str

The name of the table.

tagsproperty

tags:AsyncTags

Tag management for the dataset.

Similar to Git, tags are a way to add metadata to a specific version of thedataset.

.. warning::

Tagged versions are exempted from the:py:meth:`optimize(cleanup_older_than)` process.To remove a version that has been tagged, you must first:py:meth:`~Tags.delete` the associated tag.

__init__

__init__(table:Table)

Create a new AsyncTable object.

You should not create AsyncTable objects directly.

UseAsyncConnection.create_table andAsyncConnection.open_table to obtainTable objects.

Source code inlancedb/table.py
def__init__(self,table:LanceDBTable):"""Create a new AsyncTable object.    You should not create AsyncTable objects directly.    Use [AsyncConnection.create_table][lancedb.AsyncConnection.create_table] and    [AsyncConnection.open_table][lancedb.AsyncConnection.open_table] to obtain    Table objects."""self._inner=table

is_open

is_open()->bool

Return True if the table is open.

Source code inlancedb/table.py
defis_open(self)->bool:"""Return True if the table is open."""returnself._inner.is_open()

close

close()

Close the table and free any resources associated with it.

It is safe to call this method multiple times.

Any attempt to use the table after it has been closed will raise an error.

Source code inlancedb/table.py
defclose(self):"""Close the table and free any resources associated with it.    It is safe to call this method multiple times.    Any attempt to use the table after it has been closed will raise an error."""returnself._inner.close()

schemaasync

schema()->Schema

TheArrow Schemaof this Table

Source code inlancedb/table.py
asyncdefschema(self)->pa.Schema:"""The [Arrow Schema](https://arrow.apache.org/docs/python/api/datatypes.html#)    of this Table    """returnawaitself._inner.schema()

embedding_functionsasync

embedding_functions()->Dict[str,EmbeddingFunctionConfig]

Get the embedding functions for the table

Returns:

  • funcs (Dict[str,EmbeddingFunctionConfig]) –

    A mapping of the vector column to the embedding functionor empty dict if not configured.

Source code inlancedb/table.py
asyncdefembedding_functions(self)->Dict[str,EmbeddingFunctionConfig]:"""    Get the embedding functions for the table    Returns    -------    funcs: Dict[str, EmbeddingFunctionConfig]        A mapping of the vector column to the embedding function        or empty dict if not configured.    """schema=awaitself.schema()returnEmbeddingFunctionRegistry.get_instance().parse_functions(schema.metadata)

count_rowsasync

count_rows(filter:Optional[str]=None)->int

Count the number of rows in the table.

Parameters:

  • filter (Optional[str], default:None) –

    A SQL where clause to filter the rows to count.

Source code inlancedb/table.py
asyncdefcount_rows(self,filter:Optional[str]=None)->int:"""    Count the number of rows in the table.    Parameters    ----------    filter: str, optional        A SQL where clause to filter the rows to count.    """returnawaitself._inner.count_rows(filter)

headasync

head(n=5)->Table

Return the firstn rows of the table.

Parameters:

  • n

    The number of rows to return.

Source code inlancedb/table.py
asyncdefhead(self,n=5)->pa.Table:"""    Return the first `n` rows of the table.    Parameters    ----------    n: int, default 5        The number of rows to return.    """returnawaitself.query().limit(n).to_arrow()

query

query()->AsyncQuery

Returns anAsyncQuery that can be usedto search the table.

Use methods on the returned query to control query behavior. The querycan be executed with methods liketo_arrow,to_pandas and more.

Source code inlancedb/table.py
defquery(self)->AsyncQuery:"""    Returns an [AsyncQuery][lancedb.query.AsyncQuery] that can be used    to search the table.    Use methods on the returned query to control query behavior.  The query    can be executed with methods like [to_arrow][lancedb.query.AsyncQuery.to_arrow],    [to_pandas][lancedb.query.AsyncQuery.to_pandas] and more.    """returnAsyncQuery(self._inner.query())

to_pandasasync

to_pandas()->'pd.DataFrame'

Return the table as a pandas DataFrame.

Returns:

  • DataFrame
Source code inlancedb/table.py
asyncdefto_pandas(self)->"pd.DataFrame":"""Return the table as a pandas DataFrame.    Returns    -------    pd.DataFrame    """return(awaitself.to_arrow()).to_pandas()

to_arrowasync

to_arrow()->Table

Return the table as a pyarrow Table.

Returns:

Source code inlancedb/table.py
asyncdefto_arrow(self)->pa.Table:"""Return the table as a pyarrow Table.    Returns    -------    pa.Table    """returnawaitself.query().to_arrow()

create_indexasync

create_index(column:str,*,replace:Optional[bool]=None,config:Optional[Union[IvfFlat,IvfPq,HnswPq,HnswSq,BTree,Bitmap,LabelList,FTS]]=None,wait_timeout:Optional[timedelta]=None)

Create an index to speed up queries

Indices can be created on vector columns or scalar columns.Indices on vector columns will speed up vector searches.Indices on scalar columns will speed up filtering (in bothvector and non-vector searches)

Parameters:

  • column (str) –

    The column to index.

  • replace (Optional[bool], default:None) –

    Whether to replace the existing index

    If this is false, and another index already exists on the same columnsand the same name, then an error will be returned. This is true even ifthat index is out of date.

    The default is True

  • config (Optional[Union[IvfFlat,IvfPq,HnswPq,HnswSq,BTree,Bitmap,LabelList,FTS]], default:None) –

    For advanced configuration you can specify the type of index you wouldlike to create. You can also specify index-specific parameters whencreating an index object.

  • wait_timeout (Optional[timedelta], default:None) –

    The timeout to wait if indexing is asynchronous.

Source code inlancedb/table.py
asyncdefcreate_index(self,column:str,*,replace:Optional[bool]=None,config:Optional[Union[IvfFlat,IvfPq,HnswPq,HnswSq,BTree,Bitmap,LabelList,FTS]]=None,wait_timeout:Optional[timedelta]=None,):"""Create an index to speed up queries    Indices can be created on vector columns or scalar columns.    Indices on vector columns will speed up vector searches.    Indices on scalar columns will speed up filtering (in both    vector and non-vector searches)    Parameters    ----------    column: str        The column to index.    replace: bool, default True        Whether to replace the existing index        If this is false, and another index already exists on the same columns        and the same name, then an error will be returned.  This is true even if        that index is out of date.        The default is True    config: default None        For advanced configuration you can specify the type of index you would        like to create.   You can also specify index-specific parameters when        creating an index object.    wait_timeout: timedelta, optional        The timeout to wait if indexing is asynchronous.    """ifconfigisnotNone:ifnotisinstance(config,(IvfFlat,IvfPq,HnswPq,HnswSq,BTree,Bitmap,LabelList,FTS)):raiseTypeError("config must be an instance of IvfPq, HnswPq, HnswSq, BTree,"" Bitmap, LabelList, or FTS")try:awaitself._inner.create_index(column,index=config,replace=replace,wait_timeout=wait_timeout)exceptValueErrorase:if"not support the requested language"instr(e):supported_langs=", ".join(lang_mapping.values())help_msg=f"Supported languages:{supported_langs}"add_note(e,help_msg)raisee

drop_indexasync

drop_index(name:str)->None

Drop an index from the table.

Parameters:

  • name (str) –

    The name of the index to drop.

Notes

This does not delete the index from disk, it just removes it from the table.To delete the index, runoptimizeafter dropping the index.

Uselist_indices to find the namesof the indices.

Source code inlancedb/table.py
asyncdefdrop_index(self,name:str)->None:"""    Drop an index from the table.    Parameters    ----------    name: str        The name of the index to drop.    Notes    -----    This does not delete the index from disk, it just removes it from the table.    To delete the index, run [optimize][lancedb.table.AsyncTable.optimize]    after dropping the index.    Use [list_indices][lancedb.table.AsyncTable.list_indices] to find the names    of the indices.    """awaitself._inner.drop_index(name)

prewarm_indexasync

prewarm_index(name:str)->None

Prewarm an index in the table.

Parameters:

  • name (str) –

    The name of the index to prewarm

Notes

This will load the index into memory. This may reduce the cold-start time forfuture queries. If the index does not fit in the cache then this call may bewasteful.

Source code inlancedb/table.py
asyncdefprewarm_index(self,name:str)->None:"""    Prewarm an index in the table.    Parameters    ----------    name: str        The name of the index to prewarm    Notes    -----    This will load the index into memory.  This may reduce the cold-start time for    future queries.  If the index does not fit in the cache then this call may be    wasteful.    """awaitself._inner.prewarm_index(name)

wait_for_indexasync

wait_for_index(index_names:Iterable[str],timeout:timedelta=timedelta(seconds=300))->None

Wait for indexing to complete for the given index names.This will poll the table until all the indices are fully indexed,or raise a timeout exception if the timeout is reached.

Parameters:

  • index_names (Iterable[str]) –

    The name of the indices to poll

  • timeout (timedelta, default:timedelta(seconds=300)) –

    Timeout to wait for asynchronous indexing. The default is 5 minutes.

Source code inlancedb/table.py
asyncdefwait_for_index(self,index_names:Iterable[str],timeout:timedelta=timedelta(seconds=300))->None:"""    Wait for indexing to complete for the given index names.    This will poll the table until all the indices are fully indexed,    or raise a timeout exception if the timeout is reached.    Parameters    ----------    index_names: str        The name of the indices to poll    timeout: timedelta        Timeout to wait for asynchronous indexing. The default is 5 minutes.    """awaitself._inner.wait_for_index(index_names,timeout)

statsasync

stats()->TableStatistics

Retrieve table and fragment statistics.

Source code inlancedb/table.py
asyncdefstats(self)->TableStatistics:"""    Retrieve table and fragment statistics.    """returnawaitself._inner.stats()

addasync

add(data:DATA,*,mode:Optional[Literal['append','overwrite']]='append',on_bad_vectors:Optional[OnBadVectorsType]=None,fill_value:Optional[float]=None)->AddResult

Add more data to theTable.

Parameters:

  • data (DATA) –

    The data to insert into the table. Acceptable types are:

    • list-of-dict

    • pandas.DataFrame

    • pyarrow.Table or pyarrow.RecordBatch

  • mode (Optional[Literal['append', 'overwrite']], default:'append') –

    The mode to use when writing the data. Valid values are"append" and "overwrite".

  • on_bad_vectors (Optional[OnBadVectorsType], default:None) –

    What to do if any of the vectors are not the same size or contains NaNs.One of "error", "drop", "fill", "null".

  • fill_value (Optional[float], default:None) –

    The value to use when filling vectors. Only used if on_bad_vectors="fill".

Source code inlancedb/table.py
asyncdefadd(self,data:DATA,*,mode:Optional[Literal["append","overwrite"]]="append",on_bad_vectors:Optional[OnBadVectorsType]=None,fill_value:Optional[float]=None,)->AddResult:"""Add more data to the [Table](Table).    Parameters    ----------    data: DATA        The data to insert into the table. Acceptable types are:        - list-of-dict        - pandas.DataFrame        - pyarrow.Table or pyarrow.RecordBatch    mode: str        The mode to use when writing the data. Valid values are        "append" and "overwrite".    on_bad_vectors: str, default "error"        What to do if any of the vectors are not the same size or contains NaNs.        One of "error", "drop", "fill", "null".    fill_value: float, default 0.        The value to use when filling vectors. Only used if on_bad_vectors="fill".    """schema=awaitself.schema()ifon_bad_vectorsisNone:on_bad_vectors="error"iffill_valueisNone:fill_value=0.0data=_sanitize_data(data,schema,metadata=schema.metadata,on_bad_vectors=on_bad_vectors,fill_value=fill_value,allow_subschema=True,)ifisinstance(data,pa.Table):data=data.to_reader()returnawaitself._inner.add(data,modeor"append")

merge_insert

merge_insert(on:Union[str,Iterable[str]])->LanceMergeInsertBuilder

Returns aLanceMergeInsertBuilderthat can be used to create a "merge insert" operation

This operation can add rows, update rows, and remove rows all in a singletransaction. It is a very generic tool that can be used to createbehaviors like "insert if not exists", "update or insert (i.e. upsert)",or even replace a portion of existing data with new data (e.g. replaceall data where month="january")

The merge insert operation works by combining new data from asource table with existing data in atarget table by using ajoin. There are three categories of records.

"Matched" records are records that exist in both the source table andthe target table. "Not matched" records exist only in the source table(e.g. these are new data) "Not matched by source" records exist onlyin the target table (this is old data)

The builder returned by this method can be used to customize whatshould happen for each category of data.

Please note that the data may appear to be reordered as part of thisoperation. This is because updated rows will be deleted from thedataset and then reinserted at the end with the new values.

Parameters:

  • on (Union[str,Iterable[str]]) –

    A column (or columns) to join on. This is how records from thesource table and target table are matched. Typically this is somekind of key or id column.

Examples:

>>>importlancedb>>>data=pa.table({"a":[2,1,3],"b":["a","b","c"]})>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",data)>>>new_data=pa.table({"a":[2,3,4],"b":["x","y","z"]})>>># Perform a "upsert" operation>>>res=table.merge_insert("a")     \....when_matched_update_all()     \....when_not_matched_insert_all() \....execute(new_data)>>>resMergeResult(version=2, num_updated_rows=2, num_inserted_rows=1, num_deleted_rows=0)>>># The order of new rows is non-deterministic since we use>>># a hash-join as part of this operation and so we sort here>>>table.to_arrow().sort_by("a").to_pandas()   a  b0  1  b1  2  x2  3  y3  4  z
Source code inlancedb/table.py
defmerge_insert(self,on:Union[str,Iterable[str]])->LanceMergeInsertBuilder:"""    Returns a [`LanceMergeInsertBuilder`][lancedb.merge.LanceMergeInsertBuilder]    that can be used to create a "merge insert" operation    This operation can add rows, update rows, and remove rows all in a single    transaction. It is a very generic tool that can be used to create    behaviors like "insert if not exists", "update or insert (i.e. upsert)",    or even replace a portion of existing data with new data (e.g. replace    all data where month="january")    The merge insert operation works by combining new data from a    **source table** with existing data in a **target table** by using a    join.  There are three categories of records.    "Matched" records are records that exist in both the source table and    the target table. "Not matched" records exist only in the source table    (e.g. these are new data) "Not matched by source" records exist only    in the target table (this is old data)    The builder returned by this method can be used to customize what    should happen for each category of data.    Please note that the data may appear to be reordered as part of this    operation.  This is because updated rows will be deleted from the    dataset and then reinserted at the end with the new values.    Parameters    ----------    on: Union[str, Iterable[str]]        A column (or columns) to join on.  This is how records from the        source table and target table are matched.  Typically this is some        kind of key or id column.    Examples    --------    >>> import lancedb    >>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})    >>> db = lancedb.connect("./.lancedb")    >>> table = db.create_table("my_table", data)    >>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})    >>> # Perform a "upsert" operation    >>> res = table.merge_insert("a")     \\    ...      .when_matched_update_all()     \\    ...      .when_not_matched_insert_all() \\    ...      .execute(new_data)    >>> res    MergeResult(version=2, num_updated_rows=2, num_inserted_rows=1, num_deleted_rows=0)    >>> # The order of new rows is non-deterministic since we use    >>> # a hash-join as part of this operation and so we sort here    >>> table.to_arrow().sort_by("a").to_pandas()       a  b    0  1  b    1  2  x    2  3  y    3  4  z    """# noqa: E501on=[on]ifisinstance(on,str)elselist(iter(on))returnLanceMergeInsertBuilder(self,on)

searchasync

search(query:Optional[Union[VEC,str,'PIL.Image.Image',Tuple,FullTextQuery]]=None,vector_column_name:Optional[str]=None,query_type:QueryType='auto',ordering_field_name:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None)->Union[AsyncHybridQuery,AsyncFTSQuery,AsyncVectorQuery]

Create a search query to find the nearest neighborsof the given query vector. We currently supportvector searchand [full-text search][experimental-full-text-search].

All query options are defined inAsyncQuery.

Parameters:

  • query (Optional[Union[VEC,str, 'PIL.Image.Image',Tuple,FullTextQuery]], default:None) –

    The targetted vector to search for.

    • default None.Acceptable types are: list, np.ndarray, PIL.Image.Image

    • If None then the select/where/limit clauses are applied to filterthe table

  • vector_column_name (Optional[str], default:None) –

    The name of the vector column to search.

    The vector column needs to be a pyarrow fixed size list type

    • If not specified then the vector column is inferred fromthe table schema

    • If the table has multiple vector columns then thevector_column_nameneeds to be specified. Otherwise, an error is raised.

  • query_type (QueryType, default:'auto') –

    default "auto".Acceptable types are: "vector", "fts", "hybrid", or "auto"

    • If "auto" then the query type is inferred from the query;

      • Ifquery is a list/np.ndarray then the query type is"vector";

      • Ifquery is a PIL.Image.Image then either do vector search,or raise an error if no corresponding embedding function is found.

    • Ifquery is a string, then the query type is "vector" if the table has embedding functions else the query type is "fts"

Returns:

Source code inlancedb/table.py
asyncdefsearch(self,query:Optional[Union[VEC,str,"PIL.Image.Image",Tuple,FullTextQuery]]=None,vector_column_name:Optional[str]=None,query_type:QueryType="auto",ordering_field_name:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None,)->Union[AsyncHybridQuery,AsyncFTSQuery,AsyncVectorQuery]:"""Create a search query to find the nearest neighbors    of the given query vector. We currently support [vector search][search]    and [full-text search][experimental-full-text-search].    All query options are defined in [AsyncQuery][lancedb.query.AsyncQuery].    Parameters    ----------    query: list/np.ndarray/str/PIL.Image.Image, default None        The targetted vector to search for.        - *default None*.        Acceptable types are: list, np.ndarray, PIL.Image.Image        - If None then the select/where/limit clauses are applied to filter        the table    vector_column_name: str, optional        The name of the vector column to search.        The vector column needs to be a pyarrow fixed size list type        - If not specified then the vector column is inferred from        the table schema        - If the table has multiple vector columns then the *vector_column_name*        needs to be specified. Otherwise, an error is raised.    query_type: str        *default "auto"*.        Acceptable types are: "vector", "fts", "hybrid", or "auto"        - If "auto" then the query type is inferred from the query;            - If `query` is a list/np.ndarray then the query type is            "vector";            - If `query` is a PIL.Image.Image then either do vector search,            or raise an error if no corresponding embedding function is found.        - If `query` is a string, then the query type is "vector" if the          table has embedding functions else the query type is "fts"    Returns    -------    LanceQueryBuilder        A query builder object representing the query.    """defis_embedding(query):returnisinstance(query,(list,np.ndarray,pa.Array,pa.ChunkedArray))asyncdefget_embedding_func(vector_column_name:Optional[str],query_type:QueryType,query:Optional[Union[VEC,str,"PIL.Image.Image",Tuple,FullTextQuery]],)->Tuple[str,EmbeddingFunctionConfig]:ifisinstance(query,FullTextQuery):query_type="fts"schema=awaitself.schema()vector_column_name=infer_vector_column_name(schema=schema,query_type=query_type,query=query,vector_column_name=vector_column_name,)funcs=EmbeddingFunctionRegistry.get_instance().parse_functions(schema.metadata)func=funcs.get(vector_column_name)iffuncisNone:error=ValueError(f"Column '{vector_column_name}' has no registered ""embedding function.")iflen(funcs)>0:add_note(error,"Embedding functions are registered for columns: "f"{list(funcs.keys())}",)else:add_note(error,"No embedding functions are registered for any columns.")raiseerrorreturnvector_column_name,funcasyncdefmake_embedding(embedding,query):ifembeddingisnotNone:loop=asyncio.get_running_loop()# This function is likely to block, since it either calls an expensive# function or makes an HTTP request to an embeddings REST API.return(awaitloop.run_in_executor(None,embedding.function.compute_query_embeddings_with_retry,query,))[0]else:returnNoneifquery_type=="auto":# Infer the query type.ifis_embedding(query):vector_query=queryquery_type="vector"elifisinstance(query,FullTextQuery):query_type="fts"elifisinstance(query,str):try:(indices,(vector_column_name,embedding_conf),)=awaitasyncio.gather(self.list_indices(),get_embedding_func(vector_column_name,"auto",query),)exceptValueErrorase:if"Column"instr(e)and"has no registered embedding function"instr(e):# If the column has no registered embedding function,# then it's an FTS query.query_type="fts"else:raiseeelse:ifembedding_confisnotNone:vector_query=awaitmake_embedding(embedding_conf,query)ifany(i.columns[0]==embedding_conf.source_columnandi.index_type=="FTS"foriinindices):query_type="hybrid"else:query_type="vector"else:query_type="fts"else:# it's an image or something else embeddable.query_type="vector"elifquery_type=="vector":ifis_embedding(query):vector_query=queryelse:vector_column_name,embedding_conf=awaitget_embedding_func(vector_column_name,query_type,query)vector_query=awaitmake_embedding(embedding_conf,query)elifquery_type=="hybrid":ifis_embedding(query):raiseValueError("Hybrid search requires a text query")else:vector_column_name,embedding_conf=awaitget_embedding_func(vector_column_name,query_type,query)vector_query=awaitmake_embedding(embedding_conf,query)ifquery_type=="vector":builder=self.query().nearest_to(vector_query)ifvector_column_name:builder=builder.column(vector_column_name)returnbuilderelifquery_type=="fts":returnself.query().nearest_to_text(query,columns=fts_columns)elifquery_type=="hybrid":builder=self.query().nearest_to(vector_query)ifvector_column_name:builder=builder.column(vector_column_name)returnbuilder.nearest_to_text(query,columns=fts_columns)else:raiseValueError(f"Unknown query type: '{query_type}'")

vector_search

vector_search(query_vector:Union[VEC,Tuple])->AsyncVectorQuery

Search the table with a given query vector.This is a convenience method for preparing a vector query andis the same thing as callingnearestTo on the builder returnedbyquery. Seernearest_to for moredetails.

Source code inlancedb/table.py
defvector_search(self,query_vector:Union[VEC,Tuple],)->AsyncVectorQuery:"""    Search the table with a given query vector.    This is a convenience method for preparing a vector query and    is the same thing as calling `nearestTo` on the builder returned    by `query`.  Seer [nearest_to][lancedb.query.AsyncQuery.nearest_to] for more    details.    """returnself.query().nearest_to(query_vector)

deleteasync

delete(where:str)->DeleteResult

Delete rows from the table.

This can be used to delete a single row, many rows, all rows, orsometimes no rows (if your predicate matches nothing).

Parameters:

  • where (str) –

    The SQL where clause to use when deleting rows.

    • For example, 'x = 2' or 'x IN (1, 2, 3)'.

    The filter must not be empty, or it will error.

Examples:

>>>importlancedb>>>data=[...{"x":1,"vector":[1.0,2]},...{"x":2,"vector":[3.0,4]},...{"x":3,"vector":[5.0,6]}...]>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",data)>>>table.to_pandas()   x      vector0  1  [1.0, 2.0]1  2  [3.0, 4.0]2  3  [5.0, 6.0]>>>table.delete("x = 2")DeleteResult(version=2)>>>table.to_pandas()   x      vector0  1  [1.0, 2.0]1  3  [5.0, 6.0]

If you have a list of values to delete, you can combine them into astringified list and use theIN operator:

>>>to_remove=[1,5]>>>to_remove=", ".join([str(v)forvinto_remove])>>>to_remove'1, 5'>>>table.delete(f"x IN ({to_remove})")DeleteResult(version=3)>>>table.to_pandas()   x      vector0  3  [5.0, 6.0]
Source code inlancedb/table.py
asyncdefdelete(self,where:str)->DeleteResult:"""Delete rows from the table.    This can be used to delete a single row, many rows, all rows, or    sometimes no rows (if your predicate matches nothing).    Parameters    ----------    where: str        The SQL where clause to use when deleting rows.        - For example, 'x = 2' or 'x IN (1, 2, 3)'.        The filter must not be empty, or it will error.    Examples    --------    >>> import lancedb    >>> data = [    ...    {"x": 1, "vector": [1.0, 2]},    ...    {"x": 2, "vector": [3.0, 4]},    ...    {"x": 3, "vector": [5.0, 6]}    ... ]    >>> db = lancedb.connect("./.lancedb")    >>> table = db.create_table("my_table", data)    >>> table.to_pandas()       x      vector    0  1  [1.0, 2.0]    1  2  [3.0, 4.0]    2  3  [5.0, 6.0]    >>> table.delete("x = 2")    DeleteResult(version=2)    >>> table.to_pandas()       x      vector    0  1  [1.0, 2.0]    1  3  [5.0, 6.0]    If you have a list of values to delete, you can combine them into a    stringified list and use the `IN` operator:    >>> to_remove = [1, 5]    >>> to_remove = ", ".join([str(v) for v in to_remove])    >>> to_remove    '1, 5'    >>> table.delete(f"x IN ({to_remove})")    DeleteResult(version=3)    >>> table.to_pandas()       x      vector    0  3  [5.0, 6.0]    """returnawaitself._inner.delete(where)

updateasync

update(updates:Optional[Dict[str,Any]]=None,*,where:Optional[str]=None,updates_sql:Optional[Dict[str,str]]=None)->UpdateResult

This can be used to update zero to all rows in the table.

If a filter is provided withwhere then only rows matching thefilter will be updated. Otherwise all rows will be updated.

Parameters:

  • updates (Optional[Dict[str,Any]], default:None) –

    The updates to apply. The keys should be the name of the column toupdate. The values should be the new values to assign. This isrequired unless updates_sql is supplied.

  • where (Optional[str], default:None) –

    An SQL filter that controls which rows are updated. For example, 'x = 2'or 'x IN (1, 2, 3)'. Only rows that satisfy this filter will be udpated.

  • updates_sql (Optional[Dict[str,str]], default:None) –

    The updates to apply, expressed as SQL expression strings. The keys shouldbe column names. The values should be SQL expressions. These can be SQLliterals (e.g. "7" or "'foo'") or they can be expressions based on theprevious value of the row (e.g. "x + 1" to increment the x column by 1)

Returns:

  • UpdateResult

    An object containing:- rows_updated: The number of rows that were updated- version: The new version number of the table after the update

Examples:

>>>importasyncio>>>importlancedb>>>importpandasaspd>>>asyncdefdemo_update():...data=pd.DataFrame({"x":[1,2],"vector":[[1,2],[3,4]]})...db=awaitlancedb.connect_async("./.lancedb")...table=awaitdb.create_table("my_table",data)...# x is [1, 2], vector is [[1, 2], [3, 4]]...awaittable.update({"vector":[10,10]},where="x = 2")...# x is [1, 2], vector is [[1, 2], [10, 10]]...awaittable.update(updates_sql={"x":"x + 1"})...# x is [2, 3], vector is [[1, 2], [10, 10]]>>>asyncio.run(demo_update())
Source code inlancedb/table.py
asyncdefupdate(self,updates:Optional[Dict[str,Any]]=None,*,where:Optional[str]=None,updates_sql:Optional[Dict[str,str]]=None,)->UpdateResult:"""    This can be used to update zero to all rows in the table.    If a filter is provided with `where` then only rows matching the    filter will be updated.  Otherwise all rows will be updated.    Parameters    ----------    updates: dict, optional        The updates to apply.  The keys should be the name of the column to        update.  The values should be the new values to assign.  This is        required unless updates_sql is supplied.    where: str, optional        An SQL filter that controls which rows are updated. For example, 'x = 2'        or 'x IN (1, 2, 3)'.  Only rows that satisfy this filter will be udpated.    updates_sql: dict, optional        The updates to apply, expressed as SQL expression strings.  The keys should        be column names. The values should be SQL expressions.  These can be SQL        literals (e.g. "7" or "'foo'") or they can be expressions based on the        previous value of the row (e.g. "x + 1" to increment the x column by 1)    Returns    -------    UpdateResult        An object containing:        - rows_updated: The number of rows that were updated        - version: The new version number of the table after the update    Examples    --------    >>> import asyncio    >>> import lancedb    >>> import pandas as pd    >>> async def demo_update():    ...     data = pd.DataFrame({"x": [1, 2], "vector": [[1, 2], [3, 4]]})    ...     db = await lancedb.connect_async("./.lancedb")    ...     table = await db.create_table("my_table", data)    ...     # x is [1, 2], vector is [[1, 2], [3, 4]]    ...     await table.update({"vector": [10, 10]}, where="x = 2")    ...     # x is [1, 2], vector is [[1, 2], [10, 10]]    ...     await table.update(updates_sql={"x": "x + 1"})    ...     # x is [2, 3], vector is [[1, 2], [10, 10]]    >>> asyncio.run(demo_update())    """ifupdatesisnotNoneandupdates_sqlisnotNone:raiseValueError("Only one of updates or updates_sql can be provided")ifupdatesisNoneandupdates_sqlisNone:raiseValueError("Either updates or updates_sql must be provided")ifupdatesisnotNone:updates_sql={k:value_to_sql(v)fork,vinupdates.items()}returnawaitself._inner.update(updates_sql,where)

add_columnsasync

add_columns(transforms:dict[str,str]|field|List[field]|Schema)->AddColumnsResult

Add new columns with defined values.

Parameters:

  • transforms (dict[str,str] |field |List[field] |Schema) –

    A map of column name to a SQL expression to use to calculate thevalue of the new column. These expressions will be evaluated foreach row in the table, and can reference existing columns.Alternatively, you can pass a pyarrow field or schema to addnew columns with NULLs.

Returns:

  • AddColumnsResult

    version: the new version number of the table after adding columns.

Source code inlancedb/table.py
asyncdefadd_columns(self,transforms:dict[str,str]|pa.field|List[pa.field]|pa.Schema)->AddColumnsResult:"""    Add new columns with defined values.    Parameters    ----------    transforms: Dict[str, str]        A map of column name to a SQL expression to use to calculate the        value of the new column. These expressions will be evaluated for        each row in the table, and can reference existing columns.        Alternatively, you can pass a pyarrow field or schema to add        new columns with NULLs.    Returns    -------    AddColumnsResult        version: the new version number of the table after adding columns.    """ifisinstance(transforms,pa.Field):transforms=[transforms]ifisinstance(transforms,list)andall({isinstance(f,pa.Field)forfintransforms}):transforms=pa.schema(transforms)ifisinstance(transforms,pa.Schema):returnawaitself._inner.add_columns_with_schema(transforms)else:returnawaitself._inner.add_columns(list(transforms.items()))

alter_columnsasync

alter_columns(*alterations:Iterable[dict[str,Any]])->AlterColumnsResult

Alter column names and nullability.

alterations : Iterable[Dict[str, Any]] A sequence of dictionaries, each with the following keys: - "path": str The column path to alter. For a top-level column, this is the name. For a nested column, this is the dot-separated path, e.g. "a.b.c". - "rename": str, optional The new name of the column. If not specified, the column name is not changed. - "data_type": pyarrow.DataType, optional The new data type of the column. Existing values will be casted to this type. If not specified, the column data type is not changed. - "nullable": bool, optional Whether the column should be nullable. If not specified, the column nullability is not changed. Only non-nullable columns can be changed to nullable. Currently, you cannot change a nullable column to non-nullable.

Returns:

  • AlterColumnsResult

    version: the new version number of the table after the alteration.

Source code inlancedb/table.py
asyncdefalter_columns(self,*alterations:Iterable[dict[str,Any]])->AlterColumnsResult:"""    Alter column names and nullability.    alterations : Iterable[Dict[str, Any]]        A sequence of dictionaries, each with the following keys:        - "path": str            The column path to alter. For a top-level column, this is the name.            For a nested column, this is the dot-separated path, e.g. "a.b.c".        - "rename": str, optional            The new name of the column. If not specified, the column name is            not changed.        - "data_type": pyarrow.DataType, optional           The new data type of the column. Existing values will be casted           to this type. If not specified, the column data type is not changed.        - "nullable": bool, optional            Whether the column should be nullable. If not specified, the column            nullability is not changed. Only non-nullable columns can be changed            to nullable. Currently, you cannot change a nullable column to            non-nullable.    Returns    -------    AlterColumnsResult        version: the new version number of the table after the alteration.    """returnawaitself._inner.alter_columns(alterations)

drop_columnsasync

drop_columns(columns:Iterable[str])

Drop columns from the table.

Parameters:

  • columns (Iterable[str]) –

    The names of the columns to drop.

Source code inlancedb/table.py
asyncdefdrop_columns(self,columns:Iterable[str]):"""    Drop columns from the table.    Parameters    ----------    columns : Iterable[str]        The names of the columns to drop.    """returnawaitself._inner.drop_columns(columns)

versionasync

version()->int

Retrieve the version of the table

LanceDb supports versioning. Every operation that modifies the table increasesversion. As long as a version hasn't been deleted you can[Self::checkout]that version to view the data at that point. In addition, you can[Self::restore] the version to replace the current table with a previousversion.

Source code inlancedb/table.py
asyncdefversion(self)->int:"""    Retrieve the version of the table    LanceDb supports versioning.  Every operation that modifies the table increases    version.  As long as a version hasn't been deleted you can `[Self::checkout]`    that version to view the data at that point.  In addition, you can    `[Self::restore]` the version to replace the current table with a previous    version.    """returnawaitself._inner.version()

list_versionsasync

list_versions()

List all versions of the table

Source code inlancedb/table.py
asyncdeflist_versions(self):"""    List all versions of the table    """versions=awaitself._inner.list_versions()forvinversions:ts_nanos=v["timestamp"]v["timestamp"]=datetime.fromtimestamp(ts_nanos//1e9)+timedelta(microseconds=(ts_nanos%1e9)//1e3)returnversions

checkoutasync

checkout(version:int|str)

Checks out a specific version of the Table

Any read operation on the table will now access the data at the checked outversion. As a consequence, calling this method will disable any read consistencyinterval that was previously set.

This is a read-only operation that turns the table into a sort of "view"or "detached head". Other table instances will not be affected. To make thechange permanent you can use the[Self::restore] method.

Any operation that modifies the table will fail while the table is in a checkedout state.

Parameters:

  • version (int |str) –

    The version to check out. A version number (int) or a tag(str) can be provided.

  • To
Source code inlancedb/table.py
asyncdefcheckout(self,version:int|str):"""    Checks out a specific version of the Table    Any read operation on the table will now access the data at the checked out    version. As a consequence, calling this method will disable any read consistency    interval that was previously set.    This is a read-only operation that turns the table into a sort of "view"    or "detached head".  Other table instances will not be affected.  To make the    change permanent you can use the `[Self::restore]` method.    Any operation that modifies the table will fail while the table is in a checked    out state.    Parameters    ----------    version: int | str,        The version to check out. A version number (`int`) or a tag        (`str`) can be provided.    To return the table to a normal state use `[Self::checkout_latest]`    """try:awaitself._inner.checkout(version)exceptRuntimeErrorase:if"not found"instr(e):raiseValueError(f"Version{version} no longer exists. Was it cleaned up?")else:raise

checkout_latestasync

checkout_latest()

Ensures the table is pointing at the latest version

This can be used to manually update a table when the read_consistency_intervalis NoneIt can also be used to undo a[Self::checkout] operation

Source code inlancedb/table.py
asyncdefcheckout_latest(self):"""    Ensures the table is pointing at the latest version    This can be used to manually update a table when the read_consistency_interval    is None    It can also be used to undo a `[Self::checkout]` operation    """awaitself._inner.checkout_latest()

restoreasync

restore(version:Optional[int|str]=None)

Restore the table to the currently checked out version

This operation will fail if checkout has not been called previously

This operation will overwrite the latest version of the table with aprevious version. Any changes made since the checked out version willno longer be visible.

Once the operation concludes the table will no longer be in a checkedout state and the read_consistency_interval, if any, will apply.

Source code inlancedb/table.py
asyncdefrestore(self,version:Optional[int|str]=None):"""    Restore the table to the currently checked out version    This operation will fail if checkout has not been called previously    This operation will overwrite the latest version of the table with a    previous version.  Any changes made since the checked out version will    no longer be visible.    Once the operation concludes the table will no longer be in a checked    out state and the read_consistency_interval, if any, will apply.    """awaitself._inner.restore(version)

optimizeasync

optimize(*,cleanup_older_than:Optional[timedelta]=None,delete_unverified:bool=False,retrain=False)->OptimizeStats

Optimize the on-disk data and indices for better performance.

Modeled afterVACUUM in PostgreSQL.

Optimization covers three operations:

  • Compaction: Merges small files into larger ones
  • Prune: Removes old versions of the dataset
  • Index: Optimizes the indices, adding new data to existing indices

Parameters:

  • cleanup_older_than (Optional[timedelta], default:None) –

    All files belonging to versions older than this will be removed. Setto 0 days to remove all versions except the latest. The latest versionis never removed.

  • delete_unverified (bool, default:False) –

    Files leftover from a failed transaction may appear to be part of anin-progress operation (e.g. appending new data) and these files will notbe deleted unless they are at least 7 days old. If delete_unverified is Truethen these files will be deleted regardless of their age.

  • retrain

    If True, retrain the vector indices, this would refine the IVF clusteringand quantization, which may improve the search accuracy. It's faster thanre-creating the index from scratch, so it's recommended to try this first,when the data distribution has changed significantly.

Experimental API

The optimization process is undergoing active development and may change.Our goal with these changes is to improve the performance of optimization andreduce the complexity.

That being said, it is essential today to run optimize if you want the bestperformance. It should be stable and safe to use in production, but it ourhope that the API may be simplified (or not even need to be called) in thefuture.

The frequency an application shoudl call optimize is based on the frequency ofdata modifications. If data is frequently added, deleted, or updated thenoptimize should be run frequently. A good rule of thumb is to run optimize ifyou have added or modified 100,000 or more records or run more than 20 datamodification operations.

Source code inlancedb/table.py
asyncdefoptimize(self,*,cleanup_older_than:Optional[timedelta]=None,delete_unverified:bool=False,retrain=False,)->OptimizeStats:"""    Optimize the on-disk data and indices for better performance.    Modeled after ``VACUUM`` in PostgreSQL.    Optimization covers three operations:     * Compaction: Merges small files into larger ones     * Prune: Removes old versions of the dataset     * Index: Optimizes the indices, adding new data to existing indices    Parameters    ----------    cleanup_older_than: timedelta, optional default 7 days        All files belonging to versions older than this will be removed.  Set        to 0 days to remove all versions except the latest.  The latest version        is never removed.    delete_unverified: bool, default False        Files leftover from a failed transaction may appear to be part of an        in-progress operation (e.g. appending new data) and these files will not        be deleted unless they are at least 7 days old. If delete_unverified is True        then these files will be deleted regardless of their age.    retrain: bool, default False        If True, retrain the vector indices, this would refine the IVF clustering        and quantization, which may improve the search accuracy. It's faster than        re-creating the index from scratch, so it's recommended to try this first,        when the data distribution has changed significantly.    Experimental API    ----------------    The optimization process is undergoing active development and may change.    Our goal with these changes is to improve the performance of optimization and    reduce the complexity.    That being said, it is essential today to run optimize if you want the best    performance.  It should be stable and safe to use in production, but it our    hope that the API may be simplified (or not even need to be called) in the    future.    The frequency an application shoudl call optimize is based on the frequency of    data modifications.  If data is frequently added, deleted, or updated then    optimize should be run frequently.  A good rule of thumb is to run optimize if    you have added or modified 100,000 or more records or run more than 20 data    modification operations.    """cleanup_since_ms:Optional[int]=Noneifcleanup_older_thanisnotNone:cleanup_since_ms=round(cleanup_older_than.total_seconds()*1000)returnawaitself._inner.optimize(cleanup_since_ms=cleanup_since_ms,delete_unverified=delete_unverified,retrain=retrain,)

list_indicesasync

list_indices()->Iterable[IndexConfig]

List all indices that have been created with Self::create_index

Source code inlancedb/table.py
asyncdeflist_indices(self)->Iterable[IndexConfig]:"""    List all indices that have been created with Self::create_index    """returnawaitself._inner.list_indices()

index_statsasync

index_stats(index_name:str)->Optional[IndexStatistics]

Retrieve statistics about an index

Parameters:

  • index_name (str) –

    The name of the index to retrieve statistics for

Returns:

  • IndexStatistics or None

    The statistics about the index. Returns None if the index does not exist.

Source code inlancedb/table.py
asyncdefindex_stats(self,index_name:str)->Optional[IndexStatistics]:"""    Retrieve statistics about an index    Parameters    ----------    index_name: str        The name of the index to retrieve statistics for    Returns    -------    IndexStatistics or None        The statistics about the index. Returns None if the index does not exist.    """stats=awaitself._inner.index_stats(index_name)ifstatsisNone:returnNoneelse:returnIndexStatistics(**stats)

uses_v2_manifest_pathsasync

uses_v2_manifest_paths()->bool

Check if the table is using the new v2 manifest paths.

Returns:

  • bool

    True if the table is using the new v2 manifest paths, False otherwise.

Source code inlancedb/table.py
asyncdefuses_v2_manifest_paths(self)->bool:"""    Check if the table is using the new v2 manifest paths.    Returns    -------    bool        True if the table is using the new v2 manifest paths, False otherwise.    """returnawaitself._inner.uses_v2_manifest_paths()

migrate_manifest_paths_v2async

migrate_manifest_paths_v2()

Migrate the manifest paths to the new format.

This will update the manifest to use the new v2 format for paths.

This function is idempotent, and can be run multiple times withoutchanging the state of the object store.

Danger

This should not be run while other concurrent operations are happening.And it should also run until completion before resuming other operations.

You can useAsyncTable.uses_v2_manifest_pathsto check if the table is already using the new path style.

Source code inlancedb/table.py
asyncdefmigrate_manifest_paths_v2(self):"""    Migrate the manifest paths to the new format.    This will update the manifest to use the new v2 format for paths.    This function is idempotent, and can be run multiple times without    changing the state of the object store.    !!! danger        This should not be run while other concurrent operations are happening.        And it should also run until completion before resuming other operations.    You can use    [AsyncTable.uses_v2_manifest_paths][lancedb.table.AsyncTable.uses_v2_manifest_paths]    to check if the table is already using the new path style.    """awaitself._inner.migrate_manifest_paths_v2()

replace_field_metadataasync

replace_field_metadata(field_name:str,new_metadata:dict[str,str])

Replace the metadata of a field in the schema

Parameters:

  • field_name (str) –

    The name of the field to replace the metadata for

  • new_metadata (dict[str,str]) –

    The new metadata to set

Source code inlancedb/table.py
asyncdefreplace_field_metadata(self,field_name:str,new_metadata:dict[str,str]):"""    Replace the metadata of a field in the schema    Parameters    ----------    field_name: str        The name of the field to replace the metadata for    new_metadata: dict        The new metadata to set    """awaitself._inner.replace_field_metadata(field_name,new_metadata)

Indices (Asynchronous)

Indices can be created on a table to speed up queries. This sectionlists the indices that LanceDb supports.

lancedb.index.BTreedataclass

Describes a btree index configuration

A btree index is an index on scalar columns. The index stores a copy of thecolumn in sorted order. A header entry is created for each block of rows(currently the block size is fixed at 4096). These header entries are storedin a separate cacheable structure (a btree). To search for data the header isused to determine which blocks need to be read from disk.

For example, a btree index in a table with 1Bi rows requiressizeof(Scalar) * 256Ki bytes of memory and will generally need to readsizeof(Scalar) * 4096 bytes to find the correct row ids.

This index is good for scalar columns with mostly distinct values and does bestwhen the query is highly selective. It works with numeric, temporal, and stringcolumns.

The btree index does not currently have any parameters though parameters such asthe block size may be added in the future.

Source code inlancedb/index.py
@dataclassclassBTree:"""Describes a btree index configuration    A btree index is an index on scalar columns.  The index stores a copy of the    column in sorted order.  A header entry is created for each block of rows    (currently the block size is fixed at 4096).  These header entries are stored    in a separate cacheable structure (a btree).  To search for data the header is    used to determine which blocks need to be read from disk.    For example, a btree index in a table with 1Bi rows requires    sizeof(Scalar) * 256Ki bytes of memory and will generally need to read    sizeof(Scalar) * 4096 bytes to find the correct row ids.    This index is good for scalar columns with mostly distinct values and does best    when the query is highly selective. It works with numeric, temporal, and string    columns.    The btree index does not currently have any parameters though parameters such as    the block size may be added in the future.    """pass

lancedb.index.Bitmapdataclass

Describe a Bitmap index configuration.

ABitmap index stores a bitmap for each distinct value in the column forevery row.

This index works best for low-cardinality numeric or string columns,where the number of unique values is small (i.e., less than a few thousands).Bitmap index can accelerate the following filters:

  • <,<=,=,>,>=
  • IN (value1, value2, ...)
  • between (value1, value2)
  • is null

For example, a bitmap index with a table with 1Bi rows, and 128 distinct values,requires 128 / 8 * 1Bi bytes on disk.

Source code inlancedb/index.py
@dataclassclassBitmap:"""Describe a Bitmap index configuration.    A `Bitmap` index stores a bitmap for each distinct value in the column for    every row.    This index works best for low-cardinality numeric or string columns,    where the number of unique values is small (i.e., less than a few thousands).    `Bitmap` index can accelerate the following filters:    - `<`, `<=`, `=`, `>`, `>=`    - `IN (value1, value2, ...)`    - `between (value1, value2)`    - `is null`    For example, a bitmap index with a table with 1Bi rows, and 128 distinct values,    requires 128 / 8 * 1Bi bytes on disk.    """pass

lancedb.index.LabelListdataclass

Describe a LabelList index configuration.

LabelList is a scalar index that can be used onList<T> columns tosupport queries witharray_contains_all andarray_contains_anyusing an underlying bitmap index.

For example, it works withtags,categories,keywords, etc.

Source code inlancedb/index.py
@dataclassclassLabelList:"""Describe a LabelList index configuration.    `LabelList` is a scalar index that can be used on `List<T>` columns to    support queries with `array_contains_all` and `array_contains_any`    using an underlying bitmap index.    For example, it works with `tags`, `categories`, `keywords`, etc.    """pass

lancedb.index.FTSdataclass

Describe a FTS index configuration.

FTS is a full-text search index that can be used onString columns

For example, it works withtitle,description,content, etc.

Attributes:

  • with_position (bool, default False) –

    Whether to store the position of the token in the document. Setting thisto False can reduce the size of the index and improve indexing speed,but it will disable support for phrase queries.

  • base_tokenizer (str, default "simple") –

    The base tokenizer to use for tokenization. Options are:- "simple": Splits text by whitespace and punctuation.- "whitespace": Split text by whitespace, but not punctuation.- "raw": No tokenization. The entire text is treated as a single token.

  • language (str, default "English") –

    The language to use for tokenization.

  • max_token_length (int, default 40) –

    The maximum token length to index. Tokens longer than this length will beignored.

  • lower_case (bool, default True) –

    Whether to convert the token to lower case. This makes queries case-insensitive.

  • stem (bool, default True) –

    Whether to stem the token. Stemming reduces words to their root form.For example, in English "running" and "runs" would both be reduced to "run".

  • remove_stop_words (bool, default True) –

    Whether to remove stop words. Stop words are common words that are oftenremoved from text before indexing. For example, in English "the" and "and".

  • ascii_folding (bool, default True) –

    Whether to fold ASCII characters. This converts accented characters totheir ASCII equivalent. For example, "café" would be converted to "cafe".

Source code inlancedb/index.py
@dataclassclassFTS:"""Describe a FTS index configuration.    `FTS` is a full-text search index that can be used on `String` columns    For example, it works with `title`, `description`, `content`, etc.    Attributes    ----------    with_position : bool, default False        Whether to store the position of the token in the document. Setting this        to False can reduce the size of the index and improve indexing speed,        but it will disable support for phrase queries.    base_tokenizer : str, default "simple"        The base tokenizer to use for tokenization. Options are:        - "simple": Splits text by whitespace and punctuation.        - "whitespace": Split text by whitespace, but not punctuation.        - "raw": No tokenization. The entire text is treated as a single token.    language : str, default "English"        The language to use for tokenization.    max_token_length : int, default 40        The maximum token length to index. Tokens longer than this length will be        ignored.    lower_case : bool, default True        Whether to convert the token to lower case. This makes queries case-insensitive.    stem : bool, default True        Whether to stem the token. Stemming reduces words to their root form.        For example, in English "running" and "runs" would both be reduced to "run".    remove_stop_words : bool, default True        Whether to remove stop words. Stop words are common words that are often        removed from text before indexing. For example, in English "the" and "and".    ascii_folding : bool, default True        Whether to fold ASCII characters. This converts accented characters to        their ASCII equivalent. For example, "café" would be converted to "cafe".    """with_position:bool=Falsebase_tokenizer:Literal["simple","raw","whitespace"]="simple"language:str="English"max_token_length:Optional[int]=40lower_case:bool=Truestem:bool=Trueremove_stop_words:bool=Trueascii_folding:bool=Truengram_min_length:int=3ngram_max_length:int=3prefix_only:bool=False

lancedb.index.IvfPqdataclass

Describes an IVF PQ Index

This index stores a compressed (quantized) copy of every vector. These vectorsare grouped into partitions of similar vectors. Each partition keeps track ofa centroid which is the average value of all vectors in the group.

During a query the centroids are compared with the query vector to find theclosest partitions. The compressed vectors in these partitions are thensearched to find the closest vectors.

The compression scheme is called product quantization. Each vector is divideinto subvectors and then each subvector is quantized into a small number ofbits. the parametersnum_bits andnum_subvectors control this process,providing a tradeoff between index size (and thus search speed) and indexaccuracy.

The partitioning process is called IVF and thenum_partitions parametercontrols how many groups to create.

Note that training an IVF PQ index on a large dataset is a slow operation andcurrently is also a memory intensive operation.

Attributes:

  • distance_type (str, default "l2") –

    The distance metric used to train the index

    This is used when training the index to calculate the IVF partitions(vectors are grouped in partitions with similar vectors according to thisdistance type) and to calculate a subvector's code during quantization.

    The distance type used to train an index MUST match the distance type usedto search the index. Failure to do so will yield inaccurate results.

    The following distance types are available:

    "l2" - Euclidean distance. This is a very common distance metric thataccounts for both magnitude and direction when determining the distancebetween vectors. l2 distance has a range of [0, ∞).

    "cosine" - Cosine distance. Cosine distance is a distance metriccalculated from the cosine similarity between two vectors. Cosinesimilarity is a measure of similarity between two non-zero vectors of aninner product space. It is defined to equal the cosine of the anglebetween them. Unlike l2, the cosine distance is not affected by themagnitude of the vectors. Cosine distance has a range of [0, 2].

    Note: the cosine distance is undefined when one (or both) of the vectorsare all zeros (there is no direction). These vectors are invalid and maynever be returned from a vector search.

    "dot" - Dot product. Dot distance is the dot product of two vectors. Dotdistance has a range of (-∞, ∞). If the vectors are normalized (i.e. theirl2 norm is 1), then dot distance is equivalent to the cosine distance.

  • num_partitions (int, default sqrt(num_rows)) –

    The number of IVF partitions to create.

    This value should generally scale with the number of rows in the dataset.By default the number of partitions is the square root of the number ofrows.

    If this value is too large then the first part of the search (picking theright partition) will be slow. If this value is too small then the secondpart of the search (searching within a partition) will be slow.

  • num_sub_vectors (int, default is vector dimension / 16) –

    Number of sub-vectors of PQ.

    This value controls how much the vector is compressed during thequantization step. The more sub vectors there are the less the vector iscompressed. The default is the dimension of the vector divided by 16. Ifthe dimension is not evenly divisible by 16 we use the dimension divded by8.

    The above two cases are highly preferred. Having 8 or 16 values persubvector allows us to use efficient SIMD instructions.

    If the dimension is not visible by 8 then we use 1 subvector. This is notideal and will likely result in poor performance.

  • num_bits (int, default 8) –

    Number of bits to encode each sub-vector.

    This value controls how much the sub-vectors are compressed. The more bitsthe more accurate the index but the slower search. The default is 8bits. Only 4 and 8 are supported.

  • max_iterations (int, default 50) –

    Max iteration to train kmeans.

    When training an IVF PQ index we use kmeans to calculate the partitions.This parameter controls how many iterations of kmeans to run.

    Increasing this might improve the quality of the index but in most casesthese extra iterations have diminishing returns.

    The default value is 50.

  • sample_rate (int, default 256) –

    The rate used to calculate the number of training vectors for kmeans.

    When an IVF PQ index is trained, we need to calculate partitions. Theseare groups of vectors that are similar to each other. To do this we use analgorithm called kmeans.

    Running kmeans on a large dataset can be slow. To speed this up we runkmeans on a random sample of the data. This parameter controls the size ofthe sample. The total number of vectors used to train the index issample_rate * num_partitions.

    Increasing this value might improve the quality of the index but in mostcases the default should be sufficient.

    The default value is 256.

Source code inlancedb/index.py
@dataclassclassIvfPq:"""Describes an IVF PQ Index    This index stores a compressed (quantized) copy of every vector.  These vectors    are grouped into partitions of similar vectors.  Each partition keeps track of    a centroid which is the average value of all vectors in the group.    During a query the centroids are compared with the query vector to find the    closest partitions.  The compressed vectors in these partitions are then    searched to find the closest vectors.    The compression scheme is called product quantization.  Each vector is divide    into subvectors and then each subvector is quantized into a small number of    bits.  the parameters `num_bits` and `num_subvectors` control this process,    providing a tradeoff between index size (and thus search speed) and index    accuracy.    The partitioning process is called IVF and the `num_partitions` parameter    controls how many groups to create.    Note that training an IVF PQ index on a large dataset is a slow operation and    currently is also a memory intensive operation.    Attributes    ----------    distance_type: str, default "l2"        The distance metric used to train the index        This is used when training the index to calculate the IVF partitions        (vectors are grouped in partitions with similar vectors according to this        distance type) and to calculate a subvector's code during quantization.        The distance type used to train an index MUST match the distance type used        to search the index.  Failure to do so will yield inaccurate results.        The following distance types are available:        "l2" - Euclidean distance. This is a very common distance metric that        accounts for both magnitude and direction when determining the distance        between vectors. l2 distance has a range of [0, ∞).        "cosine" - Cosine distance.  Cosine distance is a distance metric        calculated from the cosine similarity between two vectors. Cosine        similarity is a measure of similarity between two non-zero vectors of an        inner product space. It is defined to equal the cosine of the angle        between them.  Unlike l2, the cosine distance is not affected by the        magnitude of the vectors.  Cosine distance has a range of [0, 2].        Note: the cosine distance is undefined when one (or both) of the vectors        are all zeros (there is no direction).  These vectors are invalid and may        never be returned from a vector search.        "dot" - Dot product. Dot distance is the dot product of two vectors. Dot        distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their        l2 norm is 1), then dot distance is equivalent to the cosine distance.    num_partitions: int, default sqrt(num_rows)        The number of IVF partitions to create.        This value should generally scale with the number of rows in the dataset.        By default the number of partitions is the square root of the number of        rows.        If this value is too large then the first part of the search (picking the        right partition) will be slow.  If this value is too small then the second        part of the search (searching within a partition) will be slow.    num_sub_vectors: int, default is vector dimension / 16        Number of sub-vectors of PQ.        This value controls how much the vector is compressed during the        quantization step.  The more sub vectors there are the less the vector is        compressed.  The default is the dimension of the vector divided by 16.  If        the dimension is not evenly divisible by 16 we use the dimension divded by        8.        The above two cases are highly preferred.  Having 8 or 16 values per        subvector allows us to use efficient SIMD instructions.        If the dimension is not visible by 8 then we use 1 subvector.  This is not        ideal and will likely result in poor performance.    num_bits: int, default 8        Number of bits to encode each sub-vector.        This value controls how much the sub-vectors are compressed.  The more bits        the more accurate the index but the slower search.  The default is 8        bits.  Only 4 and 8 are supported.    max_iterations: int, default 50        Max iteration to train kmeans.        When training an IVF PQ index we use kmeans to calculate the partitions.        This parameter controls how many iterations of kmeans to run.        Increasing this might improve the quality of the index but in most cases        these extra iterations have diminishing returns.        The default value is 50.    sample_rate: int, default 256        The rate used to calculate the number of training vectors for kmeans.        When an IVF PQ index is trained, we need to calculate partitions.  These        are groups of vectors that are similar to each other.  To do this we use an        algorithm called kmeans.        Running kmeans on a large dataset can be slow.  To speed this up we run        kmeans on a random sample of the data.  This parameter controls the size of        the sample.  The total number of vectors used to train the index is        `sample_rate * num_partitions`.        Increasing this value might improve the quality of the index but in most        cases the default should be sufficient.        The default value is 256.    """distance_type:Literal["l2","cosine","dot"]="l2"num_partitions:Optional[int]=Nonenum_sub_vectors:Optional[int]=Nonenum_bits:int=8max_iterations:int=50sample_rate:int=256

lancedb.index.HnswPqdataclass

Describe a HNSW-PQ index configuration.

HNSW-PQ stands for Hierarchical Navigable Small World - Product Quantization.It is a variant of the HNSW algorithm that uses product quantization to compressthe vectors. To create an HNSW-PQ index, you can specify the following parameters:

Parameters:

  • distance_type (Literal['l2', 'cosine', 'dot'], default:'l2') –

    The distance metric used to train the index.

    The following distance types are available:

    "l2" - Euclidean distance. This is a very common distance metric thataccounts for both magnitude and direction when determining the distancebetween vectors. l2 distance has a range of [0, ∞).

    "cosine" - Cosine distance. Cosine distance is a distance metriccalculated from the cosine similarity between two vectors. Cosinesimilarity is a measure of similarity between two non-zero vectors of aninner product space. It is defined to equal the cosine of the anglebetween them. Unlike l2, the cosine distance is not affected by themagnitude of the vectors. Cosine distance has a range of [0, 2].

    "dot" - Dot product. Dot distance is the dot product of two vectors. Dotdistance has a range of (-∞, ∞). If the vectors are normalized (i.e. theirl2 norm is 1), then dot distance is equivalent to the cosine distance.

  • num_partitions (Optional[int], default:None) –

    The number of IVF partitions to create.

    For HNSW, we recommend a small number of partitions. Setting this to 1 workswell for most tables. For very large tables, training just one HNSW graphwill require too much memory. Each partition becomes its own HNSW graph, sosetting this value higher reduces the peak memory use of training.

  • default (Optional[int], default:None) –

    The number of IVF partitions to create.

    For HNSW, we recommend a small number of partitions. Setting this to 1 workswell for most tables. For very large tables, training just one HNSW graphwill require too much memory. Each partition becomes its own HNSW graph, sosetting this value higher reduces the peak memory use of training.

  • num_sub_vectors (Optional[int], default:None) –

    Number of sub-vectors of PQ.

    This value controls how much the vector is compressed during thequantization step. The more sub vectors there are the less the vector iscompressed. The default is the dimension of the vector divided by 16.If the dimension is not evenly divisible by 16 we use the dimensiondivided by 8.

    The above two cases are highly preferred. Having 8 or 16 values persubvector allows us to use efficient SIMD instructions.

    If the dimension is not visible by 8 then we use 1 subvector. This is notideal and will likely result in poor performance.

    num_bits: int, default 8Number of bits to encode each sub-vector.

    This value controls how much the sub-vectors are compressed. The more bitsthe more accurate the index but the slower search. Only 4 and 8 are supported.

  • default (Optional[int], default:None) –

    Number of sub-vectors of PQ.

    This value controls how much the vector is compressed during thequantization step. The more sub vectors there are the less the vector iscompressed. The default is the dimension of the vector divided by 16.If the dimension is not evenly divisible by 16 we use the dimensiondivided by 8.

    The above two cases are highly preferred. Having 8 or 16 values persubvector allows us to use efficient SIMD instructions.

    If the dimension is not visible by 8 then we use 1 subvector. This is notideal and will likely result in poor performance.

    num_bits: int, default 8Number of bits to encode each sub-vector.

    This value controls how much the sub-vectors are compressed. The more bitsthe more accurate the index but the slower search. Only 4 and 8 are supported.

  • max_iterations (int, default:50) –

    Max iterations to train kmeans.

    When training an IVF index we use kmeans to calculate the partitions. Thisparameter controls how many iterations of kmeans to run.

    Increasing this might improve the quality of the index but in most cases theparameter is unused because kmeans will converge with fewer iterations. Theparameter is only used in cases where kmeans does not appear to converge. Inthose cases it is unlikely that setting this larger will lead to the indexconverging anyways.

  • default (int, default:50) –

    Max iterations to train kmeans.

    When training an IVF index we use kmeans to calculate the partitions. Thisparameter controls how many iterations of kmeans to run.

    Increasing this might improve the quality of the index but in most cases theparameter is unused because kmeans will converge with fewer iterations. Theparameter is only used in cases where kmeans does not appear to converge. Inthose cases it is unlikely that setting this larger will lead to the indexconverging anyways.

  • sample_rate (int, default:256) –

    The rate used to calculate the number of training vectors for kmeans.

    When an IVF index is trained, we need to calculate partitions. These aregroups of vectors that are similar to each other. To do this we use analgorithm called kmeans.

    Running kmeans on a large dataset can be slow. To speed this up werun kmeans on a random sample of the data. This parameter controls thesize of the sample. The total number of vectors used to train the indexissample_rate * num_partitions.

    Increasing this value might improve the quality of the index but inmost cases the default should be sufficient.

  • default (int, default:256) –

    The rate used to calculate the number of training vectors for kmeans.

    When an IVF index is trained, we need to calculate partitions. These aregroups of vectors that are similar to each other. To do this we use analgorithm called kmeans.

    Running kmeans on a large dataset can be slow. To speed this up werun kmeans on a random sample of the data. This parameter controls thesize of the sample. The total number of vectors used to train the indexissample_rate * num_partitions.

    Increasing this value might improve the quality of the index but inmost cases the default should be sufficient.

  • m (int, default:20) –

    The number of neighbors to select for each vector in the HNSW graph.

    This value controls the tradeoff between search speed and accuracy.The higher the value the more accurate the search but the slower it will be.

  • default (int, default:20) –

    The number of neighbors to select for each vector in the HNSW graph.

    This value controls the tradeoff between search speed and accuracy.The higher the value the more accurate the search but the slower it will be.

  • ef_construction (int, default:300) –

    The number of candidates to evaluate during the construction of the HNSW graph.

    This value controls the tradeoff between build speed and accuracy.The higher the value the more accurate the build but the slower it will be.150 to 300 is the typical range. 100 is a minimum for good quality searchresults. In most cases, there is no benefit to setting this higher than 500.This value should be set to a value that is not less thanef in thesearch phase.

  • default (int, default:300) –

    The number of candidates to evaluate during the construction of the HNSW graph.

    This value controls the tradeoff between build speed and accuracy.The higher the value the more accurate the build but the slower it will be.150 to 300 is the typical range. 100 is a minimum for good quality searchresults. In most cases, there is no benefit to setting this higher than 500.This value should be set to a value that is not less thanef in thesearch phase.

Source code inlancedb/index.py
@dataclassclassHnswPq:"""Describe a HNSW-PQ index configuration.    HNSW-PQ stands for Hierarchical Navigable Small World - Product Quantization.    It is a variant of the HNSW algorithm that uses product quantization to compress    the vectors. To create an HNSW-PQ index, you can specify the following parameters:    Parameters    ----------    distance_type: str, default "l2"        The distance metric used to train the index.        The following distance types are available:        "l2" - Euclidean distance. This is a very common distance metric that        accounts for both magnitude and direction when determining the distance        between vectors. l2 distance has a range of [0, ∞).        "cosine" - Cosine distance.  Cosine distance is a distance metric        calculated from the cosine similarity between two vectors. Cosine        similarity is a measure of similarity between two non-zero vectors of an        inner product space. It is defined to equal the cosine of the angle        between them.  Unlike l2, the cosine distance is not affected by the        magnitude of the vectors.  Cosine distance has a range of [0, 2].        "dot" - Dot product. Dot distance is the dot product of two vectors. Dot        distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their        l2 norm is 1), then dot distance is equivalent to the cosine distance.    num_partitions, default sqrt(num_rows)        The number of IVF partitions to create.        For HNSW, we recommend a small number of partitions. Setting this to 1 works        well for most tables. For very large tables, training just one HNSW graph        will require too much memory. Each partition becomes its own HNSW graph, so        setting this value higher reduces the peak memory use of training.    num_sub_vectors, default is vector dimension / 16        Number of sub-vectors of PQ.        This value controls how much the vector is compressed during the        quantization step. The more sub vectors there are the less the vector is        compressed.  The default is the dimension of the vector divided by 16.        If the dimension is not evenly divisible by 16 we use the dimension        divided by 8.        The above two cases are highly preferred.  Having 8 or 16 values per        subvector allows us to use efficient SIMD instructions.        If the dimension is not visible by 8 then we use 1 subvector.  This is not        ideal and will likely result in poor performance.     num_bits: int, default 8        Number of bits to encode each sub-vector.        This value controls how much the sub-vectors are compressed.  The more bits        the more accurate the index but the slower search. Only 4 and 8 are supported.    max_iterations, default 50        Max iterations to train kmeans.        When training an IVF index we use kmeans to calculate the partitions.  This        parameter controls how many iterations of kmeans to run.        Increasing this might improve the quality of the index but in most cases the        parameter is unused because kmeans will converge with fewer iterations.  The        parameter is only used in cases where kmeans does not appear to converge.  In        those cases it is unlikely that setting this larger will lead to the index        converging anyways.    sample_rate, default 256        The rate used to calculate the number of training vectors for kmeans.        When an IVF index is trained, we need to calculate partitions.  These are        groups of vectors that are similar to each other.  To do this we use an        algorithm called kmeans.        Running kmeans on a large dataset can be slow.  To speed this up we        run kmeans on a random sample of the data.  This parameter controls the        size of the sample.  The total number of vectors used to train the index        is `sample_rate * num_partitions`.        Increasing this value might improve the quality of the index but in        most cases the default should be sufficient.    m, default 20        The number of neighbors to select for each vector in the HNSW graph.        This value controls the tradeoff between search speed and accuracy.        The higher the value the more accurate the search but the slower it will be.    ef_construction, default 300        The number of candidates to evaluate during the construction of the HNSW graph.        This value controls the tradeoff between build speed and accuracy.        The higher the value the more accurate the build but the slower it will be.        150 to 300 is the typical range. 100 is a minimum for good quality search        results. In most cases, there is no benefit to setting this higher than 500.        This value should be set to a value that is not less than `ef` in the        search phase.    """distance_type:Literal["l2","cosine","dot"]="l2"num_partitions:Optional[int]=Nonenum_sub_vectors:Optional[int]=Nonenum_bits:int=8max_iterations:int=50sample_rate:int=256m:int=20ef_construction:int=300

lancedb.index.HnswSqdataclass

Describe a HNSW-SQ index configuration.

HNSW-SQ stands for Hierarchical Navigable Small World - Scalar Quantization.It is a variant of the HNSW algorithm that uses scalar quantization to compressthe vectors.

Parameters:

  • distance_type (Literal['l2', 'cosine', 'dot'], default:'l2') –

    The distance metric used to train the index.

    The following distance types are available:

    "l2" - Euclidean distance. This is a very common distance metric thataccounts for both magnitude and direction when determining the distancebetween vectors. l2 distance has a range of [0, ∞).

    "cosine" - Cosine distance. Cosine distance is a distance metriccalculated from the cosine similarity between two vectors. Cosinesimilarity is a measure of similarity between two non-zero vectors of aninner product space. It is defined to equal the cosine of the anglebetween them. Unlike l2, the cosine distance is not affected by themagnitude of the vectors. Cosine distance has a range of [0, 2].

    "dot" - Dot product. Dot distance is the dot product of two vectors. Dotdistance has a range of (-∞, ∞). If the vectors are normalized (i.e. theirl2 norm is 1), then dot distance is equivalent to the cosine distance.

  • num_partitions (Optional[int], default:None) –

    The number of IVF partitions to create.

    For HNSW, we recommend a small number of partitions. Setting this to 1 workswell for most tables. For very large tables, training just one HNSW graphwill require too much memory. Each partition becomes its own HNSW graph, sosetting this value higher reduces the peak memory use of training.

  • default (Optional[int], default:None) –

    The number of IVF partitions to create.

    For HNSW, we recommend a small number of partitions. Setting this to 1 workswell for most tables. For very large tables, training just one HNSW graphwill require too much memory. Each partition becomes its own HNSW graph, sosetting this value higher reduces the peak memory use of training.

  • max_iterations (int, default:50) –

    Max iterations to train kmeans.

    When training an IVF index we use kmeans to calculate the partitions.This parameter controls how many iterations of kmeans to run.

    Increasing this might improve the quality of the index but in most casesthe parameter is unused because kmeans will converge with fewer iterations.The parameter is only used in cases where kmeans does not appear to converge.In those cases it is unlikely that setting this larger will lead tothe index converging anyways.

  • default (int, default:50) –

    Max iterations to train kmeans.

    When training an IVF index we use kmeans to calculate the partitions.This parameter controls how many iterations of kmeans to run.

    Increasing this might improve the quality of the index but in most casesthe parameter is unused because kmeans will converge with fewer iterations.The parameter is only used in cases where kmeans does not appear to converge.In those cases it is unlikely that setting this larger will lead tothe index converging anyways.

  • sample_rate (int, default:256) –

    The rate used to calculate the number of training vectors for kmeans.

    When an IVF index is trained, we need to calculate partitions. Theseare groups of vectors that are similar to each other. To do thiswe use an algorithm called kmeans.

    Running kmeans on a large dataset can be slow. To speed this up werun kmeans on a random sample of the data. This parameter controls thesize of the sample. The total number of vectors used to train the indexissample_rate * num_partitions.

    Increasing this value might improve the quality of the index but inmost cases the default should be sufficient.

  • default (int, default:256) –

    The rate used to calculate the number of training vectors for kmeans.

    When an IVF index is trained, we need to calculate partitions. Theseare groups of vectors that are similar to each other. To do thiswe use an algorithm called kmeans.

    Running kmeans on a large dataset can be slow. To speed this up werun kmeans on a random sample of the data. This parameter controls thesize of the sample. The total number of vectors used to train the indexissample_rate * num_partitions.

    Increasing this value might improve the quality of the index but inmost cases the default should be sufficient.

  • m (int, default:20) –

    The number of neighbors to select for each vector in the HNSW graph.

    This value controls the tradeoff between search speed and accuracy.The higher the value the more accurate the search but the slower it will be.

  • default (int, default:20) –

    The number of neighbors to select for each vector in the HNSW graph.

    This value controls the tradeoff between search speed and accuracy.The higher the value the more accurate the search but the slower it will be.

  • ef_construction (int, default:300) –

    The number of candidates to evaluate during the construction of the HNSW graph.

    This value controls the tradeoff between build speed and accuracy.The higher the value the more accurate the build but the slower it will be.150 to 300 is the typical range. 100 is a minimum for good quality searchresults. In most cases, there is no benefit to setting this higher than 500.This value should be set to a value that is not less thanef in the searchphase.

  • default (int, default:300) –

    The number of candidates to evaluate during the construction of the HNSW graph.

    This value controls the tradeoff between build speed and accuracy.The higher the value the more accurate the build but the slower it will be.150 to 300 is the typical range. 100 is a minimum for good quality searchresults. In most cases, there is no benefit to setting this higher than 500.This value should be set to a value that is not less thanef in the searchphase.

Source code inlancedb/index.py
@dataclassclassHnswSq:"""Describe a HNSW-SQ index configuration.    HNSW-SQ stands for Hierarchical Navigable Small World - Scalar Quantization.    It is a variant of the HNSW algorithm that uses scalar quantization to compress    the vectors.    Parameters    ----------    distance_type: str, default "l2"        The distance metric used to train the index.        The following distance types are available:        "l2" - Euclidean distance. This is a very common distance metric that        accounts for both magnitude and direction when determining the distance        between vectors. l2 distance has a range of [0, ∞).        "cosine" - Cosine distance.  Cosine distance is a distance metric        calculated from the cosine similarity between two vectors. Cosine        similarity is a measure of similarity between two non-zero vectors of an        inner product space. It is defined to equal the cosine of the angle        between them.  Unlike l2, the cosine distance is not affected by the        magnitude of the vectors.  Cosine distance has a range of [0, 2].        "dot" - Dot product. Dot distance is the dot product of two vectors. Dot        distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their        l2 norm is 1), then dot distance is equivalent to the cosine distance.    num_partitions, default sqrt(num_rows)        The number of IVF partitions to create.        For HNSW, we recommend a small number of partitions. Setting this to 1 works        well for most tables. For very large tables, training just one HNSW graph        will require too much memory. Each partition becomes its own HNSW graph, so        setting this value higher reduces the peak memory use of training.    max_iterations, default 50        Max iterations to train kmeans.        When training an IVF index we use kmeans to calculate the partitions.        This parameter controls how many iterations of kmeans to run.        Increasing this might improve the quality of the index but in most cases        the parameter is unused because kmeans will converge with fewer iterations.        The parameter is only used in cases where kmeans does not appear to converge.        In those cases it is unlikely that setting this larger will lead to        the index converging anyways.    sample_rate, default 256        The rate used to calculate the number of training vectors for kmeans.        When an IVF index is trained, we need to calculate partitions.  These        are groups of vectors that are similar to each other.  To do this        we use an algorithm called kmeans.        Running kmeans on a large dataset can be slow.  To speed this up we        run kmeans on a random sample of the data.  This parameter controls the        size of the sample.  The total number of vectors used to train the index        is `sample_rate * num_partitions`.        Increasing this value might improve the quality of the index but in        most cases the default should be sufficient.    m, default 20        The number of neighbors to select for each vector in the HNSW graph.        This value controls the tradeoff between search speed and accuracy.        The higher the value the more accurate the search but the slower it will be.    ef_construction, default 300        The number of candidates to evaluate during the construction of the HNSW graph.        This value controls the tradeoff between build speed and accuracy.        The higher the value the more accurate the build but the slower it will be.        150 to 300 is the typical range. 100 is a minimum for good quality search        results. In most cases, there is no benefit to setting this higher than 500.        This value should be set to a value that is not less than `ef` in the search        phase.    """distance_type:Literal["l2","cosine","dot"]="l2"num_partitions:Optional[int]=Nonemax_iterations:int=50sample_rate:int=256m:int=20ef_construction:int=300

lancedb.index.IvfFlatdataclass

Describes an IVF Flat Index

This index stores raw vectors.These vectors are grouped into partitions of similar vectors.Each partition keeps track of a centroid which isthe average value of all vectors in the group.

Attributes:

  • distance_type (str, default "l2") –

    The distance metric used to train the index

    This is used when training the index to calculate the IVF partitions(vectors are grouped in partitions with similar vectors according to thisdistance type) and to calculate a subvector's code during quantization.

    The distance type used to train an index MUST match the distance type usedto search the index. Failure to do so will yield inaccurate results.

    The following distance types are available:

    "l2" - Euclidean distance. This is a very common distance metric thataccounts for both magnitude and direction when determining the distancebetween vectors. l2 distance has a range of [0, ∞).

    "cosine" - Cosine distance. Cosine distance is a distance metriccalculated from the cosine similarity between two vectors. Cosinesimilarity is a measure of similarity between two non-zero vectors of aninner product space. It is defined to equal the cosine of the anglebetween them. Unlike l2, the cosine distance is not affected by themagnitude of the vectors. Cosine distance has a range of [0, 2].

    Note: the cosine distance is undefined when one (or both) of the vectorsare all zeros (there is no direction). These vectors are invalid and maynever be returned from a vector search.

    "dot" - Dot product. Dot distance is the dot product of two vectors. Dotdistance has a range of (-∞, ∞). If the vectors are normalized (i.e. theirl2 norm is 1), then dot distance is equivalent to the cosine distance.

    "hamming" - Hamming distance. Hamming distance is a distance metriccalculated as the number of positions at which the corresponding bits aredifferent. Hamming distance has a range of [0, vector dimension].

  • num_partitions (int, default sqrt(num_rows)) –

    The number of IVF partitions to create.

    This value should generally scale with the number of rows in the dataset.By default the number of partitions is the square root of the number ofrows.

    If this value is too large then the first part of the search (picking theright partition) will be slow. If this value is too small then the secondpart of the search (searching within a partition) will be slow.

  • max_iterations (int, default 50) –

    Max iteration to train kmeans.

    When training an IVF PQ index we use kmeans to calculate the partitions.This parameter controls how many iterations of kmeans to run.

    Increasing this might improve the quality of the index but in most casesthese extra iterations have diminishing returns.

    The default value is 50.

  • sample_rate (int, default 256) –

    The rate used to calculate the number of training vectors for kmeans.

    When an IVF PQ index is trained, we need to calculate partitions. Theseare groups of vectors that are similar to each other. To do this we use analgorithm called kmeans.

    Running kmeans on a large dataset can be slow. To speed this up we runkmeans on a random sample of the data. This parameter controls the size ofthe sample. The total number of vectors used to train the index issample_rate * num_partitions.

    Increasing this value might improve the quality of the index but in mostcases the default should be sufficient.

    The default value is 256.

Source code inlancedb/index.py
@dataclassclassIvfFlat:"""Describes an IVF Flat Index    This index stores raw vectors.    These vectors are grouped into partitions of similar vectors.    Each partition keeps track of a centroid which is    the average value of all vectors in the group.    Attributes    ----------    distance_type: str, default "l2"        The distance metric used to train the index        This is used when training the index to calculate the IVF partitions        (vectors are grouped in partitions with similar vectors according to this        distance type) and to calculate a subvector's code during quantization.        The distance type used to train an index MUST match the distance type used        to search the index.  Failure to do so will yield inaccurate results.        The following distance types are available:        "l2" - Euclidean distance. This is a very common distance metric that        accounts for both magnitude and direction when determining the distance        between vectors. l2 distance has a range of [0, ∞).        "cosine" - Cosine distance.  Cosine distance is a distance metric        calculated from the cosine similarity between two vectors. Cosine        similarity is a measure of similarity between two non-zero vectors of an        inner product space. It is defined to equal the cosine of the angle        between them.  Unlike l2, the cosine distance is not affected by the        magnitude of the vectors.  Cosine distance has a range of [0, 2].        Note: the cosine distance is undefined when one (or both) of the vectors        are all zeros (there is no direction).  These vectors are invalid and may        never be returned from a vector search.        "dot" - Dot product. Dot distance is the dot product of two vectors. Dot        distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their        l2 norm is 1), then dot distance is equivalent to the cosine distance.        "hamming" - Hamming distance. Hamming distance is a distance metric        calculated as the number of positions at which the corresponding bits are        different. Hamming distance has a range of [0, vector dimension].    num_partitions: int, default sqrt(num_rows)        The number of IVF partitions to create.        This value should generally scale with the number of rows in the dataset.        By default the number of partitions is the square root of the number of        rows.        If this value is too large then the first part of the search (picking the        right partition) will be slow.  If this value is too small then the second        part of the search (searching within a partition) will be slow.    max_iterations: int, default 50        Max iteration to train kmeans.        When training an IVF PQ index we use kmeans to calculate the partitions.        This parameter controls how many iterations of kmeans to run.        Increasing this might improve the quality of the index but in most cases        these extra iterations have diminishing returns.        The default value is 50.    sample_rate: int, default 256        The rate used to calculate the number of training vectors for kmeans.        When an IVF PQ index is trained, we need to calculate partitions.  These        are groups of vectors that are similar to each other.  To do this we use an        algorithm called kmeans.        Running kmeans on a large dataset can be slow.  To speed this up we run        kmeans on a random sample of the data.  This parameter controls the size of        the sample.  The total number of vectors used to train the index is        `sample_rate * num_partitions`.        Increasing this value might improve the quality of the index but in most        cases the default should be sufficient.        The default value is 256.    """distance_type:Literal["l2","cosine","dot","hamming"]="l2"num_partitions:Optional[int]=Nonemax_iterations:int=50sample_rate:int=256

Querying (Asynchronous)

Queries allow you to return data from your database. Basic queries can becreated with theAsyncTable.query methodto return the entire (typically filtered) table. Vector searches return therows nearest to a query vector and can be created with theAsyncTable.vector_search method.

lancedb.query.AsyncQuery

Bases:AsyncQueryBase

Source code inlancedb/query.py
classAsyncQuery(AsyncQueryBase):def__init__(self,inner:LanceQuery):"""        Construct an AsyncQuery        This method is not intended to be called directly.  Instead, use the        [AsyncTable.query][lancedb.table.AsyncTable.query] method to create a query.        """super().__init__(inner)self._inner=inner@classmethoddef_query_vec_to_array(self,vec:Union[VEC,Tuple]):ifisinstance(vec,list):returnpa.array(vec)ifisinstance(vec,np.ndarray):returnpa.array(vec)ifisinstance(vec,pa.Array):returnvecifisinstance(vec,pa.ChunkedArray):returnvec.combine_chunks()ifisinstance(vec,tuple):returnpa.array(vec)# We've checked everything we formally support in our typings# but, as a fallback, let pyarrow try and convert it anyway.# This can allow for some more exotic things like iterablesreturnpa.array(vec)defnearest_to(self,query_vector:Union[VEC,Tuple,List[VEC]],)->AsyncVectorQuery:"""        Find the nearest vectors to the given query vector.        This converts the query from a plain query to a vector query.        This method will attempt to convert the input to the query vector        expected by the embedding model.  If the input cannot be converted        then an error will be thrown.        By default, there is no embedding model, and the input should be        something that can be converted to a pyarrow array of floats.  This        includes lists, numpy arrays, and tuples.        If there is only one vector column (a column whose data type is a        fixed size list of floats) then the column does not need to be specified.        If there is more than one vector column you must use        [AsyncVectorQuery.column][lancedb.query.AsyncVectorQuery.column] to specify        which column you would like to compare with.        If no index has been created on the vector column then a vector query        will perform a distance comparison between the query vector and every        vector in the database and then sort the results.  This is sometimes        called a "flat search"        For small databases, with tens of thousands of vectors or less, this can        be reasonably fast.  In larger databases you should create a vector index        on the column.  If there is a vector index then an "approximate" nearest        neighbor search (frequently called an ANN search) will be performed.  This        search is much faster, but the results will be approximate.        The query can be further parameterized using the returned builder.  There        are various ANN search parameters that will let you fine tune your recall        accuracy vs search latency.        Vector searches always have a [limit][].  If `limit` has not been called then        a default `limit` of 10 will be used.        Typically, a single vector is passed in as the query. However, you can also        pass in multiple vectors. When multiple vectors are passed in, if the vector        column is with multivector type, then the vectors will be treated as a single        query. Or the vectors will be treated as multiple queries, this can be useful        if you want to find the nearest vectors to multiple query vectors.        This is not expected to be faster than making multiple queries concurrently;        it is just a convenience method. If multiple vectors are passed in then        an additional column `query_index` will be added to the results. This column        will contain the index of the query vector that the result is nearest to.        """ifquery_vectorisNone:raiseValueError("query_vector can not be None")if(isinstance(query_vector,(list,np.ndarray,pa.Array))andlen(query_vector)>0andisinstance(query_vector[0],(list,np.ndarray,pa.Array))):# multiple have been passedquery_vectors=[AsyncQuery._query_vec_to_array(v)forvinquery_vector]new_self=self._inner.nearest_to(query_vectors[0])forvinquery_vectors[1:]:new_self.add_query_vector(v)returnAsyncVectorQuery(new_self)else:returnAsyncVectorQuery(self._inner.nearest_to(AsyncQuery._query_vec_to_array(query_vector)))defnearest_to_text(self,query:str|FullTextQuery,columns:Union[str,List[str],None]=None)->AsyncFTSQuery:"""        Find the documents that are most relevant to the given text query.        This method will perform a full text search on the table and return        the most relevant documents.  The relevance is determined by BM25.        The columns to search must be with native FTS index        (Tantivy-based can't work with this method).        By default, all indexed columns are searched,        now only one column can be searched at a time.        Parameters        ----------        query: str            The text query to search for.        columns: str or list of str, default None            The columns to search in. If None, all indexed columns are searched.            For now only one column can be searched at a time.        """ifisinstance(columns,str):columns=[columns]ifcolumnsisNone:columns=[]ifisinstance(query,str):returnAsyncFTSQuery(self._inner.nearest_to_text({"query":query,"columns":columns}))# FullTextQuery objectreturnAsyncFTSQuery(self._inner.nearest_to_text({"query":query}))

where

where(predicate:str)->Self

Only return rows matching the given predicate

The predicate should be supplied as an SQL query string.

Examples:

>>>predicate="x > 10">>>predicate="y > 0 AND y < 100">>>predicate="x > 5 OR y = 'test'"

Filtering performance can often be improved by creating a scalar indexon the filter column(s).

Source code inlancedb/query.py
defwhere(self,predicate:str)->Self:"""    Only return rows matching the given predicate    The predicate should be supplied as an SQL query string.    Examples    --------    >>> predicate = "x > 10"    >>> predicate = "y > 0 AND y < 100"    >>> predicate = "x > 5 OR y = 'test'"    Filtering performance can often be improved by creating a scalar index    on the filter column(s).    """self._inner.where(predicate)returnself

select

select(columns:Union[List[str],dict[str,str]])->Self

Return only the specified columns.

By default a query will return all columns from the table. However, this canhave a very significant impact on latency. LanceDb stores data in a columnarfashion. Thismeans we can finely tune our I/O to select exactly the columns we need.

As a best practice you should always limit queries to the columns that you need.If you pass in a list of column names then only those columns will bereturned.

You can also use this method to create new "dynamic" columns based on yourexisting columns. For example, you may not care about "a" or "b" but insteadsimply want "a + b". This is often seen in the SELECT clause of an SQL query(e.g.SELECT a+b FROM my_table).

To create dynamic columns you can pass in a dict[str, str]. A column will bereturned for each entry in the map. The key provides the name of the column.The value is an SQL string used to specify how the column is calculated.

For example, an SQL query might stateSELECT a + b AS combined, c. Theequivalent input to this method would be{"combined": "a + b", "c": "c"}.

Columns will always be returned in the order given, even if that order isdifferent than the order used when adding the data.

Source code inlancedb/query.py
defselect(self,columns:Union[List[str],dict[str,str]])->Self:"""    Return only the specified columns.    By default a query will return all columns from the table.  However, this can    have a very significant impact on latency.  LanceDb stores data in a columnar    fashion.  This    means we can finely tune our I/O to select exactly the columns we need.    As a best practice you should always limit queries to the columns that you need.    If you pass in a list of column names then only those columns will be    returned.    You can also use this method to create new "dynamic" columns based on your    existing columns. For example, you may not care about "a" or "b" but instead    simply want "a + b".  This is often seen in the SELECT clause of an SQL query    (e.g. `SELECT a+b FROM my_table`).    To create dynamic columns you can pass in a dict[str, str].  A column will be    returned for each entry in the map.  The key provides the name of the column.    The value is an SQL string used to specify how the column is calculated.    For example, an SQL query might state `SELECT a + b AS combined, c`.  The    equivalent input to this method would be `{"combined": "a + b", "c": "c"}`.    Columns will always be returned in the order given, even if that order is    different than the order used when adding the data.    """ifisinstance(columns,list)andall(isinstance(c,str)forcincolumns):self._inner.select_columns(columns)elifisinstance(columns,dict)andall(isinstance(k,str)andisinstance(v,str)fork,vincolumns.items()):self._inner.select(list(columns.items()))else:raiseTypeError("columns must be a list of column names or a dict")returnself

limit

limit(limit:int)->Self

Set the maximum number of results to return.

By default, a plain search has no limit. If this method is notcalled then every valid row from the table will be returned.

Source code inlancedb/query.py
deflimit(self,limit:int)->Self:"""    Set the maximum number of results to return.    By default, a plain search has no limit.  If this method is not    called then every valid row from the table will be returned.    """self._inner.limit(limit)returnself

offset

offset(offset:int)->Self

Set the offset for the results.

Parameters:

  • offset (int) –

    The offset to start fetching results from.

Source code inlancedb/query.py
defoffset(self,offset:int)->Self:"""    Set the offset for the results.    Parameters    ----------    offset: int        The offset to start fetching results from.    """self._inner.offset(offset)returnself

fast_search

fast_search()->Self

Skip searching un-indexed data.

This can make queries faster, but will miss any data that has not beenindexed.

Tip

You can add new data into an existing index by callingAsyncTable.optimize.

Source code inlancedb/query.py
deffast_search(self)->Self:"""    Skip searching un-indexed data.    This can make queries faster, but will miss any data that has not been    indexed.    !!! tip        You can add new data into an existing index by calling        [AsyncTable.optimize][lancedb.table.AsyncTable.optimize].    """self._inner.fast_search()returnself

with_row_id

with_row_id()->Self

Include the _rowid column in the results.

Source code inlancedb/query.py
defwith_row_id(self)->Self:"""    Include the _rowid column in the results.    """self._inner.with_row_id()returnself

postfilter

postfilter()->Self

If this is called then filtering will happen after the search instead ofbefore.By default filtering will be performed before the search. This is howfiltering is typically understood to work. This prefilter step does add someadditional latency. Creating a scalar index on the filter column(s) canoften improve this latency. However, sometimes a filter is too complex orscalar indices cannot be applied to the column. In these cases postfilteringcan be used instead of prefiltering to improve latency.Post filtering applies the filter to the results of the search. Thismeans we only run the filter on a much smaller set of data. However, it cancause the query to return fewer thanlimit results (or even no results) ifnone of the nearest results match the filter.Post filtering happens during the "refine stage" (described in more detail in@see {@link VectorQuery#refineFactor}). This means that setting a higher refinefactor can often help restore some of the results lost by post filtering.

Source code inlancedb/query.py
defpostfilter(self)->Self:"""    If this is called then filtering will happen after the search instead of    before.    By default filtering will be performed before the search.  This is how    filtering is typically understood to work.  This prefilter step does add some    additional latency.  Creating a scalar index on the filter column(s) can    often improve this latency.  However, sometimes a filter is too complex or    scalar indices cannot be applied to the column.  In these cases postfiltering    can be used instead of prefiltering to improve latency.    Post filtering applies the filter to the results of the search.  This    means we only run the filter on a much smaller set of data.  However, it can    cause the query to return fewer than `limit` results (or even no results) if    none of the nearest results match the filter.    Post filtering happens during the "refine stage" (described in more detail in    @see {@link VectorQuery#refineFactor}).  This means that setting a higher refine    factor can often help restore some of the results lost by post filtering.    """self._inner.postfilter()returnself

to_batchesasync

to_batches(*,max_batch_length:Optional[int]=None,timeout:Optional[timedelta]=None)->AsyncRecordBatchReader

Execute the query and return the results as an Apache Arrow RecordBatchReader.

Parameters:

  • max_batch_length (Optional[int], default:None) –

    The maximum number of selected records in a single RecordBatch object.If not specified, a default batch length is used.It is possible for batches to be smaller than the provided length if theunderlying data is stored in smaller chunks.

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.

Source code inlancedb/query.py
asyncdefto_batches(self,*,max_batch_length:Optional[int]=None,timeout:Optional[timedelta]=None,)->AsyncRecordBatchReader:"""    Execute the query and return the results as an Apache Arrow RecordBatchReader.    Parameters    ----------    max_batch_length: Optional[int]        The maximum number of selected records in a single RecordBatch object.        If not specified, a default batch length is used.        It is possible for batches to be smaller than the provided length if the        underlying data is stored in smaller chunks.    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If not specified, no timeout is applied. If the query does not        complete within the specified time, an error will be raised.    """returnAsyncRecordBatchReader(awaitself._inner.execute(max_batch_length,timeout))

to_arrowasync

to_arrow(timeout:Optional[timedelta]=None)->Table

Execute the query and collect the results into an Apache Arrow Table.

This method will collect all results into memory before returning. Ifyou expect a large number of results, you may want to useto_batches

Parameters:

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.

Source code inlancedb/query.py
asyncdefto_arrow(self,timeout:Optional[timedelta]=None)->pa.Table:"""    Execute the query and collect the results into an Apache Arrow Table.    This method will collect all results into memory before returning.  If    you expect a large number of results, you may want to use    [to_batches][lancedb.query.AsyncQueryBase.to_batches]    Parameters    ----------    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If not specified, no timeout is applied. If the query does not        complete within the specified time, an error will be raised.    """batch_iter=awaitself.to_batches(timeout=timeout)returnpa.Table.from_batches(awaitbatch_iter.read_all(),schema=batch_iter.schema)

to_listasync

to_list(timeout:Optional[timedelta]=None)->List[dict]

Execute the query and return the results as a list of dictionaries.

Each list entry is a dictionary with the selected column names as keys,or all table columns ifselect is not called. The vector and the "_distance"fields are returned whether or not they're explicitly selected.

Parameters:

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.

Source code inlancedb/query.py
asyncdefto_list(self,timeout:Optional[timedelta]=None)->List[dict]:"""    Execute the query and return the results as a list of dictionaries.    Each list entry is a dictionary with the selected column names as keys,    or all table columns if `select` is not called. The vector and the "_distance"    fields are returned whether or not they're explicitly selected.    Parameters    ----------    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If not specified, no timeout is applied. If the query does not        complete within the specified time, an error will be raised.    """return(awaitself.to_arrow(timeout=timeout)).to_pylist()

to_pandasasync

to_pandas(flatten:Optional[Union[int,bool]]=None,timeout:Optional[timedelta]=None)->'pd.DataFrame'

Execute the query and collect the results into a pandas DataFrame.

This method will collect all results into memory before returning. If youexpect a large number of results, you may want to useto_batches and convert each batch topandas separately.

Examples:

>>>importasyncio>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",data=[{"a":1,"b":2}])...asyncforbatchinawaittable.query().to_batches():...batch_df=batch.to_pandas()>>>asyncio.run(doctest_example())

Parameters:

  • flatten (Optional[Union[int,bool]], default:None) –

    If flatten is True, flatten all nested columns.If flatten is an integer, flatten the nested columns up to thespecified depth.If unspecified, do not flatten the nested columns.

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.

Source code inlancedb/query.py
asyncdefto_pandas(self,flatten:Optional[Union[int,bool]]=None,timeout:Optional[timedelta]=None,)->"pd.DataFrame":"""    Execute the query and collect the results into a pandas DataFrame.    This method will collect all results into memory before returning.  If you    expect a large number of results, you may want to use    [to_batches][lancedb.query.AsyncQueryBase.to_batches] and convert each batch to    pandas separately.    Examples    --------    >>> import asyncio    >>> from lancedb import connect_async    >>> async def doctest_example():    ...     conn = await connect_async("./.lancedb")    ...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])    ...     async for batch in await table.query().to_batches():    ...         batch_df = batch.to_pandas()    >>> asyncio.run(doctest_example())    Parameters    ----------    flatten: Optional[Union[int, bool]]        If flatten is True, flatten all nested columns.        If flatten is an integer, flatten the nested columns up to the        specified depth.        If unspecified, do not flatten the nested columns.    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If not specified, no timeout is applied. If the query does not        complete within the specified time, an error will be raised.    """return(flatten_columns(awaitself.to_arrow(timeout=timeout),flatten)).to_pandas()

to_polarsasync

to_polars(timeout:Optional[timedelta]=None)->'pl.DataFrame'

Execute the query and collect the results into a Polars DataFrame.

This method will collect all results into memory before returning. If youexpect a large number of results, you may want to useto_batches and convert each batch topolars separately.

Parameters:

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.

Examples:

>>>importasyncio>>>importpolarsaspl>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",data=[{"a":1,"b":2}])...asyncforbatchinawaittable.query().to_batches():...batch_df=pl.from_arrow(batch)>>>asyncio.run(doctest_example())
Source code inlancedb/query.py
asyncdefto_polars(self,timeout:Optional[timedelta]=None,)->"pl.DataFrame":"""    Execute the query and collect the results into a Polars DataFrame.    This method will collect all results into memory before returning.  If you    expect a large number of results, you may want to use    [to_batches][lancedb.query.AsyncQueryBase.to_batches] and convert each batch to    polars separately.    Parameters    ----------    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If not specified, no timeout is applied. If the query does not        complete within the specified time, an error will be raised.    Examples    --------    >>> import asyncio    >>> import polars as pl    >>> from lancedb import connect_async    >>> async def doctest_example():    ...     conn = await connect_async("./.lancedb")    ...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])    ...     async for batch in await table.query().to_batches():    ...         batch_df = pl.from_arrow(batch)    >>> asyncio.run(doctest_example())    """importpolarsasplreturnpl.from_arrow(awaitself.to_arrow(timeout=timeout))

explain_planasync

explain_plan(verbose:Optional[bool]=False)

Return the execution plan for this query.

Examples:

>>>importasyncio>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",[{"vector":[99,99]}])...query=[100,100]...plan=awaittable.query().nearest_to([1,2]).explain_plan(True)...print(plan)>>>asyncio.run(doctest_example())ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]  GlobalLimitExec: skip=0, fetch=10    FilterExec: _distance@2 IS NOT NULL      SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]        KNNVectorDistance: metric=l2          LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

Parameters:

  • verbose (bool, default:False) –

    Use a verbose output format.

Returns:

  • plan (str) –
Source code inlancedb/query.py
asyncdefexplain_plan(self,verbose:Optional[bool]=False):"""Return the execution plan for this query.    Examples    --------    >>> import asyncio    >>> from lancedb import connect_async    >>> async def doctest_example():    ...     conn = await connect_async("./.lancedb")    ...     table = await conn.create_table("my_table", [{"vector": [99, 99]}])    ...     query = [100, 100]    ...     plan = await table.query().nearest_to([1, 2]).explain_plan(True)    ...     print(plan)    >>> asyncio.run(doctest_example()) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE    ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]      GlobalLimitExec: skip=0, fetch=10        FilterExec: _distance@2 IS NOT NULL          SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]            KNNVectorDistance: metric=l2              LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false    Parameters    ----------    verbose : bool, default False        Use a verbose output format.    Returns    -------    plan : str    """# noqa: E501returnawaitself._inner.explain_plan(verbose)

analyze_planasync

analyze_plan()

Execute the query and display with runtime metrics.

Returns:

  • plan (str) –
Source code inlancedb/query.py
asyncdefanalyze_plan(self):"""Execute the query and display with runtime metrics.    Returns    -------    plan : str    """returnawaitself._inner.analyze_plan()

__init__

__init__(inner:Query)

Construct an AsyncQuery

This method is not intended to be called directly. Instead, use theAsyncTable.query method to create a query.

Source code inlancedb/query.py
def__init__(self,inner:LanceQuery):"""    Construct an AsyncQuery    This method is not intended to be called directly.  Instead, use the    [AsyncTable.query][lancedb.table.AsyncTable.query] method to create a query.    """super().__init__(inner)self._inner=inner

nearest_to

nearest_to(query_vector:Union[VEC,Tuple,List[VEC]])->AsyncVectorQuery

Find the nearest vectors to the given query vector.

This converts the query from a plain query to a vector query.

This method will attempt to convert the input to the query vectorexpected by the embedding model. If the input cannot be convertedthen an error will be thrown.

By default, there is no embedding model, and the input should besomething that can be converted to a pyarrow array of floats. Thisincludes lists, numpy arrays, and tuples.

If there is only one vector column (a column whose data type is afixed size list of floats) then the column does not need to be specified.If there is more than one vector column you must useAsyncVectorQuery.column to specifywhich column you would like to compare with.

If no index has been created on the vector column then a vector querywill perform a distance comparison between the query vector and everyvector in the database and then sort the results. This is sometimescalled a "flat search"

For small databases, with tens of thousands of vectors or less, this canbe reasonably fast. In larger databases you should create a vector indexon the column. If there is a vector index then an "approximate" nearestneighbor search (frequently called an ANN search) will be performed. Thissearch is much faster, but the results will be approximate.

The query can be further parameterized using the returned builder. Thereare various ANN search parameters that will let you fine tune your recallaccuracy vs search latency.

Vector searches always have alimit. Iflimit has not been called thena defaultlimit of 10 will be used.

Typically, a single vector is passed in as the query. However, you can alsopass in multiple vectors. When multiple vectors are passed in, if the vectorcolumn is with multivector type, then the vectors will be treated as a singlequery. Or the vectors will be treated as multiple queries, this can be usefulif you want to find the nearest vectors to multiple query vectors.This is not expected to be faster than making multiple queries concurrently;it is just a convenience method. If multiple vectors are passed in thenan additional columnquery_index will be added to the results. This columnwill contain the index of the query vector that the result is nearest to.

Source code inlancedb/query.py
defnearest_to(self,query_vector:Union[VEC,Tuple,List[VEC]],)->AsyncVectorQuery:"""    Find the nearest vectors to the given query vector.    This converts the query from a plain query to a vector query.    This method will attempt to convert the input to the query vector    expected by the embedding model.  If the input cannot be converted    then an error will be thrown.    By default, there is no embedding model, and the input should be    something that can be converted to a pyarrow array of floats.  This    includes lists, numpy arrays, and tuples.    If there is only one vector column (a column whose data type is a    fixed size list of floats) then the column does not need to be specified.    If there is more than one vector column you must use    [AsyncVectorQuery.column][lancedb.query.AsyncVectorQuery.column] to specify    which column you would like to compare with.    If no index has been created on the vector column then a vector query    will perform a distance comparison between the query vector and every    vector in the database and then sort the results.  This is sometimes    called a "flat search"    For small databases, with tens of thousands of vectors or less, this can    be reasonably fast.  In larger databases you should create a vector index    on the column.  If there is a vector index then an "approximate" nearest    neighbor search (frequently called an ANN search) will be performed.  This    search is much faster, but the results will be approximate.    The query can be further parameterized using the returned builder.  There    are various ANN search parameters that will let you fine tune your recall    accuracy vs search latency.    Vector searches always have a [limit][].  If `limit` has not been called then    a default `limit` of 10 will be used.    Typically, a single vector is passed in as the query. However, you can also    pass in multiple vectors. When multiple vectors are passed in, if the vector    column is with multivector type, then the vectors will be treated as a single    query. Or the vectors will be treated as multiple queries, this can be useful    if you want to find the nearest vectors to multiple query vectors.    This is not expected to be faster than making multiple queries concurrently;    it is just a convenience method. If multiple vectors are passed in then    an additional column `query_index` will be added to the results. This column    will contain the index of the query vector that the result is nearest to.    """ifquery_vectorisNone:raiseValueError("query_vector can not be None")if(isinstance(query_vector,(list,np.ndarray,pa.Array))andlen(query_vector)>0andisinstance(query_vector[0],(list,np.ndarray,pa.Array))):# multiple have been passedquery_vectors=[AsyncQuery._query_vec_to_array(v)forvinquery_vector]new_self=self._inner.nearest_to(query_vectors[0])forvinquery_vectors[1:]:new_self.add_query_vector(v)returnAsyncVectorQuery(new_self)else:returnAsyncVectorQuery(self._inner.nearest_to(AsyncQuery._query_vec_to_array(query_vector)))

nearest_to_text

nearest_to_text(query:str|FullTextQuery,columns:Union[str,List[str],None]=None)->AsyncFTSQuery

Find the documents that are most relevant to the given text query.

This method will perform a full text search on the table and returnthe most relevant documents. The relevance is determined by BM25.

The columns to search must be with native FTS index(Tantivy-based can't work with this method).

By default, all indexed columns are searched,now only one column can be searched at a time.

Parameters:

  • query (str |FullTextQuery) –

    The text query to search for.

  • columns (Union[str,List[str], None], default:None) –

    The columns to search in. If None, all indexed columns are searched.For now only one column can be searched at a time.

Source code inlancedb/query.py
defnearest_to_text(self,query:str|FullTextQuery,columns:Union[str,List[str],None]=None)->AsyncFTSQuery:"""    Find the documents that are most relevant to the given text query.    This method will perform a full text search on the table and return    the most relevant documents.  The relevance is determined by BM25.    The columns to search must be with native FTS index    (Tantivy-based can't work with this method).    By default, all indexed columns are searched,    now only one column can be searched at a time.    Parameters    ----------    query: str        The text query to search for.    columns: str or list of str, default None        The columns to search in. If None, all indexed columns are searched.        For now only one column can be searched at a time.    """ifisinstance(columns,str):columns=[columns]ifcolumnsisNone:columns=[]ifisinstance(query,str):returnAsyncFTSQuery(self._inner.nearest_to_text({"query":query,"columns":columns}))# FullTextQuery objectreturnAsyncFTSQuery(self._inner.nearest_to_text({"query":query}))

lancedb.query.AsyncVectorQuery

Bases:AsyncQueryBase,AsyncVectorQueryBase

Source code inlancedb/query.py
classAsyncVectorQuery(AsyncQueryBase,AsyncVectorQueryBase):def__init__(self,inner:LanceVectorQuery):"""        Construct an AsyncVectorQuery        This method is not intended to be called directly.  Instead, create        a query first with [AsyncTable.query][lancedb.table.AsyncTable.query] and then        use [AsyncQuery.nearest_to][lancedb.query.AsyncQuery.nearest_to]] to convert to        a vector query.  Or you can use        [AsyncTable.vector_search][lancedb.table.AsyncTable.vector_search]        """super().__init__(inner)self._inner=innerself._reranker=Noneself._query_string=Nonedefrerank(self,reranker:Reranker=RRFReranker(),query_string:Optional[str]=None)->AsyncHybridQuery:ifrerankerandnotisinstance(reranker,Reranker):raiseValueError("reranker must be an instance of Reranker class.")self._reranker=rerankerifnotself._query_stringandnotquery_string:raiseValueError("query_string must be provided to rerank the results.")self._query_string=query_stringreturnselfdefnearest_to_text(self,query:str|FullTextQuery,columns:Union[str,List[str],None]=None)->AsyncHybridQuery:"""        Find the documents that are most relevant to the given text query,        in addition to vector search.        This converts the vector query into a hybrid query.        This search will perform a full text search on the table and return        the most relevant documents, combined with the vector query results.        The text relevance is determined by BM25.        The columns to search must be with native FTS index        (Tantivy-based can't work with this method).        By default, all indexed columns are searched,        now only one column can be searched at a time.        Parameters        ----------        query: str            The text query to search for.        columns: str or list of str, default None            The columns to search in. If None, all indexed columns are searched.            For now only one column can be searched at a time.        """ifisinstance(columns,str):columns=[columns]ifcolumnsisNone:columns=[]ifisinstance(query,str):returnAsyncHybridQuery(self._inner.nearest_to_text({"query":query,"columns":columns}))# FullTextQuery objectreturnAsyncHybridQuery(self._inner.nearest_to_text({"query":query}))asyncdefto_batches(self,*,max_batch_length:Optional[int]=None,timeout:Optional[timedelta]=None,)->AsyncRecordBatchReader:reader=awaitsuper().to_batches(timeout=timeout)results=pa.Table.from_batches(awaitreader.read_all(),reader.schema)ifself._reranker:results=self._reranker.rerank_vector(self._query_string,results)returnAsyncRecordBatchReader(results,max_batch_length=max_batch_length)

column

column(column:str)->Self

Set the vector column to query

This controls which column is compared to the query vector supplied inthe call toAsyncQuery.nearest_to.

This parameter must be specified if the table has more than one columnwhose data type is a fixed-size-list of floats.

Source code inlancedb/query.py
defcolumn(self,column:str)->Self:"""    Set the vector column to query    This controls which column is compared to the query vector supplied in    the call to [AsyncQuery.nearest_to][lancedb.query.AsyncQuery.nearest_to].    This parameter must be specified if the table has more than one column    whose data type is a fixed-size-list of floats.    """self._inner.column(column)returnself

nprobes

nprobes(nprobes:int)->Self

Set the number of partitions to search (probe)

This argument is only used when the vector column has an IVF-based index.If there is no index then this value is ignored.

The IVF stage of IVF PQ divides the input into partitions (clusters) ofrelated values.

The partition whose centroids are closest to the query vector will beexhaustiely searched to find matches. This parameter controls how manypartitions should be searched.

Increasing this value will increase the recall of your query but willalso increase the latency of your query. The default value is 20. Thisdefault is good for many cases but the best value to use will depend onyour data and the recall that you need to achieve.

For best results we recommend tuning this parameter with a benchmark againstyour actual data to find the smallest possible value that will still giveyou the desired recall.

Source code inlancedb/query.py
defnprobes(self,nprobes:int)->Self:"""    Set the number of partitions to search (probe)    This argument is only used when the vector column has an IVF-based index.    If there is no index then this value is ignored.    The IVF stage of IVF PQ divides the input into partitions (clusters) of    related values.    The partition whose centroids are closest to the query vector will be    exhaustiely searched to find matches.  This parameter controls how many    partitions should be searched.    Increasing this value will increase the recall of your query but will    also increase the latency of your query.  The default value is 20.  This    default is good for many cases but the best value to use will depend on    your data and the recall that you need to achieve.    For best results we recommend tuning this parameter with a benchmark against    your actual data to find the smallest possible value that will still give    you the desired recall.    """self._inner.nprobes(nprobes)returnself

minimum_nprobes

minimum_nprobes(minimum_nprobes:int)->Self

Set the minimum number of probes to use.

Seenprobes for more details.

These partitions will be searched on every indexed vector query and willincrease recall at the expense of latency.

Source code inlancedb/query.py
defminimum_nprobes(self,minimum_nprobes:int)->Self:"""Set the minimum number of probes to use.    See `nprobes` for more details.    These partitions will be searched on every indexed vector query and will    increase recall at the expense of latency.    """self._inner.minimum_nprobes(minimum_nprobes)returnself

maximum_nprobes

maximum_nprobes(maximum_nprobes:int)->Self

Set the maximum number of probes to use.

Seenprobes for more details.

If this value is greater thanminimum_nprobes then the excess partitionswill be searched only if we have not found enough results.

This can be useful when there is a narrow filter to allow these queries tospend more time searching and avoid potential false negatives.

If this value is 0 then no limit will be applied and all partitions could besearched if needed to satisfy the limit.

Source code inlancedb/query.py
defmaximum_nprobes(self,maximum_nprobes:int)->Self:"""Set the maximum number of probes to use.    See `nprobes` for more details.    If this value is greater than `minimum_nprobes` then the excess partitions    will be searched only if we have not found enough results.    This can be useful when there is a narrow filter to allow these queries to    spend more time searching and avoid potential false negatives.    If this value is 0 then no limit will be applied and all partitions could be    searched if needed to satisfy the limit.    """self._inner.maximum_nprobes(maximum_nprobes)returnself

distance_range

distance_range(lower_bound:Optional[float]=None,upper_bound:Optional[float]=None)->Self

Set the distance range to use.

Only rows with distances within range [lower_bound, upper_bound)will be returned.

Parameters:

  • lower_bound (Optional[float], default:None) –

    The lower bound of the distance range.

  • upper_bound (Optional[float], default:None) –

    The upper bound of the distance range.

Returns:

Source code inlancedb/query.py
defdistance_range(self,lower_bound:Optional[float]=None,upper_bound:Optional[float]=None)->Self:"""Set the distance range to use.    Only rows with distances within range [lower_bound, upper_bound)    will be returned.    Parameters    ----------    lower_bound: Optional[float]        The lower bound of the distance range.    upper_bound: Optional[float]        The upper bound of the distance range.    Returns    -------    AsyncVectorQuery        The AsyncVectorQuery object.    """self._inner.distance_range(lower_bound,upper_bound)returnself

ef

ef(ef:int)->Self

Set the number of candidates to consider during search

This argument is only used when the vector column has an HNSW index.If there is no index then this value is ignored.

Increasing this value will increase the recall of your query but will alsoincrease the latency of your query. The default value is 1.5 * limit. Thisdefault is good for many cases but the best value to use will depend on yourdata and the recall that you need to achieve.

Source code inlancedb/query.py
defef(self,ef:int)->Self:"""    Set the number of candidates to consider during search    This argument is only used when the vector column has an HNSW index.    If there is no index then this value is ignored.    Increasing this value will increase the recall of your query but will also    increase the latency of your query.  The default value is 1.5 * limit.  This    default is good for many cases but the best value to use will depend on your    data and the recall that you need to achieve.    """self._inner.ef(ef)returnself

refine_factor

refine_factor(refine_factor:int)->Self

A multiplier to control how many additional rows are taken during the refinestep

This argument is only used when the vector column has an IVF PQ index.If there is no index then this value is ignored.

An IVF PQ index stores compressed (quantized) values. They query vector iscompared against these values and, since they are compressed, the comparison isinaccurate.

This parameter can be used to refine the results. It can improve both improverecall and correct the ordering of the nearest results.

To refine results LanceDb will first perform an ANN search to find the nearestlimit *refine_factor results. In other words, ifrefine_factor is 3 andlimit is the default (10) then the first 30 results will be selected. LanceDbthen fetches the full, uncompressed, values for these 30 results. The resultsare then reordered by the true distance and only the nearest 10 are kept.

Note: there is a difference between calling this method with a value of 1 andnever calling this method at all. Calling this method with any value will havean impact on your search latency. When you call this method with arefine_factor of 1 then LanceDb still needs to fetch the full, uncompressed,values so that it can potentially reorder the results.

Note: if this method is NOT called then the distances returned in the _distancecolumn will be approximate distances based on the comparison of the quantizedquery vector and the quantized result vectors. This can be considerablydifferent than the true distance between the query vector and the actualuncompressed vector.

Source code inlancedb/query.py
defrefine_factor(self,refine_factor:int)->Self:"""    A multiplier to control how many additional rows are taken during the refine    step    This argument is only used when the vector column has an IVF PQ index.    If there is no index then this value is ignored.    An IVF PQ index stores compressed (quantized) values.  They query vector is    compared against these values and, since they are compressed, the comparison is    inaccurate.    This parameter can be used to refine the results.  It can improve both improve    recall and correct the ordering of the nearest results.    To refine results LanceDb will first perform an ANN search to find the nearest    `limit` * `refine_factor` results.  In other words, if `refine_factor` is 3 and    `limit` is the default (10) then the first 30 results will be selected.  LanceDb    then fetches the full, uncompressed, values for these 30 results.  The results    are then reordered by the true distance and only the nearest 10 are kept.    Note: there is a difference between calling this method with a value of 1 and    never calling this method at all.  Calling this method with any value will have    an impact on your search latency.  When you call this method with a    `refine_factor` of 1 then LanceDb still needs to fetch the full, uncompressed,    values so that it can potentially reorder the results.    Note: if this method is NOT called then the distances returned in the _distance    column will be approximate distances based on the comparison of the quantized    query vector and the quantized result vectors.  This can be considerably    different than the true distance between the query vector and the actual    uncompressed vector.    """self._inner.refine_factor(refine_factor)returnself

distance_type

distance_type(distance_type:str)->Self

Set the distance metric to use

When performing a vector search we try and find the "nearest" vectors accordingto some kind of distance metric. This parameter controls which distance metricto use. See @see {@link IvfPqOptions.distanceType} for more details on thedifferent distance metrics available.

Note: if there is a vector index then the distance type used MUST match thedistance type used to train the vector index. If this is not done then theresults will be invalid.

By default "l2" is used.

Source code inlancedb/query.py
defdistance_type(self,distance_type:str)->Self:"""    Set the distance metric to use    When performing a vector search we try and find the "nearest" vectors according    to some kind of distance metric.  This parameter controls which distance metric    to use.  See @see {@link IvfPqOptions.distanceType} for more details on the    different distance metrics available.    Note: if there is a vector index then the distance type used MUST match the    distance type used to train the vector index.  If this is not done then the    results will be invalid.    By default "l2" is used.    """self._inner.distance_type(distance_type)returnself

bypass_vector_index

bypass_vector_index()->Self

If this is called then any vector index is skipped

An exhaustive (flat) search will be performed. The query vector willbe compared to every vector in the table. At high scales this can beexpensive. However, this is often still useful. For example, skippingthe vector index can give you ground truth results which you can use tocalculate your recall to select an appropriate value for nprobes.

Source code inlancedb/query.py
defbypass_vector_index(self)->Self:"""    If this is called then any vector index is skipped    An exhaustive (flat) search will be performed.  The query vector will    be compared to every vector in the table.  At high scales this can be    expensive.  However, this is often still useful.  For example, skipping    the vector index can give you ground truth results which you can use to    calculate your recall to select an appropriate value for nprobes.    """self._inner.bypass_vector_index()returnself

where

where(predicate:str)->Self

Only return rows matching the given predicate

The predicate should be supplied as an SQL query string.

Examples:

>>>predicate="x > 10">>>predicate="y > 0 AND y < 100">>>predicate="x > 5 OR y = 'test'"

Filtering performance can often be improved by creating a scalar indexon the filter column(s).

Source code inlancedb/query.py
defwhere(self,predicate:str)->Self:"""    Only return rows matching the given predicate    The predicate should be supplied as an SQL query string.    Examples    --------    >>> predicate = "x > 10"    >>> predicate = "y > 0 AND y < 100"    >>> predicate = "x > 5 OR y = 'test'"    Filtering performance can often be improved by creating a scalar index    on the filter column(s).    """self._inner.where(predicate)returnself

select

select(columns:Union[List[str],dict[str,str]])->Self

Return only the specified columns.

By default a query will return all columns from the table. However, this canhave a very significant impact on latency. LanceDb stores data in a columnarfashion. Thismeans we can finely tune our I/O to select exactly the columns we need.

As a best practice you should always limit queries to the columns that you need.If you pass in a list of column names then only those columns will bereturned.

You can also use this method to create new "dynamic" columns based on yourexisting columns. For example, you may not care about "a" or "b" but insteadsimply want "a + b". This is often seen in the SELECT clause of an SQL query(e.g.SELECT a+b FROM my_table).

To create dynamic columns you can pass in a dict[str, str]. A column will bereturned for each entry in the map. The key provides the name of the column.The value is an SQL string used to specify how the column is calculated.

For example, an SQL query might stateSELECT a + b AS combined, c. Theequivalent input to this method would be{"combined": "a + b", "c": "c"}.

Columns will always be returned in the order given, even if that order isdifferent than the order used when adding the data.

Source code inlancedb/query.py
defselect(self,columns:Union[List[str],dict[str,str]])->Self:"""    Return only the specified columns.    By default a query will return all columns from the table.  However, this can    have a very significant impact on latency.  LanceDb stores data in a columnar    fashion.  This    means we can finely tune our I/O to select exactly the columns we need.    As a best practice you should always limit queries to the columns that you need.    If you pass in a list of column names then only those columns will be    returned.    You can also use this method to create new "dynamic" columns based on your    existing columns. For example, you may not care about "a" or "b" but instead    simply want "a + b".  This is often seen in the SELECT clause of an SQL query    (e.g. `SELECT a+b FROM my_table`).    To create dynamic columns you can pass in a dict[str, str].  A column will be    returned for each entry in the map.  The key provides the name of the column.    The value is an SQL string used to specify how the column is calculated.    For example, an SQL query might state `SELECT a + b AS combined, c`.  The    equivalent input to this method would be `{"combined": "a + b", "c": "c"}`.    Columns will always be returned in the order given, even if that order is    different than the order used when adding the data.    """ifisinstance(columns,list)andall(isinstance(c,str)forcincolumns):self._inner.select_columns(columns)elifisinstance(columns,dict)andall(isinstance(k,str)andisinstance(v,str)fork,vincolumns.items()):self._inner.select(list(columns.items()))else:raiseTypeError("columns must be a list of column names or a dict")returnself

limit

limit(limit:int)->Self

Set the maximum number of results to return.

By default, a plain search has no limit. If this method is notcalled then every valid row from the table will be returned.

Source code inlancedb/query.py
deflimit(self,limit:int)->Self:"""    Set the maximum number of results to return.    By default, a plain search has no limit.  If this method is not    called then every valid row from the table will be returned.    """self._inner.limit(limit)returnself

offset

offset(offset:int)->Self

Set the offset for the results.

Parameters:

  • offset (int) –

    The offset to start fetching results from.

Source code inlancedb/query.py
defoffset(self,offset:int)->Self:"""    Set the offset for the results.    Parameters    ----------    offset: int        The offset to start fetching results from.    """self._inner.offset(offset)returnself

fast_search

fast_search()->Self

Skip searching un-indexed data.

This can make queries faster, but will miss any data that has not beenindexed.

Tip

You can add new data into an existing index by callingAsyncTable.optimize.

Source code inlancedb/query.py
deffast_search(self)->Self:"""    Skip searching un-indexed data.    This can make queries faster, but will miss any data that has not been    indexed.    !!! tip        You can add new data into an existing index by calling        [AsyncTable.optimize][lancedb.table.AsyncTable.optimize].    """self._inner.fast_search()returnself

with_row_id

with_row_id()->Self

Include the _rowid column in the results.

Source code inlancedb/query.py
defwith_row_id(self)->Self:"""    Include the _rowid column in the results.    """self._inner.with_row_id()returnself

postfilter

postfilter()->Self

If this is called then filtering will happen after the search instead ofbefore.By default filtering will be performed before the search. This is howfiltering is typically understood to work. This prefilter step does add someadditional latency. Creating a scalar index on the filter column(s) canoften improve this latency. However, sometimes a filter is too complex orscalar indices cannot be applied to the column. In these cases postfilteringcan be used instead of prefiltering to improve latency.Post filtering applies the filter to the results of the search. Thismeans we only run the filter on a much smaller set of data. However, it cancause the query to return fewer thanlimit results (or even no results) ifnone of the nearest results match the filter.Post filtering happens during the "refine stage" (described in more detail in@see {@link VectorQuery#refineFactor}). This means that setting a higher refinefactor can often help restore some of the results lost by post filtering.

Source code inlancedb/query.py
defpostfilter(self)->Self:"""    If this is called then filtering will happen after the search instead of    before.    By default filtering will be performed before the search.  This is how    filtering is typically understood to work.  This prefilter step does add some    additional latency.  Creating a scalar index on the filter column(s) can    often improve this latency.  However, sometimes a filter is too complex or    scalar indices cannot be applied to the column.  In these cases postfiltering    can be used instead of prefiltering to improve latency.    Post filtering applies the filter to the results of the search.  This    means we only run the filter on a much smaller set of data.  However, it can    cause the query to return fewer than `limit` results (or even no results) if    none of the nearest results match the filter.    Post filtering happens during the "refine stage" (described in more detail in    @see {@link VectorQuery#refineFactor}).  This means that setting a higher refine    factor can often help restore some of the results lost by post filtering.    """self._inner.postfilter()returnself

to_arrowasync

to_arrow(timeout:Optional[timedelta]=None)->Table

Execute the query and collect the results into an Apache Arrow Table.

This method will collect all results into memory before returning. Ifyou expect a large number of results, you may want to useto_batches

Parameters:

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.

Source code inlancedb/query.py
asyncdefto_arrow(self,timeout:Optional[timedelta]=None)->pa.Table:"""    Execute the query and collect the results into an Apache Arrow Table.    This method will collect all results into memory before returning.  If    you expect a large number of results, you may want to use    [to_batches][lancedb.query.AsyncQueryBase.to_batches]    Parameters    ----------    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If not specified, no timeout is applied. If the query does not        complete within the specified time, an error will be raised.    """batch_iter=awaitself.to_batches(timeout=timeout)returnpa.Table.from_batches(awaitbatch_iter.read_all(),schema=batch_iter.schema)

to_listasync

to_list(timeout:Optional[timedelta]=None)->List[dict]

Execute the query and return the results as a list of dictionaries.

Each list entry is a dictionary with the selected column names as keys,or all table columns ifselect is not called. The vector and the "_distance"fields are returned whether or not they're explicitly selected.

Parameters:

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.

Source code inlancedb/query.py
asyncdefto_list(self,timeout:Optional[timedelta]=None)->List[dict]:"""    Execute the query and return the results as a list of dictionaries.    Each list entry is a dictionary with the selected column names as keys,    or all table columns if `select` is not called. The vector and the "_distance"    fields are returned whether or not they're explicitly selected.    Parameters    ----------    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If not specified, no timeout is applied. If the query does not        complete within the specified time, an error will be raised.    """return(awaitself.to_arrow(timeout=timeout)).to_pylist()

to_pandasasync

to_pandas(flatten:Optional[Union[int,bool]]=None,timeout:Optional[timedelta]=None)->'pd.DataFrame'

Execute the query and collect the results into a pandas DataFrame.

This method will collect all results into memory before returning. If youexpect a large number of results, you may want to useto_batches and convert each batch topandas separately.

Examples:

>>>importasyncio>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",data=[{"a":1,"b":2}])...asyncforbatchinawaittable.query().to_batches():...batch_df=batch.to_pandas()>>>asyncio.run(doctest_example())

Parameters:

  • flatten (Optional[Union[int,bool]], default:None) –

    If flatten is True, flatten all nested columns.If flatten is an integer, flatten the nested columns up to thespecified depth.If unspecified, do not flatten the nested columns.

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.

Source code inlancedb/query.py
asyncdefto_pandas(self,flatten:Optional[Union[int,bool]]=None,timeout:Optional[timedelta]=None,)->"pd.DataFrame":"""    Execute the query and collect the results into a pandas DataFrame.    This method will collect all results into memory before returning.  If you    expect a large number of results, you may want to use    [to_batches][lancedb.query.AsyncQueryBase.to_batches] and convert each batch to    pandas separately.    Examples    --------    >>> import asyncio    >>> from lancedb import connect_async    >>> async def doctest_example():    ...     conn = await connect_async("./.lancedb")    ...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])    ...     async for batch in await table.query().to_batches():    ...         batch_df = batch.to_pandas()    >>> asyncio.run(doctest_example())    Parameters    ----------    flatten: Optional[Union[int, bool]]        If flatten is True, flatten all nested columns.        If flatten is an integer, flatten the nested columns up to the        specified depth.        If unspecified, do not flatten the nested columns.    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If not specified, no timeout is applied. If the query does not        complete within the specified time, an error will be raised.    """return(flatten_columns(awaitself.to_arrow(timeout=timeout),flatten)).to_pandas()

to_polarsasync

to_polars(timeout:Optional[timedelta]=None)->'pl.DataFrame'

Execute the query and collect the results into a Polars DataFrame.

This method will collect all results into memory before returning. If youexpect a large number of results, you may want to useto_batches and convert each batch topolars separately.

Parameters:

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.

Examples:

>>>importasyncio>>>importpolarsaspl>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",data=[{"a":1,"b":2}])...asyncforbatchinawaittable.query().to_batches():...batch_df=pl.from_arrow(batch)>>>asyncio.run(doctest_example())
Source code inlancedb/query.py
asyncdefto_polars(self,timeout:Optional[timedelta]=None,)->"pl.DataFrame":"""    Execute the query and collect the results into a Polars DataFrame.    This method will collect all results into memory before returning.  If you    expect a large number of results, you may want to use    [to_batches][lancedb.query.AsyncQueryBase.to_batches] and convert each batch to    polars separately.    Parameters    ----------    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If not specified, no timeout is applied. If the query does not        complete within the specified time, an error will be raised.    Examples    --------    >>> import asyncio    >>> import polars as pl    >>> from lancedb import connect_async    >>> async def doctest_example():    ...     conn = await connect_async("./.lancedb")    ...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])    ...     async for batch in await table.query().to_batches():    ...         batch_df = pl.from_arrow(batch)    >>> asyncio.run(doctest_example())    """importpolarsasplreturnpl.from_arrow(awaitself.to_arrow(timeout=timeout))

explain_planasync

explain_plan(verbose:Optional[bool]=False)

Return the execution plan for this query.

Examples:

>>>importasyncio>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",[{"vector":[99,99]}])...query=[100,100]...plan=awaittable.query().nearest_to([1,2]).explain_plan(True)...print(plan)>>>asyncio.run(doctest_example())ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]  GlobalLimitExec: skip=0, fetch=10    FilterExec: _distance@2 IS NOT NULL      SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]        KNNVectorDistance: metric=l2          LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

Parameters:

  • verbose (bool, default:False) –

    Use a verbose output format.

Returns:

  • plan (str) –
Source code inlancedb/query.py
asyncdefexplain_plan(self,verbose:Optional[bool]=False):"""Return the execution plan for this query.    Examples    --------    >>> import asyncio    >>> from lancedb import connect_async    >>> async def doctest_example():    ...     conn = await connect_async("./.lancedb")    ...     table = await conn.create_table("my_table", [{"vector": [99, 99]}])    ...     query = [100, 100]    ...     plan = await table.query().nearest_to([1, 2]).explain_plan(True)    ...     print(plan)    >>> asyncio.run(doctest_example()) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE    ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]      GlobalLimitExec: skip=0, fetch=10        FilterExec: _distance@2 IS NOT NULL          SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]            KNNVectorDistance: metric=l2              LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false    Parameters    ----------    verbose : bool, default False        Use a verbose output format.    Returns    -------    plan : str    """# noqa: E501returnawaitself._inner.explain_plan(verbose)

analyze_planasync

analyze_plan()

Execute the query and display with runtime metrics.

Returns:

  • plan (str) –
Source code inlancedb/query.py
asyncdefanalyze_plan(self):"""Execute the query and display with runtime metrics.    Returns    -------    plan : str    """returnawaitself._inner.analyze_plan()

__init__

__init__(inner:VectorQuery)

Construct an AsyncVectorQuery

This method is not intended to be called directly. Instead, createa query first withAsyncTable.query and thenuseAsyncQuery.nearest_to] to convert toa vector query. Or you can useAsyncTable.vector_search

Source code inlancedb/query.py
def__init__(self,inner:LanceVectorQuery):"""    Construct an AsyncVectorQuery    This method is not intended to be called directly.  Instead, create    a query first with [AsyncTable.query][lancedb.table.AsyncTable.query] and then    use [AsyncQuery.nearest_to][lancedb.query.AsyncQuery.nearest_to]] to convert to    a vector query.  Or you can use    [AsyncTable.vector_search][lancedb.table.AsyncTable.vector_search]    """super().__init__(inner)self._inner=innerself._reranker=Noneself._query_string=None

nearest_to_text

nearest_to_text(query:str|FullTextQuery,columns:Union[str,List[str],None]=None)->AsyncHybridQuery

Find the documents that are most relevant to the given text query,in addition to vector search.

This converts the vector query into a hybrid query.

This search will perform a full text search on the table and returnthe most relevant documents, combined with the vector query results.The text relevance is determined by BM25.

The columns to search must be with native FTS index(Tantivy-based can't work with this method).

By default, all indexed columns are searched,now only one column can be searched at a time.

Parameters:

  • query (str |FullTextQuery) –

    The text query to search for.

  • columns (Union[str,List[str], None], default:None) –

    The columns to search in. If None, all indexed columns are searched.For now only one column can be searched at a time.

Source code inlancedb/query.py
defnearest_to_text(self,query:str|FullTextQuery,columns:Union[str,List[str],None]=None)->AsyncHybridQuery:"""    Find the documents that are most relevant to the given text query,    in addition to vector search.    This converts the vector query into a hybrid query.    This search will perform a full text search on the table and return    the most relevant documents, combined with the vector query results.    The text relevance is determined by BM25.    The columns to search must be with native FTS index    (Tantivy-based can't work with this method).    By default, all indexed columns are searched,    now only one column can be searched at a time.    Parameters    ----------    query: str        The text query to search for.    columns: str or list of str, default None        The columns to search in. If None, all indexed columns are searched.        For now only one column can be searched at a time.    """ifisinstance(columns,str):columns=[columns]ifcolumnsisNone:columns=[]ifisinstance(query,str):returnAsyncHybridQuery(self._inner.nearest_to_text({"query":query,"columns":columns}))# FullTextQuery objectreturnAsyncHybridQuery(self._inner.nearest_to_text({"query":query}))

lancedb.query.AsyncFTSQuery

Bases:AsyncQueryBase

A query for full text search for LanceDB.

Source code inlancedb/query.py
classAsyncFTSQuery(AsyncQueryBase):"""A query for full text search for LanceDB."""def__init__(self,inner:LanceFTSQuery):super().__init__(inner)self._inner=innerself._reranker=Nonedefget_query(self)->str:returnself._inner.get_query()defrerank(self,reranker:Reranker=RRFReranker(),)->AsyncFTSQuery:ifrerankerandnotisinstance(reranker,Reranker):raiseValueError("reranker must be an instance of Reranker class.")self._reranker=rerankerreturnselfdefnearest_to(self,query_vector:Union[VEC,Tuple,List[VEC]],)->AsyncHybridQuery:"""        In addition doing text search on the LanceDB Table, also        find the nearest vectors to the given query vector.        This converts the query from a FTS Query to a Hybrid query. Results        from the vector search will be combined with results from the FTS query.        This method will attempt to convert the input to the query vector        expected by the embedding model.  If the input cannot be converted        then an error will be thrown.        By default, there is no embedding model, and the input should be        something that can be converted to a pyarrow array of floats.  This        includes lists, numpy arrays, and tuples.        If there is only one vector column (a column whose data type is a        fixed size list of floats) then the column does not need to be specified.        If there is more than one vector column you must use        [AsyncVectorQuery.column][lancedb.query.AsyncVectorQuery.column] to specify        which column you would like to compare with.        If no index has been created on the vector column then a vector query        will perform a distance comparison between the query vector and every        vector in the database and then sort the results.  This is sometimes        called a "flat search"        For small databases, with tens of thousands of vectors or less, this can        be reasonably fast.  In larger databases you should create a vector index        on the column.  If there is a vector index then an "approximate" nearest        neighbor search (frequently called an ANN search) will be performed.  This        search is much faster, but the results will be approximate.        The query can be further parameterized using the returned builder.  There        are various ANN search parameters that will let you fine tune your recall        accuracy vs search latency.        Hybrid searches always have a [limit][].  If `limit` has not been called then        a default `limit` of 10 will be used.        Typically, a single vector is passed in as the query. However, you can also        pass in multiple vectors.  This can be useful if you want to find the nearest        vectors to multiple query vectors. This is not expected to be faster than        making multiple queries concurrently; it is just a convenience method.        If multiple vectors are passed in then an additional column `query_index`        will be added to the results.  This column will contain the index of the        query vector that the result is nearest to.        """ifquery_vectorisNone:raiseValueError("query_vector can not be None")if(isinstance(query_vector,list)andlen(query_vector)>0andnotisinstance(query_vector[0],(float,int))):# multiple have been passedquery_vectors=[AsyncQuery._query_vec_to_array(v)forvinquery_vector]new_self=self._inner.nearest_to(query_vectors[0])forvinquery_vectors[1:]:new_self.add_query_vector(v)returnAsyncHybridQuery(new_self)else:returnAsyncHybridQuery(self._inner.nearest_to(AsyncQuery._query_vec_to_array(query_vector)))asyncdefto_batches(self,*,max_batch_length:Optional[int]=None,timeout:Optional[timedelta]=None,)->AsyncRecordBatchReader:reader=awaitsuper().to_batches(timeout=timeout)results=pa.Table.from_batches(awaitreader.read_all(),reader.schema)ifself._reranker:results=self._reranker.rerank_fts(self.get_query(),results)returnAsyncRecordBatchReader(results,max_batch_length=max_batch_length)

where

where(predicate:str)->Self

Only return rows matching the given predicate

The predicate should be supplied as an SQL query string.

Examples:

>>>predicate="x > 10">>>predicate="y > 0 AND y < 100">>>predicate="x > 5 OR y = 'test'"

Filtering performance can often be improved by creating a scalar indexon the filter column(s).

Source code inlancedb/query.py
defwhere(self,predicate:str)->Self:"""    Only return rows matching the given predicate    The predicate should be supplied as an SQL query string.    Examples    --------    >>> predicate = "x > 10"    >>> predicate = "y > 0 AND y < 100"    >>> predicate = "x > 5 OR y = 'test'"    Filtering performance can often be improved by creating a scalar index    on the filter column(s).    """self._inner.where(predicate)returnself

select

select(columns:Union[List[str],dict[str,str]])->Self

Return only the specified columns.

By default a query will return all columns from the table. However, this canhave a very significant impact on latency. LanceDb stores data in a columnarfashion. Thismeans we can finely tune our I/O to select exactly the columns we need.

As a best practice you should always limit queries to the columns that you need.If you pass in a list of column names then only those columns will bereturned.

You can also use this method to create new "dynamic" columns based on yourexisting columns. For example, you may not care about "a" or "b" but insteadsimply want "a + b". This is often seen in the SELECT clause of an SQL query(e.g.SELECT a+b FROM my_table).

To create dynamic columns you can pass in a dict[str, str]. A column will bereturned for each entry in the map. The key provides the name of the column.The value is an SQL string used to specify how the column is calculated.

For example, an SQL query might stateSELECT a + b AS combined, c. Theequivalent input to this method would be{"combined": "a + b", "c": "c"}.

Columns will always be returned in the order given, even if that order isdifferent than the order used when adding the data.

Source code inlancedb/query.py
defselect(self,columns:Union[List[str],dict[str,str]])->Self:"""    Return only the specified columns.    By default a query will return all columns from the table.  However, this can    have a very significant impact on latency.  LanceDb stores data in a columnar    fashion.  This    means we can finely tune our I/O to select exactly the columns we need.    As a best practice you should always limit queries to the columns that you need.    If you pass in a list of column names then only those columns will be    returned.    You can also use this method to create new "dynamic" columns based on your    existing columns. For example, you may not care about "a" or "b" but instead    simply want "a + b".  This is often seen in the SELECT clause of an SQL query    (e.g. `SELECT a+b FROM my_table`).    To create dynamic columns you can pass in a dict[str, str].  A column will be    returned for each entry in the map.  The key provides the name of the column.    The value is an SQL string used to specify how the column is calculated.    For example, an SQL query might state `SELECT a + b AS combined, c`.  The    equivalent input to this method would be `{"combined": "a + b", "c": "c"}`.    Columns will always be returned in the order given, even if that order is    different than the order used when adding the data.    """ifisinstance(columns,list)andall(isinstance(c,str)forcincolumns):self._inner.select_columns(columns)elifisinstance(columns,dict)andall(isinstance(k,str)andisinstance(v,str)fork,vincolumns.items()):self._inner.select(list(columns.items()))else:raiseTypeError("columns must be a list of column names or a dict")returnself

limit

limit(limit:int)->Self

Set the maximum number of results to return.

By default, a plain search has no limit. If this method is notcalled then every valid row from the table will be returned.

Source code inlancedb/query.py
deflimit(self,limit:int)->Self:"""    Set the maximum number of results to return.    By default, a plain search has no limit.  If this method is not    called then every valid row from the table will be returned.    """self._inner.limit(limit)returnself

offset

offset(offset:int)->Self

Set the offset for the results.

Parameters:

  • offset (int) –

    The offset to start fetching results from.

Source code inlancedb/query.py
defoffset(self,offset:int)->Self:"""    Set the offset for the results.    Parameters    ----------    offset: int        The offset to start fetching results from.    """self._inner.offset(offset)returnself

fast_search

fast_search()->Self

Skip searching un-indexed data.

This can make queries faster, but will miss any data that has not beenindexed.

Tip

You can add new data into an existing index by callingAsyncTable.optimize.

Source code inlancedb/query.py
deffast_search(self)->Self:"""    Skip searching un-indexed data.    This can make queries faster, but will miss any data that has not been    indexed.    !!! tip        You can add new data into an existing index by calling        [AsyncTable.optimize][lancedb.table.AsyncTable.optimize].    """self._inner.fast_search()returnself

with_row_id

with_row_id()->Self

Include the _rowid column in the results.

Source code inlancedb/query.py
defwith_row_id(self)->Self:"""    Include the _rowid column in the results.    """self._inner.with_row_id()returnself

postfilter

postfilter()->Self

If this is called then filtering will happen after the search instead ofbefore.By default filtering will be performed before the search. This is howfiltering is typically understood to work. This prefilter step does add someadditional latency. Creating a scalar index on the filter column(s) canoften improve this latency. However, sometimes a filter is too complex orscalar indices cannot be applied to the column. In these cases postfilteringcan be used instead of prefiltering to improve latency.Post filtering applies the filter to the results of the search. Thismeans we only run the filter on a much smaller set of data. However, it cancause the query to return fewer thanlimit results (or even no results) ifnone of the nearest results match the filter.Post filtering happens during the "refine stage" (described in more detail in@see {@link VectorQuery#refineFactor}). This means that setting a higher refinefactor can often help restore some of the results lost by post filtering.

Source code inlancedb/query.py
defpostfilter(self)->Self:"""    If this is called then filtering will happen after the search instead of    before.    By default filtering will be performed before the search.  This is how    filtering is typically understood to work.  This prefilter step does add some    additional latency.  Creating a scalar index on the filter column(s) can    often improve this latency.  However, sometimes a filter is too complex or    scalar indices cannot be applied to the column.  In these cases postfiltering    can be used instead of prefiltering to improve latency.    Post filtering applies the filter to the results of the search.  This    means we only run the filter on a much smaller set of data.  However, it can    cause the query to return fewer than `limit` results (or even no results) if    none of the nearest results match the filter.    Post filtering happens during the "refine stage" (described in more detail in    @see {@link VectorQuery#refineFactor}).  This means that setting a higher refine    factor can often help restore some of the results lost by post filtering.    """self._inner.postfilter()returnself

to_arrowasync

to_arrow(timeout:Optional[timedelta]=None)->Table

Execute the query and collect the results into an Apache Arrow Table.

This method will collect all results into memory before returning. Ifyou expect a large number of results, you may want to useto_batches

Parameters:

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.

Source code inlancedb/query.py
asyncdefto_arrow(self,timeout:Optional[timedelta]=None)->pa.Table:"""    Execute the query and collect the results into an Apache Arrow Table.    This method will collect all results into memory before returning.  If    you expect a large number of results, you may want to use    [to_batches][lancedb.query.AsyncQueryBase.to_batches]    Parameters    ----------    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If not specified, no timeout is applied. If the query does not        complete within the specified time, an error will be raised.    """batch_iter=awaitself.to_batches(timeout=timeout)returnpa.Table.from_batches(awaitbatch_iter.read_all(),schema=batch_iter.schema)

to_listasync

to_list(timeout:Optional[timedelta]=None)->List[dict]

Execute the query and return the results as a list of dictionaries.

Each list entry is a dictionary with the selected column names as keys,or all table columns ifselect is not called. The vector and the "_distance"fields are returned whether or not they're explicitly selected.

Parameters:

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.

Source code inlancedb/query.py
asyncdefto_list(self,timeout:Optional[timedelta]=None)->List[dict]:"""    Execute the query and return the results as a list of dictionaries.    Each list entry is a dictionary with the selected column names as keys,    or all table columns if `select` is not called. The vector and the "_distance"    fields are returned whether or not they're explicitly selected.    Parameters    ----------    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If not specified, no timeout is applied. If the query does not        complete within the specified time, an error will be raised.    """return(awaitself.to_arrow(timeout=timeout)).to_pylist()

to_pandasasync

to_pandas(flatten:Optional[Union[int,bool]]=None,timeout:Optional[timedelta]=None)->'pd.DataFrame'

Execute the query and collect the results into a pandas DataFrame.

This method will collect all results into memory before returning. If youexpect a large number of results, you may want to useto_batches and convert each batch topandas separately.

Examples:

>>>importasyncio>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",data=[{"a":1,"b":2}])...asyncforbatchinawaittable.query().to_batches():...batch_df=batch.to_pandas()>>>asyncio.run(doctest_example())

Parameters:

  • flatten (Optional[Union[int,bool]], default:None) –

    If flatten is True, flatten all nested columns.If flatten is an integer, flatten the nested columns up to thespecified depth.If unspecified, do not flatten the nested columns.

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.

Source code inlancedb/query.py
asyncdefto_pandas(self,flatten:Optional[Union[int,bool]]=None,timeout:Optional[timedelta]=None,)->"pd.DataFrame":"""    Execute the query and collect the results into a pandas DataFrame.    This method will collect all results into memory before returning.  If you    expect a large number of results, you may want to use    [to_batches][lancedb.query.AsyncQueryBase.to_batches] and convert each batch to    pandas separately.    Examples    --------    >>> import asyncio    >>> from lancedb import connect_async    >>> async def doctest_example():    ...     conn = await connect_async("./.lancedb")    ...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])    ...     async for batch in await table.query().to_batches():    ...         batch_df = batch.to_pandas()    >>> asyncio.run(doctest_example())    Parameters    ----------    flatten: Optional[Union[int, bool]]        If flatten is True, flatten all nested columns.        If flatten is an integer, flatten the nested columns up to the        specified depth.        If unspecified, do not flatten the nested columns.    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If not specified, no timeout is applied. If the query does not        complete within the specified time, an error will be raised.    """return(flatten_columns(awaitself.to_arrow(timeout=timeout),flatten)).to_pandas()

to_polarsasync

to_polars(timeout:Optional[timedelta]=None)->'pl.DataFrame'

Execute the query and collect the results into a Polars DataFrame.

This method will collect all results into memory before returning. If youexpect a large number of results, you may want to useto_batches and convert each batch topolars separately.

Parameters:

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.

Examples:

>>>importasyncio>>>importpolarsaspl>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",data=[{"a":1,"b":2}])...asyncforbatchinawaittable.query().to_batches():...batch_df=pl.from_arrow(batch)>>>asyncio.run(doctest_example())
Source code inlancedb/query.py
asyncdefto_polars(self,timeout:Optional[timedelta]=None,)->"pl.DataFrame":"""    Execute the query and collect the results into a Polars DataFrame.    This method will collect all results into memory before returning.  If you    expect a large number of results, you may want to use    [to_batches][lancedb.query.AsyncQueryBase.to_batches] and convert each batch to    polars separately.    Parameters    ----------    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If not specified, no timeout is applied. If the query does not        complete within the specified time, an error will be raised.    Examples    --------    >>> import asyncio    >>> import polars as pl    >>> from lancedb import connect_async    >>> async def doctest_example():    ...     conn = await connect_async("./.lancedb")    ...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])    ...     async for batch in await table.query().to_batches():    ...         batch_df = pl.from_arrow(batch)    >>> asyncio.run(doctest_example())    """importpolarsasplreturnpl.from_arrow(awaitself.to_arrow(timeout=timeout))

explain_planasync

explain_plan(verbose:Optional[bool]=False)

Return the execution plan for this query.

Examples:

>>>importasyncio>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",[{"vector":[99,99]}])...query=[100,100]...plan=awaittable.query().nearest_to([1,2]).explain_plan(True)...print(plan)>>>asyncio.run(doctest_example())ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]  GlobalLimitExec: skip=0, fetch=10    FilterExec: _distance@2 IS NOT NULL      SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]        KNNVectorDistance: metric=l2          LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

Parameters:

  • verbose (bool, default:False) –

    Use a verbose output format.

Returns:

  • plan (str) –
Source code inlancedb/query.py
asyncdefexplain_plan(self,verbose:Optional[bool]=False):"""Return the execution plan for this query.    Examples    --------    >>> import asyncio    >>> from lancedb import connect_async    >>> async def doctest_example():    ...     conn = await connect_async("./.lancedb")    ...     table = await conn.create_table("my_table", [{"vector": [99, 99]}])    ...     query = [100, 100]    ...     plan = await table.query().nearest_to([1, 2]).explain_plan(True)    ...     print(plan)    >>> asyncio.run(doctest_example()) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE    ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]      GlobalLimitExec: skip=0, fetch=10        FilterExec: _distance@2 IS NOT NULL          SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]            KNNVectorDistance: metric=l2              LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false    Parameters    ----------    verbose : bool, default False        Use a verbose output format.    Returns    -------    plan : str    """# noqa: E501returnawaitself._inner.explain_plan(verbose)

analyze_planasync

analyze_plan()

Execute the query and display with runtime metrics.

Returns:

  • plan (str) –
Source code inlancedb/query.py
asyncdefanalyze_plan(self):"""Execute the query and display with runtime metrics.    Returns    -------    plan : str    """returnawaitself._inner.analyze_plan()

nearest_to

nearest_to(query_vector:Union[VEC,Tuple,List[VEC]])->AsyncHybridQuery

In addition doing text search on the LanceDB Table, alsofind the nearest vectors to the given query vector.

This converts the query from a FTS Query to a Hybrid query. Resultsfrom the vector search will be combined with results from the FTS query.

This method will attempt to convert the input to the query vectorexpected by the embedding model. If the input cannot be convertedthen an error will be thrown.

By default, there is no embedding model, and the input should besomething that can be converted to a pyarrow array of floats. Thisincludes lists, numpy arrays, and tuples.

If there is only one vector column (a column whose data type is afixed size list of floats) then the column does not need to be specified.If there is more than one vector column you must useAsyncVectorQuery.column to specifywhich column you would like to compare with.

If no index has been created on the vector column then a vector querywill perform a distance comparison between the query vector and everyvector in the database and then sort the results. This is sometimescalled a "flat search"

For small databases, with tens of thousands of vectors or less, this canbe reasonably fast. In larger databases you should create a vector indexon the column. If there is a vector index then an "approximate" nearestneighbor search (frequently called an ANN search) will be performed. Thissearch is much faster, but the results will be approximate.

The query can be further parameterized using the returned builder. Thereare various ANN search parameters that will let you fine tune your recallaccuracy vs search latency.

Hybrid searches always have alimit. Iflimit has not been called thena defaultlimit of 10 will be used.

Typically, a single vector is passed in as the query. However, you can alsopass in multiple vectors. This can be useful if you want to find the nearestvectors to multiple query vectors. This is not expected to be faster thanmaking multiple queries concurrently; it is just a convenience method.If multiple vectors are passed in then an additional columnquery_indexwill be added to the results. This column will contain the index of thequery vector that the result is nearest to.

Source code inlancedb/query.py
defnearest_to(self,query_vector:Union[VEC,Tuple,List[VEC]],)->AsyncHybridQuery:"""    In addition doing text search on the LanceDB Table, also    find the nearest vectors to the given query vector.    This converts the query from a FTS Query to a Hybrid query. Results    from the vector search will be combined with results from the FTS query.    This method will attempt to convert the input to the query vector    expected by the embedding model.  If the input cannot be converted    then an error will be thrown.    By default, there is no embedding model, and the input should be    something that can be converted to a pyarrow array of floats.  This    includes lists, numpy arrays, and tuples.    If there is only one vector column (a column whose data type is a    fixed size list of floats) then the column does not need to be specified.    If there is more than one vector column you must use    [AsyncVectorQuery.column][lancedb.query.AsyncVectorQuery.column] to specify    which column you would like to compare with.    If no index has been created on the vector column then a vector query    will perform a distance comparison between the query vector and every    vector in the database and then sort the results.  This is sometimes    called a "flat search"    For small databases, with tens of thousands of vectors or less, this can    be reasonably fast.  In larger databases you should create a vector index    on the column.  If there is a vector index then an "approximate" nearest    neighbor search (frequently called an ANN search) will be performed.  This    search is much faster, but the results will be approximate.    The query can be further parameterized using the returned builder.  There    are various ANN search parameters that will let you fine tune your recall    accuracy vs search latency.    Hybrid searches always have a [limit][].  If `limit` has not been called then    a default `limit` of 10 will be used.    Typically, a single vector is passed in as the query. However, you can also    pass in multiple vectors.  This can be useful if you want to find the nearest    vectors to multiple query vectors. This is not expected to be faster than    making multiple queries concurrently; it is just a convenience method.    If multiple vectors are passed in then an additional column `query_index`    will be added to the results.  This column will contain the index of the    query vector that the result is nearest to.    """ifquery_vectorisNone:raiseValueError("query_vector can not be None")if(isinstance(query_vector,list)andlen(query_vector)>0andnotisinstance(query_vector[0],(float,int))):# multiple have been passedquery_vectors=[AsyncQuery._query_vec_to_array(v)forvinquery_vector]new_self=self._inner.nearest_to(query_vectors[0])forvinquery_vectors[1:]:new_self.add_query_vector(v)returnAsyncHybridQuery(new_self)else:returnAsyncHybridQuery(self._inner.nearest_to(AsyncQuery._query_vec_to_array(query_vector)))

lancedb.query.AsyncHybridQuery

Bases:AsyncQueryBase,AsyncVectorQueryBase

A query builder that performs hybrid vector and full text search.Results are combined and reranked based on the specified reranker.By default, the results are reranked using the RRFReranker, whichuses reciprocal rank fusion score for reranking.

To make the vector and fts results comparable, the scores are normalized.Instead of normalizing scores, thenormalize parameter can be set to "rank"in thererank method to convert the scores to ranks and then normalize them.

Source code inlancedb/query.py
classAsyncHybridQuery(AsyncQueryBase,AsyncVectorQueryBase):"""    A query builder that performs hybrid vector and full text search.    Results are combined and reranked based on the specified reranker.    By default, the results are reranked using the RRFReranker, which    uses reciprocal rank fusion score for reranking.    To make the vector and fts results comparable, the scores are normalized.    Instead of normalizing scores, the `normalize` parameter can be set to "rank"    in the `rerank` method to convert the scores to ranks and then normalize them.    """def__init__(self,inner:LanceHybridQuery):super().__init__(inner)self._inner=innerself._norm="score"self._reranker=RRFReranker()defrerank(self,reranker:Reranker=RRFReranker(),normalize:str="score")->AsyncHybridQuery:"""        Rerank the hybrid search results using the specified reranker. The reranker        must be an instance of Reranker class.        Parameters        ----------        reranker: Reranker, default RRFReranker()            The reranker to use. Must be an instance of Reranker class.        normalize: str, default "score"            The method to normalize the scores. Can be "rank" or "score". If "rank",            the scores are converted to ranks and then normalized. If "score", the            scores are normalized directly.        Returns        -------        AsyncHybridQuery            The AsyncHybridQuery object.        """ifnormalizenotin["rank","score"]:raiseValueError("normalize must be 'rank' or 'score'.")ifrerankerandnotisinstance(reranker,Reranker):raiseValueError("reranker must be an instance of Reranker class.")self._norm=normalizeself._reranker=rerankerreturnselfasyncdefto_batches(self,*,max_batch_length:Optional[int]=None,timeout:Optional[timedelta]=None,)->AsyncRecordBatchReader:fts_query=AsyncFTSQuery(self._inner.to_fts_query())vec_query=AsyncVectorQuery(self._inner.to_vector_query())# save the row ID choice that was made on the query builder and force it# to actually fetch the row ids because we need this for rerankingwith_row_ids=self._inner.get_with_row_id()fts_query.with_row_id()vec_query.with_row_id()fts_results,vector_results=awaitasyncio.gather(fts_query.to_arrow(timeout=timeout),vec_query.to_arrow(timeout=timeout),)result=LanceHybridQueryBuilder._combine_hybrid_results(fts_results=fts_results,vector_results=vector_results,norm=self._norm,fts_query=fts_query.get_query(),reranker=self._reranker,limit=self._inner.get_limit(),with_row_ids=with_row_ids,)returnAsyncRecordBatchReader(result,max_batch_length=max_batch_length)asyncdefexplain_plan(self,verbose:Optional[bool]=False):"""Return the execution plan for this query.        The output includes both the vector and FTS search plans.        Examples        --------        >>> import asyncio        >>> from lancedb import connect_async        >>> from lancedb.index import FTS        >>> async def doctest_example():        ...     conn = await connect_async("./.lancedb")        ...     table = await conn.create_table("my_table", [{"vector": [99, 99], "text": "hello world"}])        ...     await table.create_index("text", config=FTS(with_position=False))        ...     query = [100, 100]        ...     plan = await table.query().nearest_to([1, 2]).nearest_to_text("hello").explain_plan(True)        ...     print(plan)        >>> asyncio.run(doctest_example()) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE        Vector Search Plan:        ProjectionExec: expr=[vector@0 as vector, text@3 as text, _distance@2 as _distance]          Take: columns="vector, _rowid, _distance, (text)"            CoalesceBatchesExec: target_batch_size=1024              GlobalLimitExec: skip=0, fetch=10                FilterExec: _distance@2 IS NOT NULL                  SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]                    KNNVectorDistance: metric=l2                      LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false        <BLANKLINE>        FTS Search Plan:        ProjectionExec: expr=[vector@2 as vector, text@3 as text, _score@1 as _score]          Take: columns="_rowid, _score, (vector), (text)"            CoalesceBatchesExec: target_batch_size=1024              GlobalLimitExec: skip=0, fetch=10                MatchQuery: query=hello        <BLANKLINE>        Parameters        ----------        verbose : bool, default False            Use a verbose output format.        Returns        -------        plan : str        """# noqa: E501results=["Vector Search Plan:"]results.append(awaitself._inner.to_vector_query().explain_plan(verbose))results.append("FTS Search Plan:")results.append(awaitself._inner.to_fts_query().explain_plan(verbose))return"\n".join(results)asyncdefanalyze_plan(self):"""        Execute the query and return the physical execution plan with runtime metrics.        This runs both the vector and FTS (full-text search) queries and returns        detailed metrics for each step of execution—such as rows processed,        elapsed time, I/O stats, and more. It’s useful for debugging and        performance analysis.        Returns        -------        plan : str        """results=["Vector Search Query:"]results.append(awaitself._inner.to_vector_query().analyze_plan())results.append("FTS Search Query:")results.append(awaitself._inner.to_fts_query().analyze_plan())return"\n".join(results)

column

column(column:str)->Self

Set the vector column to query

This controls which column is compared to the query vector supplied inthe call toAsyncQuery.nearest_to.

This parameter must be specified if the table has more than one columnwhose data type is a fixed-size-list of floats.

Source code inlancedb/query.py
defcolumn(self,column:str)->Self:"""    Set the vector column to query    This controls which column is compared to the query vector supplied in    the call to [AsyncQuery.nearest_to][lancedb.query.AsyncQuery.nearest_to].    This parameter must be specified if the table has more than one column    whose data type is a fixed-size-list of floats.    """self._inner.column(column)returnself

nprobes

nprobes(nprobes:int)->Self

Set the number of partitions to search (probe)

This argument is only used when the vector column has an IVF-based index.If there is no index then this value is ignored.

The IVF stage of IVF PQ divides the input into partitions (clusters) ofrelated values.

The partition whose centroids are closest to the query vector will beexhaustiely searched to find matches. This parameter controls how manypartitions should be searched.

Increasing this value will increase the recall of your query but willalso increase the latency of your query. The default value is 20. Thisdefault is good for many cases but the best value to use will depend onyour data and the recall that you need to achieve.

For best results we recommend tuning this parameter with a benchmark againstyour actual data to find the smallest possible value that will still giveyou the desired recall.

Source code inlancedb/query.py
defnprobes(self,nprobes:int)->Self:"""    Set the number of partitions to search (probe)    This argument is only used when the vector column has an IVF-based index.    If there is no index then this value is ignored.    The IVF stage of IVF PQ divides the input into partitions (clusters) of    related values.    The partition whose centroids are closest to the query vector will be    exhaustiely searched to find matches.  This parameter controls how many    partitions should be searched.    Increasing this value will increase the recall of your query but will    also increase the latency of your query.  The default value is 20.  This    default is good for many cases but the best value to use will depend on    your data and the recall that you need to achieve.    For best results we recommend tuning this parameter with a benchmark against    your actual data to find the smallest possible value that will still give    you the desired recall.    """self._inner.nprobes(nprobes)returnself

minimum_nprobes

minimum_nprobes(minimum_nprobes:int)->Self

Set the minimum number of probes to use.

Seenprobes for more details.

These partitions will be searched on every indexed vector query and willincrease recall at the expense of latency.

Source code inlancedb/query.py
defminimum_nprobes(self,minimum_nprobes:int)->Self:"""Set the minimum number of probes to use.    See `nprobes` for more details.    These partitions will be searched on every indexed vector query and will    increase recall at the expense of latency.    """self._inner.minimum_nprobes(minimum_nprobes)returnself

maximum_nprobes

maximum_nprobes(maximum_nprobes:int)->Self

Set the maximum number of probes to use.

Seenprobes for more details.

If this value is greater thanminimum_nprobes then the excess partitionswill be searched only if we have not found enough results.

This can be useful when there is a narrow filter to allow these queries tospend more time searching and avoid potential false negatives.

If this value is 0 then no limit will be applied and all partitions could besearched if needed to satisfy the limit.

Source code inlancedb/query.py
defmaximum_nprobes(self,maximum_nprobes:int)->Self:"""Set the maximum number of probes to use.    See `nprobes` for more details.    If this value is greater than `minimum_nprobes` then the excess partitions    will be searched only if we have not found enough results.    This can be useful when there is a narrow filter to allow these queries to    spend more time searching and avoid potential false negatives.    If this value is 0 then no limit will be applied and all partitions could be    searched if needed to satisfy the limit.    """self._inner.maximum_nprobes(maximum_nprobes)returnself

distance_range

distance_range(lower_bound:Optional[float]=None,upper_bound:Optional[float]=None)->Self

Set the distance range to use.

Only rows with distances within range [lower_bound, upper_bound)will be returned.

Parameters:

  • lower_bound (Optional[float], default:None) –

    The lower bound of the distance range.

  • upper_bound (Optional[float], default:None) –

    The upper bound of the distance range.

Returns:

Source code inlancedb/query.py
defdistance_range(self,lower_bound:Optional[float]=None,upper_bound:Optional[float]=None)->Self:"""Set the distance range to use.    Only rows with distances within range [lower_bound, upper_bound)    will be returned.    Parameters    ----------    lower_bound: Optional[float]        The lower bound of the distance range.    upper_bound: Optional[float]        The upper bound of the distance range.    Returns    -------    AsyncVectorQuery        The AsyncVectorQuery object.    """self._inner.distance_range(lower_bound,upper_bound)returnself

ef

ef(ef:int)->Self

Set the number of candidates to consider during search

This argument is only used when the vector column has an HNSW index.If there is no index then this value is ignored.

Increasing this value will increase the recall of your query but will alsoincrease the latency of your query. The default value is 1.5 * limit. Thisdefault is good for many cases but the best value to use will depend on yourdata and the recall that you need to achieve.

Source code inlancedb/query.py
defef(self,ef:int)->Self:"""    Set the number of candidates to consider during search    This argument is only used when the vector column has an HNSW index.    If there is no index then this value is ignored.    Increasing this value will increase the recall of your query but will also    increase the latency of your query.  The default value is 1.5 * limit.  This    default is good for many cases but the best value to use will depend on your    data and the recall that you need to achieve.    """self._inner.ef(ef)returnself

refine_factor

refine_factor(refine_factor:int)->Self

A multiplier to control how many additional rows are taken during the refinestep

This argument is only used when the vector column has an IVF PQ index.If there is no index then this value is ignored.

An IVF PQ index stores compressed (quantized) values. They query vector iscompared against these values and, since they are compressed, the comparison isinaccurate.

This parameter can be used to refine the results. It can improve both improverecall and correct the ordering of the nearest results.

To refine results LanceDb will first perform an ANN search to find the nearestlimit *refine_factor results. In other words, ifrefine_factor is 3 andlimit is the default (10) then the first 30 results will be selected. LanceDbthen fetches the full, uncompressed, values for these 30 results. The resultsare then reordered by the true distance and only the nearest 10 are kept.

Note: there is a difference between calling this method with a value of 1 andnever calling this method at all. Calling this method with any value will havean impact on your search latency. When you call this method with arefine_factor of 1 then LanceDb still needs to fetch the full, uncompressed,values so that it can potentially reorder the results.

Note: if this method is NOT called then the distances returned in the _distancecolumn will be approximate distances based on the comparison of the quantizedquery vector and the quantized result vectors. This can be considerablydifferent than the true distance between the query vector and the actualuncompressed vector.

Source code inlancedb/query.py
defrefine_factor(self,refine_factor:int)->Self:"""    A multiplier to control how many additional rows are taken during the refine    step    This argument is only used when the vector column has an IVF PQ index.    If there is no index then this value is ignored.    An IVF PQ index stores compressed (quantized) values.  They query vector is    compared against these values and, since they are compressed, the comparison is    inaccurate.    This parameter can be used to refine the results.  It can improve both improve    recall and correct the ordering of the nearest results.    To refine results LanceDb will first perform an ANN search to find the nearest    `limit` * `refine_factor` results.  In other words, if `refine_factor` is 3 and    `limit` is the default (10) then the first 30 results will be selected.  LanceDb    then fetches the full, uncompressed, values for these 30 results.  The results    are then reordered by the true distance and only the nearest 10 are kept.    Note: there is a difference between calling this method with a value of 1 and    never calling this method at all.  Calling this method with any value will have    an impact on your search latency.  When you call this method with a    `refine_factor` of 1 then LanceDb still needs to fetch the full, uncompressed,    values so that it can potentially reorder the results.    Note: if this method is NOT called then the distances returned in the _distance    column will be approximate distances based on the comparison of the quantized    query vector and the quantized result vectors.  This can be considerably    different than the true distance between the query vector and the actual    uncompressed vector.    """self._inner.refine_factor(refine_factor)returnself

distance_type

distance_type(distance_type:str)->Self

Set the distance metric to use

When performing a vector search we try and find the "nearest" vectors accordingto some kind of distance metric. This parameter controls which distance metricto use. See @see {@link IvfPqOptions.distanceType} for more details on thedifferent distance metrics available.

Note: if there is a vector index then the distance type used MUST match thedistance type used to train the vector index. If this is not done then theresults will be invalid.

By default "l2" is used.

Source code inlancedb/query.py
defdistance_type(self,distance_type:str)->Self:"""    Set the distance metric to use    When performing a vector search we try and find the "nearest" vectors according    to some kind of distance metric.  This parameter controls which distance metric    to use.  See @see {@link IvfPqOptions.distanceType} for more details on the    different distance metrics available.    Note: if there is a vector index then the distance type used MUST match the    distance type used to train the vector index.  If this is not done then the    results will be invalid.    By default "l2" is used.    """self._inner.distance_type(distance_type)returnself

bypass_vector_index

bypass_vector_index()->Self

If this is called then any vector index is skipped

An exhaustive (flat) search will be performed. The query vector willbe compared to every vector in the table. At high scales this can beexpensive. However, this is often still useful. For example, skippingthe vector index can give you ground truth results which you can use tocalculate your recall to select an appropriate value for nprobes.

Source code inlancedb/query.py
defbypass_vector_index(self)->Self:"""    If this is called then any vector index is skipped    An exhaustive (flat) search will be performed.  The query vector will    be compared to every vector in the table.  At high scales this can be    expensive.  However, this is often still useful.  For example, skipping    the vector index can give you ground truth results which you can use to    calculate your recall to select an appropriate value for nprobes.    """self._inner.bypass_vector_index()returnself

where

where(predicate:str)->Self

Only return rows matching the given predicate

The predicate should be supplied as an SQL query string.

Examples:

>>>predicate="x > 10">>>predicate="y > 0 AND y < 100">>>predicate="x > 5 OR y = 'test'"

Filtering performance can often be improved by creating a scalar indexon the filter column(s).

Source code inlancedb/query.py
defwhere(self,predicate:str)->Self:"""    Only return rows matching the given predicate    The predicate should be supplied as an SQL query string.    Examples    --------    >>> predicate = "x > 10"    >>> predicate = "y > 0 AND y < 100"    >>> predicate = "x > 5 OR y = 'test'"    Filtering performance can often be improved by creating a scalar index    on the filter column(s).    """self._inner.where(predicate)returnself

select

select(columns:Union[List[str],dict[str,str]])->Self

Return only the specified columns.

By default a query will return all columns from the table. However, this canhave a very significant impact on latency. LanceDb stores data in a columnarfashion. Thismeans we can finely tune our I/O to select exactly the columns we need.

As a best practice you should always limit queries to the columns that you need.If you pass in a list of column names then only those columns will bereturned.

You can also use this method to create new "dynamic" columns based on yourexisting columns. For example, you may not care about "a" or "b" but insteadsimply want "a + b". This is often seen in the SELECT clause of an SQL query(e.g.SELECT a+b FROM my_table).

To create dynamic columns you can pass in a dict[str, str]. A column will bereturned for each entry in the map. The key provides the name of the column.The value is an SQL string used to specify how the column is calculated.

For example, an SQL query might stateSELECT a + b AS combined, c. Theequivalent input to this method would be{"combined": "a + b", "c": "c"}.

Columns will always be returned in the order given, even if that order isdifferent than the order used when adding the data.

Source code inlancedb/query.py
defselect(self,columns:Union[List[str],dict[str,str]])->Self:"""    Return only the specified columns.    By default a query will return all columns from the table.  However, this can    have a very significant impact on latency.  LanceDb stores data in a columnar    fashion.  This    means we can finely tune our I/O to select exactly the columns we need.    As a best practice you should always limit queries to the columns that you need.    If you pass in a list of column names then only those columns will be    returned.    You can also use this method to create new "dynamic" columns based on your    existing columns. For example, you may not care about "a" or "b" but instead    simply want "a + b".  This is often seen in the SELECT clause of an SQL query    (e.g. `SELECT a+b FROM my_table`).    To create dynamic columns you can pass in a dict[str, str].  A column will be    returned for each entry in the map.  The key provides the name of the column.    The value is an SQL string used to specify how the column is calculated.    For example, an SQL query might state `SELECT a + b AS combined, c`.  The    equivalent input to this method would be `{"combined": "a + b", "c": "c"}`.    Columns will always be returned in the order given, even if that order is    different than the order used when adding the data.    """ifisinstance(columns,list)andall(isinstance(c,str)forcincolumns):self._inner.select_columns(columns)elifisinstance(columns,dict)andall(isinstance(k,str)andisinstance(v,str)fork,vincolumns.items()):self._inner.select(list(columns.items()))else:raiseTypeError("columns must be a list of column names or a dict")returnself

limit

limit(limit:int)->Self

Set the maximum number of results to return.

By default, a plain search has no limit. If this method is notcalled then every valid row from the table will be returned.

Source code inlancedb/query.py
deflimit(self,limit:int)->Self:"""    Set the maximum number of results to return.    By default, a plain search has no limit.  If this method is not    called then every valid row from the table will be returned.    """self._inner.limit(limit)returnself

offset

offset(offset:int)->Self

Set the offset for the results.

Parameters:

  • offset (int) –

    The offset to start fetching results from.

Source code inlancedb/query.py
defoffset(self,offset:int)->Self:"""    Set the offset for the results.    Parameters    ----------    offset: int        The offset to start fetching results from.    """self._inner.offset(offset)returnself

fast_search

fast_search()->Self

Skip searching un-indexed data.

This can make queries faster, but will miss any data that has not beenindexed.

Tip

You can add new data into an existing index by callingAsyncTable.optimize.

Source code inlancedb/query.py
deffast_search(self)->Self:"""    Skip searching un-indexed data.    This can make queries faster, but will miss any data that has not been    indexed.    !!! tip        You can add new data into an existing index by calling        [AsyncTable.optimize][lancedb.table.AsyncTable.optimize].    """self._inner.fast_search()returnself

with_row_id

with_row_id()->Self

Include the _rowid column in the results.

Source code inlancedb/query.py
defwith_row_id(self)->Self:"""    Include the _rowid column in the results.    """self._inner.with_row_id()returnself

postfilter

postfilter()->Self

If this is called then filtering will happen after the search instead ofbefore.By default filtering will be performed before the search. This is howfiltering is typically understood to work. This prefilter step does add someadditional latency. Creating a scalar index on the filter column(s) canoften improve this latency. However, sometimes a filter is too complex orscalar indices cannot be applied to the column. In these cases postfilteringcan be used instead of prefiltering to improve latency.Post filtering applies the filter to the results of the search. Thismeans we only run the filter on a much smaller set of data. However, it cancause the query to return fewer thanlimit results (or even no results) ifnone of the nearest results match the filter.Post filtering happens during the "refine stage" (described in more detail in@see {@link VectorQuery#refineFactor}). This means that setting a higher refinefactor can often help restore some of the results lost by post filtering.

Source code inlancedb/query.py
defpostfilter(self)->Self:"""    If this is called then filtering will happen after the search instead of    before.    By default filtering will be performed before the search.  This is how    filtering is typically understood to work.  This prefilter step does add some    additional latency.  Creating a scalar index on the filter column(s) can    often improve this latency.  However, sometimes a filter is too complex or    scalar indices cannot be applied to the column.  In these cases postfiltering    can be used instead of prefiltering to improve latency.    Post filtering applies the filter to the results of the search.  This    means we only run the filter on a much smaller set of data.  However, it can    cause the query to return fewer than `limit` results (or even no results) if    none of the nearest results match the filter.    Post filtering happens during the "refine stage" (described in more detail in    @see {@link VectorQuery#refineFactor}).  This means that setting a higher refine    factor can often help restore some of the results lost by post filtering.    """self._inner.postfilter()returnself

to_arrowasync

to_arrow(timeout:Optional[timedelta]=None)->Table

Execute the query and collect the results into an Apache Arrow Table.

This method will collect all results into memory before returning. Ifyou expect a large number of results, you may want to useto_batches

Parameters:

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.

Source code inlancedb/query.py
asyncdefto_arrow(self,timeout:Optional[timedelta]=None)->pa.Table:"""    Execute the query and collect the results into an Apache Arrow Table.    This method will collect all results into memory before returning.  If    you expect a large number of results, you may want to use    [to_batches][lancedb.query.AsyncQueryBase.to_batches]    Parameters    ----------    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If not specified, no timeout is applied. If the query does not        complete within the specified time, an error will be raised.    """batch_iter=awaitself.to_batches(timeout=timeout)returnpa.Table.from_batches(awaitbatch_iter.read_all(),schema=batch_iter.schema)

to_listasync

to_list(timeout:Optional[timedelta]=None)->List[dict]

Execute the query and return the results as a list of dictionaries.

Each list entry is a dictionary with the selected column names as keys,or all table columns ifselect is not called. The vector and the "_distance"fields are returned whether or not they're explicitly selected.

Parameters:

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.

Source code inlancedb/query.py
asyncdefto_list(self,timeout:Optional[timedelta]=None)->List[dict]:"""    Execute the query and return the results as a list of dictionaries.    Each list entry is a dictionary with the selected column names as keys,    or all table columns if `select` is not called. The vector and the "_distance"    fields are returned whether or not they're explicitly selected.    Parameters    ----------    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If not specified, no timeout is applied. If the query does not        complete within the specified time, an error will be raised.    """return(awaitself.to_arrow(timeout=timeout)).to_pylist()

to_pandasasync

to_pandas(flatten:Optional[Union[int,bool]]=None,timeout:Optional[timedelta]=None)->'pd.DataFrame'

Execute the query and collect the results into a pandas DataFrame.

This method will collect all results into memory before returning. If youexpect a large number of results, you may want to useto_batches and convert each batch topandas separately.

Examples:

>>>importasyncio>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",data=[{"a":1,"b":2}])...asyncforbatchinawaittable.query().to_batches():...batch_df=batch.to_pandas()>>>asyncio.run(doctest_example())

Parameters:

  • flatten (Optional[Union[int,bool]], default:None) –

    If flatten is True, flatten all nested columns.If flatten is an integer, flatten the nested columns up to thespecified depth.If unspecified, do not flatten the nested columns.

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.

Source code inlancedb/query.py
asyncdefto_pandas(self,flatten:Optional[Union[int,bool]]=None,timeout:Optional[timedelta]=None,)->"pd.DataFrame":"""    Execute the query and collect the results into a pandas DataFrame.    This method will collect all results into memory before returning.  If you    expect a large number of results, you may want to use    [to_batches][lancedb.query.AsyncQueryBase.to_batches] and convert each batch to    pandas separately.    Examples    --------    >>> import asyncio    >>> from lancedb import connect_async    >>> async def doctest_example():    ...     conn = await connect_async("./.lancedb")    ...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])    ...     async for batch in await table.query().to_batches():    ...         batch_df = batch.to_pandas()    >>> asyncio.run(doctest_example())    Parameters    ----------    flatten: Optional[Union[int, bool]]        If flatten is True, flatten all nested columns.        If flatten is an integer, flatten the nested columns up to the        specified depth.        If unspecified, do not flatten the nested columns.    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If not specified, no timeout is applied. If the query does not        complete within the specified time, an error will be raised.    """return(flatten_columns(awaitself.to_arrow(timeout=timeout),flatten)).to_pandas()

to_polarsasync

to_polars(timeout:Optional[timedelta]=None)->'pl.DataFrame'

Execute the query and collect the results into a Polars DataFrame.

This method will collect all results into memory before returning. If youexpect a large number of results, you may want to useto_batches and convert each batch topolars separately.

Parameters:

  • timeout (Optional[timedelta], default:None) –

    The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.

Examples:

>>>importasyncio>>>importpolarsaspl>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",data=[{"a":1,"b":2}])...asyncforbatchinawaittable.query().to_batches():...batch_df=pl.from_arrow(batch)>>>asyncio.run(doctest_example())
Source code inlancedb/query.py
asyncdefto_polars(self,timeout:Optional[timedelta]=None,)->"pl.DataFrame":"""    Execute the query and collect the results into a Polars DataFrame.    This method will collect all results into memory before returning.  If you    expect a large number of results, you may want to use    [to_batches][lancedb.query.AsyncQueryBase.to_batches] and convert each batch to    polars separately.    Parameters    ----------    timeout: Optional[timedelta]        The maximum time to wait for the query to complete.        If not specified, no timeout is applied. If the query does not        complete within the specified time, an error will be raised.    Examples    --------    >>> import asyncio    >>> import polars as pl    >>> from lancedb import connect_async    >>> async def doctest_example():    ...     conn = await connect_async("./.lancedb")    ...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])    ...     async for batch in await table.query().to_batches():    ...         batch_df = pl.from_arrow(batch)    >>> asyncio.run(doctest_example())    """importpolarsasplreturnpl.from_arrow(awaitself.to_arrow(timeout=timeout))

rerank

rerank(reranker:Reranker=RRFReranker(),normalize:str='score')->AsyncHybridQuery

Rerank the hybrid search results using the specified reranker. The rerankermust be an instance of Reranker class.

Parameters:

  • reranker (Reranker, default:RRFReranker()) –

    The reranker to use. Must be an instance of Reranker class.

  • normalize (str, default:'score') –

    The method to normalize the scores. Can be "rank" or "score". If "rank",the scores are converted to ranks and then normalized. If "score", thescores are normalized directly.

Returns:

Source code inlancedb/query.py
defrerank(self,reranker:Reranker=RRFReranker(),normalize:str="score")->AsyncHybridQuery:"""    Rerank the hybrid search results using the specified reranker. The reranker    must be an instance of Reranker class.    Parameters    ----------    reranker: Reranker, default RRFReranker()        The reranker to use. Must be an instance of Reranker class.    normalize: str, default "score"        The method to normalize the scores. Can be "rank" or "score". If "rank",        the scores are converted to ranks and then normalized. If "score", the        scores are normalized directly.    Returns    -------    AsyncHybridQuery        The AsyncHybridQuery object.    """ifnormalizenotin["rank","score"]:raiseValueError("normalize must be 'rank' or 'score'.")ifrerankerandnotisinstance(reranker,Reranker):raiseValueError("reranker must be an instance of Reranker class.")self._norm=normalizeself._reranker=rerankerreturnself

explain_planasync

explain_plan(verbose:Optional[bool]=False)

Return the execution plan for this query.

The output includes both the vector and FTS search plans.

Examples:

>>>importasyncio>>>fromlancedbimportconnect_async>>>fromlancedb.indeximportFTS>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",[{"vector":[99,99],"text":"hello world"}])...awaittable.create_index("text",config=FTS(with_position=False))...query=[100,100]...plan=awaittable.query().nearest_to([1,2]).nearest_to_text("hello").explain_plan(True)...print(plan)>>>asyncio.run(doctest_example())Vector Search Plan:ProjectionExec: expr=[vector@0 as vector, text@3 as text, _distance@2 as _distance]  Take: columns="vector, _rowid, _distance, (text)"    CoalesceBatchesExec: target_batch_size=1024      GlobalLimitExec: skip=0, fetch=10        FilterExec: _distance@2 IS NOT NULL          SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]            KNNVectorDistance: metric=l2              LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=falseFTS Search Plan:ProjectionExec: expr=[vector@2 as vector, text@3 as text, _score@1 as _score]  Take: columns="_rowid, _score, (vector), (text)"    CoalesceBatchesExec: target_batch_size=1024      GlobalLimitExec: skip=0, fetch=10        MatchQuery: query=hello

Parameters:

  • verbose (bool, default:False) –

    Use a verbose output format.

Returns:

  • plan (str) –
Source code inlancedb/query.py
asyncdefexplain_plan(self,verbose:Optional[bool]=False):"""Return the execution plan for this query.    The output includes both the vector and FTS search plans.    Examples    --------    >>> import asyncio    >>> from lancedb import connect_async    >>> from lancedb.index import FTS    >>> async def doctest_example():    ...     conn = await connect_async("./.lancedb")    ...     table = await conn.create_table("my_table", [{"vector": [99, 99], "text": "hello world"}])    ...     await table.create_index("text", config=FTS(with_position=False))    ...     query = [100, 100]    ...     plan = await table.query().nearest_to([1, 2]).nearest_to_text("hello").explain_plan(True)    ...     print(plan)    >>> asyncio.run(doctest_example()) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE    Vector Search Plan:    ProjectionExec: expr=[vector@0 as vector, text@3 as text, _distance@2 as _distance]      Take: columns="vector, _rowid, _distance, (text)"        CoalesceBatchesExec: target_batch_size=1024          GlobalLimitExec: skip=0, fetch=10            FilterExec: _distance@2 IS NOT NULL              SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]                KNNVectorDistance: metric=l2                  LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false    <BLANKLINE>    FTS Search Plan:    ProjectionExec: expr=[vector@2 as vector, text@3 as text, _score@1 as _score]      Take: columns="_rowid, _score, (vector), (text)"        CoalesceBatchesExec: target_batch_size=1024          GlobalLimitExec: skip=0, fetch=10            MatchQuery: query=hello    <BLANKLINE>    Parameters    ----------    verbose : bool, default False        Use a verbose output format.    Returns    -------    plan : str    """# noqa: E501results=["Vector Search Plan:"]results.append(awaitself._inner.to_vector_query().explain_plan(verbose))results.append("FTS Search Plan:")results.append(awaitself._inner.to_fts_query().explain_plan(verbose))return"\n".join(results)

analyze_planasync

analyze_plan()

Execute the query and return the physical execution plan with runtime metrics.

This runs both the vector and FTS (full-text search) queries and returnsdetailed metrics for each step of execution—such as rows processed,elapsed time, I/O stats, and more. It’s useful for debugging andperformance analysis.

Returns:

  • plan (str) –
Source code inlancedb/query.py
asyncdefanalyze_plan(self):"""    Execute the query and return the physical execution plan with runtime metrics.    This runs both the vector and FTS (full-text search) queries and returns    detailed metrics for each step of execution—such as rows processed,    elapsed time, I/O stats, and more. It’s useful for debugging and    performance analysis.    Returns    -------    plan : str    """results=["Vector Search Query:"]results.append(awaitself._inner.to_vector_query().analyze_plan())results.append("FTS Search Query:")results.append(awaitself._inner.to_fts_query().analyze_plan())return"\n".join(results)

[8]ページ先頭

©2009-2025 Movatter.jp