- Installation
- Connections (Synchronous)
- Tables (Synchronous)
- Table
- name
- version
- schema
- tags
- embedding_functions
- __len__
- count_rows
- to_pandas
- to_arrow
- create_index
- drop_index
- wait_for_index
- stats
- create_scalar_index
- create_fts_index
- add
- merge_insert
- search
- delete
- update
- cleanup_old_versions
- compact_files
- optimize
- list_indices
- index_stats
- add_columns
- alter_columns
- drop_columns
- checkout
- checkout_latest
- restore
- list_versions
- uses_v2_manifest_paths
- migrate_v2_manifest_paths
- Table
- Querying (Synchronous)
- Embeddings
- Context
- Full text search
- Utilities
- Integrations
- Pydantic
- Reranking
- Connections (Asynchronous)
- Tables (Asynchronous)
- AsyncTable
- name
- tags
- __init__
- is_open
- close
- schema
- embedding_functions
- count_rows
- head
- query
- to_pandas
- to_arrow
- create_index
- drop_index
- prewarm_index
- wait_for_index
- stats
- add
- merge_insert
- search
- vector_search
- delete
- update
- add_columns
- alter_columns
- drop_columns
- version
- list_versions
- checkout
- checkout_latest
- restore
- optimize
- list_indices
- index_stats
- uses_v2_manifest_paths
- migrate_manifest_paths_v2
- replace_field_metadata
- AsyncTable
- Indices (Asynchronous)
- Querying (Asynchronous)
Python API Reference
This section contains the API reference for the Python API. There is asynchronous and an asynchronous API client.
The general flow of using the API is:
- Uselancedb.connect orlancedb.connect_async to connect to a database.
- Use the returnedlancedb.DBConnection orlancedb.AsyncConnection to create or open tables.
- Use the returnedlancedb.table.Table orlancedb.AsyncTable to query or modify tables.
Installation
The following methods describe the synchronous API client. Thereis also anasynchronous API client.
Connections (Synchronous)
lancedb.connect
connect(uri:URI,*,api_key:Optional[str]=None,region:str='us-east-1',host_override:Optional[str]=None,read_consistency_interval:Optional[timedelta]=None,request_thread_pool:Optional[Union[int,ThreadPoolExecutor]]=None,client_config:Union[ClientConfig,Dict[str,Any],None]=None,storage_options:Optional[Dict[str,str]]=None,**kwargs:Any)->DBConnection
Connect to a LanceDB database.
Parameters:
uri
(URI
) –The uri of the database.
api_key
(Optional[str]
, default:None
) –If presented, connect to LanceDB cloud.Otherwise, connect to a database on file system or cloud storage.Can be set via environment variable
LANCEDB_API_KEY
.region
(str
, default:'us-east-1'
) –The region to use for LanceDB Cloud.
host_override
(Optional[str]
, default:None
) –The override url for LanceDB Cloud.
read_consistency_interval
(Optional[timedelta]
, default:None
) –(For LanceDB OSS only)The interval at which to check for updates to the table from otherprocesses. If None, then consistency is not checked. For performancereasons, this is the default. For strong consistency, set this tozero seconds. Then every read will check for updates from otherprocesses. As a compromise, you can set this to a non-zero timedeltafor eventual consistency. If more than that interval has passed sincethe last check, then the table will be checked for updates. Note: thisconsistency only applies to read operations. Write operations arealways consistent.
client_config
(Union[ClientConfig,Dict[str,Any], None]
, default:None
) –Configuration options for the LanceDB Cloud HTTP client. If a dict, thenthe keys are the attributes of the ClientConfig class. If None, then thedefault configuration is used.
storage_options
(Optional[Dict[str,str]]
, default:None
) –Additional options for the storage backend. See available options athttps://lancedb.github.io/lancedb/guides/storage/
Examples:
For a local directory, provide a path for the database:
For object storage, use a URI prefix:
Connect to LanceDB cloud:
>>>db=lancedb.connect("db://my_database",api_key="ldb_...",...client_config={"retry_config":{"retries":5}})
Returns:
conn
(DBConnection
) –A connection to a LanceDB database.
Source code inlancedb/__init__.py
lancedb.db.DBConnection
Bases:EnforceOverrides
An active LanceDB connection interface.
Source code inlancedb/db.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308 |
|
table_namesabstractmethod
List all tables in this database, in sorted order
Parameters:
page_token
(Optional[str]
, default:None
) –The token to use for pagination. If not present, start from the beginning.Typically, this token is last table name from the previous page.Only supported by LanceDb Cloud.
limit
(int
, default:10
) –The size of the page to return.Only supported by LanceDb Cloud.
Returns:
Iterable of str
–
Source code inlancedb/db.py
create_tableabstractmethod
create_table(name:str,data:Optional[DATA]=None,schema:Optional[Union[Schema,LanceModel]]=None,mode:str='create',exist_ok:bool=False,on_bad_vectors:str='error',fill_value:float=0.0,embedding_functions:Optional[List[EmbeddingFunctionConfig]]=None,*,storage_options:Optional[Dict[str,str]]=None,data_storage_version:Optional[str]=None,enable_v2_manifest_paths:Optional[bool]=None)->Table
Create aTable in the database.
Parameters:
name
(str
) –The name of the table.
data
(Optional[DATA]
, default:None
) –User must provide at least one of
data
orschema
.Acceptable types are:list-of-dict
pandas.DataFrame
pyarrow.Table or pyarrow.RecordBatch
schema
(Optional[Union[Schema,LanceModel]]
, default:None
) –Acceptable types are:
pyarrow.Schema
mode
(str
, default:'create'
) –The mode to use when creating the table.Can be either "create" or "overwrite".By default, if the table already exists, an exception is raised.If you want to overwrite the table, use mode="overwrite".
exist_ok
(bool
, default:False
) –If a table by the same name already exists, then raise an exceptionif exist_ok=False. If exist_ok=True, then open the existing table;it will not add the provided data but will validate against anyschema that's specified.
on_bad_vectors
(str
, default:'error'
) –What to do if any of the vectors are not the same size or contains NaNs.One of "error", "drop", "fill".
fill_value
(float
, default:0.0
) –The value to use when filling vectors. Only used if on_bad_vectors="fill".
storage_options
(Optional[Dict[str,str]]
, default:None
) –Additional options for the storage backend. Options already set on theconnection will be inherited by the table, but can be overridden here.See available options athttps://lancedb.github.io/lancedb/guides/storage/
data_storage_version
(Optional[str]
, default:None
) –Deprecated. Set
storage_options
when connecting to the database and setnew_table_data_storage_version
in the options.enable_v2_manifest_paths
(Optional[bool]
, default:None
) –Deprecated. Set
storage_options
when connecting to the database and setnew_table_enable_v2_manifest_paths
in the options.
Returns:
LanceTable
–A reference to the newly created table.
!!! note
–The vector index won't be created by default.To create the index, call the
create_index
method on the table.
Examples:
Can create with list of tuples or dictionaries:
>>>importlancedb>>>db=lancedb.connect("./.lancedb")>>>data=[{"vector":[1.1,1.2],"lat":45.5,"long":-122.7},...{"vector":[0.2,1.8],"lat":40.1,"long":-74.1}]>>>db.create_table("my_table",data)LanceTable(name='my_table', version=1, ...)>>>db["my_table"].head()pyarrow.Tablevector: fixed_size_list<item: float>[2] child 0, item: floatlat: doublelong: double----vector: [[[1.1,1.2],[0.2,1.8]]]lat: [[45.5,40.1]]long: [[-122.7,-74.1]]
You can also pass a pandas DataFrame:
>>>importpandasaspd>>>data=pd.DataFrame({..."vector":[[1.1,1.2],[0.2,1.8]],..."lat":[45.5,40.1],..."long":[-122.7,-74.1]...})>>>db.create_table("table2",data)LanceTable(name='table2', version=1, ...)>>>db["table2"].head()pyarrow.Tablevector: fixed_size_list<item: float>[2] child 0, item: floatlat: doublelong: double----vector: [[[1.1,1.2],[0.2,1.8]]]lat: [[45.5,40.1]]long: [[-122.7,-74.1]]
Data is converted to Arrow before being written to disk. For maximumcontrol over how data is saved, either provide the PyArrow schema toconvert to or else provide aPyArrow Table directly.
>>>importpyarrowaspa>>>custom_schema=pa.schema([...pa.field("vector",pa.list_(pa.float32(),2)),...pa.field("lat",pa.float32()),...pa.field("long",pa.float32())...])>>>db.create_table("table3",data,schema=custom_schema)LanceTable(name='table3', version=1, ...)>>>db["table3"].head()pyarrow.Tablevector: fixed_size_list<item: float>[2] child 0, item: floatlat: floatlong: float----vector: [[[1.1,1.2],[0.2,1.8]]]lat: [[45.5,40.1]]long: [[-122.7,-74.1]]
It is also possible to create an table from[Iterable[pa.RecordBatch]]
:
>>>importpyarrowaspa>>>defmake_batches():...foriinrange(5):...yieldpa.RecordBatch.from_arrays(...[...pa.array([[3.1,4.1],[5.9,26.5]],...pa.list_(pa.float32(),2)),...pa.array(["foo","bar"]),...pa.array([10.0,20.0]),...],...["vector","item","price"],...)>>>schema=pa.schema([...pa.field("vector",pa.list_(pa.float32(),2)),...pa.field("item",pa.utf8()),...pa.field("price",pa.float32()),...])>>>db.create_table("table4",make_batches(),schema=schema)LanceTable(name='table4', version=1, ...)
Source code inlancedb/db.py
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231 |
|
open_table
open_table(name:str,*,storage_options:Optional[Dict[str,str]]=None,index_cache_size:Optional[int]=None)->Table
Open a Lance Table in the database.
Parameters:
name
(str
) –The name of the table.
index_cache_size
(Optional[int]
, default:None
) –Set the size of the index cache, specified as a number of entries
The exact meaning of an "entry" will depend on the type of index:* IVF - there is one entry for each IVF partition* BTREE - there is one entry for the entire index
This cache applies to the entire opened table, across all indices.Setting this value higher will increase performance on larger datasetsat the expense of more RAM
storage_options
(Optional[Dict[str,str]]
, default:None
) –Additional options for the storage backend. Options already set on theconnection will be inherited by the table, but can be overridden here.See available options athttps://lancedb.github.io/lancedb/guides/storage/
Returns:
A LanceTable object representing the table.
–
Source code inlancedb/db.py
drop_table
rename_table
Rename a table in the database.
Parameters:
cur_name
(str
) –The current name of the table.
new_name
(str
) –The new name of the table.
drop_database
Tables (Synchronous)
lancedb.table.Table
Bases:ABC
A Table is a collection of Records in a LanceDB Database.
Examples:
Create usingDBConnection.create_table(more examples in that method's documentation).
>>>importlancedb>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",data=[{"vector":[1.1,1.2],"b":2}])>>>table.head()pyarrow.Tablevector: fixed_size_list<item: float>[2] child 0, item: floatb: int64----vector: [[[1.1,1.2]]]b: [[2]]
Can append new data withTable.add().
Can query the table withTable.search.
>>>table.search([0.4,0.4]).select(["b","vector"]).to_pandas() b vector _distance0 4 [0.5, 1.3] 0.821 2 [1.1, 1.2] 1.13
Search queries are much faster when an index is created. SeeTable.create_index.
Source code inlancedb/table.py
535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 9991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264126512661267126812691270127112721273127412751276127712781279128012811282128312841285128612871288128912901291129212931294129512961297129812991300130113021303130413051306130713081309131013111312131313141315131613171318131913201321132213231324132513261327132813291330133113321333133413351336133713381339134013411342134313441345134613471348134913501351135213531354135513561357135813591360136113621363136413651366136713681369137013711372137313741375137613771378137913801381138213831384138513861387138813891390139113921393139413951396139713981399140014011402140314041405140614071408140914101411141214131414141514161417141814191420142114221423142414251426142714281429143014311432143314341435143614371438143914401441144214431444144514461447144814491450145114521453145414551456145714581459146014611462146314641465146614671468146914701471147214731474147514761477147814791480148114821483148414851486148714881489149014911492149314941495149614971498149915001501150215031504150515061507150815091510151115121513151415151516151715181519152015211522152315241525152615271528152915301531153215331534153515361537153815391540154115421543154415451546154715481549 |
|
tagsabstractmethod
property
Tag management for the table.
Similar to Git, tags are a way to add metadata to a specific version of thetable.
.. warning::
Tagged versions are exempted from the :py:meth:`cleanup_old_versions()`process.To remove a version that has been tagged, you must first:py:meth:`~Tags.delete` the associated tag.
Examples:
.. code-block:: python
table = db.open_table("my_table")table.tags.create("v2-prod-20250203", 10)tags = table.tags.list()
embedding_functionsabstractmethod
property
embedding_functions:Dict[str,EmbeddingFunctionConfig]
Get a mapping from vector column name to it's configured embedding function.
__len__
count_rowsabstractmethod
Count the number of rows in the table.
Parameters:
filter
(Optional[str]
, default:None
) –A SQL where clause to filter the rows to count.
to_pandas
to_arrowabstractmethod
to_arrow()->Table
create_index
create_index(metric='l2',num_partitions=256,num_sub_vectors=96,vector_column_name:str=VECTOR_COLUMN_NAME,replace:bool=True,accelerator:Optional[str]=None,index_cache_size:Optional[int]=None,*,index_type:VectorIndexType='IVF_PQ',wait_timeout:Optional[timedelta]=None,num_bits:int=8,max_iterations:int=50,sample_rate:int=256,m:int=20,ef_construction:int=300)
Create an index on the table.
Parameters:
metric
–The distance metric to use when creating the index.Valid values are "l2", "cosine", "dot", or "hamming".l2 is euclidean distance.Hamming is available only for binary vectors.
num_partitions
–The number of IVF partitions to use when creating the index.Default is 256.
num_sub_vectors
–The number of PQ sub-vectors to use when creating the index.Default is 96.
vector_column_name
(str
, default:VECTOR_COLUMN_NAME
) –The vector column name to create the index.
replace
(bool
, default:True
) –If True, replace the existing index if it exists.
If False, raise an error if duplicate index exists.
accelerator
(Optional[str]
, default:None
) –If set, use the given accelerator to create the index.Only support "cuda" for now.
index_cache_size
(int
, default:None
) –The size of the index cache in number of entries. Default value is 256.
num_bits
(int
, default:8
) –The number of bits to encode sub-vectors. Only used with the IVF_PQ index.Only 4 and 8 are supported.
wait_timeout
(Optional[timedelta]
, default:None
) –The timeout to wait if indexing is asynchronous.
Source code inlancedb/table.py
drop_index
Drop an index from the table.
Parameters:
name
(str
) –The name of the index to drop.
Notes
This does not delete the index from disk, it just removes it from the table.To delete the index, runoptimizeafter dropping the index.
Uselist_indices to find the names ofthe indices.
Source code inlancedb/table.py
wait_for_index
Wait for indexing to complete for the given index names.This will poll the table until all the indices are fully indexed,or raise a timeout exception if the timeout is reached.
Parameters:
index_names
(Iterable[str]
) –The name of the indices to poll
timeout
(timedelta
, default:timedelta(seconds=300)
) –Timeout to wait for asynchronous indexing. The default is 5 minutes.
Source code inlancedb/table.py
statsabstractmethod
create_scalar_indexabstractmethod
create_scalar_index(column:str,*,replace:bool=True,index_type:ScalarIndexType='BTREE',wait_timeout:Optional[timedelta]=None)
Create a scalar index on a column.
Parameters:
column
(str
) –The column to be indexed. Must be a boolean, integer, float,or string column.
replace
(bool
, default:True
) –Replace the existing index if it exists.
index_type
(ScalarIndexType
, default:'BTREE'
) –The type of index to create.
wait_timeout
(Optional[timedelta]
, default:None
) –The timeout to wait if indexing is asynchronous.
Examples:
Scalar indices, like vector indices, can be used to speed up scans. A scalarindex can speed up scans that contain filter expressions on the indexed column.For example, the following scan will be faster if the columnmy_col
hasa scalar index:
>>>importlancedb>>>db=lancedb.connect("/data/lance")>>>img_table=db.open_table("images")>>>my_df=img_table.search().where("my_col = 7",...prefilter=True).to_pandas()
Scalar indices can also speed up scans containing a vector search and aprefilter:
>>>importlancedb>>>db=lancedb.connect("/data/lance")>>>img_table=db.open_table("images")>>>img_table.search([1,2,3,4],vector_column_name="vector")....where("my_col != 7",prefilter=True)....to_pandas()
Scalar indices can only speed up scans for basic filters usingequality, comparison, range (e.g.my_col BETWEEN 0 AND 100
), and setmembership (e.g.my_col IN (0, 1, 2)
)
Scalar indices can be used if the filter contains multiple indexed columns andthe filter criteria are AND'd or OR'd together(e.g.my_col < 0 AND other_col> 100
)
Scalar indices may be used if the filter contains non-indexed columns but,depending on the structure of the filter, they may not be usable. For example,if the columnnot_indexed
does not have a scalar index then the filtermy_col = 0 OR not_indexed = 1
will not be able to use any scalar index onmy_col
.
Source code inlancedb/table.py
create_fts_index
create_fts_index(field_names:Union[str,List[str]],*,ordering_field_names:Optional[Union[str,List[str]]]=None,replace:bool=False,writer_heap_size:Optional[int]=1024*1024*1024,use_tantivy:bool=False,tokenizer_name:Optional[str]=None,with_position:bool=False,base_tokenizer:BaseTokenizerType='simple',language:str='English',max_token_length:Optional[int]=40,lower_case:bool=True,stem:bool=True,remove_stop_words:bool=True,ascii_folding:bool=True,ngram_min_length:int=3,ngram_max_length:int=3,prefix_only:bool=False,wait_timeout:Optional[timedelta]=None)
Create a full-text search index on the table.
Warning - this API is highly experimental and is highly likely to changein the future.
Parameters:
field_names
(Union[str,List[str]]
) –The name(s) of the field to index.can be only str if use_tantivy=True for now.
replace
(bool
, default:False
) –If True, replace the existing index if it exists. Note that this isnot yet an atomic operation; the index will be temporarilyunavailable while the new index is being created.
writer_heap_size
(Optional[int]
, default:1024 * 1024 * 1024
) –Only available with use_tantivy=True
ordering_field_names
(Optional[Union[str,List[str]]]
, default:None
) –A list of unsigned type fields to index to optionally orderresults on at search time.only available with use_tantivy=True
tokenizer_name
(Optional[str]
, default:None
) –The tokenizer to use for the index. Can be "raw", "default" or the 2 letterlanguage code followed by "_stem". So for english it would be "en_stem".For available languages see: https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html
use_tantivy
(bool
, default:False
) –If True, use the legacy full-text search implementation based on tantivy.If False, use the new full-text search implementation based on lance-index.
with_position
(bool
, default:False
) –Only available with use_tantivy=FalseIf False, do not store the positions of the terms in the text.This can reduce the size of the index and improve indexing speed.But it will raise an exception for phrase queries.
base_tokenizer
(str
, default:"simple"
) –The base tokenizer to use for tokenization. Options are:- "simple": Splits text by whitespace and punctuation.- "whitespace": Split text by whitespace, but not punctuation.- "raw": No tokenization. The entire text is treated as a single token.- "ngram": N-Gram tokenizer.
language
(str
, default:"English"
) –The language to use for tokenization.
max_token_length
(int
, default:40
) –The maximum token length to index. Tokens longer than this length will beignored.
lower_case
(bool
, default:True
) –Whether to convert the token to lower case. This makes queriescase-insensitive.
stem
(bool
, default:True
) –Whether to stem the token. Stemming reduces words to their root form.For example, in English "running" and "runs" would both be reduced to "run".
remove_stop_words
(bool
, default:True
) –Whether to remove stop words. Stop words are common words that are oftenremoved from text before indexing. For example, in English "the" and "and".
ascii_folding
(bool
, default:True
) –Whether to fold ASCII characters. This converts accented characters totheir ASCII equivalent. For example, "café" would be converted to "cafe".
ngram_min_length
(int
, default:3
) –The minimum length of an n-gram.
ngram_max_length
(int
, default:3
) –The maximum length of an n-gram.
prefix_only
(bool
, default:False
) –Whether to only index the prefix of the token for ngram tokenizer.
wait_timeout
(Optional[timedelta]
, default:None
) –The timeout to wait if indexing is asynchronous.
Source code inlancedb/table.py
addabstractmethod
add(data:DATA,mode:AddMode='append',on_bad_vectors:OnBadVectorsType='error',fill_value:float=0.0)->AddResult
Add more data to theTable.
Parameters:
data
(DATA
) –The data to insert into the table. Acceptable types are:
list-of-dict
pandas.DataFrame
pyarrow.Table or pyarrow.RecordBatch
mode
(AddMode
, default:'append'
) –The mode to use when writing the data. Valid values are"append" and "overwrite".
on_bad_vectors
(OnBadVectorsType
, default:'error'
) –What to do if any of the vectors are not the same size or contains NaNs.One of "error", "drop", "fill".
fill_value
(float
, default:0.0
) –The value to use when filling vectors. Only used if on_bad_vectors="fill".
Returns:
AddResult
–An object containing the new version number of the table after adding data.
Source code inlancedb/table.py
merge_insert
merge_insert(on:Union[str,Iterable[str]])->LanceMergeInsertBuilder
Returns aLanceMergeInsertBuilder
that can be used to create a "merge insert" operation
This operation can add rows, update rows, and remove rows all in a singletransaction. It is a very generic tool that can be used to createbehaviors like "insert if not exists", "update or insert (i.e. upsert)",or even replace a portion of existing data with new data (e.g. replaceall data where month="january")
The merge insert operation works by combining new data from asource table with existing data in atarget table by using ajoin. There are three categories of records.
"Matched" records are records that exist in both the source table andthe target table. "Not matched" records exist only in the source table(e.g. these are new data) "Not matched by source" records exist onlyin the target table (this is old data)
The builder returned by this method can be used to customize whatshould happen for each category of data.
Please note that the data may appear to be reordered as part of thisoperation. This is because updated rows will be deleted from thedataset and then reinserted at the end with the new values.
Parameters:
on
(Union[str,Iterable[str]]
) –A column (or columns) to join on. This is how records from thesource table and target table are matched. Typically this is somekind of key or id column.
Examples:
>>>importlancedb>>>data=pa.table({"a":[2,1,3],"b":["a","b","c"]})>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",data)>>>new_data=pa.table({"a":[2,3,4],"b":["x","y","z"]})>>># Perform a "upsert" operation>>>res=table.merge_insert("a") \....when_matched_update_all() \....when_not_matched_insert_all() \....execute(new_data)>>>resMergeResult(version=2, num_updated_rows=2, num_inserted_rows=1, num_deleted_rows=0)>>># The order of new rows is non-deterministic since we use>>># a hash-join as part of this operation and so we sort here>>>table.to_arrow().sort_by("a").to_pandas() a b0 1 b1 2 x2 3 y3 4 z
Source code inlancedb/table.py
searchabstractmethod
search(query:Optional[Union[VEC,str,'PIL.Image.Image',Tuple,FullTextQuery]]=None,vector_column_name:Optional[str]=None,query_type:QueryType='auto',ordering_field_name:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None)->LanceQueryBuilder
Create a search query to find the nearest neighborsof the given query vector. We currently supportvector searchand [full-text search][experimental-full-text-search].
All query options are defined inLanceQueryBuilder.
Examples:
>>>importlancedb>>>db=lancedb.connect("./.lancedb")>>>data=[...{"original_width":100,"caption":"bar","vector":[0.1,2.3,4.5]},...{"original_width":2000,"caption":"foo","vector":[0.5,3.4,1.3]},...{"original_width":3000,"caption":"test","vector":[0.3,6.2,2.6]}...]>>>table=db.create_table("my_table",data)>>>query=[0.4,1.4,2.4]>>>(table.search(query)....where("original_width > 1000",prefilter=True)....select(["caption","original_width","vector"])....limit(2)....to_pandas()) caption original_width vector _distance0 foo 2000 [0.5, 3.4, 1.3] 5.2200001 test 3000 [0.3, 6.2, 2.6] 23.089996
Parameters:
query
(Optional[Union[VEC,str, 'PIL.Image.Image',Tuple,FullTextQuery]]
, default:None
) –The targetted vector to search for.
default None.Acceptable types are: list, np.ndarray, PIL.Image.Image
If None then the select/where/limit clauses are applied to filterthe table
vector_column_name
(Optional[str]
, default:None
) –The name of the vector column to search.
The vector column needs to be a pyarrow fixed size list type
If not specified then the vector column is inferred fromthe table schema
If the table has multiple vector columns then thevector_column_nameneeds to be specified. Otherwise, an error is raised.
query_type
(QueryType
, default:'auto'
) –default "auto".Acceptable types are: "vector", "fts", "hybrid", or "auto"
If "auto" then the query type is inferred from the query;
If
query
is a list/np.ndarray then the query type is"vector";If
query
is a PIL.Image.Image then either do vector search,or raise an error if no corresponding embedding function is found.
If
query
is a string, then the query type is "vector" if thetable has embedding functions else the query type is "fts"
Returns:
LanceQueryBuilder
–A query builder object representing the query.Once executed, the query returns
selected columns
the vector
and also the "_distance" column which is the distance between the queryvector and the returned vector.
Source code inlancedb/table.py
101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096 |
|
deleteabstractmethod
Delete rows from the table.
This can be used to delete a single row, many rows, all rows, orsometimes no rows (if your predicate matches nothing).
Parameters:
where
(str
) –The SQL where clause to use when deleting rows.
- For example, 'x = 2' or 'x IN (1, 2, 3)'.
The filter must not be empty, or it will error.
Returns:
DeleteResult
–An object containing the new version number of the table after deletion.
Examples:
>>>importlancedb>>>data=[...{"x":1,"vector":[1.0,2]},...{"x":2,"vector":[3.0,4]},...{"x":3,"vector":[5.0,6]}...]>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",data)>>>table.to_pandas() x vector0 1 [1.0, 2.0]1 2 [3.0, 4.0]2 3 [5.0, 6.0]>>>table.delete("x = 2")DeleteResult(version=2)>>>table.to_pandas() x vector0 1 [1.0, 2.0]1 3 [5.0, 6.0]
If you have a list of values to delete, you can combine them into astringified list and use theIN
operator:
>>>to_remove=[1,5]>>>to_remove=", ".join([str(v)forvinto_remove])>>>to_remove'1, 5'>>>table.delete(f"x IN ({to_remove})")DeleteResult(version=3)>>>table.to_pandas() x vector0 3 [5.0, 6.0]
Source code inlancedb/table.py
updateabstractmethod
update(where:Optional[str]=None,values:Optional[dict]=None,*,values_sql:Optional[Dict[str,str]]=None)->UpdateResult
This can be used to update zero to all rows depending on how manyrows match the where clause. If no where clause is provided, thenall rows will be updated.
Eithervalues
orvalues_sql
must be provided. You cannot provideboth.
Parameters:
where
(Optional[str]
, default:None
) –The SQL where clause to use when updating rows. For example, 'x = 2'or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.
values
(Optional[dict]
, default:None
) –The values to update. The keys are the column names and the valuesare the values to set.
values_sql
(Optional[Dict[str,str]]
, default:None
) –The values to update, expressed as SQL expression strings. These canreference existing columns. For example, {"x": "x + 1"} will incrementthe x column by 1.
Returns:
UpdateResult
–- rows_updated: The number of rows that were updated
- version: The new version number of the table after the update
Examples:
>>>importlancedb>>>importpandasaspd>>>data=pd.DataFrame({"x":[1,2,3],"vector":[[1.0,2],[3,4],[5,6]]})>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",data)>>>table.to_pandas() x vector0 1 [1.0, 2.0]1 2 [3.0, 4.0]2 3 [5.0, 6.0]>>>table.update(where="x = 2",values={"vector":[10.0,10]})UpdateResult(rows_updated=1, version=2)>>>table.to_pandas() x vector0 1 [1.0, 2.0]1 3 [5.0, 6.0]2 2 [10.0, 10.0]>>>table.update(values_sql={"x":"x + 1"})UpdateResult(rows_updated=3, version=3)>>>table.to_pandas() x vector0 2 [1.0, 2.0]1 4 [5.0, 6.0]2 3 [10.0, 10.0]
Source code inlancedb/table.py
cleanup_old_versionsabstractmethod
cleanup_old_versions(older_than:Optional[timedelta]=None,*,delete_unverified:bool=False)->'CleanupStats'
Clean up old versions of the table, freeing disk space.
Parameters:
older_than
(Optional[timedelta]
, default:None
) –The minimum age of the version to delete. If None, then this defaultsto two weeks.
delete_unverified
(bool
, default:False
) –Because they may be part of an in-progress transaction, files newerthan 7 days old are not deleted by default. If you are sure thatthere are no in-progress transactions, then you can set this to Trueto delete all files older than
older_than
.
Returns:
CleanupStats
–The stats of the cleanup operation, including how many bytes werefreed.
See Also
Table.optimize: A more comprehensive optimization operation that includes cleanup as well as other operations.
Notes
This function is not available in LanceDb Cloud (since LanceDBCloud manages cleanup for you automatically)
Source code inlancedb/table.py
compact_filesabstractmethod
Run the compaction process on the table.This can be run after making several small appends to optimize the tablefor faster reads.
Arguments are passed onto Lance's[compact_files][lance.dataset.DatasetOptimizer.compact_files].For most cases, the default should be fine.
See Also
Table.optimize: A more comprehensive optimization operation that includes cleanup as well as other operations.
Notes
This function is not available in LanceDB Cloud (since LanceDBCloud manages compaction for you automatically)
Source code inlancedb/table.py
optimizeabstractmethod
optimize(*,cleanup_older_than:Optional[timedelta]=None,delete_unverified:bool=False,retrain:bool=False)
Optimize the on-disk data and indices for better performance.
Modeled afterVACUUM
in PostgreSQL.
Optimization covers three operations:
- Compaction: Merges small files into larger ones
- Prune: Removes old versions of the dataset
- Index: Optimizes the indices, adding new data to existing indices
Parameters:
cleanup_older_than
(Optional[timedelta]
, default:None
) –All files belonging to versions older than this will be removed. Setto 0 days to remove all versions except the latest. The latest versionis never removed.
delete_unverified
(bool
, default:False
) –Files leftover from a failed transaction may appear to be part of anin-progress operation (e.g. appending new data) and these files will notbe deleted unless they are at least 7 days old. If delete_unverified is Truethen these files will be deleted regardless of their age.
retrain
(bool
, default:False
) –If True, retrain the vector indices, this would refine the IVF clusteringand quantization, which may improve the search accuracy. It's faster thanre-creating the index from scratch, so it's recommended to try this first,when the data distribution has changed significantly.
Experimental API
The optimization process is undergoing active development and may change.Our goal with these changes is to improve the performance of optimization andreduce the complexity.
That being said, it is essential today to run optimize if you want the bestperformance. It should be stable and safe to use in production, but it ourhope that the API may be simplified (or not even need to be called) in thefuture.
The frequency an application shoudl call optimize is based on the frequency ofdata modifications. If data is frequently added, deleted, or updated thenoptimize should be run frequently. A good rule of thumb is to run optimize ifyou have added or modified 100,000 or more records or run more than 20 datamodification operations.
Source code inlancedb/table.py
list_indicesabstractmethod
index_statsabstractmethod
Retrieve statistics about an index
Parameters:
index_name
(str
) –The name of the index to retrieve statistics for
Returns:
IndexStatistics or None
–The statistics about the index. Returns None if the index does not exist.
Source code inlancedb/table.py
add_columnsabstractmethod
Add new columns with defined values.
Parameters:
transforms
(Dict[str,str] |Field |List[Field] |Schema
) –A map of column name to a SQL expression to use to calculate thevalue of the new column. These expressions will be evaluated foreach row in the table, and can reference existing columns.Alternatively, a pyarrow Field or Schema can be provided to addnew columns with the specified data types. The new columns willbe initialized with null values.
Returns:
AddColumnsResult
–version: the new version number of the table after adding columns.
Source code inlancedb/table.py
alter_columnsabstractmethod
Alter column names and nullability.
Parameters:
alterations
(Iterable[Dict[str,Any]]
, default:()
) –A sequence of dictionaries, each with the following keys:- "path": str The column path to alter. For a top-level column, this is the name. For a nested column, this is the dot-separated path, e.g. "a.b.c".- "rename": str, optional The new name of the column. If not specified, the column name is not changed.- "data_type": pyarrow.DataType, optional The new data type of the column. Existing values will be casted to this type. If not specified, the column data type is not changed.- "nullable": bool, optional Whether the column should be nullable. If not specified, the column nullability is not changed. Only non-nullable columns can be changed to nullable. Currently, you cannot change a nullable column to non-nullable.
Returns:
AlterColumnsResult
–version: the new version number of the table after the alteration.
Source code inlancedb/table.py
drop_columnsabstractmethod
Drop columns from the table.
Parameters:
columns
(Iterable[str]
) –The names of the columns to drop.
Returns:
DropColumnsResult
–version: the new version number of the table dropping the columns.
Source code inlancedb/table.py
checkoutabstractmethod
Checks out a specific version of the Table
Any read operation on the table will now access the data at the checked outversion. As a consequence, calling this method will disable any read consistencyinterval that was previously set.
This is a read-only operation that turns the table into a sort of "view"or "detached head". Other table instances will not be affected. To make thechange permanent you can use the[Self::restore]
method.
Any operation that modifies the table will fail while the table is in a checkedout state.
Parameters:
version
(Union[int,str]
) –The version to check out. A version number (
int
) or a tag(str
) can be provided.To
–
Source code inlancedb/table.py
checkout_latestabstractmethod
Ensures the table is pointing at the latest version
This can be used to manually update a table when the read_consistency_intervalis NoneIt can also be used to undo a[Self::checkout]
operation
Source code inlancedb/table.py
restoreabstractmethod
Restore a version of the table. This is an in-place operation.
This creates a new version where the data is equivalent to thespecified previous version. Data is not copied (as of python-v0.2.1).
Parameters:
version
(int orstr
, default:None
) –The version number or version tag to restore.If unspecified then restores the currently checked out version.If the currently checked out version is thelatest version then this is a no-op.
Source code inlancedb/table.py
list_versionsabstractmethod
uses_v2_manifest_pathsabstractmethod
Check if the table is using the new v2 manifest paths.
Returns:
bool
–True if the table is using the new v2 manifest paths, False otherwise.
migrate_v2_manifest_pathsabstractmethod
Migrate the manifest paths to the new format.
This will update the manifest to use the new v2 format for paths.
This function is idempotent, and can be run multiple times withoutchanging the state of the object store.
Danger
This should not be run while other concurrent operations are happening.And it should also run until completion before resuming other operations.
You can useTable.uses_v2_manifest_pathsto check if the table is already using the new path style.
Source code inlancedb/table.py
Querying (Synchronous)
lancedb.query.Query
Bases:BaseModel
A LanceDB Query
Queries are constructed by theTable.search
method. This class is apython representation of the query. Normally you will not need to interactwith this class directly. You can build up a query and execute it usingcollection methods such asto_batches()
,to_arrow()
,to_pandas()
,etc.
However, you can use theto_query()
method to get the underlying query object.This can be useful for serializing a query or using it in a different context.
Attributes:
filter
(Optional[str]
) –sql filter to refine the query with
limit
(Optional[int]
) –The limit on the number of results to return. If this is a vector or FTS query,then this is required. If this is a plain SQL query, then this is optional.
offset
(Optional[int]
) –The offset to start fetching results from
This is ignored for vector / FTS search (will be None).
columns
(Optional[Union[List[str],Dict[str,str]]]
) –which columns to return in the results
This can be a list of column names or a dictionary. If it is a dictionary,then the keys are the column names and the values are sql expressions touse to calculate the result.
If this is None then all columns are returned. This can be expensive.
with_row_id
(Optional[bool]
) –if True then include the row id in the results
vector
(Optional[Union[List[float],List[List[float]],Array,List[Array]]]
) –the vector to search for, if this a vector search or hybrid search. It willbe None for full text search and plain SQL filtering.
vector_column
(Optional[str]
) –the name of the vector column to use for vector search
If this is None then a default vector column will be used.
distance_type
(Optional[str]
) –the distance type to use for vector search
This can be l2 (default), cosine and dot. Seemetric definitions formore details.
If this is not a vector search this will be None.
postfilter
(bool
) –if True then apply the filter after vector / FTS search. This is ignored forplain SQL filtering.
nprobes
(Optional[int]
) –The number of IVF partitions to search. If this is None then a defaultnumber of partitions will be used.
A higher number makes search more accurate but also slower.
See discussion inQuerying an ANN Index for tuning advice.
Will be None if this is not a vector search.
refine_factor
(Optional[int]
) –Refine the results by reading extra elements and re-ranking them in memory.
A higher number makes search more accurate but also slower.
See discussion inQuerying an ANN Index for tuning advice.
Will be None if this is not a vector search.
lower_bound
(Optional[float]
) –The lower bound for distance search
Only results with a distance greater than or equal to this valuewill be returned.
This will only be set on vector search.
upper_bound
(Optional[float]
) –The upper bound for distance search
Only results with a distance less than or equal to this valuewill be returned.
This will only be set on vector search.
ef
(Optional[int]
) –The size of the nearest neighbor list maintained during HNSW search
This will only be set on vector search.
full_text_query
(Optional[Union[str,dict]]
) –The full text search query
This can be a string or a dictionary. A dictionary will be used to searchmultiple columns. The keys are the column names and the values are thesearch queries.
This will only be set on FTS or hybrid queries.
fast_search
(Optional[bool]
) –Skip a flat search of unindexed data. This will improvesearch performance but search results will not include unindexed data.
The default is False
Source code inlancedb/query.py
313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516 |
|
lancedb.query.LanceQueryBuilder
Bases:ABC
An abstract query builder. Subclasses are defined for vector search,full text search, hybrid, and plain SQL filtering.
Source code inlancedb/query.py
519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999100010011002100310041005100610071008100910101011101210131014101510161017101810191020 |
|
createclassmethod
create(table:'Table',query:Optional[Union[ndarray,str,'PIL.Image.Image',Tuple]],query_type:str,vector_column_name:str,ordering_field_name:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None,fast_search:bool=None)->Self
Create a query builder based on the given query and query type.
Parameters:
table
('Table'
) –The table to query.
query
(Optional[Union[ndarray,str, 'PIL.Image.Image',Tuple]]
) –The query to use. If None, an empty query builder is returnedwhich performs simple SQL filtering.
query_type
(str
) –The type of query to perform. One of "vector", "fts", "hybrid", or "auto".If "auto", the query type is inferred based on the query.
vector_column_name
(str
) –The name of the vector column to use for vector search.
fast_search
(bool
, default:None
) –Skip flat search of unindexed data.
Source code inlancedb/query.py
to_df
Deprecated alias forto_pandas()
. Please useto_pandas()
instead.
Execute the query and return the results as a pandas DataFrame.In addition to the selected columns, LanceDB also returns a vectorand also the "_distance" column which is the distance between the queryvector and the returned vector.
Source code inlancedb/query.py
to_pandas
to_pandas(flatten:Optional[Union[int,bool]]=None,*,timeout:Optional[timedelta]=None)->'pd.DataFrame'
Execute the query and return the results as a pandas DataFrame.In addition to the selected columns, LanceDB also returns a vectorand also the "_distance" column which is the distance between the queryvector and the returned vector.
Parameters:
flatten
(Optional[Union[int,bool]]
, default:None
) –If flatten is True, flatten all nested columns.If flatten is an integer, flatten the nested columns up to thespecified depth.If unspecified, do not flatten the nested columns.
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If None, wait indefinitely.
Source code inlancedb/query.py
to_arrowabstractmethod
to_arrow(*,timeout:Optional[timedelta]=None)->Table
Execute the query and return the results as anApache Arrow Table.
In addition to the selected columns, LanceDB also returns a vectorand also the "_distance" column which is the distance between the queryvector and the returned vectors.
Parameters:
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If None, wait indefinitely.
Source code inlancedb/query.py
to_batchesabstractmethod
to_batches(batch_size:Optional[int]=None,*,timeout:Optional[timedelta]=None)->RecordBatchReader
Execute the query and return the results as a pyarrowRecordBatchReader
Parameters:
batch_size
(Optional[int]
, default:None
) –The maximum number of selected records in a RecordBatch object.
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If None, wait indefinitely.
Source code inlancedb/query.py
to_list
Execute the query and return the results as a list of dictionaries.
Each list entry is a dictionary with the selected column names as keys,or all table columns ifselect
is not called. The vector and the "_distance"fields are returned whether or not they're explicitly selected.
Parameters:
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If None, wait indefinitely.
Source code inlancedb/query.py
to_pydantic
to_pydantic(model:Type[LanceModel],*,timeout:Optional[timedelta]=None)->List[LanceModel]
Return the table as a list of pydantic models.
Parameters:
model
(Type[LanceModel]
) –The pydantic model to use.
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If None, wait indefinitely.
Returns:
List[LanceModel]
–
Source code inlancedb/query.py
to_polars
Execute the query and return the results as a Polars DataFrame.In addition to the selected columns, LanceDB also returns a vectorand also the "_distance" column which is the distance between the queryvector and the returned vector.
Parameters:
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If None, wait indefinitely.
Source code inlancedb/query.py
limit
Set the maximum number of results to return.
Parameters:
limit
(Union[int, None]
) –The maximum number of results to return.The default query limit is 10 results.For ANN/KNN queries, you must specify a limit.For plain searches, all records are returned if limit not set.WARNING if you have a large dataset, settingthe limit to a large number, e.g. the table size,can potentially result in reading alarge amount of data into memory and causeout of memory issues.
Returns:
LanceQueryBuilder
–The LanceQueryBuilder object.
Source code inlancedb/query.py
offset
Set the offset for the results.
Parameters:
offset
(int
) –The offset to start fetching results from.
Returns:
LanceQueryBuilder
–The LanceQueryBuilder object.
Source code inlancedb/query.py
select
select(columns:Union[list[str],dict[str,str]])->Self
Set the columns to return.
Parameters:
columns
(Union[list[str],dict[str,str]]
) –List of column names to be fetched.Or a dictionary of column names to SQL expressions.All columns are fetched if None or unspecified.
Returns:
LanceQueryBuilder
–The LanceQueryBuilder object.
Source code inlancedb/query.py
where
Set the where clause.
Parameters:
where
(str
) –The where clause which is a valid SQL where clause. See
Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>
_for valid SQL expressions.prefilter
(bool
, default:True
) –If True, apply the filter before vector search, otherwise thefilter is applied on the result of vector search.This feature isEXPERIMENTAL and may be removed and modifiedwithout warning in the future.
Returns:
LanceQueryBuilder
–The LanceQueryBuilder object.
Source code inlancedb/query.py
with_row_id
Set whether to return row ids.
Parameters:
with_row_id
(bool
) –If True, return _rowid column in the results.
Returns:
LanceQueryBuilder
–The LanceQueryBuilder object.
Source code inlancedb/query.py
explain_plan
Return the execution plan for this query.
Examples:
>>>importlancedb>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",[{"vector":[99.0,99]}])>>>query=[100,100]>>>plan=table.search(query).explain_plan(True)>>>print(plan)ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]GlobalLimitExec: skip=0, fetch=10 FilterExec: _distance@2 IS NOT NULL SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false] KNNVectorDistance: metric=l2 LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false
Parameters:
verbose
(bool
, default:False
) –Use a verbose output format.
Returns:
plan
(str
) –
Source code inlancedb/query.py
analyze_plan
Run the query and return its execution plan with runtime metrics.
This returns detailed metrics for each step, such as elapsed time,rows processed, bytes read, and I/O stats. It is useful for debuggingand performance tuning.
Examples:
>>>importlancedb>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",[{"vector":[99.0,99]}])>>>query=[100,100]>>>plan=table.search(query).analyze_plan()>>>print(plan)AnalyzeExec verbose=true, metrics=[] ProjectionExec: expr=[...], metrics=[...] GlobalLimitExec: skip=0, fetch=10, metrics=[...] FilterExec: _distance@2 IS NOT NULL, metrics=[output_rows=..., elapsed_compute=...] SortExec: TopK(fetch=10), expr=[...], preserve_partitioning=[...], metrics=[output_rows=..., elapsed_compute=..., row_replacements=...] KNNVectorDistance: metric=l2, metrics=[output_rows=..., elapsed_compute=..., output_batches=...] LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false, metrics=[output_rows=..., elapsed_compute=..., bytes_read=..., iops=..., requests=...]
Returns:
plan
(str
) –The physical query execution plan with runtime metrics.
Source code inlancedb/query.py
vector
vector(vector:Union[ndarray,list])->Self
Set the vector to search for.
Parameters:
vector
(Union[ndarray,list]
) –The vector to search for.
Returns:
LanceQueryBuilder
–The LanceQueryBuilder object.
Source code inlancedb/query.py
text
Set the text to search for.
Parameters:
text
(str |FullTextQuery
) –If a string, it is treated as a MatchQuery.If a FullTextQuery object, it is used directly.
Returns:
LanceQueryBuilder
–The LanceQueryBuilder object.
Source code inlancedb/query.py
rerankabstractmethod
Rerank the results using the specified reranker.
Parameters:
reranker
(Reranker
) –The reranker to use.
Returns:
The LanceQueryBuilder object.
–
Source code inlancedb/query.py
lancedb.query.LanceVectorQueryBuilder
Bases:LanceQueryBuilder
Examples:
>>>importlancedb>>>data=[{"vector":[1.1,1.2],"b":2},...{"vector":[0.5,1.3],"b":4},...{"vector":[0.4,0.4],"b":6},...{"vector":[0.4,0.4],"b":10}]>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",data=data)>>>(table.search([0.4,0.4])....distance_type("cosine")....where("b < 10")....select(["b","vector"])....limit(2)....to_pandas()) b vector _distance0 6 [0.4, 0.4] 0.0000001 2 [1.1, 1.2] 0.000944
Source code inlancedb/query.py
102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187118811891190119111921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241124212431244124512461247124812491250125112521253125412551256125712581259126012611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316131713181319132013211322132313241325132613271328132913301331133213331334133513361337133813391340134113421343134413451346134713481349135013511352135313541355135613571358135913601361136213631364136513661367136813691370137113721373137413751376137713781379138013811382138313841385138613871388138913901391139213931394139513961397 |
|
metric
metric(metric:Literal['l2','cosine','dot'])->LanceVectorQueryBuilder
Set the distance metric to use.
This is an alias for distance_type() and may be deprecated in the future.Please use distance_type() instead.
Parameters:
metric
(Literal['l2', 'cosine', 'dot']
) –The distance metric to use. By default "l2" is used.
Returns:
LanceVectorQueryBuilder
–The LanceQueryBuilder object.
Source code inlancedb/query.py
distance_type
Set the distance metric to use.
When performing a vector search we try and find the "nearest" vectors accordingto some kind of distance metric. This parameter controls which distance metricto use.
Note: if there is a vector index then the distance type used MUST match thedistance type used to train the vector index. If this is not done then theresults will be invalid.
Parameters:
distance_type
(Literal['l2', 'cosine', 'dot']
) –The distance metric to use. By default "l2" is used.
Returns:
LanceVectorQueryBuilder
–The LanceQueryBuilder object.
Source code inlancedb/query.py
nprobes
nprobes(nprobes:int)->LanceVectorQueryBuilder
Set the number of probes to use.
Higher values will yield better recall (more likely to find vectors ifthey exist) at the expense of latency.
See discussion inQuerying an ANN Index fortuning advice.
This method sets both the minimum and maximum number of probes to the samevalue. Seeminimum_nprobes
andmaximum_nprobes
for more fine-grainedcontrol.
Parameters:
nprobes
(int
) –The number of probes to use.
Returns:
LanceVectorQueryBuilder
–The LanceQueryBuilder object.
Source code inlancedb/query.py
minimum_nprobes
minimum_nprobes(minimum_nprobes:int)->LanceVectorQueryBuilder
Set the minimum number of probes to use.
Seenprobes
for more details.
These partitions will be searched on every vector query and will increase recallat the expense of latency.
Source code inlancedb/query.py
maximum_nprobes
maximum_nprobes(maximum_nprobes:int)->LanceVectorQueryBuilder
Set the maximum number of probes to use.
Seenprobes
for more details.
If this value is greater thanminimum_nprobes
then the excess partitionswill be searched only if we have not found enough results.
This can be useful when there is a narrow filter to allow these queries tospend more time searching and avoid potential false negatives.
If this value is 0 then no limit will be applied and all partitions could besearched if needed to satisfy the limit.
Source code inlancedb/query.py
distance_range
distance_range(lower_bound:Optional[float]=None,upper_bound:Optional[float]=None)->LanceVectorQueryBuilder
Set the distance range to use.
Only rows with distances within range [lower_bound, upper_bound)will be returned.
Parameters:
lower_bound
(Optional[float]
, default:None
) –The lower bound of the distance range.
upper_bound
(Optional[float]
, default:None
) –The upper bound of the distance range.
Returns:
LanceVectorQueryBuilder
–The LanceQueryBuilder object.
Source code inlancedb/query.py
ef
ef(ef:int)->LanceVectorQueryBuilder
Set the number of candidates to consider during search.
Higher values will yield better recall (more likely to find vectors ifthey exist) at the expense of latency.
This only applies to the HNSW-related index.The default value is 1.5 * limit.
Parameters:
ef
(int
) –The number of candidates to consider during search.
Returns:
LanceVectorQueryBuilder
–The LanceQueryBuilder object.
Source code inlancedb/query.py
refine_factor
refine_factor(refine_factor:int)->LanceVectorQueryBuilder
Set the refine factor to use, increasing the number of vectors sampled.
As an example, a refine factor of 2 will sample 2x as many vectors asrequested, re-ranks them, and returns the top half most relevant results.
See discussion inQuerying an ANN Index fortuning advice.
Parameters:
refine_factor
(int
) –The refine factor to use.
Returns:
LanceVectorQueryBuilder
–The LanceQueryBuilder object.
Source code inlancedb/query.py
to_arrow
to_arrow(*,timeout:Optional[timedelta]=None)->Table
Execute the query and return the results as anApache Arrow Table.
In addition to the selected columns, LanceDB also returns a vectorand also the "_distance" column which is the distance between the queryvector and the returned vectors.
Parameters:
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If None, wait indefinitely.
Source code inlancedb/query.py
to_query_object
to_query_object()->Query
Build a Query object
This can be used to serialize a query
Source code inlancedb/query.py
to_batches
to_batches(batch_size:Optional[int]=None,*,timeout:Optional[timedelta]=None)->RecordBatchReader
Execute the query and return the result as a RecordBatchReader object.
Parameters:
batch_size
(Optional[int]
, default:None
) –The maximum number of selected records in a RecordBatch object.
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If None, wait indefinitely.
Returns:
Source code inlancedb/query.py
where
where(where:str,prefilter:bool=None)->LanceVectorQueryBuilder
Set the where clause.
Parameters:
where
(str
) –The where clause which is a valid SQL where clause. See
Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>
_for valid SQL expressions.prefilter
(bool
, default:None
) –If True, apply the filter before vector search, otherwise thefilter is applied on the result of vector search.
Returns:
LanceQueryBuilder
–The LanceQueryBuilder object.
Source code inlancedb/query.py
rerank
rerank(reranker:Reranker,query_string:Optional[str]=None)->LanceVectorQueryBuilder
Rerank the results using the specified reranker.
Parameters:
reranker
(Reranker
) –The reranker to use.
query_string
(Optional[str]
, default:None
) –The query to use for reranking. This needs to be specified explicitly hereas the query used for vector search may already be vectorized and thereranker requires a string query.This is only required if the query used for vector search is not a string.Note: This doesn't yet support the case where the query is multimodal or alist of vectors.
Returns:
LanceVectorQueryBuilder
–The LanceQueryBuilder object.
Source code inlancedb/query.py
bypass_vector_index
bypass_vector_index()->LanceVectorQueryBuilder
If this is called then any vector index is skipped
An exhaustive (flat) search will be performed. The query vector willbe compared to every vector in the table. At high scales this can beexpensive. However, this is often still useful. For example, skippingthe vector index can give you ground truth results which you can use tocalculate your recall to select an appropriate value for nprobes.
Returns:
LanceVectorQueryBuilder
–The LanceVectorQueryBuilder object.
Source code inlancedb/query.py
lancedb.query.LanceFtsQueryBuilder
Bases:LanceQueryBuilder
A builder for full text search for LanceDB.
Source code inlancedb/query.py
140014011402140314041405140614071408140914101411141214131414141514161417141814191420142114221423142414251426142714281429143014311432143314341435143614371438143914401441144214431444144514461447144814491450145114521453145414551456145714581459146014611462146314641465146614671468146914701471147214731474147514761477147814791480148114821483148414851486148714881489149014911492149314941495149614971498149915001501150215031504150515061507150815091510151115121513151415151516151715181519152015211522152315241525152615271528152915301531153215331534153515361537153815391540154115421543154415451546154715481549155015511552155315541555155615571558155915601561156215631564156515661567156815691570157115721573157415751576 |
|
phrase_query
phrase_query(phrase_query:bool=True)->LanceFtsQueryBuilder
Set whether to use phrase query.
Parameters:
phrase_query
(bool
, default:True
) –If True, then the query will be wrapped in quotes anddouble quotes replaced by single quotes.
Returns:
LanceFtsQueryBuilder
–The LanceFtsQueryBuilder object.
Source code inlancedb/query.py
rerank
rerank(reranker:Reranker)->LanceFtsQueryBuilder
Rerank the results using the specified reranker.
Parameters:
reranker
(Reranker
) –The reranker to use.
Returns:
LanceFtsQueryBuilder
–The LanceQueryBuilder object.
Source code inlancedb/query.py
lancedb.query.LanceHybridQueryBuilder
Bases:LanceQueryBuilder
A query builder that performs hybrid vector and full text search.Results are combined and reranked based on the specified reranker.By default, the results are reranked using the RRFReranker, whichuses reciprocal rank fusion score for reranking.
To make the vector and fts results comparable, the scores are normalized.Instead of normalizing scores, thenormalize
parameter can be set to "rank"in thererank
method to convert the scores to ranks and then normalize them.
Source code inlancedb/query.py
1614161516161617161816191620162116221623162416251626162716281629163016311632163316341635163616371638163916401641164216431644164516461647164816491650165116521653165416551656165716581659166016611662166316641665166616671668166916701671167216731674167516761677167816791680168116821683168416851686168716881689169016911692169316941695169616971698169917001701170217031704170517061707170817091710171117121713171417151716171717181719172017211722172317241725172617271728172917301731173217331734173517361737173817391740174117421743174417451746174717481749175017511752175317541755175617571758175917601761176217631764176517661767176817691770177117721773177417751776177717781779178017811782178317841785178617871788178917901791179217931794179517961797179817991800180118021803180418051806180718081809181018111812181318141815181618171818181918201821182218231824182518261827182818291830183118321833183418351836183718381839184018411842184318441845184618471848184918501851185218531854185518561857185818591860186118621863186418651866186718681869187018711872187318741875187618771878187918801881188218831884188518861887188818891890189118921893189418951896189718981899190019011902190319041905190619071908190919101911191219131914191519161917191819191920192119221923192419251926192719281929193019311932193319341935193619371938193919401941194219431944194519461947194819491950195119521953195419551956195719581959196019611962196319641965196619671968196919701971197219731974197519761977197819791980198119821983198419851986198719881989199019911992199319941995199619971998199920002001200220032004200520062007200820092010201120122013201420152016201720182019202020212022202320242025202620272028202920302031203220332034203520362037203820392040204120422043204420452046204720482049205020512052205320542055205620572058205920602061206220632064206520662067206820692070207120722073207420752076207720782079208020812082208320842085208620872088208920902091209220932094209520962097209820992100210121022103210421052106210721082109211021112112211321142115211621172118211921202121212221232124212521262127212821292130213121322133213421352136 |
|
phrase_query
phrase_query(phrase_query:bool=None)->LanceHybridQueryBuilder
Set whether to use phrase query.
Parameters:
phrase_query
(bool
, default:None
) –If True, then the query will be wrapped in quotes anddouble quotes replaced by single quotes.
Returns:
LanceHybridQueryBuilder
–The LanceHybridQueryBuilder object.
Source code inlancedb/query.py
rerank
rerank(reranker:Reranker=RRFReranker(),normalize:str='score')->LanceHybridQueryBuilder
Rerank the hybrid search results using the specified reranker. The rerankermust be an instance of Reranker class.
Parameters:
reranker
(Reranker
, default:RRFReranker()
) –The reranker to use. Must be an instance of Reranker class.
normalize
(str
, default:'score'
) –The method to normalize the scores. Can be "rank" or "score". If "rank",the scores are converted to ranks and then normalized. If "score", thescores are normalized directly.
Returns:
LanceHybridQueryBuilder
–The LanceHybridQueryBuilder object.
Source code inlancedb/query.py
nprobes
nprobes(nprobes:int)->LanceHybridQueryBuilder
Set the number of probes to use for vector search.
Higher values will yield better recall (more likely to find vectors ifthey exist) at the expense of latency.
Parameters:
nprobes
(int
) –The number of probes to use.
Returns:
LanceHybridQueryBuilder
–The LanceHybridQueryBuilder object.
Source code inlancedb/query.py
minimum_nprobes
minimum_nprobes(minimum_nprobes:int)->LanceHybridQueryBuilder
Set the minimum number of probes to use.
Seenprobes
for more details.
maximum_nprobes
maximum_nprobes(maximum_nprobes:int)->LanceHybridQueryBuilder
Set the maximum number of probes to use.
Seenprobes
for more details.
distance_range
distance_range(lower_bound:Optional[float]=None,upper_bound:Optional[float]=None)->LanceHybridQueryBuilder
Set the distance range to use.
Only rows with distances within range [lower_bound, upper_bound)will be returned.
Parameters:
lower_bound
(Optional[float]
, default:None
) –The lower bound of the distance range.
upper_bound
(Optional[float]
, default:None
) –The upper bound of the distance range.
Returns:
LanceHybridQueryBuilder
–The LanceHybridQueryBuilder object.
Source code inlancedb/query.py
ef
ef(ef:int)->LanceHybridQueryBuilder
Set the number of candidates to consider during search.
Higher values will yield better recall (more likely to find vectors ifthey exist) at the expense of latency.
This only applies to the HNSW-related index.The default value is 1.5 * limit.
Parameters:
ef
(int
) –The number of candidates to consider during search.
Returns:
LanceHybridQueryBuilder
–The LanceHybridQueryBuilder object.
Source code inlancedb/query.py
metric
metric(metric:Literal['l2','cosine','dot'])->LanceHybridQueryBuilder
Set the distance metric to use.
This is an alias for distance_type() and may be deprecated in the future.Please use distance_type() instead.
Parameters:
metric
(Literal['l2', 'cosine', 'dot']
) –The distance metric to use. By default "l2" is used.
Returns:
LanceVectorQueryBuilder
–The LanceQueryBuilder object.
Source code inlancedb/query.py
distance_type
Set the distance metric to use.
When performing a vector search we try and find the "nearest" vectors accordingto some kind of distance metric. This parameter controls which distance metricto use.
Note: if there is a vector index then the distance type used MUST match thedistance type used to train the vector index. If this is not done then theresults will be invalid.
Parameters:
distance_type
(Literal['l2', 'cosine', 'dot']
) –The distance metric to use. By default "l2" is used.
Returns:
LanceVectorQueryBuilder
–The LanceQueryBuilder object.
Source code inlancedb/query.py
refine_factor
refine_factor(refine_factor:int)->LanceHybridQueryBuilder
Refine the vector search results by reading extra elements andre-ranking them in memory.
Parameters:
refine_factor
(int
) –The refine factor to use.
Returns:
LanceHybridQueryBuilder
–The LanceHybridQueryBuilder object.
Source code inlancedb/query.py
bypass_vector_index
bypass_vector_index()->LanceHybridQueryBuilder
If this is called then any vector index is skipped
An exhaustive (flat) search will be performed. The query vector willbe compared to every vector in the table. At high scales this can beexpensive. However, this is often still useful. For example, skippingthe vector index can give you ground truth results which you can use tocalculate your recall to select an appropriate value for nprobes.
Returns:
LanceHybridQueryBuilder
–The LanceHybridQueryBuilder object.
Source code inlancedb/query.py
explain_plan
Return the execution plan for this query.
Examples:
>>>importlancedb>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",[{"vector":[99.0,99]}])>>>query=[100,100]>>>plan=table.search(query).explain_plan(True)>>>print(plan)ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]GlobalLimitExec: skip=0, fetch=10 FilterExec: _distance@2 IS NOT NULL SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false] KNNVectorDistance: metric=l2 LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false
Parameters:
verbose
(bool
, default:False
) –Use a verbose output format.
Returns:
plan
(str
) –
Source code inlancedb/query.py
analyze_plan
Execute the query and display with runtime metrics.
Returns:
plan
(str
) –
Source code inlancedb/query.py
Embeddings
lancedb.embeddings.registry.EmbeddingFunctionRegistry
This is a singleton class used to register embedding functionsand fetch them by name. It also handles serializing and deserializing.You can implement your own embedding function by subclassing EmbeddingFunctionor TextEmbeddingFunction and registering it with the registry.
NOTE: Here TEXT is a type alias for Union[str, List[str], pa.Array, pa.ChunkedArray, np.ndarray]
Examples:
>>>registry=EmbeddingFunctionRegistry.get_instance()>>>@registry.register("my-embedding-function")...classMyEmbeddingFunction(EmbeddingFunction):...defndims(self)->int:...return128......defcompute_query_embeddings(self,query:str,*args,**kwargs):...returnself.compute_source_embeddings(query,*args,**kwargs)......defcompute_source_embeddings(self,texts,*args,**kwargs):...return[np.random.rand(self.ndims())for_inrange(len(texts))]...>>>registry.get("my-embedding-function")<class 'lancedb.embeddings.registry.MyEmbeddingFunction'>
Source code inlancedb/embeddings/registry.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180 |
|
register
This creates a decorator that can be used to registeran EmbeddingFunction.
Parameters:
alias
(Optional[str]
, default:None
) –a human friendly name for the embedding function. If notprovided, the class name will be used.
Source code inlancedb/embeddings/registry.py
reset
get
Fetch an embedding function class by name
Parameters:
name
(str
) –The name of the embedding function to fetchEither the alias or the class name if no alias was providedduring registration
Source code inlancedb/embeddings/registry.py
parse_functions
parse_functions(metadata:Optional[Dict[bytes,bytes]])->Dict[str,EmbeddingFunctionConfig]
Parse the metadata from an arrow table andreturn a mapping of the vector column to theembedding function and source column
Parameters:
metadata
(Optional[Dict[bytes,bytes]]
) –The metadata from an arrow table. Note thatthe keys and values are bytes (pyarrow api)
Returns:
functions
(dict
) –A mapping of vector column name to embedding function.An empty dict is returned if input is None or does notcontain b"embedding_functions".
Source code inlancedb/embeddings/registry.py
function_to_metadata
function_to_metadata(conf:EmbeddingFunctionConfig)
Convert the given embedding function and source / vector column configsinto a config dictionary that can be serialized into arrow metadata
Source code inlancedb/embeddings/registry.py
get_table_metadata
Convert a list of embedding functions and source / vector configsinto a config dictionary that can be serialized into arrow metadata
Source code inlancedb/embeddings/registry.py
set_var
Set a variable. These can be accessed in embedding configuration usingthe syntax$var:variable_name
. If they are not set, an error will bethrown letting you know which variable is missing. If you want to supplya default value, you can add an additional part in the configurationlike so:$var:variable_name:default_value
. Default values can beused for runtime configurations that are not sensitive, such aswhether to use a GPU for inference.
The name must not contain a colon. Default values can contain colons.
Source code inlancedb/embeddings/registry.py
lancedb.embeddings.base.EmbeddingFunctionConfig
Bases:BaseModel
This model encapsulates the configuration for a embedding functionin a lancedb table. It holds the embedding function, the source column,and the vector column
Source code inlancedb/embeddings/base.py
lancedb.embeddings.base.EmbeddingFunction
Bases:BaseModel
,ABC
An ABC for embedding functions.
All concrete embedding functions must implement the following methods:1. compute_query_embeddings() which takes a query and returns a list of embeddings2. compute_source_embeddings() which returns a list of embeddings for the source columnFor text data, the two will be the same. For multi-modal data, the source columnmight be images and the vector column might be text.3. ndims() which returns the number of dimensions of the vector column
Source code inlancedb/embeddings/base.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196 |
|
createclassmethod
Create an instance of the embedding function
__resolveVariablesclassmethod
Resolve variables in the args
Source code inlancedb/embeddings/base.py
sensitive_keysstaticmethod
Return a list of keys that are sensitive and should not be allowedto be set to hardcoded values in the config. For example, API keys.
compute_query_embeddingsabstractmethod
compute_query_embeddings(*args,**kwargs)->list[Union[array,None]]
Compute the embeddings for a given user query
Returns:
A list of embeddings for each input. The embedding of each input can be None
–when the embedding is not valid.
–
Source code inlancedb/embeddings/base.py
compute_source_embeddingsabstractmethod
compute_source_embeddings(*args,**kwargs)->list[Union[array,None]]
Compute the embeddings for the source column in the database
Returns:
A list of embeddings for each input. The embedding of each input can be None
–when the embedding is not valid.
–
Source code inlancedb/embeddings/base.py
compute_query_embeddings_with_retry
compute_query_embeddings_with_retry(*args,**kwargs)->list[Union[array,None]]
Compute the embeddings for a given user query with retries
Returns:
A list of embeddings for each input. The embedding of each input can be None
–when the embedding is not valid.
–
Source code inlancedb/embeddings/base.py
compute_source_embeddings_with_retry
compute_source_embeddings_with_retry(*args,**kwargs)->list[Union[array,None]]
Compute the embeddings for the source column in the database with retries.
Returns:
A list of embeddings for each input. The embedding of each input can be None
–when the embedding is not valid.
–
Source code inlancedb/embeddings/base.py
sanitize_input
Sanitize the input to the embedding function.
Source code inlancedb/embeddings/base.py
ndimsabstractmethod
SourceField
Creates a pydantic Field that can automatically annotatethe source column for this embedding function
VectorField
Creates a pydantic Field that can automatically annotatethe target vector column for this embedding function
lancedb.embeddings.base.TextEmbeddingFunction
Bases:EmbeddingFunction
A callable ABC for embedding functions that take text as input
Source code inlancedb/embeddings/base.py
generate_embeddingsabstractmethod
generate_embeddings(texts:Union[List[str],ndarray],*args,**kwargs)->list[Union[array,None]]
lancedb.embeddings.sentence_transformers.SentenceTransformerEmbeddings
Bases:TextEmbeddingFunction
An embedding function that uses the sentence-transformers library
https://huggingface.co/sentence-transformers
Parameters:
name
–The name of the model to use.
device
–The device to use for the model
normalize
–Whether to normalize the embeddings
trust_remote_code
–Whether to trust the remote code
Source code inlancedb/embeddings/sentence_transformers.py
embedding_modelproperty
Get the sentence-transformers embedding model specified by thename, device, and trust_remote_code. This is cached so that themodel is only loaded once per process.
generate_embeddings
Get the embeddings for the given texts
Parameters:
texts
(Union[List[str],ndarray]
) –The texts to embed
Source code inlancedb/embeddings/sentence_transformers.py
get_embedding_model
Get the sentence-transformers embedding model specified by thename, device, and trust_remote_code. This is cached so that themodel is only loaded once per process.
TODO: use lru_cache instead with a reasonable/configurable maxsize
Source code inlancedb/embeddings/sentence_transformers.py
lancedb.embeddings.openai.OpenAIEmbeddings
Bases:TextEmbeddingFunction
An embedding function that uses the OpenAI API
https://platform.openai.com/docs/guides/embeddings
This can also be used for open source models thatare compatible with the OpenAI API.
Notes
If you're running an Ollama server locally,you can just override thebase_url
parameterand provide the Ollama embedding model you wantto use (https://ollama.com/library):
Source code inlancedb/embeddings/openai.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137 |
|
generate_embeddings
Get the embeddings for the given texts
Parameters:
texts
(Union[List[str],ndarray]
) –The texts to embed
Source code inlancedb/embeddings/openai.py
lancedb.embeddings.open_clip.OpenClipEmbeddings
Bases:EmbeddingFunction
An embedding function that uses the OpenClip APIFor multi-modal text-to-image search
https://github.com/mlfoundations/open_clip
Source code inlancedb/embeddings/open_clip.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172 |
|
compute_query_embeddings
Compute the embeddings for a given user query
Parameters:
query
(Union[str,Image]
) –The query to embed. A query can be either text or an image.
Source code inlancedb/embeddings/open_clip.py
sanitize_input
Sanitize the input to the embedding function.
Source code inlancedb/embeddings/open_clip.py
compute_source_embeddings
Get the embeddings for the given images
Source code inlancedb/embeddings/open_clip.py
generate_image_embedding
Generate the embedding for a single image
Parameters:
image
(Union[str,bytes,Image]
) –The image to embed. If the image is a str, it is treated as a uri.If the image is bytes, it is treated as the raw image bytes.
Source code inlancedb/embeddings/open_clip.py
Context
lancedb.context.contextualize
contextualize(raw_df:'pd.DataFrame')->Contextualizer
Create a Contextualizer object for the given DataFrame.
Used to create context windows. Context windows are rolling subsets of textdata.
The input text column should already be separated into rows that will be theunit of the window. So to create a context window over tokens, start witha DataFrame with one token per row. To create a context window over sentences,start with a DataFrame with one sentence per row.
Examples:
>>>fromlancedb.contextimportcontextualize>>>importpandasaspd>>>data=pd.DataFrame({...'token':['The','quick','brown','fox','jumped','over',...'the','lazy','dog','I','love','sandwiches'],...'document_id':[1,1,1,1,1,1,1,1,1,2,2,2]...})
window
determines how many rows to include in each window. In our casethis how many tokens, but depending on the input data, it could be sentences,paragraphs, messages, etc.
>>>contextualize(data).window(3).stride(1).text_col('token').to_pandas() token document_id0 The quick brown 11 quick brown fox 12 brown fox jumped 13 fox jumped over 14 jumped over the 15 over the lazy 16 the lazy dog 17 lazy dog I 18 dog I love 19 I love sandwiches 210 love sandwiches 2>>>(contextualize(data).window(7).stride(1).min_window_size(7)....text_col('token').to_pandas()) token document_id0 The quick brown fox jumped over the 11 quick brown fox jumped over the lazy 12 brown fox jumped over the lazy dog 13 fox jumped over the lazy dog I 14 jumped over the lazy dog I love 15 over the lazy dog I love sandwiches 1
stride
determines how many rows to skip between each window start. This canbe used to reduce the total number of windows generated.
>>>contextualize(data).window(4).stride(2).text_col('token').to_pandas() token document_id0 The quick brown fox 12 brown fox jumped over 14 jumped over the lazy 16 the lazy dog I 18 dog I love sandwiches 110 love sandwiches 2
groupby
determines how to group the rows. For example, we would like to havecontext windows that don't cross document boundaries. In this case, we canpassdocument_id
as the group by.
>>>(contextualize(data)....window(4).stride(2).text_col('token').groupby('document_id')....to_pandas()) token document_id0 The quick brown fox 12 brown fox jumped over 14 jumped over the lazy 16 the lazy dog 19 I love sandwiches 2
min_window_size
determines the minimum size of the context windowsthat are generated.This can be used to trim the last few context windowswhich have size less thanmin_window_size
.By default context windows of size 1 are skipped.
>>>(contextualize(data)....window(6).stride(3).text_col('token').groupby('document_id')....to_pandas()) token document_id0 The quick brown fox jumped over 13 fox jumped over the lazy dog 16 the lazy dog 19 I love sandwiches 2
>>>(contextualize(data)....window(6).stride(3).min_window_size(4).text_col('token')....groupby('document_id')....to_pandas()) token document_id0 The quick brown fox jumped over 13 fox jumped over the lazy dog 1
Source code inlancedb/context.py
lancedb.context.Contextualizer
Create context windows from a DataFrame.Seelancedb.context.contextualize.
Source code inlancedb/context.py
114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236 |
|
window
window(window:int)->Contextualizer
Set the window size. i.e., how many rows to include in each window.
Parameters:
window
(int
) –The window size.
stride
stride(stride:int)->Contextualizer
Set the stride. i.e., how many rows to skip between each window.
Parameters:
stride
(int
) –The stride.
groupby
groupby(groupby:str)->Contextualizer
Set the groupby column. i.e., how to group the rows.Windows don't cross groups
Parameters:
groupby
(str
) –The groupby column.
text_col
text_col(text_col:str)->Contextualizer
Set the text column used to make the context window.
Parameters:
text_col
(str
) –The text column.
min_window_size
min_window_size(min_window_size:int)->Contextualizer
Set the (optional) min_window_size size for the context window.
Parameters:
min_window_size
(int
) –The min_window_size.
Source code inlancedb/context.py
to_pandas
Create the context windows and return a DataFrame.
Source code inlancedb/context.py
Full text search
lancedb.fts.create_index
create_index(index_path:str,text_fields:List[str],ordering_fields:Optional[List[str]]=None,tokenizer_name:str='default')->Index
Create a new Index (not populated)
Parameters:
index_path
(str
) –Path to the index directory
text_fields
(List[str]
) –List of text fields to index
ordering_fields
(Optional[List[str]]
, default:None
) –List of unsigned type fields to order by at search time
tokenizer_name
(str
, default:"default"
) –The tokenizer to use
Returns:
index
(Index
) –The index object (not yet populated)
Source code inlancedb/fts.py
lancedb.fts.populate_index
populate_index(index:Index,table:LanceTable,fields:List[str],writer_heap_size:Optional[int]=None,ordering_fields:Optional[List[str]]=None)->int
Populate an index with data from a LanceTable
Parameters:
index
(Index
) –The index object
table
(LanceTable
) –The table to index
fields
(List[str]
) –List of fields to index
writer_heap_size
(int
, default:None
) –The writer heap size in bytes, defaults to 1GB
Returns:
int
–The number of rows indexed
Source code inlancedb/fts.py
lancedb.fts.search_index
search_index(index:Index,query:str,limit:int=10,ordering_field=None)->Tuple[Tuple[int],Tuple[float]]
Search an index for a query
Parameters:
index
(Index
) –The index object
query
(str
) –The query string
limit
(int
, default:10
) –The maximum number of results to return
Returns:
ids_and_score
(list[tuple[int],tuple[float]]
) –A tuple of two tuples, the first containing the document idsand the second containing the scores
Source code inlancedb/fts.py
Utilities
lancedb.schema.vector
A help function to create a vector type.
Parameters:
Returns:
A PyArrow DataType for vectors.
–
Examples:
>>>importpyarrowaspa>>>importlancedb>>>schema=pa.schema([...pa.field("id",pa.int64()),...pa.field("vector",lancedb.vector(756)),...])
Source code inlancedb/schema.py
lancedb.merge.LanceMergeInsertBuilder
Bases:object
Builder for a LanceDB merge insert operation
Seemerge_insert
formore context
Source code inlancedb/merge.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99100101102103104105106107108109110111112113114115116117118119120121122123124 |
|
when_matched_update_all
when_matched_update_all(*,where:Optional[str]=None)->LanceMergeInsertBuilder
Rows that exist in both the source table (new data) andthe target table (old data) will be updated, replacingthe old row with the corresponding matching row.
If there are multiple matches then the behavior is undefined.Currently this causes multiple copies of the row to be createdbut that behavior is subject to change.
Source code inlancedb/merge.py
when_not_matched_insert_all
when_not_matched_insert_all()->LanceMergeInsertBuilder
Rows that exist only in the source table (new data) shouldbe inserted into the target table.
when_not_matched_by_source_delete
when_not_matched_by_source_delete(condition:Optional[str]=None)->LanceMergeInsertBuilder
Rows that exist only in the target table (old data) will bedeleted. An optional condition can be provided to limit whatdata is deleted.
Parameters:
condition
(Optional[str]
, default:None
) –If None then all such rows will be deleted. Otherwise thecondition will be used as an SQL filter to limit what rowsare deleted.
Source code inlancedb/merge.py
execute
execute(new_data:DATA,on_bad_vectors:str='error',fill_value:float=0.0,timeout:Optional[timedelta]=None)->MergeInsertResult
Executes the merge insert operation
Nothing is returned but theTable
is updated
Parameters:
new_data
(DATA
) –New records which will be matched against the existing recordsto potentially insert or update into the table. This parametercan be anything you use for
add
on_bad_vectors
(str
, default:'error'
) –What to do if any of the vectors are not the same size or contains NaNs.One of "error", "drop", "fill".
fill_value
(float
, default:0.0
) –The value to use when filling vectors. Only used if on_bad_vectors="fill".
timeout
(Optional[timedelta]
, default:None
) –Maximum time to run the operation before cancelling it.
By default, there is a 30-second timeout that is only enforced after thefirst attempt. This is to prevent spending too long retrying to resolveconflicts. For example, if a write attempt takes 20 seconds and fails,the second attempt will be cancelled after 10 seconds, hitting the30-second timeout. However, a write that takes one hour and succeeds on thefirst attempt will not be cancelled.
When this is set, the timeout is enforced on all attempts, includingthe first.
Returns:
MergeInsertResult
–version: the new version number of the table after doing merge insert.
Source code inlancedb/merge.py
Integrations
Pydantic
lancedb.pydantic.pydantic_to_schema
Convert aPydantic Model to aPyArrow Schema.
Parameters:
model
(Type[BaseModel]
) –The Pydantic BaseModel to convert to Arrow Schema.
Returns:
Schema
–The Arrow Schema
Examples:
>>>fromtypingimportList,Optional>>>importpydantic>>>fromlancedb.pydanticimportpydantic_to_schema,Vector>>>classFooModel(pydantic.BaseModel):...id:int...s:str...vec:Vector(1536)# fixed_size_list<item: float32>[1536]...li:List[int]...>>>schema=pydantic_to_schema(FooModel)>>>assertschema==pa.schema([...pa.field("id",pa.int64(),False),...pa.field("s",pa.utf8(),False),...pa.field("vec",pa.list_(pa.float32(),1536)),...pa.field("li",pa.list_(pa.int64()),False),...])
Source code inlancedb/pydantic.py
lancedb.pydantic.vector
vector(dim:int,value_type:DataType=pa.float32())
Source code inlancedb/pydantic.py
lancedb.pydantic.LanceModel
Bases:BaseModel
A Pydantic Model base class that can be converted to a LanceDB Table.
Examples:
>>>importlancedb>>>fromlancedb.pydanticimportLanceModel,Vector>>>>>>classTestModel(LanceModel):...name:str...vector:Vector(2)...>>>db=lancedb.connect("./example")>>>table=db.create_table("test",schema=TestModel)>>>table.add([...TestModel(name="test",vector=[1.0,2.0])...])AddResult(version=2)>>>table.search([0.,0.]).limit(1).to_pydantic(TestModel)[TestModel(name='test', vector=FixedSizeList(dim=2))]
Source code inlancedb/pydantic.py
to_arrow_schemaclassmethod
Get the Arrow Schema for this model.
Source code inlancedb/pydantic.py
field_namesclassmethod
parse_embedding_functionsclassmethod
Parse the embedding functions from this model.
Source code inlancedb/pydantic.py
Reranking
lancedb.rerankers.linear_combination.LinearCombinationReranker
Bases:Reranker
Reranks the results using a linear combination of the scores from thevector and FTS search. For missing scores, fill withfill
value.
Parameters:
weight
(float
, default:0.7
) –The weight to give to the vector score. Must be between 0 and 1.
fill
(float
, default:1.0
) –The score to give to results that are only in one of the two result sets.This is treated as penalty, so a higher value means a lower score.TODO: We should just hardcode this--its pretty confusing as we invert scores to calculate final score
return_score
(str
, default:"relevance"
) –opntions are "relevance" or "all"The type of score to return. If "relevance", will return only the relevancescore. If "all", will return all scores from the vector and FTS search alongwith the relevance score.
Source code inlancedb/rerankers/linear_combination.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99100101102103104105106107108109110111112113114115116117118119120121122123124125126127128 |
|
lancedb.rerankers.cohere.CohereReranker
Bases:Reranker
Reranks the results using the Cohere Rerank API.https://docs.cohere.com/docs/rerank-guide
Parameters:
model_name
(str
, default:"rerank-english-v2.0"
) –The name of the cross encoder model to use. Available cohere models are:- rerank-english-v2.0- rerank-multilingual-v2.0
column
(str
, default:"text"
) –The name of the column to use as input to the cross encoder model.
top_n
(str
, default:None
) –The number of results to return. If None, will return all results.
Source code inlancedb/rerankers/cohere.py
lancedb.rerankers.colbert.ColbertReranker
Bases:AnswerdotaiRerankers
Reranks the results using the ColBERT model.
Parameters:
model_name
(str
, default:"colbert" (colbert-ir/colbert-v2.0)
) –The name of the cross encoder model to use.
column
(str
, default:"text"
) –The name of the column to use as input to the cross encoder model.
return_score
(str
, default:"relevance"
) –options are "relevance" or "all". Only "relevance" is supported for now.
**kwargs
–Additional keyword arguments to pass to the model, for example, 'device'.See AnswerDotAI/rerankers for more information.
Source code inlancedb/rerankers/colbert.py
lancedb.rerankers.cross_encoder.CrossEncoderReranker
Bases:Reranker
Reranks the results using a cross encoder model. The cross encoder model isused to score the query and each result. The results are then sorted by the score.
Parameters:
model_name
(str
, default:"cross-encoder/ms-marco-TinyBERT-L-6"
) –The name of the cross encoder model to use. See the sentence transformersdocumentation for a list of available models.
column
(str
, default:"text"
) –The name of the column to use as input to the cross encoder model.
device
(str
, default:None
) –The device to use for the cross encoder model. If None, will use "cuda"if available, otherwise "cpu".
return_score
(str
, default:"relevance"
) –options are "relevance" or "all". Only "relevance" is supported for now.
trust_remote_code
(bool
, default:True
) –If True, will trust the remote code to be safe. If False, will not trustthe remote code and will not run it
Source code inlancedb/rerankers/cross_encoder.py
lancedb.rerankers.openai.OpenaiReranker
Bases:Reranker
Reranks the results using the OpenAI API.WARNING: This is a prompt based reranker that uses chat model that isnot a dedicated reranker API. This should be treated as experimental.
Parameters:
model_name
(str
, default:"gpt-4-turbo-preview"
) –The name of the cross encoder model to use.
column
(str
, default:"text"
) –The name of the column to use as input to the cross encoder model.
return_score
(str
, default:"relevance"
) –options are "relevance" or "all". Only "relevance" is supported for now.
api_key
(str
, default:None
) –The API key to use. If None, will use the OPENAI_API_KEY environment variable.
Source code inlancedb/rerankers/openai.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129 |
|
Connections (Asynchronous)
Connections represent a connection to a LanceDb database andcan be used to create, list, or open tables.
lancedb.connect_asyncasync
connect_async(uri:URI,*,api_key:Optional[str]=None,region:str='us-east-1',host_override:Optional[str]=None,read_consistency_interval:Optional[timedelta]=None,client_config:Optional[Union[ClientConfig,Dict[str,Any]]]=None,storage_options:Optional[Dict[str,str]]=None)->AsyncConnection
Connect to a LanceDB database.
Parameters:
uri
(URI
) –The uri of the database.
api_key
(Optional[str]
, default:None
) –If present, connect to LanceDB cloud.Otherwise, connect to a database on file system or cloud storage.Can be set via environment variable
LANCEDB_API_KEY
.region
(str
, default:'us-east-1'
) –The region to use for LanceDB Cloud.
host_override
(Optional[str]
, default:None
) –The override url for LanceDB Cloud.
read_consistency_interval
(Optional[timedelta]
, default:None
) –(For LanceDB OSS only)The interval at which to check for updates to the table from otherprocesses. If None, then consistency is not checked. For performancereasons, this is the default. For strong consistency, set this tozero seconds. Then every read will check for updates from otherprocesses. As a compromise, you can set this to a non-zero timedeltafor eventual consistency. If more than that interval has passed sincethe last check, then the table will be checked for updates. Note: thisconsistency only applies to read operations. Write operations arealways consistent.
client_config
(Optional[Union[ClientConfig,Dict[str,Any]]]
, default:None
) –Configuration options for the LanceDB Cloud HTTP client. If a dict, thenthe keys are the attributes of the ClientConfig class. If None, then thedefault configuration is used.
storage_options
(Optional[Dict[str,str]]
, default:None
) –Additional options for the storage backend. See available options athttps://lancedb.github.io/lancedb/guides/storage/
Examples:
>>>importlancedb>>>asyncdefdoctest_example():...# For a local directory, provide a path to the database...db=awaitlancedb.connect_async("~/.lancedb")...# For object storage, use a URI prefix...db=awaitlancedb.connect_async("s3://my-bucket/lancedb",...storage_options={..."aws_access_key_id":"***"})...# Connect to LanceDB cloud...db=awaitlancedb.connect_async("db://my_database",api_key="ldb_...",...client_config={..."retry_config":{"retries":5}})
Returns:
conn
(AsyncConnection
) –A connection to a LanceDB database.
Source code inlancedb/__init__.py
lancedb.db.AsyncConnection
Bases:object
An active LanceDB connection
To obtain a connection you can use theconnect_asyncfunction.
This could be a native connection (using lance) or a remote connection (e.g. forconnecting to LanceDb Cloud)
Local connections do not currently hold any open resources but they may do so in thefuture (for example, for shared cache or connections to catalog services) Remoteconnections represent an open connection to the remote server. Theclose method can be used to release anyunderlying resources eagerly. The connection can also be used as a context manager.
Connections can be shared on multiple threads and are expected to be long lived.Connections can also be used as a context manager, however, in many cases a singleconnection can be used for the lifetime of the application and so this is oftennot needed. Closing a connection is optional. If it is not closed then it willbe automatically closed when the connection object is deleted.
Examples:
>>>importlancedb>>>asyncdefdoctest_example():...withawaitlancedb.connect_async("/tmp/my_dataset")asconn:...# do something with the connection...pass...# conn is closed here
Source code inlancedb/db.py
513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885 |
|
is_open
close
Close the connection, releasing any underlying resources.
It is safe to call this method multiple times.
Any attempt to use the connection after it is closed will result in an error.
table_namesasync
List all tables in this database, in sorted order
Parameters:
start_after
(Optional[str]
, default:None
) –If present, only return names that come lexicographically after the suppliedvalue.
This can be combined with limit to implement pagination by setting this tothe last table name from the previous page.
limit
(Optional[int]
, default:None
) –The number of results to return.
Returns:
Iterable of str
–
Source code inlancedb/db.py
create_tableasync
create_table(name:str,data:Optional[DATA]=None,schema:Optional[Union[Schema,LanceModel]]=None,mode:Optional[Literal['create','overwrite']]=None,exist_ok:Optional[bool]=None,on_bad_vectors:Optional[str]=None,fill_value:Optional[float]=None,storage_options:Optional[Dict[str,str]]=None,*,embedding_functions:Optional[List[EmbeddingFunctionConfig]]=None)->AsyncTable
Create anAsyncTable in the database.
Parameters:
name
(str
) –The name of the table.
data
(Optional[DATA]
, default:None
) –User must provide at least one of
data
orschema
.Acceptable types are:list-of-dict
pandas.DataFrame
pyarrow.Table or pyarrow.RecordBatch
schema
(Optional[Union[Schema,LanceModel]]
, default:None
) –Acceptable types are:
pyarrow.Schema
mode
(Optional[Literal['create', 'overwrite']]
, default:None
) –The mode to use when creating the table.Can be either "create" or "overwrite".By default, if the table already exists, an exception is raised.If you want to overwrite the table, use mode="overwrite".
exist_ok
(Optional[bool]
, default:None
) –If a table by the same name already exists, then raise an exceptionif exist_ok=False. If exist_ok=True, then open the existing table;it will not add the provided data but will validate against anyschema that's specified.
on_bad_vectors
(Optional[str]
, default:None
) –What to do if any of the vectors are not the same size or contains NaNs.One of "error", "drop", "fill".
fill_value
(Optional[float]
, default:None
) –The value to use when filling vectors. Only used if on_bad_vectors="fill".
storage_options
(Optional[Dict[str,str]]
, default:None
) –Additional options for the storage backend. Options already set on theconnection will be inherited by the table, but can be overridden here.See available options athttps://lancedb.github.io/lancedb/guides/storage/
Returns:
AsyncTable
–A reference to the newly created table.
!!! note
–The vector index won't be created by default.To create the index, call the
create_index
method on the table.
Examples:
Can create with list of tuples or dictionaries:
>>>importlancedb>>>asyncdefdoctest_example():...db=awaitlancedb.connect_async("./.lancedb")...data=[{"vector":[1.1,1.2],"lat":45.5,"long":-122.7},...{"vector":[0.2,1.8],"lat":40.1,"long":-74.1}]...my_table=awaitdb.create_table("my_table",data)...print(awaitmy_table.query().limit(5).to_arrow())>>>importasyncio>>>asyncio.run(doctest_example())pyarrow.Tablevector: fixed_size_list<item: float>[2] child 0, item: floatlat: doublelong: double----vector: [[[1.1,1.2],[0.2,1.8]]]lat: [[45.5,40.1]]long: [[-122.7,-74.1]]
You can also pass a pandas DataFrame:
>>>importpandasaspd>>>data=pd.DataFrame({..."vector":[[1.1,1.2],[0.2,1.8]],..."lat":[45.5,40.1],..."long":[-122.7,-74.1]...})>>>asyncdefpandas_example():...db=awaitlancedb.connect_async("./.lancedb")...my_table=awaitdb.create_table("table2",data)...print(awaitmy_table.query().limit(5).to_arrow())>>>asyncio.run(pandas_example())pyarrow.Tablevector: fixed_size_list<item: float>[2] child 0, item: floatlat: doublelong: double----vector: [[[1.1,1.2],[0.2,1.8]]]lat: [[45.5,40.1]]long: [[-122.7,-74.1]]
Data is converted to Arrow before being written to disk. For maximumcontrol over how data is saved, either provide the PyArrow schema toconvert to or else provide aPyArrow Table directly.
>>>importpyarrowaspa>>>custom_schema=pa.schema([...pa.field("vector",pa.list_(pa.float32(),2)),...pa.field("lat",pa.float32()),...pa.field("long",pa.float32())...])>>>asyncdefwith_schema():...db=awaitlancedb.connect_async("./.lancedb")...my_table=awaitdb.create_table("table3",data,schema=custom_schema)...print(awaitmy_table.query().limit(5).to_arrow())>>>asyncio.run(with_schema())pyarrow.Tablevector: fixed_size_list<item: float>[2] child 0, item: floatlat: floatlong: float----vector: [[[1.1,1.2],[0.2,1.8]]]lat: [[45.5,40.1]]long: [[-122.7,-74.1]]
It is also possible to create an table from[Iterable[pa.RecordBatch]]
:
>>>importpyarrowaspa>>>defmake_batches():...foriinrange(5):...yieldpa.RecordBatch.from_arrays(...[...pa.array([[3.1,4.1],[5.9,26.5]],...pa.list_(pa.float32(),2)),...pa.array(["foo","bar"]),...pa.array([10.0,20.0]),...],...["vector","item","price"],...)>>>schema=pa.schema([...pa.field("vector",pa.list_(pa.float32(),2)),...pa.field("item",pa.utf8()),...pa.field("price",pa.float32()),...])>>>asyncdefiterable_example():...db=awaitlancedb.connect_async("./.lancedb")...awaitdb.create_table("table4",make_batches(),schema=schema)>>>asyncio.run(iterable_example())
Source code inlancedb/db.py
595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803 |
|
open_tableasync
open_table(name:str,storage_options:Optional[Dict[str,str]]=None,index_cache_size:Optional[int]=None)->AsyncTable
Open a Lance Table in the database.
Parameters:
name
(str
) –The name of the table.
storage_options
(Optional[Dict[str,str]]
, default:None
) –Additional options for the storage backend. Options already set on theconnection will be inherited by the table, but can be overridden here.See available options athttps://lancedb.github.io/lancedb/guides/storage/
index_cache_size
(Optional[int]
, default:None
) –Set the size of the index cache, specified as a number of entries
The exact meaning of an "entry" will depend on the type of index:* IVF - there is one entry for each IVF partition* BTREE - there is one entry for the entire index
This cache applies to the entire opened table, across all indices.Setting this value higher will increase performance on larger datasetsat the expense of more RAM
Returns:
A LanceTable object representing the table.
–
Source code inlancedb/db.py
rename_tableasync
Rename a table in the database.
Parameters:
old_name
(str
) –The current name of the table.
new_name
(str
) –The new name of the table.
Source code inlancedb/db.py
drop_tableasync
Drop a table from the database.
Parameters:
name
(str
) –The name of the table.
ignore_missing
(bool
, default:False
) –If True, ignore if the table does not exist.
Source code inlancedb/db.py
drop_all_tablesasync
drop_databaseasync
Drop databaseThis is the same thing as dropping all the tables
Source code inlancedb/db.py
Tables (Asynchronous)
Table hold your actual data as a collection of records / rows.
lancedb.table.AsyncTable
An AsyncTable is a collection of Records in a LanceDB Database.
An AsyncTable can be obtained from theAsyncConnection.create_table andAsyncConnection.open_table methods.
An AsyncTable object is expected to be long lived and reused for multipleoperations. AsyncTable objects will cache a certain amount of index data in memory.This cache will be freed when the Table is garbage collected. To eagerly free thecache you can call theclose method. Once theAsyncTable is closed, it cannot be used for any further operations.
An AsyncTable can also be used as a context manager, and will automatically closewhen the context is exited. Closing a table is optional. If you do not close thetable, it will be closed when the AsyncTable object is garbage collected.
Examples:
Create usingAsyncConnection.create_table(more examples in that method's documentation).
>>>importlancedb>>>asyncdefcreate_a_table():...db=awaitlancedb.connect_async("./.lancedb")...data=[{"vector":[1.1,1.2],"b":2}]...table=awaitdb.create_table("my_table",data=data)...print(awaittable.query().limit(5).to_arrow())>>>importasyncio>>>asyncio.run(create_a_table())pyarrow.Tablevector: fixed_size_list<item: float>[2] child 0, item: floatb: int64----vector: [[[1.1,1.2]]]b: [[2]]
Can append new data withAsyncTable.add().
>>>asyncdefadd_to_table():...db=awaitlancedb.connect_async("./.lancedb")...table=awaitdb.open_table("my_table")...awaittable.add([{"vector":[0.5,1.3],"b":4}])>>>asyncio.run(add_to_table())
Can query the table withAsyncTable.vector_search.
>>>asyncdefsearch_table_for_vector():...db=awaitlancedb.connect_async("./.lancedb")...table=awaitdb.open_table("my_table")...results=(...awaittable.vector_search([0.4,0.4]).select(["b","vector"]).to_pandas()...)...print(results)>>>asyncio.run(search_table_for_vector()) b vector _distance0 4 [0.5, 1.3] 0.821 2 [1.1, 1.2] 1.13
Search queries are much faster when an index is created. SeeAsyncTable.create_index.
Source code inlancedb/table.py
299729982999300030013002300330043005300630073008300930103011301230133014301530163017301830193020302130223023302430253026302730283029303030313032303330343035303630373038303930403041304230433044304530463047304830493050305130523053305430553056305730583059306030613062306330643065306630673068306930703071307230733074307530763077307830793080308130823083308430853086308730883089309030913092309330943095309630973098309931003101310231033104310531063107310831093110311131123113311431153116311731183119312031213122312331243125312631273128312931303131313231333134313531363137313831393140314131423143314431453146314731483149315031513152315331543155315631573158315931603161316231633164316531663167316831693170317131723173317431753176317731783179318031813182318331843185318631873188318931903191319231933194319531963197319831993200320132023203320432053206320732083209321032113212321332143215321632173218321932203221322232233224322532263227322832293230323132323233323432353236323732383239324032413242324332443245324632473248324932503251325232533254325532563257325832593260326132623263326432653266326732683269327032713272327332743275327632773278327932803281328232833284328532863287328832893290329132923293329432953296329732983299330033013302330333043305330633073308330933103311331233133314331533163317331833193320332133223323332433253326332733283329333033313332333333343335333633373338333933403341334233433344334533463347334833493350335133523353335433553356335733583359336033613362336333643365336633673368336933703371337233733374337533763377337833793380338133823383338433853386338733883389339033913392339333943395339633973398339934003401340234033404340534063407340834093410341134123413341434153416341734183419342034213422342334243425342634273428342934303431343234333434343534363437343834393440344134423443344434453446344734483449345034513452345334543455345634573458345934603461346234633464346534663467346834693470347134723473347434753476347734783479348034813482348334843485348634873488348934903491349234933494349534963497349834993500350135023503350435053506350735083509351035113512351335143515351635173518351935203521352235233524352535263527352835293530353135323533353435353536353735383539354035413542354335443545354635473548354935503551355235533554355535563557355835593560356135623563356435653566356735683569357035713572357335743575357635773578357935803581358235833584358535863587358835893590359135923593359435953596359735983599360036013602360336043605360636073608360936103611361236133614361536163617361836193620362136223623362436253626362736283629363036313632363336343635363636373638363936403641364236433644364536463647364836493650365136523653365436553656365736583659366036613662366336643665366636673668366936703671367236733674367536763677367836793680368136823683368436853686368736883689369036913692369336943695369636973698369937003701370237033704370537063707370837093710371137123713371437153716371737183719372037213722372337243725372637273728372937303731373237333734373537363737373837393740374137423743374437453746374737483749375037513752375337543755375637573758375937603761376237633764376537663767376837693770377137723773377437753776377737783779378037813782378337843785378637873788378937903791379237933794379537963797379837993800380138023803380438053806380738083809381038113812381338143815381638173818381938203821382238233824382538263827382838293830383138323833383438353836383738383839384038413842384338443845384638473848384938503851385238533854385538563857385838593860386138623863386438653866386738683869387038713872387338743875387638773878387938803881388238833884388538863887388838893890389138923893389438953896389738983899390039013902390339043905390639073908390939103911391239133914391539163917391839193920392139223923392439253926392739283929393039313932393339343935393639373938393939403941394239433944394539463947394839493950395139523953395439553956395739583959396039613962396339643965396639673968396939703971397239733974397539763977397839793980398139823983398439853986398739883989399039913992399339943995399639973998399940004001400240034004400540064007400840094010401140124013401440154016401740184019402040214022402340244025402640274028402940304031403240334034403540364037403840394040404140424043404440454046404740484049405040514052405340544055405640574058405940604061406240634064406540664067406840694070407140724073407440754076407740784079408040814082408340844085408640874088408940904091409240934094409540964097409840994100410141024103410441054106410741084109411041114112411341144115411641174118411941204121412241234124412541264127412841294130413141324133413441354136413741384139414041414142414341444145414641474148414941504151415241534154415541564157415841594160416141624163416441654166416741684169417041714172 |
|
tagsproperty
Tag management for the dataset.
Similar to Git, tags are a way to add metadata to a specific version of thedataset.
.. warning::
Tagged versions are exempted from the:py:meth:`optimize(cleanup_older_than)` process.To remove a version that has been tagged, you must first:py:meth:`~Tags.delete` the associated tag.
__init__
Create a new AsyncTable object.
You should not create AsyncTable objects directly.
UseAsyncConnection.create_table andAsyncConnection.open_table to obtainTable objects.
Source code inlancedb/table.py
is_open
close
Close the table and free any resources associated with it.
It is safe to call this method multiple times.
Any attempt to use the table after it has been closed will raise an error.
schemaasync
schema()->Schema
embedding_functionsasync
embedding_functions()->Dict[str,EmbeddingFunctionConfig]
Get the embedding functions for the table
Returns:
funcs
(Dict[str,EmbeddingFunctionConfig]
) –A mapping of the vector column to the embedding functionor empty dict if not configured.
Source code inlancedb/table.py
count_rowsasync
Count the number of rows in the table.
Parameters:
filter
(Optional[str]
, default:None
) –A SQL where clause to filter the rows to count.
Source code inlancedb/table.py
headasync
head(n=5)->Table
Return the firstn
rows of the table.
Parameters:
n
–The number of rows to return.
query
query()->AsyncQuery
Returns anAsyncQuery that can be usedto search the table.
Use methods on the returned query to control query behavior. The querycan be executed with methods liketo_arrow,to_pandas and more.
Source code inlancedb/table.py
to_pandasasync
to_arrowasync
to_arrow()->Table
create_indexasync
create_index(column:str,*,replace:Optional[bool]=None,config:Optional[Union[IvfFlat,IvfPq,HnswPq,HnswSq,BTree,Bitmap,LabelList,FTS]]=None,wait_timeout:Optional[timedelta]=None)
Create an index to speed up queries
Indices can be created on vector columns or scalar columns.Indices on vector columns will speed up vector searches.Indices on scalar columns will speed up filtering (in bothvector and non-vector searches)
Parameters:
column
(str
) –The column to index.
replace
(Optional[bool]
, default:None
) –Whether to replace the existing index
If this is false, and another index already exists on the same columnsand the same name, then an error will be returned. This is true even ifthat index is out of date.
The default is True
config
(Optional[Union[IvfFlat,IvfPq,HnswPq,HnswSq,BTree,Bitmap,LabelList,FTS]]
, default:None
) –For advanced configuration you can specify the type of index you wouldlike to create. You can also specify index-specific parameters whencreating an index object.
wait_timeout
(Optional[timedelta]
, default:None
) –The timeout to wait if indexing is asynchronous.
Source code inlancedb/table.py
drop_indexasync
Drop an index from the table.
Parameters:
name
(str
) –The name of the index to drop.
Notes
This does not delete the index from disk, it just removes it from the table.To delete the index, runoptimizeafter dropping the index.
Uselist_indices to find the namesof the indices.
Source code inlancedb/table.py
prewarm_indexasync
Prewarm an index in the table.
Parameters:
name
(str
) –The name of the index to prewarm
Notes
This will load the index into memory. This may reduce the cold-start time forfuture queries. If the index does not fit in the cache then this call may bewasteful.
Source code inlancedb/table.py
wait_for_indexasync
Wait for indexing to complete for the given index names.This will poll the table until all the indices are fully indexed,or raise a timeout exception if the timeout is reached.
Parameters:
index_names
(Iterable[str]
) –The name of the indices to poll
timeout
(timedelta
, default:timedelta(seconds=300)
) –Timeout to wait for asynchronous indexing. The default is 5 minutes.
Source code inlancedb/table.py
statsasync
addasync
add(data:DATA,*,mode:Optional[Literal['append','overwrite']]='append',on_bad_vectors:Optional[OnBadVectorsType]=None,fill_value:Optional[float]=None)->AddResult
Add more data to theTable.
Parameters:
data
(DATA
) –The data to insert into the table. Acceptable types are:
list-of-dict
pandas.DataFrame
pyarrow.Table or pyarrow.RecordBatch
mode
(Optional[Literal['append', 'overwrite']]
, default:'append'
) –The mode to use when writing the data. Valid values are"append" and "overwrite".
on_bad_vectors
(Optional[OnBadVectorsType]
, default:None
) –What to do if any of the vectors are not the same size or contains NaNs.One of "error", "drop", "fill", "null".
fill_value
(Optional[float]
, default:None
) –The value to use when filling vectors. Only used if on_bad_vectors="fill".
Source code inlancedb/table.py
merge_insert
merge_insert(on:Union[str,Iterable[str]])->LanceMergeInsertBuilder
Returns aLanceMergeInsertBuilder
that can be used to create a "merge insert" operation
This operation can add rows, update rows, and remove rows all in a singletransaction. It is a very generic tool that can be used to createbehaviors like "insert if not exists", "update or insert (i.e. upsert)",or even replace a portion of existing data with new data (e.g. replaceall data where month="january")
The merge insert operation works by combining new data from asource table with existing data in atarget table by using ajoin. There are three categories of records.
"Matched" records are records that exist in both the source table andthe target table. "Not matched" records exist only in the source table(e.g. these are new data) "Not matched by source" records exist onlyin the target table (this is old data)
The builder returned by this method can be used to customize whatshould happen for each category of data.
Please note that the data may appear to be reordered as part of thisoperation. This is because updated rows will be deleted from thedataset and then reinserted at the end with the new values.
Parameters:
on
(Union[str,Iterable[str]]
) –A column (or columns) to join on. This is how records from thesource table and target table are matched. Typically this is somekind of key or id column.
Examples:
>>>importlancedb>>>data=pa.table({"a":[2,1,3],"b":["a","b","c"]})>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",data)>>>new_data=pa.table({"a":[2,3,4],"b":["x","y","z"]})>>># Perform a "upsert" operation>>>res=table.merge_insert("a") \....when_matched_update_all() \....when_not_matched_insert_all() \....execute(new_data)>>>resMergeResult(version=2, num_updated_rows=2, num_inserted_rows=1, num_deleted_rows=0)>>># The order of new rows is non-deterministic since we use>>># a hash-join as part of this operation and so we sort here>>>table.to_arrow().sort_by("a").to_pandas() a b0 1 b1 2 x2 3 y3 4 z
Source code inlancedb/table.py
searchasync
search(query:Optional[Union[VEC,str,'PIL.Image.Image',Tuple,FullTextQuery]]=None,vector_column_name:Optional[str]=None,query_type:QueryType='auto',ordering_field_name:Optional[str]=None,fts_columns:Optional[Union[str,List[str]]]=None)->Union[AsyncHybridQuery,AsyncFTSQuery,AsyncVectorQuery]
Create a search query to find the nearest neighborsof the given query vector. We currently supportvector searchand [full-text search][experimental-full-text-search].
All query options are defined inAsyncQuery.
Parameters:
query
(Optional[Union[VEC,str, 'PIL.Image.Image',Tuple,FullTextQuery]]
, default:None
) –The targetted vector to search for.
default None.Acceptable types are: list, np.ndarray, PIL.Image.Image
If None then the select/where/limit clauses are applied to filterthe table
vector_column_name
(Optional[str]
, default:None
) –The name of the vector column to search.
The vector column needs to be a pyarrow fixed size list type
If not specified then the vector column is inferred fromthe table schema
If the table has multiple vector columns then thevector_column_nameneeds to be specified. Otherwise, an error is raised.
query_type
(QueryType
, default:'auto'
) –default "auto".Acceptable types are: "vector", "fts", "hybrid", or "auto"
If "auto" then the query type is inferred from the query;
If
query
is a list/np.ndarray then the query type is"vector";If
query
is a PIL.Image.Image then either do vector search,or raise an error if no corresponding embedding function is found.
If
query
is a string, then the query type is "vector" if the table has embedding functions else the query type is "fts"
Returns:
LanceQueryBuilder
–A query builder object representing the query.
Source code inlancedb/table.py
344834493450345134523453345434553456345734583459346034613462346334643465346634673468346934703471347234733474347534763477347834793480348134823483348434853486348734883489349034913492349334943495349634973498349935003501350235033504350535063507350835093510351135123513351435153516351735183519352035213522352335243525352635273528352935303531353235333534353535363537353835393540354135423543354435453546354735483549355035513552355335543555355635573558355935603561356235633564356535663567356835693570357135723573357435753576357735783579358035813582358335843585358635873588358935903591359235933594359535963597359835993600360136023603360436053606360736083609361036113612361336143615361636173618361936203621362236233624362536263627362836293630 |
|
vector_search
vector_search(query_vector:Union[VEC,Tuple])->AsyncVectorQuery
Search the table with a given query vector.This is a convenience method for preparing a vector query andis the same thing as callingnearestTo
on the builder returnedbyquery
. Seernearest_to for moredetails.
Source code inlancedb/table.py
deleteasync
Delete rows from the table.
This can be used to delete a single row, many rows, all rows, orsometimes no rows (if your predicate matches nothing).
Parameters:
where
(str
) –The SQL where clause to use when deleting rows.
- For example, 'x = 2' or 'x IN (1, 2, 3)'.
The filter must not be empty, or it will error.
Examples:
>>>importlancedb>>>data=[...{"x":1,"vector":[1.0,2]},...{"x":2,"vector":[3.0,4]},...{"x":3,"vector":[5.0,6]}...]>>>db=lancedb.connect("./.lancedb")>>>table=db.create_table("my_table",data)>>>table.to_pandas() x vector0 1 [1.0, 2.0]1 2 [3.0, 4.0]2 3 [5.0, 6.0]>>>table.delete("x = 2")DeleteResult(version=2)>>>table.to_pandas() x vector0 1 [1.0, 2.0]1 3 [5.0, 6.0]
If you have a list of values to delete, you can combine them into astringified list and use theIN
operator:
>>>to_remove=[1,5]>>>to_remove=", ".join([str(v)forvinto_remove])>>>to_remove'1, 5'>>>table.delete(f"x IN ({to_remove})")DeleteResult(version=3)>>>table.to_pandas() x vector0 3 [5.0, 6.0]
Source code inlancedb/table.py
updateasync
update(updates:Optional[Dict[str,Any]]=None,*,where:Optional[str]=None,updates_sql:Optional[Dict[str,str]]=None)->UpdateResult
This can be used to update zero to all rows in the table.
If a filter is provided withwhere
then only rows matching thefilter will be updated. Otherwise all rows will be updated.
Parameters:
updates
(Optional[Dict[str,Any]]
, default:None
) –The updates to apply. The keys should be the name of the column toupdate. The values should be the new values to assign. This isrequired unless updates_sql is supplied.
where
(Optional[str]
, default:None
) –An SQL filter that controls which rows are updated. For example, 'x = 2'or 'x IN (1, 2, 3)'. Only rows that satisfy this filter will be udpated.
updates_sql
(Optional[Dict[str,str]]
, default:None
) –The updates to apply, expressed as SQL expression strings. The keys shouldbe column names. The values should be SQL expressions. These can be SQLliterals (e.g. "7" or "'foo'") or they can be expressions based on theprevious value of the row (e.g. "x + 1" to increment the x column by 1)
Returns:
UpdateResult
–An object containing:- rows_updated: The number of rows that were updated- version: The new version number of the table after the update
Examples:
>>>importasyncio>>>importlancedb>>>importpandasaspd>>>asyncdefdemo_update():...data=pd.DataFrame({"x":[1,2],"vector":[[1,2],[3,4]]})...db=awaitlancedb.connect_async("./.lancedb")...table=awaitdb.create_table("my_table",data)...# x is [1, 2], vector is [[1, 2], [3, 4]]...awaittable.update({"vector":[10,10]},where="x = 2")...# x is [1, 2], vector is [[1, 2], [10, 10]]...awaittable.update(updates_sql={"x":"x + 1"})...# x is [2, 3], vector is [[1, 2], [10, 10]]>>>asyncio.run(demo_update())
Source code inlancedb/table.py
add_columnsasync
Add new columns with defined values.
Parameters:
transforms
(dict[str,str] |field |List[field] |Schema
) –A map of column name to a SQL expression to use to calculate thevalue of the new column. These expressions will be evaluated foreach row in the table, and can reference existing columns.Alternatively, you can pass a pyarrow field or schema to addnew columns with NULLs.
Returns:
AddColumnsResult
–version: the new version number of the table after adding columns.
Source code inlancedb/table.py
alter_columnsasync
Alter column names and nullability.
alterations : Iterable[Dict[str, Any]] A sequence of dictionaries, each with the following keys: - "path": str The column path to alter. For a top-level column, this is the name. For a nested column, this is the dot-separated path, e.g. "a.b.c". - "rename": str, optional The new name of the column. If not specified, the column name is not changed. - "data_type": pyarrow.DataType, optional The new data type of the column. Existing values will be casted to this type. If not specified, the column data type is not changed. - "nullable": bool, optional Whether the column should be nullable. If not specified, the column nullability is not changed. Only non-nullable columns can be changed to nullable. Currently, you cannot change a nullable column to non-nullable.
Returns:
AlterColumnsResult
–version: the new version number of the table after the alteration.
Source code inlancedb/table.py
drop_columnsasync
Drop columns from the table.
Parameters:
columns
(Iterable[str]
) –The names of the columns to drop.
versionasync
Retrieve the version of the table
LanceDb supports versioning. Every operation that modifies the table increasesversion. As long as a version hasn't been deleted you can[Self::checkout]
that version to view the data at that point. In addition, you can[Self::restore]
the version to replace the current table with a previousversion.
Source code inlancedb/table.py
list_versionsasync
List all versions of the table
Source code inlancedb/table.py
checkoutasync
Checks out a specific version of the Table
Any read operation on the table will now access the data at the checked outversion. As a consequence, calling this method will disable any read consistencyinterval that was previously set.
This is a read-only operation that turns the table into a sort of "view"or "detached head". Other table instances will not be affected. To make thechange permanent you can use the[Self::restore]
method.
Any operation that modifies the table will fail while the table is in a checkedout state.
Parameters:
version
(int |str
) –The version to check out. A version number (
int
) or a tag(str
) can be provided.To
–
Source code inlancedb/table.py
checkout_latestasync
Ensures the table is pointing at the latest version
This can be used to manually update a table when the read_consistency_intervalis NoneIt can also be used to undo a[Self::checkout]
operation
Source code inlancedb/table.py
restoreasync
Restore the table to the currently checked out version
This operation will fail if checkout has not been called previously
This operation will overwrite the latest version of the table with aprevious version. Any changes made since the checked out version willno longer be visible.
Once the operation concludes the table will no longer be in a checkedout state and the read_consistency_interval, if any, will apply.
Source code inlancedb/table.py
optimizeasync
optimize(*,cleanup_older_than:Optional[timedelta]=None,delete_unverified:bool=False,retrain=False)->OptimizeStats
Optimize the on-disk data and indices for better performance.
Modeled afterVACUUM
in PostgreSQL.
Optimization covers three operations:
- Compaction: Merges small files into larger ones
- Prune: Removes old versions of the dataset
- Index: Optimizes the indices, adding new data to existing indices
Parameters:
cleanup_older_than
(Optional[timedelta]
, default:None
) –All files belonging to versions older than this will be removed. Setto 0 days to remove all versions except the latest. The latest versionis never removed.
delete_unverified
(bool
, default:False
) –Files leftover from a failed transaction may appear to be part of anin-progress operation (e.g. appending new data) and these files will notbe deleted unless they are at least 7 days old. If delete_unverified is Truethen these files will be deleted regardless of their age.
retrain
–If True, retrain the vector indices, this would refine the IVF clusteringand quantization, which may improve the search accuracy. It's faster thanre-creating the index from scratch, so it's recommended to try this first,when the data distribution has changed significantly.
Experimental API
The optimization process is undergoing active development and may change.Our goal with these changes is to improve the performance of optimization andreduce the complexity.
That being said, it is essential today to run optimize if you want the bestperformance. It should be stable and safe to use in production, but it ourhope that the API may be simplified (or not even need to be called) in thefuture.
The frequency an application shoudl call optimize is based on the frequency ofdata modifications. If data is frequently added, deleted, or updated thenoptimize should be run frequently. A good rule of thumb is to run optimize ifyou have added or modified 100,000 or more records or run more than 20 datamodification operations.
Source code inlancedb/table.py
list_indicesasync
index_statsasync
Retrieve statistics about an index
Parameters:
index_name
(str
) –The name of the index to retrieve statistics for
Returns:
IndexStatistics or None
–The statistics about the index. Returns None if the index does not exist.
Source code inlancedb/table.py
uses_v2_manifest_pathsasync
Check if the table is using the new v2 manifest paths.
Returns:
bool
–True if the table is using the new v2 manifest paths, False otherwise.
Source code inlancedb/table.py
migrate_manifest_paths_v2async
Migrate the manifest paths to the new format.
This will update the manifest to use the new v2 format for paths.
This function is idempotent, and can be run multiple times withoutchanging the state of the object store.
Danger
This should not be run while other concurrent operations are happening.And it should also run until completion before resuming other operations.
You can useAsyncTable.uses_v2_manifest_pathsto check if the table is already using the new path style.
Source code inlancedb/table.py
replace_field_metadataasync
Replace the metadata of a field in the schema
Parameters:
field_name
(str
) –The name of the field to replace the metadata for
new_metadata
(dict[str,str]
) –The new metadata to set
Source code inlancedb/table.py
Indices (Asynchronous)
Indices can be created on a table to speed up queries. This sectionlists the indices that LanceDb supports.
lancedb.index.BTreedataclass
Describes a btree index configuration
A btree index is an index on scalar columns. The index stores a copy of thecolumn in sorted order. A header entry is created for each block of rows(currently the block size is fixed at 4096). These header entries are storedin a separate cacheable structure (a btree). To search for data the header isused to determine which blocks need to be read from disk.
For example, a btree index in a table with 1Bi rows requiressizeof(Scalar) * 256Ki bytes of memory and will generally need to readsizeof(Scalar) * 4096 bytes to find the correct row ids.
This index is good for scalar columns with mostly distinct values and does bestwhen the query is highly selective. It works with numeric, temporal, and stringcolumns.
The btree index does not currently have any parameters though parameters such asthe block size may be added in the future.
Source code inlancedb/index.py
lancedb.index.Bitmapdataclass
Describe a Bitmap index configuration.
ABitmap
index stores a bitmap for each distinct value in the column forevery row.
This index works best for low-cardinality numeric or string columns,where the number of unique values is small (i.e., less than a few thousands).Bitmap
index can accelerate the following filters:
<
,<=
,=
,>
,>=
IN (value1, value2, ...)
between (value1, value2)
is null
For example, a bitmap index with a table with 1Bi rows, and 128 distinct values,requires 128 / 8 * 1Bi bytes on disk.
Source code inlancedb/index.py
lancedb.index.LabelListdataclass
Describe a LabelList index configuration.
LabelList
is a scalar index that can be used onList<T>
columns tosupport queries witharray_contains_all
andarray_contains_any
using an underlying bitmap index.
For example, it works withtags
,categories
,keywords
, etc.
Source code inlancedb/index.py
lancedb.index.FTSdataclass
Describe a FTS index configuration.
FTS
is a full-text search index that can be used onString
columns
For example, it works withtitle
,description
,content
, etc.
Attributes:
with_position
(bool, default False
) –Whether to store the position of the token in the document. Setting thisto False can reduce the size of the index and improve indexing speed,but it will disable support for phrase queries.
base_tokenizer
(str, default "simple"
) –The base tokenizer to use for tokenization. Options are:- "simple": Splits text by whitespace and punctuation.- "whitespace": Split text by whitespace, but not punctuation.- "raw": No tokenization. The entire text is treated as a single token.
language
(str, default "English"
) –The language to use for tokenization.
max_token_length
(int, default 40
) –The maximum token length to index. Tokens longer than this length will beignored.
lower_case
(bool, default True
) –Whether to convert the token to lower case. This makes queries case-insensitive.
stem
(bool, default True
) –Whether to stem the token. Stemming reduces words to their root form.For example, in English "running" and "runs" would both be reduced to "run".
remove_stop_words
(bool, default True
) –Whether to remove stop words. Stop words are common words that are oftenremoved from text before indexing. For example, in English "the" and "and".
ascii_folding
(bool, default True
) –Whether to fold ASCII characters. This converts accented characters totheir ASCII equivalent. For example, "café" would be converted to "cafe".
Source code inlancedb/index.py
lancedb.index.IvfPqdataclass
Describes an IVF PQ Index
This index stores a compressed (quantized) copy of every vector. These vectorsare grouped into partitions of similar vectors. Each partition keeps track ofa centroid which is the average value of all vectors in the group.
During a query the centroids are compared with the query vector to find theclosest partitions. The compressed vectors in these partitions are thensearched to find the closest vectors.
The compression scheme is called product quantization. Each vector is divideinto subvectors and then each subvector is quantized into a small number ofbits. the parametersnum_bits
andnum_subvectors
control this process,providing a tradeoff between index size (and thus search speed) and indexaccuracy.
The partitioning process is called IVF and thenum_partitions
parametercontrols how many groups to create.
Note that training an IVF PQ index on a large dataset is a slow operation andcurrently is also a memory intensive operation.
Attributes:
distance_type
(str, default "l2"
) –The distance metric used to train the index
This is used when training the index to calculate the IVF partitions(vectors are grouped in partitions with similar vectors according to thisdistance type) and to calculate a subvector's code during quantization.
The distance type used to train an index MUST match the distance type usedto search the index. Failure to do so will yield inaccurate results.
The following distance types are available:
"l2" - Euclidean distance. This is a very common distance metric thataccounts for both magnitude and direction when determining the distancebetween vectors. l2 distance has a range of [0, ∞).
"cosine" - Cosine distance. Cosine distance is a distance metriccalculated from the cosine similarity between two vectors. Cosinesimilarity is a measure of similarity between two non-zero vectors of aninner product space. It is defined to equal the cosine of the anglebetween them. Unlike l2, the cosine distance is not affected by themagnitude of the vectors. Cosine distance has a range of [0, 2].
Note: the cosine distance is undefined when one (or both) of the vectorsare all zeros (there is no direction). These vectors are invalid and maynever be returned from a vector search.
"dot" - Dot product. Dot distance is the dot product of two vectors. Dotdistance has a range of (-∞, ∞). If the vectors are normalized (i.e. theirl2 norm is 1), then dot distance is equivalent to the cosine distance.
num_partitions
(int, default sqrt(num_rows)
) –The number of IVF partitions to create.
This value should generally scale with the number of rows in the dataset.By default the number of partitions is the square root of the number ofrows.
If this value is too large then the first part of the search (picking theright partition) will be slow. If this value is too small then the secondpart of the search (searching within a partition) will be slow.
num_sub_vectors
(int, default is vector dimension / 16
) –Number of sub-vectors of PQ.
This value controls how much the vector is compressed during thequantization step. The more sub vectors there are the less the vector iscompressed. The default is the dimension of the vector divided by 16. Ifthe dimension is not evenly divisible by 16 we use the dimension divded by8.
The above two cases are highly preferred. Having 8 or 16 values persubvector allows us to use efficient SIMD instructions.
If the dimension is not visible by 8 then we use 1 subvector. This is notideal and will likely result in poor performance.
num_bits
(int, default 8
) –Number of bits to encode each sub-vector.
This value controls how much the sub-vectors are compressed. The more bitsthe more accurate the index but the slower search. The default is 8bits. Only 4 and 8 are supported.
max_iterations
(int, default 50
) –Max iteration to train kmeans.
When training an IVF PQ index we use kmeans to calculate the partitions.This parameter controls how many iterations of kmeans to run.
Increasing this might improve the quality of the index but in most casesthese extra iterations have diminishing returns.
The default value is 50.
sample_rate
(int, default 256
) –The rate used to calculate the number of training vectors for kmeans.
When an IVF PQ index is trained, we need to calculate partitions. Theseare groups of vectors that are similar to each other. To do this we use analgorithm called kmeans.
Running kmeans on a large dataset can be slow. To speed this up we runkmeans on a random sample of the data. This parameter controls the size ofthe sample. The total number of vectors used to train the index is
sample_rate * num_partitions
.Increasing this value might improve the quality of the index but in mostcases the default should be sufficient.
The default value is 256.
Source code inlancedb/index.py
455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574 |
|
lancedb.index.HnswPqdataclass
Describe a HNSW-PQ index configuration.
HNSW-PQ stands for Hierarchical Navigable Small World - Product Quantization.It is a variant of the HNSW algorithm that uses product quantization to compressthe vectors. To create an HNSW-PQ index, you can specify the following parameters:
Parameters:
distance_type
(Literal['l2', 'cosine', 'dot']
, default:'l2'
) –The distance metric used to train the index.
The following distance types are available:
"l2" - Euclidean distance. This is a very common distance metric thataccounts for both magnitude and direction when determining the distancebetween vectors. l2 distance has a range of [0, ∞).
"cosine" - Cosine distance. Cosine distance is a distance metriccalculated from the cosine similarity between two vectors. Cosinesimilarity is a measure of similarity between two non-zero vectors of aninner product space. It is defined to equal the cosine of the anglebetween them. Unlike l2, the cosine distance is not affected by themagnitude of the vectors. Cosine distance has a range of [0, 2].
"dot" - Dot product. Dot distance is the dot product of two vectors. Dotdistance has a range of (-∞, ∞). If the vectors are normalized (i.e. theirl2 norm is 1), then dot distance is equivalent to the cosine distance.
num_partitions
(Optional[int]
, default:None
) –The number of IVF partitions to create.
For HNSW, we recommend a small number of partitions. Setting this to 1 workswell for most tables. For very large tables, training just one HNSW graphwill require too much memory. Each partition becomes its own HNSW graph, sosetting this value higher reduces the peak memory use of training.
default
(Optional[int]
, default:None
) –The number of IVF partitions to create.
For HNSW, we recommend a small number of partitions. Setting this to 1 workswell for most tables. For very large tables, training just one HNSW graphwill require too much memory. Each partition becomes its own HNSW graph, sosetting this value higher reduces the peak memory use of training.
num_sub_vectors
(Optional[int]
, default:None
) –Number of sub-vectors of PQ.
This value controls how much the vector is compressed during thequantization step. The more sub vectors there are the less the vector iscompressed. The default is the dimension of the vector divided by 16.If the dimension is not evenly divisible by 16 we use the dimensiondivided by 8.
The above two cases are highly preferred. Having 8 or 16 values persubvector allows us to use efficient SIMD instructions.
If the dimension is not visible by 8 then we use 1 subvector. This is notideal and will likely result in poor performance.
num_bits: int, default 8Number of bits to encode each sub-vector.
This value controls how much the sub-vectors are compressed. The more bitsthe more accurate the index but the slower search. Only 4 and 8 are supported.
default
(Optional[int]
, default:None
) –Number of sub-vectors of PQ.
This value controls how much the vector is compressed during thequantization step. The more sub vectors there are the less the vector iscompressed. The default is the dimension of the vector divided by 16.If the dimension is not evenly divisible by 16 we use the dimensiondivided by 8.
The above two cases are highly preferred. Having 8 or 16 values persubvector allows us to use efficient SIMD instructions.
If the dimension is not visible by 8 then we use 1 subvector. This is notideal and will likely result in poor performance.
num_bits: int, default 8Number of bits to encode each sub-vector.
This value controls how much the sub-vectors are compressed. The more bitsthe more accurate the index but the slower search. Only 4 and 8 are supported.
max_iterations
(int
, default:50
) –Max iterations to train kmeans.
When training an IVF index we use kmeans to calculate the partitions. Thisparameter controls how many iterations of kmeans to run.
Increasing this might improve the quality of the index but in most cases theparameter is unused because kmeans will converge with fewer iterations. Theparameter is only used in cases where kmeans does not appear to converge. Inthose cases it is unlikely that setting this larger will lead to the indexconverging anyways.
default
(int
, default:50
) –Max iterations to train kmeans.
When training an IVF index we use kmeans to calculate the partitions. Thisparameter controls how many iterations of kmeans to run.
Increasing this might improve the quality of the index but in most cases theparameter is unused because kmeans will converge with fewer iterations. Theparameter is only used in cases where kmeans does not appear to converge. Inthose cases it is unlikely that setting this larger will lead to the indexconverging anyways.
sample_rate
(int
, default:256
) –The rate used to calculate the number of training vectors for kmeans.
When an IVF index is trained, we need to calculate partitions. These aregroups of vectors that are similar to each other. To do this we use analgorithm called kmeans.
Running kmeans on a large dataset can be slow. To speed this up werun kmeans on a random sample of the data. This parameter controls thesize of the sample. The total number of vectors used to train the indexis
sample_rate * num_partitions
.Increasing this value might improve the quality of the index but inmost cases the default should be sufficient.
default
(int
, default:256
) –The rate used to calculate the number of training vectors for kmeans.
When an IVF index is trained, we need to calculate partitions. These aregroups of vectors that are similar to each other. To do this we use analgorithm called kmeans.
Running kmeans on a large dataset can be slow. To speed this up werun kmeans on a random sample of the data. This parameter controls thesize of the sample. The total number of vectors used to train the indexis
sample_rate * num_partitions
.Increasing this value might improve the quality of the index but inmost cases the default should be sufficient.
m
(int
, default:20
) –The number of neighbors to select for each vector in the HNSW graph.
This value controls the tradeoff between search speed and accuracy.The higher the value the more accurate the search but the slower it will be.
default
(int
, default:20
) –The number of neighbors to select for each vector in the HNSW graph.
This value controls the tradeoff between search speed and accuracy.The higher the value the more accurate the search but the slower it will be.
ef_construction
(int
, default:300
) –The number of candidates to evaluate during the construction of the HNSW graph.
This value controls the tradeoff between build speed and accuracy.The higher the value the more accurate the build but the slower it will be.150 to 300 is the typical range. 100 is a minimum for good quality searchresults. In most cases, there is no benefit to setting this higher than 500.This value should be set to a value that is not less than
ef
in thesearch phase.default
(int
, default:300
) –The number of candidates to evaluate during the construction of the HNSW graph.
This value controls the tradeoff between build speed and accuracy.The higher the value the more accurate the build but the slower it will be.150 to 300 is the typical range. 100 is a minimum for good quality searchresults. In most cases, there is no benefit to setting this higher than 500.This value should be set to a value that is not less than
ef
in thesearch phase.
Source code inlancedb/index.py
145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263 |
|
lancedb.index.HnswSqdataclass
Describe a HNSW-SQ index configuration.
HNSW-SQ stands for Hierarchical Navigable Small World - Scalar Quantization.It is a variant of the HNSW algorithm that uses scalar quantization to compressthe vectors.
Parameters:
distance_type
(Literal['l2', 'cosine', 'dot']
, default:'l2'
) –The distance metric used to train the index.
The following distance types are available:
"l2" - Euclidean distance. This is a very common distance metric thataccounts for both magnitude and direction when determining the distancebetween vectors. l2 distance has a range of [0, ∞).
"cosine" - Cosine distance. Cosine distance is a distance metriccalculated from the cosine similarity between two vectors. Cosinesimilarity is a measure of similarity between two non-zero vectors of aninner product space. It is defined to equal the cosine of the anglebetween them. Unlike l2, the cosine distance is not affected by themagnitude of the vectors. Cosine distance has a range of [0, 2].
"dot" - Dot product. Dot distance is the dot product of two vectors. Dotdistance has a range of (-∞, ∞). If the vectors are normalized (i.e. theirl2 norm is 1), then dot distance is equivalent to the cosine distance.
num_partitions
(Optional[int]
, default:None
) –The number of IVF partitions to create.
For HNSW, we recommend a small number of partitions. Setting this to 1 workswell for most tables. For very large tables, training just one HNSW graphwill require too much memory. Each partition becomes its own HNSW graph, sosetting this value higher reduces the peak memory use of training.
default
(Optional[int]
, default:None
) –The number of IVF partitions to create.
For HNSW, we recommend a small number of partitions. Setting this to 1 workswell for most tables. For very large tables, training just one HNSW graphwill require too much memory. Each partition becomes its own HNSW graph, sosetting this value higher reduces the peak memory use of training.
max_iterations
(int
, default:50
) –Max iterations to train kmeans.
When training an IVF index we use kmeans to calculate the partitions.This parameter controls how many iterations of kmeans to run.
Increasing this might improve the quality of the index but in most casesthe parameter is unused because kmeans will converge with fewer iterations.The parameter is only used in cases where kmeans does not appear to converge.In those cases it is unlikely that setting this larger will lead tothe index converging anyways.
default
(int
, default:50
) –Max iterations to train kmeans.
When training an IVF index we use kmeans to calculate the partitions.This parameter controls how many iterations of kmeans to run.
Increasing this might improve the quality of the index but in most casesthe parameter is unused because kmeans will converge with fewer iterations.The parameter is only used in cases where kmeans does not appear to converge.In those cases it is unlikely that setting this larger will lead tothe index converging anyways.
sample_rate
(int
, default:256
) –The rate used to calculate the number of training vectors for kmeans.
When an IVF index is trained, we need to calculate partitions. Theseare groups of vectors that are similar to each other. To do thiswe use an algorithm called kmeans.
Running kmeans on a large dataset can be slow. To speed this up werun kmeans on a random sample of the data. This parameter controls thesize of the sample. The total number of vectors used to train the indexis
sample_rate * num_partitions
.Increasing this value might improve the quality of the index but inmost cases the default should be sufficient.
default
(int
, default:256
) –The rate used to calculate the number of training vectors for kmeans.
When an IVF index is trained, we need to calculate partitions. Theseare groups of vectors that are similar to each other. To do thiswe use an algorithm called kmeans.
Running kmeans on a large dataset can be slow. To speed this up werun kmeans on a random sample of the data. This parameter controls thesize of the sample. The total number of vectors used to train the indexis
sample_rate * num_partitions
.Increasing this value might improve the quality of the index but inmost cases the default should be sufficient.
m
(int
, default:20
) –The number of neighbors to select for each vector in the HNSW graph.
This value controls the tradeoff between search speed and accuracy.The higher the value the more accurate the search but the slower it will be.
default
(int
, default:20
) –The number of neighbors to select for each vector in the HNSW graph.
This value controls the tradeoff between search speed and accuracy.The higher the value the more accurate the search but the slower it will be.
ef_construction
(int
, default:300
) –The number of candidates to evaluate during the construction of the HNSW graph.
This value controls the tradeoff between build speed and accuracy.The higher the value the more accurate the build but the slower it will be.150 to 300 is the typical range. 100 is a minimum for good quality searchresults. In most cases, there is no benefit to setting this higher than 500.This value should be set to a value that is not less than
ef
in the searchphase.default
(int
, default:300
) –The number of candidates to evaluate during the construction of the HNSW graph.
This value controls the tradeoff between build speed and accuracy.The higher the value the more accurate the build but the slower it will be.150 to 300 is the typical range. 100 is a minimum for good quality searchresults. In most cases, there is no benefit to setting this higher than 500.This value should be set to a value that is not less than
ef
in the searchphase.
Source code inlancedb/index.py
lancedb.index.IvfFlatdataclass
Describes an IVF Flat Index
This index stores raw vectors.These vectors are grouped into partitions of similar vectors.Each partition keeps track of a centroid which isthe average value of all vectors in the group.
Attributes:
distance_type
(str, default "l2"
) –The distance metric used to train the index
This is used when training the index to calculate the IVF partitions(vectors are grouped in partitions with similar vectors according to thisdistance type) and to calculate a subvector's code during quantization.
The distance type used to train an index MUST match the distance type usedto search the index. Failure to do so will yield inaccurate results.
The following distance types are available:
"l2" - Euclidean distance. This is a very common distance metric thataccounts for both magnitude and direction when determining the distancebetween vectors. l2 distance has a range of [0, ∞).
"cosine" - Cosine distance. Cosine distance is a distance metriccalculated from the cosine similarity between two vectors. Cosinesimilarity is a measure of similarity between two non-zero vectors of aninner product space. It is defined to equal the cosine of the anglebetween them. Unlike l2, the cosine distance is not affected by themagnitude of the vectors. Cosine distance has a range of [0, 2].
Note: the cosine distance is undefined when one (or both) of the vectorsare all zeros (there is no direction). These vectors are invalid and maynever be returned from a vector search.
"dot" - Dot product. Dot distance is the dot product of two vectors. Dotdistance has a range of (-∞, ∞). If the vectors are normalized (i.e. theirl2 norm is 1), then dot distance is equivalent to the cosine distance.
"hamming" - Hamming distance. Hamming distance is a distance metriccalculated as the number of positions at which the corresponding bits aredifferent. Hamming distance has a range of [0, vector dimension].
num_partitions
(int, default sqrt(num_rows)
) –The number of IVF partitions to create.
This value should generally scale with the number of rows in the dataset.By default the number of partitions is the square root of the number ofrows.
If this value is too large then the first part of the search (picking theright partition) will be slow. If this value is too small then the secondpart of the search (searching within a partition) will be slow.
max_iterations
(int, default 50
) –Max iteration to train kmeans.
When training an IVF PQ index we use kmeans to calculate the partitions.This parameter controls how many iterations of kmeans to run.
Increasing this might improve the quality of the index but in most casesthese extra iterations have diminishing returns.
The default value is 50.
sample_rate
(int, default 256
) –The rate used to calculate the number of training vectors for kmeans.
When an IVF PQ index is trained, we need to calculate partitions. Theseare groups of vectors that are similar to each other. To do this we use analgorithm called kmeans.
Running kmeans on a large dataset can be slow. To speed this up we runkmeans on a random sample of the data. This parameter controls the size ofthe sample. The total number of vectors used to train the index is
sample_rate * num_partitions
.Increasing this value might improve the quality of the index but in mostcases the default should be sufficient.
The default value is 256.
Source code inlancedb/index.py
Querying (Asynchronous)
Queries allow you to return data from your database. Basic queries can becreated with theAsyncTable.query methodto return the entire (typically filtered) table. Vector searches return therows nearest to a query vector and can be created with theAsyncTable.vector_search method.
lancedb.query.AsyncQuery
Bases:AsyncQueryBase
Source code inlancedb/query.py
245524562457245824592460246124622463246424652466246724682469247024712472247324742475247624772478247924802481248224832484248524862487248824892490249124922493249424952496249724982499250025012502250325042505250625072508250925102511251225132514251525162517251825192520252125222523252425252526252725282529253025312532253325342535253625372538253925402541254225432544254525462547254825492550255125522553255425552556255725582559256025612562256325642565256625672568256925702571257225732574257525762577257825792580258125822583258425852586 |
|
where
Only return rows matching the given predicate
The predicate should be supplied as an SQL query string.
Examples:
Filtering performance can often be improved by creating a scalar indexon the filter column(s).
Source code inlancedb/query.py
select
Return only the specified columns.
By default a query will return all columns from the table. However, this canhave a very significant impact on latency. LanceDb stores data in a columnarfashion. Thismeans we can finely tune our I/O to select exactly the columns we need.
As a best practice you should always limit queries to the columns that you need.If you pass in a list of column names then only those columns will bereturned.
You can also use this method to create new "dynamic" columns based on yourexisting columns. For example, you may not care about "a" or "b" but insteadsimply want "a + b". This is often seen in the SELECT clause of an SQL query(e.g.SELECT a+b FROM my_table
).
To create dynamic columns you can pass in a dict[str, str]. A column will bereturned for each entry in the map. The key provides the name of the column.The value is an SQL string used to specify how the column is calculated.
For example, an SQL query might stateSELECT a + b AS combined, c
. Theequivalent input to this method would be{"combined": "a + b", "c": "c"}
.
Columns will always be returned in the order given, even if that order isdifferent than the order used when adding the data.
Source code inlancedb/query.py
limit
Set the maximum number of results to return.
By default, a plain search has no limit. If this method is notcalled then every valid row from the table will be returned.
Source code inlancedb/query.py
offset
Set the offset for the results.
Parameters:
offset
(int
) –The offset to start fetching results from.
fast_search
Skip searching un-indexed data.
This can make queries faster, but will miss any data that has not beenindexed.
Tip
You can add new data into an existing index by callingAsyncTable.optimize.
Source code inlancedb/query.py
with_row_id
postfilter
If this is called then filtering will happen after the search instead ofbefore.By default filtering will be performed before the search. This is howfiltering is typically understood to work. This prefilter step does add someadditional latency. Creating a scalar index on the filter column(s) canoften improve this latency. However, sometimes a filter is too complex orscalar indices cannot be applied to the column. In these cases postfilteringcan be used instead of prefiltering to improve latency.Post filtering applies the filter to the results of the search. Thismeans we only run the filter on a much smaller set of data. However, it cancause the query to return fewer thanlimit
results (or even no results) ifnone of the nearest results match the filter.Post filtering happens during the "refine stage" (described in more detail in@see {@link VectorQuery#refineFactor}). This means that setting a higher refinefactor can often help restore some of the results lost by post filtering.
Source code inlancedb/query.py
to_batchesasync
to_batches(*,max_batch_length:Optional[int]=None,timeout:Optional[timedelta]=None)->AsyncRecordBatchReader
Execute the query and return the results as an Apache Arrow RecordBatchReader.
Parameters:
max_batch_length
(Optional[int]
, default:None
) –The maximum number of selected records in a single RecordBatch object.If not specified, a default batch length is used.It is possible for batches to be smaller than the provided length if theunderlying data is stored in smaller chunks.
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.
Source code inlancedb/query.py
to_arrowasync
to_arrow(timeout:Optional[timedelta]=None)->Table
Execute the query and collect the results into an Apache Arrow Table.
This method will collect all results into memory before returning. Ifyou expect a large number of results, you may want to useto_batches
Parameters:
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.
Source code inlancedb/query.py
to_listasync
Execute the query and return the results as a list of dictionaries.
Each list entry is a dictionary with the selected column names as keys,or all table columns ifselect
is not called. The vector and the "_distance"fields are returned whether or not they're explicitly selected.
Parameters:
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.
Source code inlancedb/query.py
to_pandasasync
Execute the query and collect the results into a pandas DataFrame.
This method will collect all results into memory before returning. If youexpect a large number of results, you may want to useto_batches and convert each batch topandas separately.
Examples:
>>>importasyncio>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",data=[{"a":1,"b":2}])...asyncforbatchinawaittable.query().to_batches():...batch_df=batch.to_pandas()>>>asyncio.run(doctest_example())
Parameters:
flatten
(Optional[Union[int,bool]]
, default:None
) –If flatten is True, flatten all nested columns.If flatten is an integer, flatten the nested columns up to thespecified depth.If unspecified, do not flatten the nested columns.
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.
Source code inlancedb/query.py
to_polarsasync
Execute the query and collect the results into a Polars DataFrame.
This method will collect all results into memory before returning. If youexpect a large number of results, you may want to useto_batches and convert each batch topolars separately.
Parameters:
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.
Examples:
>>>importasyncio>>>importpolarsaspl>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",data=[{"a":1,"b":2}])...asyncforbatchinawaittable.query().to_batches():...batch_df=pl.from_arrow(batch)>>>asyncio.run(doctest_example())
Source code inlancedb/query.py
explain_planasync
Return the execution plan for this query.
Examples:
>>>importasyncio>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",[{"vector":[99,99]}])...query=[100,100]...plan=awaittable.query().nearest_to([1,2]).explain_plan(True)...print(plan)>>>asyncio.run(doctest_example())ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance] GlobalLimitExec: skip=0, fetch=10 FilterExec: _distance@2 IS NOT NULL SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false] KNNVectorDistance: metric=l2 LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false
Parameters:
verbose
(bool
, default:False
) –Use a verbose output format.
Returns:
plan
(str
) –
Source code inlancedb/query.py
analyze_planasync
__init__
Construct an AsyncQuery
This method is not intended to be called directly. Instead, use theAsyncTable.query method to create a query.
Source code inlancedb/query.py
nearest_to
nearest_to(query_vector:Union[VEC,Tuple,List[VEC]])->AsyncVectorQuery
Find the nearest vectors to the given query vector.
This converts the query from a plain query to a vector query.
This method will attempt to convert the input to the query vectorexpected by the embedding model. If the input cannot be convertedthen an error will be thrown.
By default, there is no embedding model, and the input should besomething that can be converted to a pyarrow array of floats. Thisincludes lists, numpy arrays, and tuples.
If there is only one vector column (a column whose data type is afixed size list of floats) then the column does not need to be specified.If there is more than one vector column you must useAsyncVectorQuery.column to specifywhich column you would like to compare with.
If no index has been created on the vector column then a vector querywill perform a distance comparison between the query vector and everyvector in the database and then sort the results. This is sometimescalled a "flat search"
For small databases, with tens of thousands of vectors or less, this canbe reasonably fast. In larger databases you should create a vector indexon the column. If there is a vector index then an "approximate" nearestneighbor search (frequently called an ANN search) will be performed. Thissearch is much faster, but the results will be approximate.
The query can be further parameterized using the returned builder. Thereare various ANN search parameters that will let you fine tune your recallaccuracy vs search latency.
Vector searches always have alimit. Iflimit
has not been called thena defaultlimit
of 10 will be used.
Typically, a single vector is passed in as the query. However, you can alsopass in multiple vectors. When multiple vectors are passed in, if the vectorcolumn is with multivector type, then the vectors will be treated as a singlequery. Or the vectors will be treated as multiple queries, this can be usefulif you want to find the nearest vectors to multiple query vectors.This is not expected to be faster than making multiple queries concurrently;it is just a convenience method. If multiple vectors are passed in thenan additional columnquery_index
will be added to the results. This columnwill contain the index of the query vector that the result is nearest to.
Source code inlancedb/query.py
nearest_to_text
nearest_to_text(query:str|FullTextQuery,columns:Union[str,List[str],None]=None)->AsyncFTSQuery
Find the documents that are most relevant to the given text query.
This method will perform a full text search on the table and returnthe most relevant documents. The relevance is determined by BM25.
The columns to search must be with native FTS index(Tantivy-based can't work with this method).
By default, all indexed columns are searched,now only one column can be searched at a time.
Parameters:
query
(str |FullTextQuery
) –The text query to search for.
columns
(Union[str,List[str], None]
, default:None
) –The columns to search in. If None, all indexed columns are searched.For now only one column can be searched at a time.
Source code inlancedb/query.py
lancedb.query.AsyncVectorQuery
Bases:AsyncQueryBase
,AsyncVectorQueryBase
Source code inlancedb/query.py
286828692870287128722873287428752876287728782879288028812882288328842885288628872888288928902891289228932894289528962897289828992900290129022903290429052906290729082909291029112912291329142915291629172918291929202921292229232924292529262927292829292930293129322933293429352936293729382939294029412942294329442945294629472948 |
|
column
Set the vector column to query
This controls which column is compared to the query vector supplied inthe call toAsyncQuery.nearest_to.
This parameter must be specified if the table has more than one columnwhose data type is a fixed-size-list of floats.
Source code inlancedb/query.py
nprobes
Set the number of partitions to search (probe)
This argument is only used when the vector column has an IVF-based index.If there is no index then this value is ignored.
The IVF stage of IVF PQ divides the input into partitions (clusters) ofrelated values.
The partition whose centroids are closest to the query vector will beexhaustiely searched to find matches. This parameter controls how manypartitions should be searched.
Increasing this value will increase the recall of your query but willalso increase the latency of your query. The default value is 20. Thisdefault is good for many cases but the best value to use will depend onyour data and the recall that you need to achieve.
For best results we recommend tuning this parameter with a benchmark againstyour actual data to find the smallest possible value that will still giveyou the desired recall.
Source code inlancedb/query.py
minimum_nprobes
Set the minimum number of probes to use.
Seenprobes
for more details.
These partitions will be searched on every indexed vector query and willincrease recall at the expense of latency.
Source code inlancedb/query.py
maximum_nprobes
Set the maximum number of probes to use.
Seenprobes
for more details.
If this value is greater thanminimum_nprobes
then the excess partitionswill be searched only if we have not found enough results.
This can be useful when there is a narrow filter to allow these queries tospend more time searching and avoid potential false negatives.
If this value is 0 then no limit will be applied and all partitions could besearched if needed to satisfy the limit.
Source code inlancedb/query.py
distance_range
Set the distance range to use.
Only rows with distances within range [lower_bound, upper_bound)will be returned.
Parameters:
lower_bound
(Optional[float]
, default:None
) –The lower bound of the distance range.
upper_bound
(Optional[float]
, default:None
) –The upper bound of the distance range.
Returns:
AsyncVectorQuery
–The AsyncVectorQuery object.
Source code inlancedb/query.py
ef
Set the number of candidates to consider during search
This argument is only used when the vector column has an HNSW index.If there is no index then this value is ignored.
Increasing this value will increase the recall of your query but will alsoincrease the latency of your query. The default value is 1.5 * limit. Thisdefault is good for many cases but the best value to use will depend on yourdata and the recall that you need to achieve.
Source code inlancedb/query.py
refine_factor
A multiplier to control how many additional rows are taken during the refinestep
This argument is only used when the vector column has an IVF PQ index.If there is no index then this value is ignored.
An IVF PQ index stores compressed (quantized) values. They query vector iscompared against these values and, since they are compressed, the comparison isinaccurate.
This parameter can be used to refine the results. It can improve both improverecall and correct the ordering of the nearest results.
To refine results LanceDb will first perform an ANN search to find the nearestlimit
*refine_factor
results. In other words, ifrefine_factor
is 3 andlimit
is the default (10) then the first 30 results will be selected. LanceDbthen fetches the full, uncompressed, values for these 30 results. The resultsare then reordered by the true distance and only the nearest 10 are kept.
Note: there is a difference between calling this method with a value of 1 andnever calling this method at all. Calling this method with any value will havean impact on your search latency. When you call this method with arefine_factor
of 1 then LanceDb still needs to fetch the full, uncompressed,values so that it can potentially reorder the results.
Note: if this method is NOT called then the distances returned in the _distancecolumn will be approximate distances based on the comparison of the quantizedquery vector and the quantized result vectors. This can be considerablydifferent than the true distance between the query vector and the actualuncompressed vector.
Source code inlancedb/query.py
distance_type
Set the distance metric to use
When performing a vector search we try and find the "nearest" vectors accordingto some kind of distance metric. This parameter controls which distance metricto use. See @see {@link IvfPqOptions.distanceType} for more details on thedifferent distance metrics available.
Note: if there is a vector index then the distance type used MUST match thedistance type used to train the vector index. If this is not done then theresults will be invalid.
By default "l2" is used.
Source code inlancedb/query.py
bypass_vector_index
If this is called then any vector index is skipped
An exhaustive (flat) search will be performed. The query vector willbe compared to every vector in the table. At high scales this can beexpensive. However, this is often still useful. For example, skippingthe vector index can give you ground truth results which you can use tocalculate your recall to select an appropriate value for nprobes.
Source code inlancedb/query.py
where
Only return rows matching the given predicate
The predicate should be supplied as an SQL query string.
Examples:
Filtering performance can often be improved by creating a scalar indexon the filter column(s).
Source code inlancedb/query.py
select
Return only the specified columns.
By default a query will return all columns from the table. However, this canhave a very significant impact on latency. LanceDb stores data in a columnarfashion. Thismeans we can finely tune our I/O to select exactly the columns we need.
As a best practice you should always limit queries to the columns that you need.If you pass in a list of column names then only those columns will bereturned.
You can also use this method to create new "dynamic" columns based on yourexisting columns. For example, you may not care about "a" or "b" but insteadsimply want "a + b". This is often seen in the SELECT clause of an SQL query(e.g.SELECT a+b FROM my_table
).
To create dynamic columns you can pass in a dict[str, str]. A column will bereturned for each entry in the map. The key provides the name of the column.The value is an SQL string used to specify how the column is calculated.
For example, an SQL query might stateSELECT a + b AS combined, c
. Theequivalent input to this method would be{"combined": "a + b", "c": "c"}
.
Columns will always be returned in the order given, even if that order isdifferent than the order used when adding the data.
Source code inlancedb/query.py
limit
Set the maximum number of results to return.
By default, a plain search has no limit. If this method is notcalled then every valid row from the table will be returned.
Source code inlancedb/query.py
offset
Set the offset for the results.
Parameters:
offset
(int
) –The offset to start fetching results from.
fast_search
Skip searching un-indexed data.
This can make queries faster, but will miss any data that has not beenindexed.
Tip
You can add new data into an existing index by callingAsyncTable.optimize.
Source code inlancedb/query.py
with_row_id
postfilter
If this is called then filtering will happen after the search instead ofbefore.By default filtering will be performed before the search. This is howfiltering is typically understood to work. This prefilter step does add someadditional latency. Creating a scalar index on the filter column(s) canoften improve this latency. However, sometimes a filter is too complex orscalar indices cannot be applied to the column. In these cases postfilteringcan be used instead of prefiltering to improve latency.Post filtering applies the filter to the results of the search. Thismeans we only run the filter on a much smaller set of data. However, it cancause the query to return fewer thanlimit
results (or even no results) ifnone of the nearest results match the filter.Post filtering happens during the "refine stage" (described in more detail in@see {@link VectorQuery#refineFactor}). This means that setting a higher refinefactor can often help restore some of the results lost by post filtering.
Source code inlancedb/query.py
to_arrowasync
to_arrow(timeout:Optional[timedelta]=None)->Table
Execute the query and collect the results into an Apache Arrow Table.
This method will collect all results into memory before returning. Ifyou expect a large number of results, you may want to useto_batches
Parameters:
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.
Source code inlancedb/query.py
to_listasync
Execute the query and return the results as a list of dictionaries.
Each list entry is a dictionary with the selected column names as keys,or all table columns ifselect
is not called. The vector and the "_distance"fields are returned whether or not they're explicitly selected.
Parameters:
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.
Source code inlancedb/query.py
to_pandasasync
Execute the query and collect the results into a pandas DataFrame.
This method will collect all results into memory before returning. If youexpect a large number of results, you may want to useto_batches and convert each batch topandas separately.
Examples:
>>>importasyncio>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",data=[{"a":1,"b":2}])...asyncforbatchinawaittable.query().to_batches():...batch_df=batch.to_pandas()>>>asyncio.run(doctest_example())
Parameters:
flatten
(Optional[Union[int,bool]]
, default:None
) –If flatten is True, flatten all nested columns.If flatten is an integer, flatten the nested columns up to thespecified depth.If unspecified, do not flatten the nested columns.
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.
Source code inlancedb/query.py
to_polarsasync
Execute the query and collect the results into a Polars DataFrame.
This method will collect all results into memory before returning. If youexpect a large number of results, you may want to useto_batches and convert each batch topolars separately.
Parameters:
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.
Examples:
>>>importasyncio>>>importpolarsaspl>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",data=[{"a":1,"b":2}])...asyncforbatchinawaittable.query().to_batches():...batch_df=pl.from_arrow(batch)>>>asyncio.run(doctest_example())
Source code inlancedb/query.py
explain_planasync
Return the execution plan for this query.
Examples:
>>>importasyncio>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",[{"vector":[99,99]}])...query=[100,100]...plan=awaittable.query().nearest_to([1,2]).explain_plan(True)...print(plan)>>>asyncio.run(doctest_example())ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance] GlobalLimitExec: skip=0, fetch=10 FilterExec: _distance@2 IS NOT NULL SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false] KNNVectorDistance: metric=l2 LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false
Parameters:
verbose
(bool
, default:False
) –Use a verbose output format.
Returns:
plan
(str
) –
Source code inlancedb/query.py
analyze_planasync
__init__
Construct an AsyncVectorQuery
This method is not intended to be called directly. Instead, createa query first withAsyncTable.query and thenuseAsyncQuery.nearest_to] to convert toa vector query. Or you can useAsyncTable.vector_search
Source code inlancedb/query.py
nearest_to_text
nearest_to_text(query:str|FullTextQuery,columns:Union[str,List[str],None]=None)->AsyncHybridQuery
Find the documents that are most relevant to the given text query,in addition to vector search.
This converts the vector query into a hybrid query.
This search will perform a full text search on the table and returnthe most relevant documents, combined with the vector query results.The text relevance is determined by BM25.
The columns to search must be with native FTS index(Tantivy-based can't work with this method).
By default, all indexed columns are searched,now only one column can be searched at a time.
Parameters:
query
(str |FullTextQuery
) –The text query to search for.
columns
(Union[str,List[str], None]
, default:None
) –The columns to search in. If None, all indexed columns are searched.For now only one column can be searched at a time.
Source code inlancedb/query.py
lancedb.query.AsyncFTSQuery
Bases:AsyncQueryBase
A query for full text search for LanceDB.
Source code inlancedb/query.py
2589259025912592259325942595259625972598259926002601260226032604260526062607260826092610261126122613261426152616261726182619262026212622262326242625262626272628262926302631263226332634263526362637263826392640264126422643264426452646264726482649265026512652265326542655265626572658265926602661266226632664266526662667266826692670267126722673267426752676267726782679268026812682268326842685268626872688268926902691 |
|
where
Only return rows matching the given predicate
The predicate should be supplied as an SQL query string.
Examples:
Filtering performance can often be improved by creating a scalar indexon the filter column(s).
Source code inlancedb/query.py
select
Return only the specified columns.
By default a query will return all columns from the table. However, this canhave a very significant impact on latency. LanceDb stores data in a columnarfashion. Thismeans we can finely tune our I/O to select exactly the columns we need.
As a best practice you should always limit queries to the columns that you need.If you pass in a list of column names then only those columns will bereturned.
You can also use this method to create new "dynamic" columns based on yourexisting columns. For example, you may not care about "a" or "b" but insteadsimply want "a + b". This is often seen in the SELECT clause of an SQL query(e.g.SELECT a+b FROM my_table
).
To create dynamic columns you can pass in a dict[str, str]. A column will bereturned for each entry in the map. The key provides the name of the column.The value is an SQL string used to specify how the column is calculated.
For example, an SQL query might stateSELECT a + b AS combined, c
. Theequivalent input to this method would be{"combined": "a + b", "c": "c"}
.
Columns will always be returned in the order given, even if that order isdifferent than the order used when adding the data.
Source code inlancedb/query.py
limit
Set the maximum number of results to return.
By default, a plain search has no limit. If this method is notcalled then every valid row from the table will be returned.
Source code inlancedb/query.py
offset
Set the offset for the results.
Parameters:
offset
(int
) –The offset to start fetching results from.
fast_search
Skip searching un-indexed data.
This can make queries faster, but will miss any data that has not beenindexed.
Tip
You can add new data into an existing index by callingAsyncTable.optimize.
Source code inlancedb/query.py
with_row_id
postfilter
If this is called then filtering will happen after the search instead ofbefore.By default filtering will be performed before the search. This is howfiltering is typically understood to work. This prefilter step does add someadditional latency. Creating a scalar index on the filter column(s) canoften improve this latency. However, sometimes a filter is too complex orscalar indices cannot be applied to the column. In these cases postfilteringcan be used instead of prefiltering to improve latency.Post filtering applies the filter to the results of the search. Thismeans we only run the filter on a much smaller set of data. However, it cancause the query to return fewer thanlimit
results (or even no results) ifnone of the nearest results match the filter.Post filtering happens during the "refine stage" (described in more detail in@see {@link VectorQuery#refineFactor}). This means that setting a higher refinefactor can often help restore some of the results lost by post filtering.
Source code inlancedb/query.py
to_arrowasync
to_arrow(timeout:Optional[timedelta]=None)->Table
Execute the query and collect the results into an Apache Arrow Table.
This method will collect all results into memory before returning. Ifyou expect a large number of results, you may want to useto_batches
Parameters:
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.
Source code inlancedb/query.py
to_listasync
Execute the query and return the results as a list of dictionaries.
Each list entry is a dictionary with the selected column names as keys,or all table columns ifselect
is not called. The vector and the "_distance"fields are returned whether or not they're explicitly selected.
Parameters:
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.
Source code inlancedb/query.py
to_pandasasync
Execute the query and collect the results into a pandas DataFrame.
This method will collect all results into memory before returning. If youexpect a large number of results, you may want to useto_batches and convert each batch topandas separately.
Examples:
>>>importasyncio>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",data=[{"a":1,"b":2}])...asyncforbatchinawaittable.query().to_batches():...batch_df=batch.to_pandas()>>>asyncio.run(doctest_example())
Parameters:
flatten
(Optional[Union[int,bool]]
, default:None
) –If flatten is True, flatten all nested columns.If flatten is an integer, flatten the nested columns up to thespecified depth.If unspecified, do not flatten the nested columns.
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.
Source code inlancedb/query.py
to_polarsasync
Execute the query and collect the results into a Polars DataFrame.
This method will collect all results into memory before returning. If youexpect a large number of results, you may want to useto_batches and convert each batch topolars separately.
Parameters:
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.
Examples:
>>>importasyncio>>>importpolarsaspl>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",data=[{"a":1,"b":2}])...asyncforbatchinawaittable.query().to_batches():...batch_df=pl.from_arrow(batch)>>>asyncio.run(doctest_example())
Source code inlancedb/query.py
explain_planasync
Return the execution plan for this query.
Examples:
>>>importasyncio>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",[{"vector":[99,99]}])...query=[100,100]...plan=awaittable.query().nearest_to([1,2]).explain_plan(True)...print(plan)>>>asyncio.run(doctest_example())ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance] GlobalLimitExec: skip=0, fetch=10 FilterExec: _distance@2 IS NOT NULL SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false] KNNVectorDistance: metric=l2 LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false
Parameters:
verbose
(bool
, default:False
) –Use a verbose output format.
Returns:
plan
(str
) –
Source code inlancedb/query.py
analyze_planasync
nearest_to
nearest_to(query_vector:Union[VEC,Tuple,List[VEC]])->AsyncHybridQuery
In addition doing text search on the LanceDB Table, alsofind the nearest vectors to the given query vector.
This converts the query from a FTS Query to a Hybrid query. Resultsfrom the vector search will be combined with results from the FTS query.
This method will attempt to convert the input to the query vectorexpected by the embedding model. If the input cannot be convertedthen an error will be thrown.
By default, there is no embedding model, and the input should besomething that can be converted to a pyarrow array of floats. Thisincludes lists, numpy arrays, and tuples.
If there is only one vector column (a column whose data type is afixed size list of floats) then the column does not need to be specified.If there is more than one vector column you must useAsyncVectorQuery.column to specifywhich column you would like to compare with.
If no index has been created on the vector column then a vector querywill perform a distance comparison between the query vector and everyvector in the database and then sort the results. This is sometimescalled a "flat search"
For small databases, with tens of thousands of vectors or less, this canbe reasonably fast. In larger databases you should create a vector indexon the column. If there is a vector index then an "approximate" nearestneighbor search (frequently called an ANN search) will be performed. Thissearch is much faster, but the results will be approximate.
The query can be further parameterized using the returned builder. Thereare various ANN search parameters that will let you fine tune your recallaccuracy vs search latency.
Hybrid searches always have alimit. Iflimit
has not been called thena defaultlimit
of 10 will be used.
Typically, a single vector is passed in as the query. However, you can alsopass in multiple vectors. This can be useful if you want to find the nearestvectors to multiple query vectors. This is not expected to be faster thanmaking multiple queries concurrently; it is just a convenience method.If multiple vectors are passed in then an additional columnquery_index
will be added to the results. This column will contain the index of thequery vector that the result is nearest to.
Source code inlancedb/query.py
lancedb.query.AsyncHybridQuery
Bases:AsyncQueryBase
,AsyncVectorQueryBase
A query builder that performs hybrid vector and full text search.Results are combined and reranked based on the specified reranker.By default, the results are reranked using the RRFReranker, whichuses reciprocal rank fusion score for reranking.
To make the vector and fts results comparable, the scores are normalized.Instead of normalizing scores, thenormalize
parameter can be set to "rank"in thererank
method to convert the scores to ranks and then normalize them.
Source code inlancedb/query.py
29512952295329542955295629572958295929602961296229632964296529662967296829692970297129722973297429752976297729782979298029812982298329842985298629872988298929902991299229932994299529962997299829993000300130023003300430053006300730083009301030113012301330143015301630173018301930203021302230233024302530263027302830293030303130323033303430353036303730383039304030413042304330443045304630473048304930503051305230533054305530563057305830593060306130623063306430653066306730683069307030713072307330743075307630773078307930803081308230833084308530863087308830893090309130923093309430953096309730983099310031013102 |
|
column
Set the vector column to query
This controls which column is compared to the query vector supplied inthe call toAsyncQuery.nearest_to.
This parameter must be specified if the table has more than one columnwhose data type is a fixed-size-list of floats.
Source code inlancedb/query.py
nprobes
Set the number of partitions to search (probe)
This argument is only used when the vector column has an IVF-based index.If there is no index then this value is ignored.
The IVF stage of IVF PQ divides the input into partitions (clusters) ofrelated values.
The partition whose centroids are closest to the query vector will beexhaustiely searched to find matches. This parameter controls how manypartitions should be searched.
Increasing this value will increase the recall of your query but willalso increase the latency of your query. The default value is 20. Thisdefault is good for many cases but the best value to use will depend onyour data and the recall that you need to achieve.
For best results we recommend tuning this parameter with a benchmark againstyour actual data to find the smallest possible value that will still giveyou the desired recall.
Source code inlancedb/query.py
minimum_nprobes
Set the minimum number of probes to use.
Seenprobes
for more details.
These partitions will be searched on every indexed vector query and willincrease recall at the expense of latency.
Source code inlancedb/query.py
maximum_nprobes
Set the maximum number of probes to use.
Seenprobes
for more details.
If this value is greater thanminimum_nprobes
then the excess partitionswill be searched only if we have not found enough results.
This can be useful when there is a narrow filter to allow these queries tospend more time searching and avoid potential false negatives.
If this value is 0 then no limit will be applied and all partitions could besearched if needed to satisfy the limit.
Source code inlancedb/query.py
distance_range
Set the distance range to use.
Only rows with distances within range [lower_bound, upper_bound)will be returned.
Parameters:
lower_bound
(Optional[float]
, default:None
) –The lower bound of the distance range.
upper_bound
(Optional[float]
, default:None
) –The upper bound of the distance range.
Returns:
AsyncVectorQuery
–The AsyncVectorQuery object.
Source code inlancedb/query.py
ef
Set the number of candidates to consider during search
This argument is only used when the vector column has an HNSW index.If there is no index then this value is ignored.
Increasing this value will increase the recall of your query but will alsoincrease the latency of your query. The default value is 1.5 * limit. Thisdefault is good for many cases but the best value to use will depend on yourdata and the recall that you need to achieve.
Source code inlancedb/query.py
refine_factor
A multiplier to control how many additional rows are taken during the refinestep
This argument is only used when the vector column has an IVF PQ index.If there is no index then this value is ignored.
An IVF PQ index stores compressed (quantized) values. They query vector iscompared against these values and, since they are compressed, the comparison isinaccurate.
This parameter can be used to refine the results. It can improve both improverecall and correct the ordering of the nearest results.
To refine results LanceDb will first perform an ANN search to find the nearestlimit
*refine_factor
results. In other words, ifrefine_factor
is 3 andlimit
is the default (10) then the first 30 results will be selected. LanceDbthen fetches the full, uncompressed, values for these 30 results. The resultsare then reordered by the true distance and only the nearest 10 are kept.
Note: there is a difference between calling this method with a value of 1 andnever calling this method at all. Calling this method with any value will havean impact on your search latency. When you call this method with arefine_factor
of 1 then LanceDb still needs to fetch the full, uncompressed,values so that it can potentially reorder the results.
Note: if this method is NOT called then the distances returned in the _distancecolumn will be approximate distances based on the comparison of the quantizedquery vector and the quantized result vectors. This can be considerablydifferent than the true distance between the query vector and the actualuncompressed vector.
Source code inlancedb/query.py
distance_type
Set the distance metric to use
When performing a vector search we try and find the "nearest" vectors accordingto some kind of distance metric. This parameter controls which distance metricto use. See @see {@link IvfPqOptions.distanceType} for more details on thedifferent distance metrics available.
Note: if there is a vector index then the distance type used MUST match thedistance type used to train the vector index. If this is not done then theresults will be invalid.
By default "l2" is used.
Source code inlancedb/query.py
bypass_vector_index
If this is called then any vector index is skipped
An exhaustive (flat) search will be performed. The query vector willbe compared to every vector in the table. At high scales this can beexpensive. However, this is often still useful. For example, skippingthe vector index can give you ground truth results which you can use tocalculate your recall to select an appropriate value for nprobes.
Source code inlancedb/query.py
where
Only return rows matching the given predicate
The predicate should be supplied as an SQL query string.
Examples:
Filtering performance can often be improved by creating a scalar indexon the filter column(s).
Source code inlancedb/query.py
select
Return only the specified columns.
By default a query will return all columns from the table. However, this canhave a very significant impact on latency. LanceDb stores data in a columnarfashion. Thismeans we can finely tune our I/O to select exactly the columns we need.
As a best practice you should always limit queries to the columns that you need.If you pass in a list of column names then only those columns will bereturned.
You can also use this method to create new "dynamic" columns based on yourexisting columns. For example, you may not care about "a" or "b" but insteadsimply want "a + b". This is often seen in the SELECT clause of an SQL query(e.g.SELECT a+b FROM my_table
).
To create dynamic columns you can pass in a dict[str, str]. A column will bereturned for each entry in the map. The key provides the name of the column.The value is an SQL string used to specify how the column is calculated.
For example, an SQL query might stateSELECT a + b AS combined, c
. Theequivalent input to this method would be{"combined": "a + b", "c": "c"}
.
Columns will always be returned in the order given, even if that order isdifferent than the order used when adding the data.
Source code inlancedb/query.py
limit
Set the maximum number of results to return.
By default, a plain search has no limit. If this method is notcalled then every valid row from the table will be returned.
Source code inlancedb/query.py
offset
Set the offset for the results.
Parameters:
offset
(int
) –The offset to start fetching results from.
fast_search
Skip searching un-indexed data.
This can make queries faster, but will miss any data that has not beenindexed.
Tip
You can add new data into an existing index by callingAsyncTable.optimize.
Source code inlancedb/query.py
with_row_id
postfilter
If this is called then filtering will happen after the search instead ofbefore.By default filtering will be performed before the search. This is howfiltering is typically understood to work. This prefilter step does add someadditional latency. Creating a scalar index on the filter column(s) canoften improve this latency. However, sometimes a filter is too complex orscalar indices cannot be applied to the column. In these cases postfilteringcan be used instead of prefiltering to improve latency.Post filtering applies the filter to the results of the search. Thismeans we only run the filter on a much smaller set of data. However, it cancause the query to return fewer thanlimit
results (or even no results) ifnone of the nearest results match the filter.Post filtering happens during the "refine stage" (described in more detail in@see {@link VectorQuery#refineFactor}). This means that setting a higher refinefactor can often help restore some of the results lost by post filtering.
Source code inlancedb/query.py
to_arrowasync
to_arrow(timeout:Optional[timedelta]=None)->Table
Execute the query and collect the results into an Apache Arrow Table.
This method will collect all results into memory before returning. Ifyou expect a large number of results, you may want to useto_batches
Parameters:
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.
Source code inlancedb/query.py
to_listasync
Execute the query and return the results as a list of dictionaries.
Each list entry is a dictionary with the selected column names as keys,or all table columns ifselect
is not called. The vector and the "_distance"fields are returned whether or not they're explicitly selected.
Parameters:
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.
Source code inlancedb/query.py
to_pandasasync
Execute the query and collect the results into a pandas DataFrame.
This method will collect all results into memory before returning. If youexpect a large number of results, you may want to useto_batches and convert each batch topandas separately.
Examples:
>>>importasyncio>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",data=[{"a":1,"b":2}])...asyncforbatchinawaittable.query().to_batches():...batch_df=batch.to_pandas()>>>asyncio.run(doctest_example())
Parameters:
flatten
(Optional[Union[int,bool]]
, default:None
) –If flatten is True, flatten all nested columns.If flatten is an integer, flatten the nested columns up to thespecified depth.If unspecified, do not flatten the nested columns.
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.
Source code inlancedb/query.py
to_polarsasync
Execute the query and collect the results into a Polars DataFrame.
This method will collect all results into memory before returning. If youexpect a large number of results, you may want to useto_batches and convert each batch topolars separately.
Parameters:
timeout
(Optional[timedelta]
, default:None
) –The maximum time to wait for the query to complete.If not specified, no timeout is applied. If the query does notcomplete within the specified time, an error will be raised.
Examples:
>>>importasyncio>>>importpolarsaspl>>>fromlancedbimportconnect_async>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",data=[{"a":1,"b":2}])...asyncforbatchinawaittable.query().to_batches():...batch_df=pl.from_arrow(batch)>>>asyncio.run(doctest_example())
Source code inlancedb/query.py
rerank
rerank(reranker:Reranker=RRFReranker(),normalize:str='score')->AsyncHybridQuery
Rerank the hybrid search results using the specified reranker. The rerankermust be an instance of Reranker class.
Parameters:
reranker
(Reranker
, default:RRFReranker()
) –The reranker to use. Must be an instance of Reranker class.
normalize
(str
, default:'score'
) –The method to normalize the scores. Can be "rank" or "score". If "rank",the scores are converted to ranks and then normalized. If "score", thescores are normalized directly.
Returns:
AsyncHybridQuery
–The AsyncHybridQuery object.
Source code inlancedb/query.py
explain_planasync
Return the execution plan for this query.
The output includes both the vector and FTS search plans.
Examples:
>>>importasyncio>>>fromlancedbimportconnect_async>>>fromlancedb.indeximportFTS>>>asyncdefdoctest_example():...conn=awaitconnect_async("./.lancedb")...table=awaitconn.create_table("my_table",[{"vector":[99,99],"text":"hello world"}])...awaittable.create_index("text",config=FTS(with_position=False))...query=[100,100]...plan=awaittable.query().nearest_to([1,2]).nearest_to_text("hello").explain_plan(True)...print(plan)>>>asyncio.run(doctest_example())Vector Search Plan:ProjectionExec: expr=[vector@0 as vector, text@3 as text, _distance@2 as _distance] Take: columns="vector, _rowid, _distance, (text)" CoalesceBatchesExec: target_batch_size=1024 GlobalLimitExec: skip=0, fetch=10 FilterExec: _distance@2 IS NOT NULL SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false] KNNVectorDistance: metric=l2 LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=falseFTS Search Plan:ProjectionExec: expr=[vector@2 as vector, text@3 as text, _score@1 as _score] Take: columns="_rowid, _score, (vector), (text)" CoalesceBatchesExec: target_batch_size=1024 GlobalLimitExec: skip=0, fetch=10 MatchQuery: query=hello
Parameters:
verbose
(bool
, default:False
) –Use a verbose output format.
Returns:
plan
(str
) –
Source code inlancedb/query.py
analyze_planasync
Execute the query and return the physical execution plan with runtime metrics.
This runs both the vector and FTS (full-text search) queries and returnsdetailed metrics for each step of execution—such as rows processed,elapsed time, I/O stats, and more. It’s useful for debugging andperformance analysis.
Returns:
plan
(str
) –