Full-text search (Native FTS)
LanceDB provides support for full-text search via Lance, allowing you to incorporate keyword-based search (based on BM25) in your retrieval solutions.
Note
The Python SDK uses tantivy-based FTS by default, need to passuse_tantivy=False
to use native FTS.
Example
Consider that we have a LanceDB table namedmy_table
, whose string columntext
we want to index and query via keyword search, the FTS index must be created before you can search via keywords.
importlancedbfromlancedb.indeximportFTSuri="data/sample-lancedb"db=lancedb.connect(uri)table=db.create_table("my_table_fts",data=[{"vector":[3.1,4.1],"text":"Frodo was a happy puppy"},{"vector":[5.9,26.5],"text":"There are several kittens playing"},],)# passing `use_tantivy=False` to use lance FTS index# `use_tantivy=True` by defaulttable.create_fts_index("text",use_tantivy=False)table.search("puppy").limit(10).select(["text"]).to_list()# [{'text': 'Frodo was a happy puppy', '_score': 0.6931471824645996}]# ...
importlancedbfromlancedb.indeximportFTSuri="data/sample-lancedb"async_db=awaitlancedb.connect_async(uri)async_tbl=awaitasync_db.create_table("my_table_fts_async",data=[{"vector":[3.1,4.1],"text":"Frodo was a happy puppy"},{"vector":[5.9,26.5],"text":"There are several kittens playing"},],)# async API uses our native FTS algorithmawaitasync_tbl.create_index("text",config=FTS())await(awaitasync_tbl.search("puppy")).select(["text"]).limit(10).to_list()# [{'text': 'Frodo was a happy puppy', '_score': 0.6931471824645996}]# ...
import*aslancedbfrom"@lancedb/lancedb";consturi="data/sample-lancedb"constdb=awaitlancedb.connect(uri);constdata=[{vector:[3.1,4.1],text:"Frodo was a happy puppy"},{vector:[5.9,26.5],text:"There are several kittens playing"},];consttbl=awaitdb.createTable("my_table",data,{mode:"overwrite"});awaittbl.createIndex("text",{config:lancedb.Index.fts(),});awaittbl.search("puppy","fts").select(["text"]).limit(10).toArray();
leturi="data/sample-lancedb";letdb=connect(uri).execute().await?;letinitial_data:Box<dynRecordBatchReader+Send>=create_some_records()?;lettbl=db.create_table("my_table",initial_data).execute().await?;tbl.create_index(&["text"],Index::FTS(FtsIndexBuilder::default())).execute().await?;tbl.query().full_text_search(FullTextSearchQuery::new("puppy".to_owned())).select(lancedb::query::Select::Columns(vec!["text".to_owned()])).limit(10).execute().await?;
It would search on all indexed columns by default, so it's useful when there are multiple indexed columns.
Passingfts_columns="text"
if you want to specify the columns to search.
Note
LanceDB automatically searches on the existing FTS index if the input to the search is of typestr
. If you provide a vector as input, LanceDB will search the ANN index instead.
Tokenization
By default the text is tokenized by splitting on punctuation and whitespaces, and would filter out words that are with length greater than 40, and lowercase all words.
Stemming is useful for improving search results by reducing words to their root form, e.g. "running" to "run". LanceDB supports stemming for multiple languages, you can specify the tokenizer name to enable stemming by the patterntokenizer_name="{language_code}_stem"
, e.g.en_stem
for English.
For example, to enable stemming for English:
the followinglanguages are currently supported.
The tokenizer is customizable, you can specify how the tokenizer splits the text, and how it filters out words, etc.
For example, for language with accents, you can specify the tokenizer to useascii_folding
to remove accents, e.g. 'é' to 'e':
Filtering
LanceDB full text search supports to filter the search results by a condition, both pre-filtering and post-filtering are supported.
This can be invoked via the familiarwhere
syntax.
With pre-filtering:
With post-filtering:
Phrase queries vs. terms queries
Warn
Lance-based FTS doesn't support queries using boolean operatorsOR
,AND
.
For full-text search you can specify either aphrase query like"the old man and the sea"
,or aterms search query likeold man sea
. For more details on the termsquery syntax, see Tantivy'squery parser rules.
To search for a phrase, the index must be created withwith_position=True
:
This will allow you to search for phrases, but it will also significantly increase the index size and indexing time.
Incremental indexing
LanceDB supports incremental indexing, which means you can add new records to the table without reindexing the entire table.
This can make the query more efficient, especially when the table is large and the new records are relatively small.
Note
New data added after creating the FTS index will appear in search results while incremental index is still progress, but with increased latency due to a flat search on the unindexed portion. LanceDB Cloud automates this merging process, minimizing the impact on search speed.