Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Filtering

Pre and post-filtering

LanceDB supports filtering of query results based on metadata fields. By default, post-filtering isperformed on the top-k results returned by the vector search. However, pre-filtering is also anoption that performs the filter prior to vector search. This can be useful to narrow downthe search space of a very large dataset to reduce query latency.

Note that both pre-filtering and post-filtering can yield false positives. For pre-filtering, if the filter is too selective, it might eliminate relevant items that the vector search would have otherwise identified as a good match. In this case, increasingnprobes parameter will help reduce such false positives. It is recommended to callbypass_vector_index() if you know that the filter is highly selective.

Similarly, a highly selective post-filter can lead to false positives. Increasing bothnprobes andrefine_factor can mitigate this issue. When deciding between pre-filtering and post-filtering, pre-filtering is generally the safer choice if you're uncertain.

# Synchronous clientresult=tbl.search([0.5,0.2]).where("id = 10",prefilter=True).limit(1).to_arrow()# Asynchronous clientresult=awaitasync_tbl.query().where("id = 10").nearest_to([0.5,0.2]).limit(1).to_arrow()
const_result=awaittbl.search(Array(1536).fill(0.5)).limit(1).where("id = 10").toArray();
letresult=awaittbl.search(Array(1536).fill(0.5)).limit(1).filter("id = 10").prefilter(true).execute();

Note

Creating ascalar index accelerates filtering.

SQL filters

Because it's built on top ofDataFusion, LanceDBembraces the utilization of standard SQL expressions as predicates for filtering operations.SQL can be used during vector search, update, and deletion operations.

LanceDB supports a growing list of SQL expressions:

  • >,>=,<,<=,=
  • AND,OR,NOT
  • IS NULL,IS NOT NULL
  • IS TRUE,IS NOT TRUE,IS FALSE,IS NOT FALSE
  • IN
  • LIKE,NOT LIKE
  • CAST
  • regexp_match(column, pattern)
  • DataFusion Functions

For example, the following filter string is acceptable:

# Synchronous clienttbl.search([100,102]).where("(item IN ('item 0', 'item 2')) AND (id > 10)").to_arrow()# Asynchronous clientawait(async_tbl.query().where("(item IN ('item 0', 'item 2')) AND (id > 10)").nearest_to([100,102]).to_arrow())
constresult=await(tbl.search(Array(1536).fill(0))aslancedb.VectorQuery).where("(item IN ('item 0', 'item 2')) AND (id > 10)").postfilter().toArray();
awaittbl.search(Array(1536).fill(0)).where("(item IN ('item 0', 'item 2')) AND (id > 10)").execute();

If your column name contains special characters, upper-case characters, or is aSQL Keyword,you can use backtick (`) to escape it. For nested fields, each segment of thepath must be wrapped in backticks.

`CUBE`=10AND`UpperCaseName`='3'AND`columnnamewithspace`ISNOTNULLAND`nestedwithspace`.`innerwithspace`<2

Field names containing periods (.) are not supported.

Literals for dates, timestamps, and decimals can be written by writing the stringvalue after the type name. For example:

date_col=date'2021-01-01'andtimestamp_col=timestamp'2021-01-01 00:00:00'anddecimal_col=decimal(8,3)'1.000'

For timestamp columns, the precision can be specified as a number in the typeparameter. Microsecond precision (6) is the default.

SQLTime unit
timestamp(0)Seconds
timestamp(3)Milliseconds
timestamp(6)Microseconds
timestamp(9)Nanoseconds

LanceDB internally stores data inApache Arrow format.The mapping from SQL types to Arrow types is:

SQL typeArrow type
booleanBoolean
tinyint /tinyint unsignedInt8 /UInt8
smallint /smallint unsignedInt16 /UInt16
int orinteger /int unsigned orinteger unsignedInt32 /UInt32
bigint /bigint unsignedInt64 /UInt64
floatFloat32
doubleFloat64
decimal(precision, scale)Decimal128
dateDate32
timestampTimestamp1
stringUtf8
binaryBinary

Filtering without Vector Search

You can also filter your data without search:

# Synchronous clienttbl.search().where("id = 10").limit(10).to_arrow()# Asynchronous clientawaitasync_tbl.query().where("id = 10").limit(10).to_arrow()
awaittbl.query().where("id = 10").limit(10).toArray();
awaittbl.filter("id = 10").limit(10).execute();

If your table is large, this could potentially return a very large amount of data. Please be sure to use alimit clause unless you're sure you want to return the whole result set.


  1. See precision mapping in previous table. 


[8]ページ先頭

©2009-2025 Movatter.jp