Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Looking for the bottleneck in a 25M dataset#2608

valkum started this conversation inGeneral
Discussion options

Hey, we are evaluating LanceDB to replace our internal in-memory solution for better scalability.
Our test dataset contains 25M rows with 6 columns(String, String, Bool, Bool, String, Vector).
We currently try to find the best values for an index as the default values resulted in poor query performance >60s per query.

Our current index is constructed as follows:

lancedb::index::vector::IvfHnswPqIndexBuilder::default()  // Our embeddings are normalized. Thus cosine similarity is the dot product.  .distance_type(lancedb::DistanceType::Dot)  // HNSW settings we currently use  .ef_construction(100)  .num_edges(32)  // We tried the default but analyze reported a huge scan  .num_partitions(16)  // Optimal for 384 dims  .num_sub_vectors(24),

The test was done on an M1 Max MBP with 64GB RAM
analyze_plan reported this:

AnalyzeExec verbose=true, metrics=[]  TracedExec, metrics=[]    ProjectionExec: expr=[registrable@2 as registrable, etld@3 as etld, is_market@4 as is_market, is_expiring@5 as is_expiring, domain@6 as domain, vector@7 as vector, _distance@0 as _distance], metrics=[output_rows=10, elapsed_compute=3.791µs]      Take: columns="_distance, _rowid, (registrable), (etld), (is_market), (is_expiring), (domain), (vector)", metrics=[output_rows=10, elapsed_compute=3.163863667s, batches_processed=1, bytes_read=23222, iops=92, requests=77]        CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=10, elapsed_compute=3.376µs]          GlobalLimitExec: skip=0, fetch=10, metrics=[output_rows=10, elapsed_compute=2.042µs]            SortExec: TopK(fetch=10), expr=[_distance@0 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=10, elapsed_compute=129.75µs, row_replacements=29]              ANNSubIndex: name=vector_idx, k=10, deltas=1, metrics=[output_rows=160, elapsed_compute=111.847289373s, index_comparisons=0, indices_loaded=0, partitions_searched=16, parts_loaded=9]                ANNIvfPartition: uuid=e76bf63c-8e7b-421c-a5f8-fe2f13fd9f12, minimum_nprobes=20, maximum_nprobes=Some(20), deltas=1, metrics=[output_rows=1, elapsed_compute=18.042µs, deltas_searched=1, index_comparisons=0, indices_loaded=0, partitions_ranked=16, parts_loaded=0]

Note ANNSubIndex which takes 111.847289373s. We got similar results with twice as many partitions or with the default 7.5 (computed bysuggested_num_partitions_for_hnsw).
The real time this takes is more like 50s (I guess whatever measures time in ANNSubIndex uses CPU time instead of real time)
A python based usearch index with the same dataset backed by SQlite to store any payload on the other hand is able to resolve these in <2s.
Something seems off with our LanceDB test but we can't figure out what it is.

Do you have any idea?

You must be logged in to vote

Replies: 3 comments 4 replies

Comment options

I now suspect the single partitions were too big to stay in cache and large enough to take a substantial amount of time to load.
Withnum_partitions(1) we get usearch like speeds at the cost of having a single HNSW index in memory (which is partially ok for our use case, especially with SQ and PQ).

You must be logged in to vote
0 replies
Comment options

@valkum the biggest red flag to me here is based on your IVF config ...

With only 16 IVF partitions and nprobes=20, you’re effectively probing all lists on every query. That defeats IVF’s selectivity and forces many sub-index loads (9 in this run), which is the real cost (I/O & cold cache).

However, nprobes = 1 will not be precise enough because it likely doesn't cover enough partitions to have comprehensive coverage.

Does any of this ring true for you?

You must be logged in to vote
2 replies
@valkum
Comment options

I mean the amount of IVF partitions is supposed to be in that range when using HSNW. The default one provided for an HNSW index is 7 for 25M rows with a dimension of 384. The nprobes are also the default.

Which values would you suggest for 25M+ or more rows with a dimension of 384?

@michael-lancedb
Comment options

I enjoyed getting into more details on this with you yesterday. Thanks for the continuing updates!

Comment options

I ran some tests against our dataset with the following configurations today.

This time on an Apple M1 Max with 64 GB RAM.

1000 Partitions

With an increased number of partitions:

IvfHnswPqIndexBuilder::default().distance_type(lancedb::DistanceType::Cosine)                    .ef_construction(128)                    .num_edges(32)                    .num_partitions(1000)

a singleanalyze looks like

AnalyzeExec verbose=true, metrics=[], cumulative_cpu=467.131375ms  TracedExec, metrics=[], cumulative_cpu=467.131375ms    ProjectionExec: expr=[registrable@2 as registrable, etld@3 as etld, is_market@4 as is_market, is_expiring@5 as is_expiring, domain@6 as domain, vector@7 as vector, _distance@0 as _distance], metrics=[output_rows=10, elapsed_compute=1.708µs], cumulative_cpu=467.131375ms      Take: columns="_distance, _rowid, (registrable), (etld), (is_market), (is_expiring), (domain), (vector)", metrics=[output_rows=10, elapsed_compute=454.167µs, batches_processed=1, bytes_read=0, iops=0, requests=0], cumulative_cpu=467.129667ms        CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=10, elapsed_compute=4.21µs], cumulative_cpu=466.6755ms          SortExec: TopK(fetch=10), expr=[_distance@0 ASC NULLS LAST, _rowid@1 ASC NULLS LAST], preserve_partitioning=[false], filter=[_distance@0 < 0.83736986 OR _distance@0 = 0.83736986 AND _rowid@1 < 38655517994], metrics=[output_rows=10, elapsed_compute=171.831µs, row_replacements=20], cumulative_cpu=466.67129ms            ANNSubIndex: name=vector_idx, k=10, deltas=1, metrics=[output_rows=200, elapsed_compute=466.379709ms, index_comparisons=0, indices_loaded=0, partitions_searched=20, parts_loaded=19], cumulative_cpu=466.499459ms              ANNIvfPartition: uuid=718b0094-eaef-4960-a9a5-ad0f60ba83f4, minimum_nprobes=20, maximum_nprobes=Some(20), deltas=1, metrics=[output_rows=1, elapsed_compute=119.75µs, deltas_searched=1, index_comparisons=0, indices_loaded=0, partitions_ranked=1000, parts_loaded=0], cumulative_cpu=119.75µs

For 111 test queries with 10 iterations each, we got a median latency of 5ms and a p99 of 578ms.

1 Probe

With a decreased number of nprobes to 1:

IvfHnswPqIndexBuilder::default()                    .distance_type(lancedb::DistanceType::Cosine)                    .ef_construction(128)                    .num_edges(32),

a singleanalyze looks like

AnalyzeExec verbose=true, metrics=[], cumulative_cpu=5.735049958s  TracedExec, metrics=[], cumulative_cpu=5.735049958s    ProjectionExec: expr=[registrable@2 as registrable, etld@3 as etld, is_market@4 as is_market, is_expiring@5 as is_expiring, domain@6 as domain, vector@7 as vector, _distance@0 as _distance], metrics=[output_rows=5, elapsed_compute=1.75µs], cumulative_cpu=5.735049958s      Take: columns="_distance, _rowid, (registrable), (etld), (is_market), (is_expiring), (domain), (vector)", metrics=[output_rows=5, elapsed_compute=478.667µs, batches_processed=1, bytes_read=0, iops=0, requests=0], cumulative_cpu=5.735048208s        CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=5, elapsed_compute=4.249µs], cumulative_cpu=5.734569541s          SortExec: TopK(fetch=5), expr=[_distance@0 ASC NULLS LAST, _rowid@1 ASC NULLS LAST], preserve_partitioning=[false], filter=[_distance@0 < 0.85125685 OR _distance@0 = 0.85125685 AND _rowid@1 < 30065359933], metrics=[output_rows=5, elapsed_compute=40.501µs, row_replacements=5], cumulative_cpu=5.734565292s            ANNSubIndex: name=vector_idx, k=5, deltas=1, metrics=[output_rows=5, elapsed_compute=5.734424s, index_comparisons=0, indices_loaded=0, partitions_searched=1, parts_loaded=1], cumulative_cpu=5.734524791s              ANNIvfPartition: uuid=a3d7a4a7-28a7-4682-8462-1882713b7ca0, minimum_nprobes=1, maximum_nprobes=Some(1), deltas=1, metrics=[output_rows=1, elapsed_compute=100.791µs, deltas_searched=1, index_comparisons=0, indices_loaded=0, partitions_ranked=7, parts_loaded=0], cumulative_cpu=100.791µs

For 5 test queries (they were slow so we reduced the amount) with 10 iterations each, we got a median latency of 4ms and a p99 of 8s 960.

I will run some more tests until end of Monday. Including:

  • increasing the index_cache_size to 20GB
  • using complete default values (I noted that ef_construction is 300 per default which suggest the default index should have a better quality).
You must be logged in to vote
2 replies
@michael-lancedb
Comment options

Thanks for these updates, interesting results so far. I suspect the biggest difference in p99 between those tests is that with only 1 nprobe and 5x10 tests over 1000 partitions you weren't able to get much cache advantage, but even still 8s over that dataset seems excessive.

If you have the appetite for it, it would also be interesting to see how the same tests perform when using an IVF-PQ index rather than the IVF-HNSW.

@valkum
Comment options

I will get back to you with those numbers. I also was only able to run the test with complete default values, which showed similar timings as in the initial post.
I hope I find time to run these soon.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Category
General
Labels
None yet
2 participants
@valkum@michael-lancedb

[8]ページ先頭

©2009-2025 Movatter.jp