Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Strange performance with index vs no index on 50K vectors data set#2382

Unanswered
s-h-a-d-o-w asked this question inQ&A
Discussion options

I recently compared LanceDB to some other solutions usingVectorDBBench. I'll just share the results rendered usingmy own UI, I'm sure you'll see what I mean - there are unexpected differences basically across the board:
VectorDBBench UI.pdf

As I was working on the LanceDB integration for the benchmark, naming and data for things slightly changed, which is why the labels are sometimes not consistent. But you can also infer by the results that anything that is related to no index... was run without index. And everything that says autoindex used an index created with no parameters provided during index creation.

You can find the implementation used in the benchmark here:https://github.com/zilliztech/VectorDBBench/blob/main/vectordb_bench/backend/clients/lancedb/lancedb.py

You must be logged in to vote

Replies: 1 comment 6 replies

Comment options

Thanks for doing this. A few questions to clarify the setup

  • Is the data stored on local SSD or over object store i.e., S3?
  • What is the dataset size in your benchmark? Small (1M) / Large (100M)?
  • When running with filters, have you create scalar index yet on the id column?
  • Table.optimize() is called before benchmark?

The QPS / latency seems a bit off if this is 1M/10M 768D dataset on disk

You must be logged in to vote
6 replies
@s-h-a-d-o-w
Comment options

I just realized that I forgot to select only "id" in my implementation:zilliztech/VectorDBBench#525

But even so - with the index, it remains high. (Dropped from ~540ms to ~430ms.)

@eddyxu
Comment options

430ms latency over 50K vectors on local SSD is abnormally high tho.

@davidmyriel@AyushExel@BubbleCal could you take a look?

For reference, bruteforce scan over that data on my macbook pro might be even faster

@BubbleCal
Comment options

it seems thenum_partitions andnum_sub_vectors are not set properly, and the index onid column is missed

@s-h-a-d-o-w
Comment options

the index on id column is missed

As mentioned above, that's not relevant for this particular test. There is no filtering by id.

it seems the num_partitions and num_sub_vectors are not set properly

Could you please elaborate on that? Because I would expect that the queries per second should still be higher than without index, regardless of whether the config is ideal. Also, given that the num_partitions default is excessive for the small data set in this test, I would think that recall shouldn't be this low.

@wjones127
Comment options

I agree we could make defaults much better, especially if we had a way to ask the user their desired in-sample recall. I wrote this up here:lance-format/lance#4094

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Category
Q&A
Labels
None yet
4 participants
@s-h-a-d-o-w@eddyxu@wjones127@BubbleCal

[8]ページ先頭

©2009-2025 Movatter.jp