lancedb/lancedbPublic

NotificationsYou must be signed in to change notification settings
Fork671
Star8.2k

Strange performance with index vs no index on 50K vectors data set#2382

Unanswered

s-h-a-d-o-w asked this question inQ&A

s-h-a-d-o-w

May 9, 2025

· 1 comments· 6 replies

Return to top

Discussion options

s-h-a-d-o-w
May 9, 2025

I recently compared LanceDB to some other solutions usingVectorDBBench. I'll just share the results rendered usingmy own UI, I'm sure you'll see what I mean - there are unexpected differences basically across the board:
VectorDBBench UI.pdf

As I was working on the LanceDB integration for the benchmark, naming and data for things slightly changed, which is why the labels are sometimes not consistent. But you can also infer by the results that anything that is related to no index... was run without index. And everything that says autoindex used an index created with no parameters provided during index creation.

You can find the implementation used in the benchmark here:https://github.com/zilliztech/VectorDBBench/blob/main/vectordb_bench/backend/clients/lancedb/lancedb.py

You must be logged in to vote

Replies: 1 comment 6 replies

Comment options

eddyxu
May 15, 2025
Maintainer

Thanks for doing this. A few questions to clarify the setup

Is the data stored on local SSD or over object store i.e., S3?
What is the dataset size in your benchmark? Small (1M) / Large (100M)?
When running with filters, have you create scalar index yet on the id column?
Table.optimize() is called before benchmark?

The QPS / latency seems a bit off if this is 1M/10M 768D dataset on disk

You must be logged in to vote

6 replies

Comment options

s-h-a-d-o-w May 15, 2025
Author

I just realized that I forgot to select only "id" in my implementation:zilliztech/VectorDBBench#525

But even so - with the index, it remains high. (Dropped from ~540ms to ~430ms.)

Comment options

eddyxu May 15, 2025
Maintainer

430ms latency over 50K vectors on local SSD is abnormally high tho.

@davidmyriel @AyushExel @BubbleCal could you take a look?

For reference, bruteforce scan over that data on my macbook pro might be even faster

Comment options

BubbleCal May 29, 2025
Collaborator

it seems thenum_partitions andnum_sub_vectors are not set properly, and the index onid column is missed

Comment options

s-h-a-d-o-w May 29, 2025
Author

the index on id column is missed

As mentioned above, that's not relevant for this particular test. There is no filtering by id.

it seems the num_partitions and num_sub_vectors are not set properly

Could you please elaborate on that? Because I would expect that the queries per second should still be higher than without index, regardless of whether the config is ideal. Also, given that the num_partitions default is excessive for the small data set in this test, I would think that recall shouldn't be this low.

Comment options

wjones127 Jun 26, 2025
Maintainer

I agree we could make defaults much better, especially if we had a way to ask the user their desired in-sample recall. I wrote this up here:lance-format/lance#4094

Movatterモバイル変換

Strange performance with index vs no index on 50K vectors data set#2382

Uh oh!

Uh oh!

s-h-a-d-o-wMay 9, 2025

Replies: 1 comment· 6 replies

Uh oh!

Uh oh!

eddyxuMay 15, 2025 Maintainer

Uh oh!

s-h-a-d-o-wMay 15, 2025 Author

Uh oh!

Uh oh!

eddyxuMay 15, 2025 Maintainer

Uh oh!

BubbleCalMay 29, 2025 Collaborator

Uh oh!

Uh oh!

s-h-a-d-o-wMay 29, 2025 Author

Uh oh!

wjones127Jun 26, 2025 Maintainer

Uh oh!

s-h-a-d-o-w
May 9, 2025

Replies: 1 comment 6 replies

eddyxu
May 15, 2025
Maintainer

s-h-a-d-o-w May 15, 2025
Author

eddyxu May 15, 2025
Maintainer

BubbleCal May 29, 2025
Collaborator

s-h-a-d-o-w May 29, 2025
Author

wjones127 Jun 26, 2025
Maintainer