Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Multivector Search: Efficient Document Retrieval with ColPali and LanceDB

Modern documents—PDFs, scans, forms, invoices, or scientific diagrams—rely heavily on visual elements like tables, figures, and spatial layouts to convey meaning. Retrieving context from these documents poses unique challenges:

  • 🖼️Loss of Context: Plain-text extraction destroys critical visual relationships (e.g., a table's structure or a diagram's annotations).
  • 🧩Multi-Modal Complexity: Layouts combine text, images, and structured elements that require joint understanding.
  • 📏Scale vs. Precision: Balancing pixel-perfect accuracy with efficient search across millions of documents.

Why Traditional Methods Fail

The traditional method is a brittle, multi-stage pipeline where visual context is eroded at every step. Retrieval becomes a "best guess" based on partial text. Usually, it will involve the following steps:

  1. OCR Text Extraction - extract raw text from scanned PDFs/images.
  2. Layout Detection - use models like LayoutLM or rule-based tools to segment pages into regions (titles, tables, figures).
  3. Structure Reconstruction - use heuristic rules or ML models try to infer reading order and hierarchy.
  4. Optional: Image/Table Captioning - apply vision-language models (e.g., GPT-4V) to describe figures/tables in natural language.
  5. Text Chunking - split text into fixed-size chunks or "semantic" passages (e.g., by paragraphs).
  6. Embedding & Indexing- use text-based embeddings (e.g., BERT) and store in a vector DB (e.g., LanceDB).

Our Approach: ColPali with XTR for performant retrieval

ColPali (Contextualized Late Interaction Over PaliGemma) enhances document retrieval by combining a vision-language model (VLM) with a multi-vector late interaction framework inspired by ColBERT. In this framework, documents and queries are encoded as collections of contextualized vectors—precomputed for documents and indexed for queries. Unlike traditional methods, late interaction defers complex similarity computations between query and document vectors until the final retrieval stage, enabling nuanced semantic matching while maintaining efficiency.

To further accelerate retrieval, we integrate XTR (ConteXtualized Token Retriever), which prioritizes critical document tokens during initial retrieval stage and removes the gathering stage to significantly improve the performance. By focusing on the most semantically salient tokens early in the process, XTR reduces computational complexity with improved recall, ensuring rapid identification of candidate documents.

We used theUFO dataset, a dataset with rich tables, images and text, to demonstrate how to efficiently retrieve documents with ColPali and LanceDB.

Step 1: Install Required Libraries

In [ ]:
!pipinstalllancedbcolpali-enginedatasetstqdm
!pip install lancedb colpali-engine datasets tqdm

Step 2: Load the UFO dataset

The UFO dataset has 2243 rows in total with an embedding of 128 dimension each. We show an example of the document to show how complicated it is with text and images blended in the document.

In [ ]:
frommathimportsqrtimportpyarrowaspafromtqdmimporttqdmimportlancedbfromdatasetsimportload_datasetfromcolpali_engine.modelsimportColPali,ColPaliProcessorimporttorchdataset=load_dataset("davanstrien/ufo-ColPali",split="train")datasetdataset[333]["image"]
from math import sqrtimport pyarrow as pafrom tqdm import tqdmimport lancedbfrom datasets import load_datasetfrom colpali_engine.models import ColPali, ColPaliProcessorimport torchdataset = load_dataset("davanstrien/ufo-ColPali", split="train")datasetdataset[333]["image"]
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: The secret `HF_TOKEN` does not exist in your Colab secrets.To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.You will be able to reuse this secret in all of your notebooks.Please note that authentication is recommended but still optional to access public models or datasets.  warnings.warn(
README.md:   0%|          | 0.00/1.20k [00:00<?, ?B/s]
train-00000-of-00001.parquet:   0%|          | 0.00/293M [00:00<?, ?B/s]
Generating train split:   0%|          | 0/2243 [00:00<?, ? examples/s]
Out[ ]:
No description has been provided for this image

Step 3: Load the ColPali model

Note: select "cuda" if you are using a Nvidia GPU or "cpu" if there is no GPU available. Mac users, please use "mps". This step can take a few minutes.

In [ ]:
# load the modelcolpali_model=ColPali.from_pretrained("davanstrien/finetune_colpali_v1_2-ufo-4bit",torch_dtype=torch.bfloat16,device_map="cuda",# change to cuda if you have a Nvidia GPU, or cpu if you don't have any GPU)colpali_processor=ColPaliProcessor.from_pretrained("vidore/colpaligemma-3b-pt-448-base")
# load the modelcolpali_model = ColPali.from_pretrained( "davanstrien/finetune_colpali_v1_2-ufo-4bit", torch_dtype=torch.bfloat16, device_map="cuda", # change to cuda if you have a Nvidia GPU, or cpu if you don't have any GPU)colpali_processor = ColPaliProcessor.from_pretrained( "vidore/colpaligemma-3b-pt-448-base")
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
adapter_model.safetensors:   0%|          | 0.00/157M [00:00<?, ?B/s]
preprocessor_config.json:   0%|          | 0.00/425 [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/243k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/17.8M [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/733 [00:00<?, ?B/s]

Step 4: Connect to LanceDB

In [ ]:
# create the lancedb tablemultivector_type=pa.list_(pa.list_(pa.float32(),128))schema=pa.schema([pa.field("id",pa.int64()),pa.field("vector",multivector_type),])db=lancedb.connect("my_database")table=db.create_table("ufo",schema=schema,exist_ok=True)
# create the lancedb tablemultivector_type = pa.list_(pa.list_(pa.float32(), 128))schema = pa.schema( [ pa.field("id", pa.int64()), pa.field("vector", multivector_type), ])db = lancedb.connect("my_database")table = db.create_table("ufo", schema=schema, exist_ok=True)

⚠️ WARNING: LONG EMBEDDING & INGESTION STEP
❗ Skip this cell unless you want to re-run the full embedding process.


Why?
Embedding the UFO dataset and ingesting it into LanceDB takes~2 hours on a T4 GPU. To save time:

  • Use the pre-prepared table with index created (provided below) to proceed directly toStep 7: search.
  • Step 5a contains the full ingestion code for reference (run it only if necessary).
  • Step 6 contains the details on creating the index on the multivector column

Step 5: Open the LanceDB table

In [ ]:
!wgethttp://vectordb-recipes.s3.us-west-2.amazonaws.com/multivector_example.zip!unzipmultivector_example.zip!rmmultivector_example.zip
!wget http://vectordb-recipes.s3.us-west-2.amazonaws.com/multivector_example.zip!unzip multivector_example.zip!rm multivector_example.zip
In [ ]:
db=lancedb.connect("multivector_example")table_name="ufo"table=db.open_table("ufo")
db = lancedb.connect("multivector_example")table_name = "ufo"table = db.open_table("ufo")

Step 7: Retrieve documents from a query

In [ ]:
# search the tablequeries=["alien","crop circles","unidentified"]image_results=[]forquery_textinqueries:# encode the queryquery=colpali_processor.process_queries([query_text]).to(colpali_model.device)query=colpali_model(**query)[0].cpu().float().numpy()print(f"query shape:{query.shape}")# search the tableresults=table.search(query).select(["id"]).limit(5).to_arrow()id=results["id"][0].as_py()image=dataset[id]["image"]image_results.append(image)
# search the tablequeries = ["alien", "crop circles", "unidentified"]image_results = []for query_text in queries: # encode the query query = colpali_processor.process_queries([query_text]).to(colpali_model.device) query = colpali_model(**query)[0].cpu().float().numpy() print(f"query shape: {query.shape}") # search the table results = table.search(query).select(["id"]).limit(5).to_arrow() id = results["id"][0].as_py() image = dataset[id]["image"] image_results.append(image)
query shape: (15, 128)query shape: (16, 128)query shape: (15, 128)

Let's see the retrieved documents!

In [ ]:
importmatplotlib.pyplotaspltforimageinimage_results:plt.figure()plt.imshow(image)
import matplotlib.pyplot as pltfor image in image_results: plt.figure() plt.imshow(image)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Step 5a: Embed the UFO dataset and ingest data into LanceDB

Note: This step will take up to 2h when running with T4 GPU with abatch_size=4. You can increase thebatch_size to accelerate the process if there is more memory available, e.g.batch_size=32 requires 60GB of memory.

In [ ]:
# this would take 40 mins for first run on Apple M3 Max, may be longer if you are using CPUbatch_size=4# low it if you have a low memory GPUwithtqdm(total=len(dataset),desc="ingesting")aspbar:foriinrange(0,len(dataset),batch_size):batch=dataset[i:i+batch_size]images=batch["image"]# encode the imageswithtorch.no_grad():batch_images=colpali_processor.process_images(images).to(colpali_model.device)image_embeddings=colpali_model(**batch_images)real_size=len(images)multivector=image_embeddings.cpu().float().numpy()multivector=pa.array(multivector.tolist(),type=multivector_type)data=pa.Table.from_pydict({"id":list(range(i,i+real_size)),"vector":multivector,})table.add(data)pbar.update(real_size)
# this would take 40 mins for first run on Apple M3 Max, may be longer if you are using CPUbatch_size = 4 # low it if you have a low memory GPUwith tqdm(total=len(dataset), desc="ingesting") as pbar: for i in range(0, len(dataset), batch_size): batch = dataset[i : i + batch_size] images = batch["image"] # encode the images with torch.no_grad(): batch_images = colpali_processor.process_images(images).to( colpali_model.device ) image_embeddings = colpali_model(**batch_images) real_size = len(images) multivector = image_embeddings.cpu().float().numpy() multivector = pa.array(multivector.tolist(), type=multivector_type) data = pa.Table.from_pydict( { "id": list(range(i, i + real_size)), "vector": multivector, } ) table.add(data) pbar.update(real_size)

Step 6: Create an index on the multivector column

Note: LanceDB Cloud automatically infers the multivector column directly from the schema. If your dataset contains only one column with a list of vectors, no manual specification is required when building the vector index—the system handles this implicitly.

In [ ]:
num_rows=table.count_rows()table.create_index(metric="cosine",# for now only cosine is supported for multivectornum_partitions=int(sqrt(num_rows*1030)# 1030 is number of embeddings per document),# it's recommended to set sqrt of the number of embeddings as the number of partitionsnum_sub_vectors=32,# higher for accuracy, lower for speedindex_type="IVF_PQ",)
num_rows = table.count_rows()table.create_index( metric="cosine", # for now only cosine is supported for multivector num_partitions=int( sqrt(num_rows * 1030) # 1030 is number of embeddings per document ), # it's recommended to set sqrt of the number of embeddings as the number of partitions num_sub_vectors=32, # higher for accuracy, lower for speed index_type="IVF_PQ",)

[8]ページ先頭

©2009-2025 Movatter.jp