- Notifications
You must be signed in to change notification settings - Fork0
[CVPR 2025] Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval
License
aimagelab/ReT
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Please cite with the following BibTeX:
@inproceedings{caffagni2025recurrence, title={{Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval}}, author={Caffagni, Davide and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year={2025}}
- Create the Python environment.
conda create -n ret -y --no-default-packages python==3.10.16conda activate ret
- Install Pytorch.
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118
- Install faiss-gpu.
conda install -n ret -y -c conda-forge faiss-gpu==1.7.4
- Clone the repo and install other dependencies.
git clone https://github.com/aimagelab/ReT.gitcd ReTpip install -r requirements.txt
ReT model checkpoints are available onHugging Face.You can use these checkpoints directly for retrieval tasks or fine-tune them to suit your specific retrieval needs.
Model | WIT Recall@10 | IGLUE Recall@1 | KVQA Recall@5 | OVEN Recall@5 | LLaVA Recall@1 | InfoSeek Recall@5 | InfoSeek Pseudo Recall@5 | EVQA Recall@5 | EVQA Pseudo Recall@5 | OKVQA Recall@5 | OKVQA Pseudo Recall@5 |
---|---|---|---|---|---|---|---|---|---|---|---|
ReT-CLIP-ViT-L-14🤗 | 0.734 | 0.818 | 0.635 | 0.820 | 0.799 | 0.470 | 0.605 | 0.445 | 0.579 | 0.202 | 0.662 |
ReT-OpenCLIP-ViT-H-14🤗 | 0.714 | 0.800 | 0.593 | 0.830 | 0.798 | 0.473 | 0.607 | 0.448 | 0.578 | 0.182 | 0.634 |
ReT-OpenCLIP-ViT-G-14🤗 | 0.751 | 0.822 | 0.606 | 0.840 | 0.792 | 0.520 | 0.625 | 0.486 | 0.602 | 0.190 | 0.638 |
You can download the ReT-M2KR benchmark by following the instructions providedhere.This dataset is used for training and evaluating ReT in multimodal information retrieval and includes images (coming soon) andJSONL
files.
ReT-M2KR benchmark is an extended version of theM2KR dataset, with the following modifications:
- MSMARCO data is excluded, as it does not contain query images
- Passage images have been added to the OVEN, InfoSeek, E-VQA, and OKVQA datasets
For further details, please refer to the associated research paper.
fromsrc.modelsimportRetrieverModel,RetModelimporttorchdevice='cuda'iftorch.cuda.is_available()else'cpu'retriever=RetrieverModel.from_pretrained('aimagelab/ReT-CLIP-ViT-L-14',device_map=device)# QUERYret:RetModel=retriever.get_query_model()ret.init_tokenizer_and_image_processor()q_txt="Retrieve documents that provide an answer to the question alongside the image: What is the content of the image?"q_img='assets/model.png'ret_feats=ret.get_ret_features([[q_txt,q_img]])print(ret_feats.shape)# torch.Size([1, 32, 128])# PASSAGEret:RetModel=retriever.get_passage_model()ret.init_tokenizer_and_image_processor()p_txt="""The image shows a diagram of what appears to be a neural network architecture using a fine-grained loss approach for multimodal learning.The architecture has two parallel processing streams labeled "ReTQ" (left side, in purple) and "ReTD" (right side, in blue).Each side has: ..."""p_img=''ret_feats=ret.get_ret_features([[p_txt,p_img]])print(ret_feats.shape)# torch.Size([1, 32, 128])
To evaluate ReT on the on M2KR benchmark, we provide SLURM script exampleshere. These scripts handle both indexing and searching processes.
Make sure to setJSONL_ROOT_PATH
andIMAGE_ROOT_PATH
accordingly to the directories where the JSONL files and images have been downloaded.
If the inference script got stuck while indexing, try to clear the Pytorch cache and re-run
rm -rf ~/.cache/torch_extensions
We thank the teams behindColBERT,PreFLMR, andUniIR for open-sourcing their models, datasets, and code.