aimagelab/ReTPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star14

[CVPR 2025] Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

License

Apache-2.0 license

14 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
scripts		scripts
src		src
third_party		third_party
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt

Repository files navigation

ReT: Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval (CVPR 2025)

Please cite with the following BibTeX:

@inproceedings{caffagni2025recurrence,  title={{Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval}},  author={Caffagni, Davide and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},  year={2025}}

Installation

Create the Python environment.

conda create -n ret -y --no-default-packages python==3.10.16conda activate ret

Install Pytorch.

pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118

Install faiss-gpu.

conda install -n ret -y -c conda-forge faiss-gpu==1.7.4

Clone the repo and install other dependencies.

git clone https://github.com/aimagelab/ReT.gitcd ReTpip install -r requirements.txt

Pre-trained models 🤗

ReT model checkpoints are available onHugging Face.You can use these checkpoints directly for retrieval tasks or fine-tune them to suit your specific retrieval needs.

Available Checkpoints and Benchmark Results

Model	WIT Recall@10	IGLUE Recall@1	KVQA Recall@5	OVEN Recall@5	LLaVA Recall@1	InfoSeek Recall@5	InfoSeek Pseudo Recall@5	EVQA Recall@5	EVQA Pseudo Recall@5	OKVQA Recall@5	OKVQA Pseudo Recall@5
ReT-CLIP-ViT-L-14🤗	0.734	0.818	0.635	0.820	0.799	0.470	0.605	0.445	0.579	0.202	0.662
ReT-OpenCLIP-ViT-H-14🤗	0.714	0.800	0.593	0.830	0.798	0.473	0.607	0.448	0.578	0.182	0.634
ReT-OpenCLIP-ViT-G-14🤗	0.751	0.822	0.606	0.840	0.792	0.520	0.625	0.486	0.602	0.190	0.638

ReT-M2KR Dataset 🤗

You can download the ReT-M2KR benchmark by following the instructions providedhere.This dataset is used for training and evaluating ReT in multimodal information retrieval and includes images (coming soon) andJSONL files.

ReT-M2KR benchmark is an extended version of theM2KR dataset, with the following modifications:

MSMARCO data is excluded, as it does not contain query images
Passage images have been added to the OVEN, InfoSeek, E-VQA, and OKVQA datasets

For further details, please refer to the associated research paper.

Use with Transformers

fromsrc.modelsimportRetrieverModel,RetModelimporttorchdevice='cuda'iftorch.cuda.is_available()else'cpu'retriever=RetrieverModel.from_pretrained('aimagelab/ReT-CLIP-ViT-L-14',device_map=device)# QUERYret:RetModel=retriever.get_query_model()ret.init_tokenizer_and_image_processor()q_txt="Retrieve documents that provide an answer to the question alongside the image: What is the content of the image?"q_img='assets/model.png'ret_feats=ret.get_ret_features([[q_txt,q_img]])print(ret_feats.shape)# torch.Size([1, 32, 128])# PASSAGEret:RetModel=retriever.get_passage_model()ret.init_tokenizer_and_image_processor()p_txt="""The image shows a diagram of what appears to be a neural network architecture using a fine-grained loss approach for multimodal learning.The architecture has two parallel processing streams labeled "ReTQ" (left side, in purple) and "ReTD" (right side, in blue).Each side has: ..."""p_img=''ret_feats=ret.get_ret_features([[p_txt,p_img]])print(ret_feats.shape)# torch.Size([1, 32, 128])

Indexing and Searching

To evaluate ReT on the on M2KR benchmark, we provide SLURM script exampleshere. These scripts handle both indexing and searching processes.

Make sure to setJSONL_ROOT_PATH andIMAGE_ROOT_PATH accordingly to the directories where the JSONL files and images have been downloaded.

Known issue

If the inference script got stuck while indexing, try to clear the Pytorch cache and re-run

rm -rf ~/.cache/torch_extensions

Acknowledgments

We thank the teams behindColBERT,PreFLMR, andUniIR for open-sourcing their models, datasets, and code.

About

[CVPR 2025] Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

ReT: Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval (CVPR 2025)

Installation

Pre-trained models 🤗

Available Checkpoints and Benchmark Results

ReT-M2KR Dataset 🤗

Use with Transformers

Indexing and Searching

Known issue

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages

Contributors2

Languages

Movatterモバイル変換

License

aimagelab/ReT

Folders and files

Latest commit

History

Repository files navigation

ReT: Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval (CVPR 2025)

Installation

Pre-trained models 🤗

Available Checkpoints and Benchmark Results

ReT-M2KR Dataset 🤗

Use with Transformers

Indexing and Searching

Known issue

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages0

Contributors2

Languages

Packages