salesforce/LAVISPublic

NotificationsYou must be signed in to change notification settings
Fork1k
Star10.8k

LAVIS - A One-stop Library for Language-Vision Intelligence

License

BSD-3-Clause license

10.8k stars 1k forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 492 Commits
.github/workflows		.github/workflows
app		app
assets		assets
dataset_card		dataset_card
docs		docs
examples		examples
lavis		lavis
projects		projects
run_scripts		run_scripts
tests/models		tests/models
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
evaluate.py		evaluate.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
train.py		train.py

Repository files navigation

Benchmark,Technical Report,Documentation,Jupyter Notebook Examples,Blog

LAVIS - A Library for Language-Vision Intelligence

What's New: 🎉

[Model Release] November 2023, released implementation ofX-InstructBLIP
Paper,Project Page,Website,

A simple, yet effective, cross-modality framework built atop frozen LLMs that allows the integration of various modalities (image, video, audio, 3D) without extensive modality-specific customization.

[Model Release] July 2023, released implementation ofBLIP-Diffusion
Paper,Project Page,Website

A text-to-image generation model that trains 20x than DreamBooth. Also facilitates zero-shot subject-driven generation and editing.

[Model Release] May 2023, released implementation ofInstructBLIP
Paper,Project Page

A new vision-language instruction-tuning framework using BLIP-2 models, achieving state-of-the-art zero-shot generalization performance on a wide range of vision-language tasks.

[Model Release] Jan 2023, released implementation ofBLIP-2
Paper,Project Page,

A generic and efficient pre-training strategy that easily harvests development of pretrained vision models and large language models (LLMs) for vision-language pretraining. BLIP-2 beats Flamingo on zero-shot VQAv2 (65.0 vs56.3), establishing new state-of-the-art on zero-shot captioning (on NoCaps121.6 CIDEr score vs previous best113.2). In addition, equipped with powerful LLMs (e.g. OPT, FlanT5), BLIP-2 also unlocks the newzero-shot instructed vision-to-language generation capabilities for various interesting applications!

Jan 2023, LAVIS is now available onPyPI for installation!
[Model Release] Dec 2022, released implementation ofImg2LLM-VQA (CVPR 2023,"From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models", by Jiaxian Guo et al)
Paper,Project Page,

A plug-and-play module that enables off-the-shelf use of Large Language Models (LLMs) for visual question answering (VQA). Img2LLM-VQA surpasses Flamingo on zero-shot VQA on VQAv2 (61.9 vs 56.3), while in contrast requiring no end-to-end training!

[Model Release] Oct 2022, released implementation ofPNP-VQA (EMNLP Findings 2022,"Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training", by Anthony T.M.H. et al),
Paper,Project Page,)

A modular zero-shot VQA framework that requires no PLMs training, achieving SoTA zero-shot VQA performance.

Technical Report and Citing LAVIS

You can find more details in ourtechnical report.

If you're using LAVIS in your research or applications, please cite it using this BibTeX:

@inproceedings{li-etal-2023-lavis,title ="{LAVIS}: A One-stop Library for Language-Vision Intelligence",author ="Li, Dongxu  and      Li, Junnan  and      Le, Hung  and      Wang, Guangsen  and      Savarese, Silvio  and      Hoi, Steven C.H.",booktitle ="Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",month = jul,year ="2023",address ="Toronto, Canada",publisher ="Association for Computational Linguistics",url ="https://aclanthology.org/2023.acl-demo.3",pages ="31--41",abstract ="We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications. LAVIS aims to serve as a one-stop comprehensive library that brings recent advancements in the language-vision field accessible for researchers and practitioners, as well as fertilizing future research and development. It features a unified interface to easily access state-of-the-art image-language, video-language models and common datasets. LAVIS supports training, evaluation and benchmarking on a rich variety of tasks, including multimodal classification, retrieval, captioning, visual question answering, dialogue and pre-training. In the meantime, the library is also highly extensible and configurable, facilitating future development and customization. In this technical report, we describe design principles, key components and functionalities of the library, and also present benchmarking results across common language-vision tasks.",}

Introduction

LAVIS is a Python deep learning library for LAnguage-and-VISion intelligence research and applications. This library aims to provide engineers and researchers with a one-stop solution to rapidly develop models for their specific multimodal scenarios, and benchmark them across standard and customized datasets.It features a unified interface design to access

10+ tasks(retrieval, captioning, visual question answering, multimodal classification etc.);
20+ datasets (COCO, Flickr, Nocaps, ConceptualCommons, SBU, etc.);
30+ pretrained weights of state-of-the-art foundation language-vision models and their task-specific adaptations, includingALBEF,BLIP,ALPRO,CLIP.

Key features of LAVIS include:

Unified and Modular Interface: facilitating to easily leverage and repurpose existing modules (datasets, models, preprocessors), also to add new modules.
Easy Off-the-shelf Inference and Feature Extraction: readily available pre-trained models let you take advantage of state-of-the-art multimodal understanding and generation capabilities on your own data.
Reproducible Model Zoo and Training Recipes: easily replicate and extend state-of-the-art models on existing and new tasks.
Dataset Zoo and Automatic Downloading Tools: it can be a hassle to prepare the many language-vision datasets. LAVIS provides automatic downloading scripts to help prepare a large variety of datasets and their annotations.

The following table shows the supported tasks, datasets and models in our library. This is a continuing effort and we are working on further growing the list.

Tasks	Supported Models	Supported Datasets
Image-text Pre-training	ALBEF, BLIP	COCO, VisualGenome, SBU ConceptualCaptions
Image-text Retrieval	ALBEF, BLIP, CLIP	COCO, Flickr30k
Text-image Retrieval	ALBEF, BLIP, CLIP	COCO, Flickr30k
Visual Question Answering	ALBEF, BLIP	VQAv2, OKVQA, A-OKVQA
Image Captioning	BLIP	COCO, NoCaps
Image Classification	CLIP	ImageNet
Natural Language Visual Reasoning (NLVR)	ALBEF, BLIP	NLVR2
Visual Entailment (VE)	ALBEF	SNLI-VE
Visual Dialogue	BLIP	VisDial
Video-text Retrieval	BLIP, ALPRO	MSRVTT, DiDeMo
Text-video Retrieval	BLIP, ALPRO	MSRVTT, DiDeMo
Video Question Answering (VideoQA)	BLIP, ALPRO	MSRVTT, MSVD
Video Dialogue	VGD-GPT	AVSD
Multimodal Feature Extraction	ALBEF, CLIP, BLIP, ALPRO	customized
Text-to-image Generation	[COMING SOON]

Installation

(Optional) Creating conda environment

conda create -n lavis python=3.8conda activate lavis

install fromPyPI

pip install salesforce-lavis

Or, for development, you may build from source

git clone https://github.com/salesforce/LAVIS.gitcd LAVISpip install -e.

Getting Started

Model Zoo

Model zoo summarizes supported models in LAVIS, to view:

fromlavis.modelsimportmodel_zooprint(model_zoo)# ==================================================# Architectures                  Types# ==================================================# albef_classification           ve# albef_feature_extractor        base# albef_nlvr                     nlvr# albef_pretrain                 base# albef_retrieval                coco, flickr# albef_vqa                      vqav2# alpro_qa                       msrvtt, msvd# alpro_retrieval                msrvtt, didemo# blip_caption                   base_coco, large_coco# blip_classification            base# blip_feature_extractor         base# blip_nlvr                      nlvr# blip_pretrain                  base# blip_retrieval                 coco, flickr# blip_vqa                       vqav2, okvqa, aokvqa# clip_feature_extractor         ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50# clip                           ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50# gpt_dialogue                   base

Let’s see how to use models in LAVIS to perform inference on example data. We first load a sample image from local.

importtorchfromPILimportImage# setup device to usedevice=torch.device("cuda"iftorch.cuda.is_available()else"cpu")# load sample imageraw_image=Image.open("docs/_static/merlion.png").convert("RGB")

This example image showsMerlion park (source), a landmark in Singapore.

Image Captioning

In this example, we use the BLIP model to generate a caption for the image. To make inference even easier, we also associate eachpre-trained model with its preprocessors (transforms), accessed viaload_model_and_preprocess().

importtorchfromlavis.modelsimportload_model_and_preprocessdevice=torch.device("cuda"iftorch.cuda.is_available()else"cpu")# loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.# this also loads the associated image processorsmodel,vis_processors,_=load_model_and_preprocess(name="blip_caption",model_type="base_coco",is_eval=True,device=device)# preprocess the image# vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)image=vis_processors["eval"](raw_image).unsqueeze(0).to(device)# generate captionmodel.generate({"image":image})# ['a large fountain spewing water into the air']

Visual question answering (VQA)

BLIP model is able to answer free-form questions about images in natural language.To access the VQA model, simply replace thename andmodel_type argumentspassed toload_model_and_preprocess().

fromlavis.modelsimportload_model_and_preprocessmodel,vis_processors,txt_processors=load_model_and_preprocess(name="blip_vqa",model_type="vqav2",is_eval=True,device=device)# ask a random question.question="Which city is this photo taken?"image=vis_processors["eval"](raw_image).unsqueeze(0).to(device)question=txt_processors["eval"](question)model.predict_answers(samples={"image":image,"text_input":question},inference_method="generate")# ['singapore']

Unified Feature Extraction Interface

LAVIS provides a unified interface to extract features from each architecture.To extract features, we load the feature extractor variants of each model.The multimodal feature can be used for multimodal classification.The low-dimensional unimodal features can be used to compute cross-modal similarity.

fromlavis.modelsimportload_model_and_preprocessmodel,vis_processors,txt_processors=load_model_and_preprocess(name="blip_feature_extractor",model_type="base",is_eval=True,device=device)caption="a large fountain spewing water into the air"image=vis_processors["eval"](raw_image).unsqueeze(0).to(device)text_input=txt_processors["eval"](caption)sample= {"image":image,"text_input": [text_input]}features_multimodal=model.extract_features(sample)print(features_multimodal.multimodal_embeds.shape)# torch.Size([1, 12, 768]), use features_multimodal[:,0,:] for multimodal classification tasksfeatures_image=model.extract_features(sample,mode="image")features_text=model.extract_features(sample,mode="text")print(features_image.image_embeds.shape)# torch.Size([1, 197, 768])print(features_text.text_embeds.shape)# torch.Size([1, 12, 768])# low-dimensional projected featuresprint(features_image.image_embeds_proj.shape)# torch.Size([1, 197, 256])print(features_text.text_embeds_proj.shape)# torch.Size([1, 12, 256])similarity=features_image.image_embeds_proj[:,0,:] @features_text.text_embeds_proj[:,0,:].t()print(similarity)# tensor([[0.2622]])

Load Datasets

LAVIS inherently supports a wide variety of common language-vision datasets by providingautomatic download tools to help download and organize these datasets. After downloading, to load the datasets, use the following code:

fromlavis.datasets.buildersimportdataset_zoodataset_names=dataset_zoo.get_names()print(dataset_names)# ['aok_vqa', 'coco_caption', 'coco_retrieval', 'coco_vqa', 'conceptual_caption_12m',#  'conceptual_caption_3m', 'didemo_retrieval', 'flickr30k', 'imagenet', 'laion2B_multi',#  'msrvtt_caption', 'msrvtt_qa', 'msrvtt_retrieval', 'msvd_caption', 'msvd_qa', 'nlvr',#  'nocaps', 'ok_vqa', 'sbu_caption', 'snli_ve', 'vatex_caption', 'vg_caption', 'vg_vqa']

After downloading the images, we can useload_dataset() to obtain the dataset.

fromlavis.datasets.buildersimportload_datasetcoco_dataset=load_dataset("coco_caption")print(coco_dataset.keys())# dict_keys(['train', 'val', 'test'])print(len(coco_dataset["train"]))# 566747print(coco_dataset["train"][0])# {'image': <PIL.Image.Image image mode=RGB size=640x480>,#  'text_input': 'A woman wearing a net on her head cutting a cake. ',#  'image_id': 0}

If you already host a local copy of the dataset, you can pass in thevis_path argument to change the default location to load images.

coco_dataset=load_dataset("coco_caption",vis_path=YOUR_LOCAL_PATH)

Jupyter Notebook Examples

Seeexamples for more inference examples, e.g. captioning, feature extraction, VQA, GradCam, zeros-shot classification.

Resources and Tools

Benchmarks: seeBenchmark for instructions to evaluate and train supported models.
Dataset Download and Browsing: seeDataset Download for instructions and automatic tools on download common language-vision datasets.
GUI Demo: to run the demo locally, runbash run_scripts/run_demo.sh and then follow the instruction on the prompts to view in browser. A web demo is coming soon.

Documentations

For more details and advanced usages, please refer todocumentation.

Ethical and Responsible Use

We note that models in LAVIS provide no guarantees on their multimodal abilities; incorrect or biased predictions may be observed. In particular, the datasets and pretrained models utilized in LAVIS may contain socioeconomic biases which could result in misclassification and other unwanted behaviors such as offensive or inappropriate speech. We strongly recommend that users review the pre-trained models and overall system in LAVIS before practical adoption. We plan to improve the library by investigating and mitigating these potential biases andinappropriate behaviors in the future.