- Notifications
You must be signed in to change notification settings - Fork1
[COLM 2024] ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning
License
bigai-nlco/ExoViP
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Official implementation of our paper: ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning
In this work, we devise a "plug-and-play" method, ExoViP, to correct the errors at both the planning and execution stages through introspective verification. We employ verification modules as "exoskeletons" to enhance current vision-language programming schemes. Specifically, our proposed verification module utilizes a mixture of three sub-verifiers to validate predictions after each reasoning step, subsequently calibrating the visual module predictions and refining the reasoning trace planned by LLMs.
Paste your OPENAI-API-KEY and OPENAPI-API-BASE toengine/.env
andtasks/*.ipynb
conda env create -f environment.yamlconda activate exovip
If the Huggingface is not available of your network, you can download all checkpoints underprev_trained_models
directory
Errors in existing methods could be summarized to two categories:
- Module Error: The visual modules are not able to correctly execute the program
- Planning Error: LLM can not parse the language query into a correct solvable program
We conducted a comparative analysis of the statistics derived from a random sample of 100 failure incidents before (left) and after (right) the implementation of our method.
Our method has been validated on six tasks:
- Compositional Image Question Answering:GQA
- Referring Expression Understanding:RefCOCO/RefCOCO+/RefCOCOg
- Natural Language for Visual Reasoning:NLVR
- Visual Abstract Reasoning:KILOGRAM
- Language-guided Image Editing:MagicBrush
- Spatial-Temporal Video Reasoning:AGQA
NOTE: All the experiments are applied on subsets of these datasets, please refer todatasets
code demos
cd tasks# GQAgqa.ipynb# NLVRnlvr.ipynb# RefCOCO(+/g)refcoco.ipynb# KILOGRAMkilogram.ipynb# MagicBrushmagicbrush.ipynb# AGQAagqa.ipynb
visprog, a neuro-symbolic system that solves complex and compositional visual tasks given natural language instructions
If you find our work helpful, please cite it.
@inproceedings{wang2024exovip,title={ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning},author={Wang, Yuxuan and Yuille, Alan and Li, Zhuowan and Zheng, Zilong},booktitle={The first Conference on Language Modeling (COLM)},year={2024}}