zeyofu/ReFocus_CodePublic

NotificationsYou must be signed in to change notification settings
Fork1
Star45

Codes for ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding [ICML 2025]]

License

Apache-2.0 license

45 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
helper		helper
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
finetune_hf_trainer_chartqa_vcot.py		finetune_hf_trainer_chartqa_vcot.py

Repository files navigation

ReFocus

This repo contains codes for the paper "ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding" accepted at ICML 2025.

🌐 Homepage |📑 Paper |🤗 Training Data |🔗 Trained Model

🔔News

🎉[2025-05-01]: ReFocus is accepted toICML2025! See you in Canada.

🔥[2025-01-12]: Releasing the codes for ReFocus and collectedtraining data andfinetuned model.

Introduction

Download Training Data

14k collected training data is uploaded onHuggingface.Complete raw data can be found in theHuggingface Dataset Files, where the training data is underchartqa_vcot.zip andtrain_chartQA_*.zip, with other files being testing data.

ReFocus Prompting

We inherit most of the prompting code followingVisual SketchPad

Installation

conda create -n refocus python=3.11conda activate refocuspip install pyautogen==0.3.0pip install 'pyautogen[jupyter-executor]'pip install Pillow joblib matplotlib opencv-python numpy networkx scipy datasets

Quick Start

Task Data

We preprocessed each task and put them into tasks. Download from thisHuggingface Dataset Files and put everything underdata.

Notice that the finetuning data is underchartqa_vcot.zip andtrain_chartQA_*.zip, with the rest being testing data.

Run a Task

Set up your openAI key which is required to run ReFocus with GPT-4 models.

export OPENAI_API_KEY=<your_key>

Run code for each task to prompt with ReFocus.

python src/main_chartQA.pypython src/main_tablevqa.pypython src/main_charxiv.py

ReFocus Finetuning

We follow thePhi-3 Cookbook for the supervised finetuning experiments.

Download the Finetuned Model

We release our best finetuned ReFocus model with full chain-of-thought data in thisHuggingFace Link.

This model is finetuned based on Phi-3.5-vision, and we used the following prompt during evaluation

<|image|>\n{question}\nThought:

To enforce the model to generate bounding box coordinates to refocus, you could try this prompt:

<|image_1|>\n{question}\nThought: The areas to focus on in the image have bounding box coordinates:

Finetune Quickstart

Follow thePhi3CookBook, clone it, and following its setting for a new finetuning environment.

git clone https://github.com/microsoft/Phi-3CookBook.git# create a new conda environmentconda create -n phi3v python=3.10conda activate phi3v# install pytorchconda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia# other libraries needed to run the example codepip install -r requirements.txt# (optional) flash attention -- Ampere+ GPUs (e.g., A100, H100)pip install ninjaMAX_JOBS=32 pip install flash-attn==2.4.2 --no-build-isolation# (optional) QLoRA -- Turing+ GPUs (e.g., RTX 8000)pip install bitsandbytes==0.43.1

Move the file

mv finetune_hf_trainer_chartqa_vcot.py Phi-3CookBook/code/04.Finetuning/vision_finetuning/

Then you could train the model

cd Phi-3CookBook/code/04.Finetuning/vision_finetuningpython -m torch.distributed.run --nproc_per_node=8 finetune_hf_trainer_chartqa_vcot.py --full_train --data_dir data/chartqa_vcot --bf16 --use_flash_attention --batch_size 48 --output_dir outputs/chartqa_vcot_loop --learning_rate 1e-6 --num_train_epochs 2 --output_bbox 1

Coordinate Acquisition

In case you are interested, we share the code we used to acquire the table and chart coordinates in ReFocus.

python helper/get_coordinates_for_chart.pypython helper/get_coordinates_for_table.py

About

Codes for ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding [ICML 2025]]

zeyofu.github.io/ReFocus/

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

ReFocus

🔔News

Introduction

Download Training Data

ReFocus Prompting

Installation

Quick Start

Task Data

Run a Task

ReFocus Finetuning

Download the Finetuned Model

Finetune Quickstart

Coordinate Acquisition

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

zeyofu/ReFocus_Code

Folders and files

Latest commit

History

Repository files navigation

ReFocus

🔔News

Introduction

Download Training Data

ReFocus Prompting

Installation

Quick Start

Task Data

Run a Task

ReFocus Finetuning

Download the Finetuned Model

Finetune Quickstart

Coordinate Acquisition

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages