aim-uofa/SegAgentPublic

NotificationsYou must be signed in to change notification settings
Fork3
Star91

[CVPR2025] SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

License

MIT license

91 stars 3 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
evaltools		evaltools
images		images
third_party/SimpleClick		third_party/SimpleClick
LICENSE		LICENSE
README.md		README.md
env.txt		env.txt

Repository files navigation

🎯 SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

Muzhi Zhu^1,2, Yuzhuo Tian¹, Hao Chen^1*, Chunluan Zhou², Qingpei Guo^2*, Yang Liu¹, Ming Yang², Chunhua Shen^1*

¹Zhejiang University, ²Ant Group

CVPR2025

📄Paper | 🌐Project Page | 🤖Model Weight | 📊Data

🚀 Overview

📖 Description

Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities in understanding images but still struggle with pixel-level tasks like segmentation. SegAgent addresses this by introducing a novelHuman-Like Mask Annotation Task (HLMAT), enabling MLLMs to mimic the annotation trajectories of human experts using interactive segmentation tools.

SegAgent effectively leverages these annotation trajectories without requiring architectural modifications or additional implicit tokens. Our approach significantly enhances MLLMs' segmentation and mask refinement abilities, establishing a new paradigm for assessing fine-grained visual understanding and multi-step reasoning.

🚩 Plan

✅ Release the weights.
✅ Release the inference code.
✅ Release the trajectory data for training and evaluation.

🚀 Getting Started

pip install -r  env.txt

🤖 Inference

You can run inference on the validation or test set using the trained model and the provided script:

bash run_eval.sh /path/to/your/trained_model

This will run inference withSimpleClick as the segmentation model andSegAgent as the language grounding model. The script processes images and saves the predictions to the output directory.

To evaluate the results, run:

python eval_result_iou.py --input_json ./results/refcoco+_val_predictions.json

📄 For more details, refer to./evaltools/eval.md.

🧑‍🏫 Training

SegAgent is trained usingHuman-Like Mask Annotation Trajectories (HLMAT). Follow the steps below to launch the training process:

Step 1: Prepare the Data

Ensure that the annotation trajectory data is preprocessed and saved in the appropriate format (e.g., COCO-style JSON files + click sequences).

We have uploaded the preprocessed trajectory data here:
📁SegAgent-Data

Example structure:

tree ./data/segagent-data├── refcoco_train.json├── refcoco_val.json├── refcoco+_train.json├── ...

Additional image data sources:

RefCOCO image datasets:LISA GitHub Repository
HQ segmentation (SAM-HQ):Hugging Face SAM-HQ Data

Step 2: Run Training

We recommend converting the trajectory data into a format supported byLLaMA-Factory, and training using their framework directly.

🎫 License

For academic usage, this project is licensed underthe 2-clause BSD License. For commercial inquiries, please contactChunhua Shen.

🖊️ Citation

If you find this work helpful for your research, please cite:

@article{zhu2025segagent,title={SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories},author={Zhu, Muzhi and Tian, Yuzhuo and Chen, Hao and Zhou, Chunluan and Guo, Qingpei and Liu, Yang and Yang, Ming and Shen, Chunhua},journal={arXiv preprint arXiv:2503.08625},year={2025},url={https://arxiv.org/abs/2503.08625}}

About

[CVPR2025] SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

🎯 SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

🚀 Overview

📖 Description

🚩 Plan

🚀 Getting Started

🤖 Inference

🧑‍🏫 Training

Step 1: Prepare the Data

Step 2: Run Training

🎫 License

🖊️ Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

aim-uofa/SegAgent

Folders and files

Latest commit

History

Repository files navigation

🎯 SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

🚀 Overview

📖 Description

🚩 Plan

🚀 Getting Started

🤖 Inference

🧑‍🏫 Training

Step 1: Prepare the Data

Step 2: Run Training

🎫 License

🖊️ Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages