- Notifications
You must be signed in to change notification settings - Fork3
[CVPR2025] SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories
License
aim-uofa/SegAgent
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
🎯 SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories
Muzhi Zhu1,2, Yuzhuo Tian1, Hao Chen1*, Chunluan Zhou2, Qingpei Guo2*, Yang Liu1, Ming Yang2, Chunhua Shen1*
1Zhejiang University, 2Ant Group
CVPR2025
Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities in understanding images but still struggle with pixel-level tasks like segmentation. SegAgent addresses this by introducing a novelHuman-Like Mask Annotation Task (HLMAT), enabling MLLMs to mimic the annotation trajectories of human experts using interactive segmentation tools.
SegAgent effectively leverages these annotation trajectories without requiring architectural modifications or additional implicit tokens. Our approach significantly enhances MLLMs' segmentation and mask refinement abilities, establishing a new paradigm for assessing fine-grained visual understanding and multi-step reasoning.
- ✅ Release the weights.
- ✅ Release the inference code.
- ✅ Release the trajectory data for training and evaluation.
pip install -r env.txt
You can run inference on the validation or test set using the trained model and the provided script:
bash run_eval.sh /path/to/your/trained_model
This will run inference withSimpleClick as the segmentation model andSegAgent as the language grounding model. The script processes images and saves the predictions to the output directory.
To evaluate the results, run:
python eval_result_iou.py --input_json ./results/refcoco+_val_predictions.json
📄 For more details, refer to./evaltools/eval.md.
SegAgent is trained usingHuman-Like Mask Annotation Trajectories (HLMAT). Follow the steps below to launch the training process:
Ensure that the annotation trajectory data is preprocessed and saved in the appropriate format (e.g., COCO-style JSON files + click sequences).
We have uploaded the preprocessed trajectory data here:
📁SegAgent-Data
Example structure:
tree ./data/segagent-data├── refcoco_train.json├── refcoco_val.json├── refcoco+_train.json├── ...
Additional image data sources:
- RefCOCO image datasets:LISA GitHub Repository
- HQ segmentation (SAM-HQ):Hugging Face SAM-HQ Data
We recommend converting the trajectory data into a format supported byLLaMA-Factory, and training using their framework directly.
For academic usage, this project is licensed underthe 2-clause BSD License. For commercial inquiries, please contactChunhua Shen.
If you find this work helpful for your research, please cite:
@article{zhu2025segagent,title={SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories},author={Zhu, Muzhi and Tian, Yuzhuo and Chen, Hao and Zhou, Chunluan and Guo, Qingpei and Liu, Yang and Yang, Ming and Shen, Chunhua},journal={arXiv preprint arXiv:2503.08625},year={2025},url={https://arxiv.org/abs/2503.08625}}
About
[CVPR2025] SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
