- Notifications
You must be signed in to change notification settings - Fork20
[ICML2025] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
xlang-ai/aguvis
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
📑Paper | 🌐Project Page | 💾 AGUVIS Data Collection
AGUVIS is a unified pure vision-based framework for autonomous GUI agents that can operate across various platforms (web, desktop, mobile). Unlike previous approaches that rely on textual representations, AGUVIS leverages unified purely vision-based observations and a consistent action space to ensure better generalization across different platforms.
- 🔍Pure Vision Framework: First fully autonomous pure vision GUI agent capable of performing tasks independently without relying on closed-source models
- 🔄Cross-Platform Unification: Unified action space and plugin system that works consistently across different GUI environments
- 📊Comprehensive Dataset: Large-scale dataset of GUI agent trajectories with multimodal grounding and reasoning
- 🧠Two-Stage Training: Novel training pipeline focusing on GUI grounding followed by planning and reasoning
- 💭Inner Monologue: Explicit planning and reasoning capabilities integrated into the model training
Our framework demonstrates state-of-the-art performance in both offline and real-world online scenarios, offering a more efficient and generalizable approach to GUI automation.
overview.mp4
androidworld.mp4
mind2web-live.mp4
osworld.mp4
- Clone the repository:
git clone git@github.com:xlang-ai/aguvis.gitcd aguvis
- Create and activate a conda environment:
conda create -n aguvis python=3.10conda activate aguvis
- Install PyTorch and dependencies:
conda install pytorch torchvision torchaudio pytorch-cuda -c pytorch -c nvidiapip install -e.
Stage 1: Grounding
- Download the dataset fromaguvis-stage1
- Place the data according to the structure defined in
data/stage1.yaml
Stage 2: Planning and Reasoning
- Download the dataset fromaguvis-stage2
- Place the data according to the structure defined in
data/stage2.yaml
Configure your training settings:
- Open
scripts/train.sh
- Set the
SFT_TASK
variable to specify your training stage
- Open
Start training:
bash scripts/train.sh
- Aguvis-7B-720P:Hugging Face
- Cooking... 🧑🍳
Configure your inference settings:
- Open
scripts/inference.sh
- Set the
MODEL_PATH
variable to specify your model path - Set the
IMAGE_PATH
variable to specify your image path - Set the
INSTRUCTION
variable to specify your instruction - Set the
PREVIOUS_ACTIONS
variable to specify your previous actions or leave it empty - Set the
LOW_LEVEL_INSTRUCTION
variable to specify your low-level instruction or leave it empty
- Open
Start inference:
bash scripts/inference.sh
- Data
- ✅ Stage 1: Grounding Dataset
- ✅ Stage 2: Planning and Reasoning Trajectories
- Code
- ✅ Training Pipeline
- 🚧 Model Weights and Configurations
- 🚧 Inference Scripts
- 🚧 Evaluation Toolkit
If this work is helpful, please kindly cite as:
@article{xu2024aguvis,title={Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction},author={Yiheng Xu and Zekun Wang and Junli Wang and Dunjie Lu and Tianbao Xie and Amrita Saha and Doyen Sahoo and Tao Yu and Caiming Xiong},year={2024},url={https://arxiv.org/abs/2412.04454}}
About
[ICML2025] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Contributors4
Uh oh!
There was an error while loading.Please reload this page.