NotificationsYou must be signed in to change notification settings
Fork20
Star332

[ICML2025] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

332 stars 20 forks Branches Tags Activity

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
scripts		scripts
src/aguvis		src/aguvis
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml
train.py		train.py

Repository files navigation

AGUVIS

📑Paper | 🌐Project Page | 💾 AGUVIS Data Collection

Introduction

AGUVIS is a unified pure vision-based framework for autonomous GUI agents that can operate across various platforms (web, desktop, mobile). Unlike previous approaches that rely on textual representations, AGUVIS leverages unified purely vision-based observations and a consistent action space to ensure better generalization across different platforms.

Key Features & Contributions

🔍Pure Vision Framework: First fully autonomous pure vision GUI agent capable of performing tasks independently without relying on closed-source models
🔄Cross-Platform Unification: Unified action space and plugin system that works consistently across different GUI environments
📊Comprehensive Dataset: Large-scale dataset of GUI agent trajectories with multimodal grounding and reasoning
🧠Two-Stage Training: Novel training pipeline focusing on GUI grounding followed by planning and reasoning
💭Inner Monologue: Explicit planning and reasoning capabilities integrated into the model training

Our framework demonstrates state-of-the-art performance in both offline and real-world online scenarios, offering a more efficient and generalizable approach to GUI automation.

overview.mp4

Mobile Tasks (Android World)

androidworld.mp4

Web Browsing Tasks (Mind2Web-Live)

mind2web-live.mp4

Computer-use Tasks (OSWorld)

osworld.mp4

Getting Started

Installation

Clone the repository:

git clone git@github.com:xlang-ai/aguvis.gitcd aguvis

Create and activate a conda environment:

conda create -n aguvis python=3.10conda activate aguvis

Install PyTorch and dependencies:

conda install pytorch torchvision torchaudio pytorch-cuda -c pytorch -c nvidiapip install -e.

Data Preparation

Stage 1: Grounding
- Download the dataset fromaguvis-stage1
- Place the data according to the structure defined indata/stage1.yaml
Stage 2: Planning and Reasoning
- Download the dataset fromaguvis-stage2
- Place the data according to the structure defined indata/stage2.yaml

Training

Configure your training settings:
- Openscripts/train.sh
- Set theSFT_TASK variable to specify your training stage
Start training:

bash scripts/train.sh

Model Checkpoints

Aguvis-7B-720P:Hugging Face
Cooking... 🧑‍🍳

Inference

Configure your inference settings:
- Openscripts/inference.sh
- Set theMODEL_PATH variable to specify your model path
- Set theIMAGE_PATH variable to specify your image path
- Set theINSTRUCTION variable to specify your instruction
- Set thePREVIOUS_ACTIONS variable to specify your previous actions or leave it empty
- Set theLOW_LEVEL_INSTRUCTION variable to specify your low-level instruction or leave it empty
Start inference:

bash scripts/inference.sh

Checklist

Data
- ✅ Stage 1: Grounding Dataset
- ✅ Stage 2: Planning and Reasoning Trajectories
Code
- ✅ Training Pipeline
- 🚧 Model Weights and Configurations
- 🚧 Inference Scripts
- 🚧 Evaluation Toolkit

Citation

If this work is helpful, please kindly cite as:

@article{xu2024aguvis,title={Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction},author={Yiheng Xu and Zekun Wang and Junli Wang and Dunjie Lu and Tianbao Xie and Amrita Saha and Doyen Sahoo and Tao Yu and Caiming Xiong},year={2024},url={https://arxiv.org/abs/2412.04454}}

About

[ICML2025] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

aguvis-project.github.io

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AGUVIS

Introduction

Key Features & Contributions

Mobile Tasks (Android World)

Web Browsing Tasks (Mind2Web-Live)

Computer-use Tasks (OSWorld)

Getting Started

Installation

Data Preparation

Training

Model Checkpoints

Inference

Checklist

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors4

Uh oh!

Languages

Movatterモバイル変換

xlang-ai/aguvis

Folders and files

Latest commit

History

Repository files navigation

AGUVIS

Introduction

Key Features & Contributions

Mobile Tasks (Android World)

Web Browsing Tasks (Mind2Web-Live)

Computer-use Tasks (OSWorld)

Getting Started

Installation

Data Preparation

Training

Model Checkpoints

Inference

Checklist

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors4

Uh oh!

Languages

Packages