Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[ICML2025] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

NotificationsYou must be signed in to change notification settings

xlang-ai/aguvis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📑Paper    |    🌐Project Page    |    💾 AGUVIS Data Collection   

Introduction

AGUVIS is a unified pure vision-based framework for autonomous GUI agents that can operate across various platforms (web, desktop, mobile). Unlike previous approaches that rely on textual representations, AGUVIS leverages unified purely vision-based observations and a consistent action space to ensure better generalization across different platforms.

Key Features & Contributions

  • 🔍Pure Vision Framework: First fully autonomous pure vision GUI agent capable of performing tasks independently without relying on closed-source models
  • 🔄Cross-Platform Unification: Unified action space and plugin system that works consistently across different GUI environments
  • 📊Comprehensive Dataset: Large-scale dataset of GUI agent trajectories with multimodal grounding and reasoning
  • 🧠Two-Stage Training: Novel training pipeline focusing on GUI grounding followed by planning and reasoning
  • 💭Inner Monologue: Explicit planning and reasoning capabilities integrated into the model training

Our framework demonstrates state-of-the-art performance in both offline and real-world online scenarios, offering a more efficient and generalizable approach to GUI automation.

overview.mp4

Mobile Tasks (Android World)

androidworld.mp4

Web Browsing Tasks (Mind2Web-Live)

mind2web-live.mp4

Computer-use Tasks (OSWorld)

osworld.mp4

Getting Started

Installation

  1. Clone the repository:
git clone git@github.com:xlang-ai/aguvis.gitcd aguvis
  1. Create and activate a conda environment:
conda create -n aguvis python=3.10conda activate aguvis
  1. Install PyTorch and dependencies:
conda install pytorch torchvision torchaudio pytorch-cuda -c pytorch -c nvidiapip install -e.

Data Preparation

  1. Stage 1: Grounding

  2. Stage 2: Planning and Reasoning

Training

  1. Configure your training settings:

    • Openscripts/train.sh
    • Set theSFT_TASK variable to specify your training stage
  2. Start training:

bash scripts/train.sh

Model Checkpoints

Inference

  1. Configure your inference settings:

    • Openscripts/inference.sh
    • Set theMODEL_PATH variable to specify your model path
    • Set theIMAGE_PATH variable to specify your image path
    • Set theINSTRUCTION variable to specify your instruction
    • Set thePREVIOUS_ACTIONS variable to specify your previous actions or leave it empty
    • Set theLOW_LEVEL_INSTRUCTION variable to specify your low-level instruction or leave it empty
  2. Start inference:

bash scripts/inference.sh

Checklist

  • Data
    • ✅ Stage 1: Grounding Dataset
    • ✅ Stage 2: Planning and Reasoning Trajectories
  • Code
    • ✅ Training Pipeline
    • 🚧 Model Weights and Configurations
    • 🚧 Inference Scripts
    • 🚧 Evaluation Toolkit

Citation

If this work is helpful, please kindly cite as:

@article{xu2024aguvis,title={Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction},author={Yiheng Xu and Zekun Wang and Junli Wang and Dunjie Lu and Tianbao Xie and Amrita Saha and Doyen Sahoo and Tao Yu and Caiming Xiong},year={2024},url={https://arxiv.org/abs/2412.04454}}

About

[ICML2025] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors4

  •  
  •  
  •  
  •  

[8]ページ先頭

©2009-2025 Movatter.jp