Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

License

NotificationsYou must be signed in to change notification settings

ritzz-ai/GUI-R1

Repository files navigation

The official repo for "GUI-R1: A Generalist R1-style Vision-Language Action Model For GUI Agents".

🤗GUI-R1-3K   |   🤗GUI-R1   |   📑Paper  

News

  • [2025/05/04] We released an 800K high-quality reinforcement learningdataset filtered from the OS-Atlas pretraining data using QwenVL2.5-7B, with varying levels of difficulty. From this, we further filtered out a diverse subset of 10K samples and applied the DAPO algorithm, giving you the potential to outperform InfiGUI-R1. We warmly welcome everyone to utilize it!.
  • [2025/04/18] We released the weights, code and scripts.
  • [2025/04/17] We releasedDataset!
  • [2025/04/14] Our GUI-R1 paper (GUI-R1: A Generalist R1-style Vision-Language Action Model For GUI Agents) can be accessed in arXiv!
  • [2025/03/10] We start our project.

Our Exploration

By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as group relative policy optimization (GRPO) to update the model, GUI-R1 achieves superior performance using only 0.02% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of LVLMs for real-world GUI agent tasks.

Framework

Given the high-level instruction, action history, and visual image inputs, the policy model generates multiple responses containing reasoning steps. Then the verifiable rewards, such as action type reward, click point reward, and input text reward, are used with the policy gradient optimization algorithm to update the policy model.

Result

image

image

image

Requirements

We recommend using thepre-built docker image in EasyR1.

# stabledocker pull hiyouga/verl:ngc-th2.5.1-cu120-vllm0.7.4-hotfix# nightlydocker pull hiyouga/verl:ngc-th2.6.0-cu120-vllm0.8.2

Data preparation

Download the training and evaluation datasetGUI-R1-3K.

The structure of the directory should be:

│──Dataset│ ├──train.parquet│ ├──test.parquet│ ├──androidcontrol_high_test.parquet│ ├──androidcontrol_low_test.parquet│ ├──guiact_web_test.parquet│ ├──guiodyssey_test.parquet│ ├──omniact_web_test.parquet│ ├──omniact_desktop_test.parquet│ ├──screenspot_pro_test.parquet│ ├──screenspot_test.parquet

RL Training

bash examples/qwen2_5_vl_3b_gui_grpo/shbash examples/qwen2_5_vl_7b_gui_grpo/sh

Inference and Evaluation

cd guir1bash inference.shbash eval.sh

Star History

Star History Chart

Acknowledgements

We would like to express our sincere gratitude toDeepSeek,VLM-R1,QwenVL,EasyR1, andOS-ATLAS for providing open-source resources that contributed to the development of this project.

Citation

If you find this repo useful for your research, please consider citing the paper

@article{luo2025gui,  title={GUI-R1: A Generalist R1-Style Vision-Language Action Model For GUI Agents},  author={Luo, Run and Wang, Lu and He, Wanwei and Xia, Xiaobo},  journal={arXiv preprint arXiv:2504.10458},  year={2025}}

About

Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2026 Movatter.jp