- Notifications
You must be signed in to change notification settings - Fork17
Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
License
ritzz-ai/GUI-R1
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
The official repo for "GUI-R1: A Generalist R1-style Vision-Language Action Model For GUI Agents".
- [2025/05/04] We released an 800K high-quality reinforcement learningdataset filtered from the OS-Atlas pretraining data using QwenVL2.5-7B, with varying levels of difficulty. From this, we further filtered out a diverse subset of 10K samples and applied the DAPO algorithm, giving you the potential to outperform InfiGUI-R1. We warmly welcome everyone to utilize it!.
- [2025/04/18] We released the weights, code and scripts.
- [2025/04/17] We releasedDataset!
- [2025/04/14] Our GUI-R1 paper (GUI-R1: A Generalist R1-style Vision-Language Action Model For GUI Agents) can be accessed in arXiv!
- [2025/03/10] We start our project.
By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as group relative policy optimization (GRPO) to update the model, GUI-R1 achieves superior performance using only 0.02% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of LVLMs for real-world GUI agent tasks.
Given the high-level instruction, action history, and visual image inputs, the policy model generates multiple responses containing reasoning steps. Then the verifiable rewards, such as action type reward, click point reward, and input text reward, are used with the policy gradient optimization algorithm to update the policy model.
We recommend using thepre-built docker image in EasyR1.
# stabledocker pull hiyouga/verl:ngc-th2.5.1-cu120-vllm0.7.4-hotfix# nightlydocker pull hiyouga/verl:ngc-th2.6.0-cu120-vllm0.8.2Download the training and evaluation datasetGUI-R1-3K.
The structure of the directory should be:
│──Dataset│ ├──train.parquet│ ├──test.parquet│ ├──androidcontrol_high_test.parquet│ ├──androidcontrol_low_test.parquet│ ├──guiact_web_test.parquet│ ├──guiodyssey_test.parquet│ ├──omniact_web_test.parquet│ ├──omniact_desktop_test.parquet│ ├──screenspot_pro_test.parquet│ ├──screenspot_test.parquetbash examples/qwen2_5_vl_3b_gui_grpo/shbash examples/qwen2_5_vl_7b_gui_grpo/sh
cd guir1bash inference.shbash eval.shWe would like to express our sincere gratitude toDeepSeek,VLM-R1,QwenVL,EasyR1, andOS-ATLAS for providing open-source resources that contributed to the development of this project.
If you find this repo useful for your research, please consider citing the paper
@article{luo2025gui, title={GUI-R1: A Generalist R1-Style Vision-Language Action Model For GUI Agents}, author={Luo, Run and Wang, Lu and He, Wanwei and Xia, Xiaobo}, journal={arXiv preprint arXiv:2504.10458}, year={2025}}About
Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.




