- Notifications
You must be signed in to change notification settings - Fork22
[NeurIPS 2025] Flow x RL. "ReinFlow: Fine-tuning Flow Policy with Online Reinforcement Learning". Support VLAs e.g., pi0, pi0.5. Fully open-sourced.
License
ReinFlow/ReinFlow
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Tonghe Zhang
Installation |Quick Start |Implementation Details |Add Dataset/Environment
Debug & Known Issues |License |Acknowledgement |Citation
This is the official implementation of"ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning".
If you like our work, it will be wonderful if you give us a star⭐!
- [2025/11/28]🔥 ReinFlow now supports fine-tuning GR00T VLA models from NVIDIA. Check it out atRLinf-GR00T-N1.5
- [2025/11/7] Update limitation section
- [2025/11/5] Update tips on hyperparameter tuning.
- [2025/11/2] 🔥We scaled up ReinFlow to fine-tune VLA models such as
$\pi_0$ and$\pi_{0.5}$ .
The code and checkpoint for the LIBERO environment are available atRLinf-pi0.A technical report including results on LIBERO, MetaWorld, ManiSkill/Simpler is available at$\pi_{\texttt{RL}}$ Online RL Fine-tuning for Flow-based Vision-Language-Action Models: arXiv:2510.25889). - [2025/09/18] ReinFlow paper is accepted atNeurIPS 2025.
- [2025/08/18] All training metrics (losses, reward, etc) released inWandB to help you reproduce our results.
- [2025/07/30] Fixed the rendering bug in Robomimic. Now supports rendering at 1080p resolution.
- [2025/07/29] Add tutorial on how to record videos during evaluation in thedocs
- [2025/06/14] Updated webpage for a detailed explanation to the algorithm design.
- [2025/05/28] Paper is posted onarXiv!
ReinFlow is a flexiblepolicy gradient framework for fine-tuningflow matching policies atany denoising step.
How does it work?
👉 First, train flow policies usingimitation learning (behavior cloning).
👉 Then, fine-tune them withonline reinforcement learning using ReinFlow!
🧩Supports:
- ✅ 1-Rectified Flow
- ✅ Shortcut Models
- ✅ Any other policy defined by ODEs (in principle)
📈Empirical Results: ReinFlow achieves strong performance across a variety of robotic tasks:
- 🦵 Legged Locomotion (OpenAI Gym)
- ✋ State-based manipulation (Franka Kitchen)
- 👀 Visual manipulation (Robomimic)
🧠Key Innovation: ReinFlow trains anoise injection network end-to-end:
- ✅ Makes policy probabilities tractable, even withvery few denoising steps (e.g., 4, 2, or 1)
- ✅ Robust to discretization and Monte Carlo approximation errors
Learn more on our 🔗project website or check out thearXiv paper.
Please follow the steps ininstallation/reinflow-setup.md.
To fully reproduce our experiments, please refer toReproduceExps.md.
To download our training data and reproduce the plots in the paper, please refer toReproduceFigs.md.
Please refer toImplement.md for descriptions of key hyperparameters of FQL, DPPO, and ReinFlow.
Please refer toCustom.md.
Please refer toKnownIssues.md to see how to resolve errors you encounter.
After training flow policies with RL in multiple benchmarks (OpenAI Gym, Franka Kitchen, Robomimic, LIBERO, ManiSkill, MetaWorld) and scaling model size from 3M to 3B,we discover that these hyperparameters are critical to RL's success, especially in visual manipulation from sparse reward:
SFT success rate. RL cannot train visual manipulation policies easily from scratch, so try to optimize your SFT success rate before starting RL. The stronger your SFT is, the easier it will be for RL.Noise level. When the SFT success rate is low, tune down noise to [0.04, 0.10] or [0.05, 0.12] to avoid too much erroneous behaviors in early-stage exploration.When the SFT success rate is high, relax the noise logvariance to [0.08, 0.16] is usually a good practice.Entropy coefficient. Turn it off first. When pocliy struggles to improve, add a small coefficient of 0.005 may help. When the policy is small and the problem is simple (dense reward, low-dim input),use larger entropy coefficient. Otherwise be cautious of increasing this constant.Critic warmup. The stronger your SFT checkpoint is, the more you need a critic warmup. Try to pick the correct critic network architecture and add some rounds of warmup before policy gradient ascent. Try to make the critic loss decrease smoothly after the warmup phase, and keep a keen eye on the explained variance--it should quickly increase to a higher level. However, even without warmup, ReinFlow should be able to increase success rate eventually, but that usually slows down convergence.
Based on community feedback, we have added a limitations section to highlight the shortcomings of our algorithm and note important caveats. We hope this discussion will inspire future research.
- ReinFlow may not be an optimal method to train RL agents from scratch. Our method is designed for fine-tuning purposes, not pre-training.
- Release pi0, pi0.5 fine-tuning results.
- Release WandB metrics
- Release docs
- Release checkpoints
- Release codebase
This repository is released under the MIT license. SeeLICENSE.If you use our code, we appreciate it if you paste the license at the beginning of the script.
This repository was developed from multiple open-source projects. Major references include:
- TorchCFM, Tong et al.: Conditional flow-matching repository.
- Shortcut Models, Francs et al.: One-step Diffusion via Shortcut Models.
- DPPO, Ren et al.: DPPO official implementation.
We also thank our collaborators from the open-source RL infrastructure projectRLinf for their generous support, which enabled scaling ReinFlow to models of up to 3 billion parameters across 320 highly randomized visual manipulation environments with thousands of object-scene-task-pose combinations.
For more references, please refer toAcknowledgement.md.
@misc{zhang2025reinflowfinetuningflowmatching,title={ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning},author={Tonghe Zhang and Chao Yu and Sichang Su and Yu Wang},year={2025},eprint={2505.22094},archivePrefix={arXiv},primaryClass={cs.RO},url={https://arxiv.org/abs/2505.22094},}
About
[NeurIPS 2025] Flow x RL. "ReinFlow: Fine-tuning Flow Policy with Online Reinforcement Learning". Support VLAs e.g., pi0, pi0.5. Fully open-sourced.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Contributors2
Uh oh!
There was an error while loading.Please reload this page.


