Computer Science > Robotics
arXiv:2409.14411 (cs)
[Submitted on 22 Sep 2024 (v1), last revised 14 Nov 2024 (this version, v2)]
Title:Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation
Authors:Minjie Zhu,Yichen Zhu,Jinming Li,Junjie Wen,Zhiyuan Xu,Ning Liu,Ran Cheng,Chaomin Shen,Yaxin Peng,Feifei Feng,Jian Tang
View a PDF of the paper titled Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation, by Minjie Zhu and 10 other authors
View PDFHTML (experimental)Abstract:Diffusion Policy is a powerful technique tool for learning end-to-end visuomotor robot control. It is expected that Diffusion Policy possesses scalability, a key attribute for deep neural networks, typically suggesting that increasing model size would lead to enhanced performance. However, our observations indicate that Diffusion Policy in transformer architecture (\DP) struggles to scale effectively; even minor additions of layers can deteriorate training outcomes. To address this issue, we introduce Scalable Diffusion Transformer Policy for visuomotor learning. Our proposed method, namely \textbf{\methodname}, introduces two modules that improve the training dynamic of Diffusion Policy and allow the network to better handle multimodal action distribution. First, we identify that \DP~suffers from large gradient issues, making the optimization of Diffusion Policy unstable. To resolve this issue, we factorize the feature embedding of observation into multiple affine layers, and integrate it into the transformer blocks. Additionally, our utilize non-causal attention which allows the policy network to \enquote{see} future actions during prediction, helping to reduce compounding errors. We demonstrate that our proposed method successfully scales the Diffusion Policy from 10 million to 1 billion parameters. This new model, named \methodname, can effectively scale up the model size with improved performance and generalization. We benchmark \methodname~across 50 different tasks from MetaWorld and find that our largest \methodname~outperforms \DP~with an average improvement of 21.6\%. Across 7 real-world robot tasks, our ScaleDP demonstrates an average improvement of 36.25\% over DP-T on four single-arm tasks and 75\% on three bimanual tasks. We believe our work paves the way for scaling up models for visuomotor learning. The project page is available atthis http URL.
Subjects: | Robotics (cs.RO) |
Cite as: | arXiv:2409.14411 [cs.RO] |
(orarXiv:2409.14411v2 [cs.RO] for this version) | |
https://doi.org/10.48550/arXiv.2409.14411 arXiv-issued DOI via DataCite |
Submission history
From: Yichen Zhu [view email][v1] Sun, 22 Sep 2024 12:14:16 UTC (5,701 KB)
[v2] Thu, 14 Nov 2024 11:59:09 UTC (5,701 KB)
Full-text links:
Access Paper:
- View PDF
- HTML (experimental)
- TeX Source
- Other Formats
View a PDF of the paper titled Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation, by Minjie Zhu and 10 other authors
References & Citations
Bibliographic and Citation Tools
Bibliographic Explorer(What is the Explorer?)
Connected Papers(What is Connected Papers?)
Litmaps(What is Litmaps?)
scite Smart Citations(What are Smart Citations?)
Code, Data and Media Associated with this Article
alphaXiv(What is alphaXiv?)
CatalyzeX Code Finder for Papers(What is CatalyzeX?)
DagsHub(What is DagsHub?)
Gotit.pub(What is GotitPub?)
Hugging Face(What is Huggingface?)
Papers with Code(What is Papers with Code?)
ScienceCast(What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower(What are Influence Flowers?)
CORE Recommender(What is CORE?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community?Learn more about arXivLabs.