Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[WIP] Fix ppo example accelerator initialization error#3732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
ccs96307 wants to merge4 commits intohuggingface:main
base:main
Choose a base branch
Loading
fromccs96307:fix-ppo-example-accelerator-error

Conversation

ccs96307
Copy link
Contributor

What does this PR do?

This PR leverages the parent class'create_accelerator_and_postprocess method to initialize theaccelerator correctly, without overhauling thePPOTrainer initialization flow.

  • Usescreate_accelerator_and_postprocess instead of manual accelerator setup.
  • Initialization now succeeds under specific configurations: ZeRO Stage 1 supports any--gradient_accumulation_steps, whereas ZeRO Stage 2 and 3 require --gradient_accumulation_steps == 1.

Note: With--gradient_accumulation_steps > 1, running with ZeRO stage 2 or 3 still trigger the well-known error:

[rank0]:   File "/workspace/Projects/trl/examples/scripts/ppo/ppo.py", line 163, in <module>                                                                                                                                12:44:26 [34/1998][rank0]:     trainer.train()                                                                                                                                                                                                                  [rank0]:   File "/workspace/Projects/trl/trl/trainer/ppo_trainer.py", line 668, in train                                                                                                                                                      [rank0]:     with accelerator.accumulate(model):                                                                                                                                                                                              [rank0]:   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__                                                                                                                                                                   [rank0]:     return next(self.gen)                                                                                                                                                                                                            [rank0]:            ^^^^^^^^^^^^^^                                                                                                                                                                                                            [rank0]:   File "/workspace/Projects/trl/venv/lib/python3.12/site-packages/accelerate/accelerator.py", line 1166, in accumulate                                                                                                               [rank0]:     cm_stack.enter_context(contextlib.nullcontext() if allow_gradient_sync else self.no_sync(m))                                                                                                                                     [rank0]:   File "/usr/lib/python3.12/contextlib.py", line 526, in enter_context                                                                                                                                                               [rank0]:     result = _enter(cm)                                                                                                                                                                                                              [rank0]:              ^^^^^^^^^^                                                                                                                                                                                                              [rank0]:   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__                                                                                                                                                                   [rank0]:     return next(self.gen)                                                                                                                                                                                                            [rank0]:            ^^^^^^^^^^^^^^                                                                                                                                                                                                            [rank0]:   File "/workspace/Projects/trl/venv/lib/python3.12/site-packages/accelerate/accelerator.py", line 1047, in no_sync                                                                                                                  [rank0]:     with context():                                                                                                                                                                                                                  [rank0]:   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__                                                                                                                                                                   [rank0]:     return next(self.gen)                                                                                                                                                                                                            [rank0]:            ^^^^^^^^^^^^^^                                                                                                                                                                                                            [rank0]:   File "/workspace/Projects/trl/venv/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 2243, in no_sync                                                                                                                [rank0]:     assert not self.zero_optimization_partition_gradients(), \                                                                                                                                                                       [rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                          [rank0]: AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3                                                                                                                             [rank1]: Traceback (most recent call last):                                                                                                                                                                                                   [rank1]:   File "/workspace/Projects/trl/examples/scripts/ppo/ppo.py", line 163, in <module>                                                                                                                                                  [rank1]:     trainer.train()[rank1]:   File "/workspace/Projects/trl/trl/trainer/ppo_trainer.py", line 668, in train[rank1]:     with accelerator.accumulate(model):[rank1]:   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__[rank1]:     return next(self.gen)[rank1]:            ^^^^^^^^^^^^^^[rank1]:   File "/workspace/Projects/trl/venv/lib/python3.12/site-packages/accelerate/accelerator.py", line 1166, in accumulate[rank1]:     cm_stack.enter_context(contextlib.nullcontext() if allow_gradient_sync else self.no_sync(m))[rank1]:   File "/usr/lib/python3.12/contextlib.py", line 526, in enter_context[rank1]:     result = _enter(cm)[rank1]:              ^^^^^^^^^^[rank1]:   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__[rank1]:     return next(self.gen)[rank1]:            ^^^^^^^^^^^^^^[rank1]:   File "/workspace/Projects/trl/venv/lib/python3.12/site-packages/accelerate/accelerator.py", line 1047, in no_sync[rank1]:     with context():[rank1]:   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__[rank1]:     return next(self.gen)[rank1]:            ^^^^^^^^^^^^^^[rank1]:   File "/workspace/Projects/trl/venv/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 2243, in no_sync[rank1]:     assert not self.zero_optimization_partition_gradients(), \[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank1]: AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3

The related issues as follows:


Request for feedback

  1. Combine fixes?
    should this PR also address theno_sync with ZeRO 2/3 compatibility (i.e. implement a workaround or guard), or…

  2. Separate the issue
    open a new issue for the ZeRO compatibility problem and keep this PR focused solely on "accelerator initialization"?

Any guidance or opinions are greatly appreciated -- thank you! 🙏


Fixes # (issue)
#2377

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read thecontributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers
No reviews
Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

1 participant
@ccs96307

[8]ページ先頭

©2009-2025 Movatter.jp