Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
/rlPublic

v0 param server (using collectives not object store)#2865

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
mikaylagawarecki wants to merge6 commits intogh/mikaylagawarecki/2/base
base:gh/mikaylagawarecki/2/base
Choose a base branch
Loading
fromgh/mikaylagawarecki/2/head

Conversation

@mikaylagawarecki
Copy link
Contributor

@mikaylagawareckimikaylagawarecki commentedMar 21, 2025
edited
Loading

@pytorch-bot
Copy link

pytorch-botbot commentedMar 21, 2025
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/pytorch/rl/2865

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 1 Cancelled Job, 1 Unrelated Failure

As of commit32c10d7 with merge base04d70c1 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mikaylagawarecki added a commit that referenced this pull requestMar 21, 2025
@facebook-github-botfacebook-github-bot added the CLA SignedThis label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. labelMar 21, 2025
Comment on lines 293 to 297
handle = self.collector._remote_collectors[worker_id].call_policy_method.remote(
"collective_rpc",
("update_weight",),
{'args': (k, v.dtype, v.shape)}
)
Copy link
ContributorAuthor

@mikaylagawareckimikaylagawareckiMar 21, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@vmoens This is one part where I'm trying to call a method on the LLM object to init a process group with the vllm workers, the second part is below on L293

In this case
SyncDataCollector is remote and has an attribute.policy

.policy is the ModuleDict object returned byfrom_vllm, and the actual llm instance is in the generate key (the LLM instance is local to the SyncDataCollector)

How can I have a handle to the LLM instance within the SyncDataCollector to call remote methods on it without the hackycall_policy_method implementation below?

mikaylagawarecki added a commit that referenced this pull requestMar 22, 2025
mikaylagawarecki added a commit that referenced this pull requestMar 22, 2025
Comment on lines +158 to +159
# here again, I want to grab the tp size from the vLLM worker... :(
# llm.llm_engine.parallel_config.tensor_parallel_size
Copy link
ContributorAuthor

@mikaylagawareckimikaylagawareckiMar 22, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@vmoens I keep finding that I want to get info off vllm directly :/, what would you do here?

Should this vLLMRemoteWeightUpdaterBase be aware of all the vllm engines and tp size of each owned by its parent RayCollector in its __init__, I already needed to pass separate master_address and master_port I guess

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think you're right, there's no way around it

We need for the main worker to know that stuff about the remote ones

mikaylagawarecki added a commit that referenced this pull requestMar 22, 2025
mikaylagawarecki added a commit that referenced this pull requestMar 22, 2025


VLLM_ERR = None
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

importfind_spec might be preferred:https://docs.python.org/3/library/importlib.html#importlib.util.find_spec as it doesn't do the actual import until needed.

mikaylagawarecki reacted with thumbs up emoji
Copy link
Collaborator

@vmoensvmoensMar 28, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I agree, that's the proper way to do it
All third party imports should be done locally even if not optional (otherwise that slows down multiproc / distributed start time and can cause bugs that are hard to debug)

mikaylagawarecki added a commit that referenced this pull requestMar 28, 2025
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@vmoensvmoensvmoens left review comments

+1 more reviewer

@joecummingsjoecummingsjoecummings left review comments

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

CLA SignedThis label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

5 participants

@mikaylagawarecki@joecummings@vmoens@facebook-github-bot

[8]ページ先頭

©2009-2025 Movatter.jp