NotificationsYou must be signed in to change notification settings
Fork425
Star3.2k

v0 param server (using collectives not object store)#2865

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Draft

mikaylagawarecki wants to merge6 commits intogh/mikaylagawarecki/2/base

base:gh/mikaylagawarecki/2/base

Choose a base branch

fromgh/mikaylagawarecki/2/head

Draft

v0 param server (using collectives not object store)#2865

mikaylagawarecki wants to merge6 commits intogh/mikaylagawarecki/2/basefromgh/mikaylagawarecki/2/head

Conversation

Copy link

Contributor

mikaylagawarecki commentedMar 21, 2025•
edited
Loading

Stack fromghstack (oldest at bottom):

->v0 param server (using collectives not object store) #2865

v0 param server (using collectives not object store)

c285ef8

[ghstack-poisoned]

Copy link

pytorch-botbot commentedMar 21, 2025•
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/pytorch/rl/2865

📄 PreviewPython docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 1 Cancelled Job, 1 Unrelated Failure

As of commit32c10d7 with merge base04d70c1 ():

NEW FAILURES - The following jobs have failed:

Continuous Benchmark (PR) / CPU Pytest benchmark (gh)
Process completed with exit code 1.
Continuous Benchmark (PR) / GPU Pytest benchmark (gh)
Process completed with exit code 1.
Habitat Tests on Linux / tests (3.9, 12.8) / linux-job (gh)
RuntimeError: Command docker exec -t 4dbc541ccd54ef94d84a3cc0540d7a13ee841f3b02373a45eb358c29eb1fcaee /exec failed with exit code 1
Lint / python-source-and-configs / linux-job (gh)
torchrl/collectors/vllm_weight_update.py:191:9: F401 'transformers.AutoModel' imported but unused
Unit-tests on Linux / tests-cpu (3.12) / linux-job (gh)
test/test_env.py::TestNonTensorEnv::test_parallel[False-True]
Unit-tests on Linux / tests-olddeps (3.8, 11.6) / linux-job (gh)
RuntimeError: Command docker exec -t 7a0121bec6daf52cfb9989f88871ca4d4764cf10fe4824304174f3bea5425632 /exec failed with exit code 1
Unit-tests on Linux / tests-optdeps (3.11, 12.8) / linux-job (gh)
test/test_cost.py::TestPPO4LLMs::test_hf[False]

CANCELLED JOB - The following job was cancelled. Please retry:

LLM Tests on Linux / unittests (3.9, 12.8) / linux-job (gh)
##[error]The operation was canceled.

BROKEN TRUNK - The following job failed but were present on the merge base:

👉Rebase onto the `viable/strict` branch to avoid these failures

Unit-tests on Windows / unittests-cpu (3.10, windows.4xlarge, cpu) / windows-job (gh) (trunk failure)
test/test_transforms.py::TestTimer::test_transform_env

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mikaylagawarecki added a commit that referenced this pull request

Mar 21, 2025

v0 param server (using collectives not object store)

da01dde

ghstack-source-id:1761493Pull Requestresolved:#2865

facebook-github-bot added the CLA SignedThis label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label

Mar 21, 2025

mikaylagawarecki commented

Mar 21, 2025

View reviewed changes

param_server_weight_updater.py Outdated

Comment on lines 293 to 297

		handle = self.collector._remote_collectors[worker_id].call_policy_method.remote(
		"collective_rpc",
		("update_weight",),
		{'args': (k, v.dtype, v.shape)}
		)

Copy link

ContributorAuthor

mikaylagawareckiMar 21, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@vmoens This is one part where I'm trying to call a method on the LLM object to init a process group with the vllm workers, the second part is below on L293

In this case
SyncDataCollector is remote and has an attribute.policy

.policy is the ModuleDict object returned byfrom_vllm, and the actual llm instance is in the generate key (the LLM instance is local to the SyncDataCollector)

How can I have a handle to the LLM instance within the SyncDataCollector to call remote methods on it without the hackycall_policy_method implementation below?

Update on "v0 param server (using collectives not object store)"

839cf0a

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request

Mar 22, 2025

v0 param server (using collectives not object store)

3939a1e

ghstack-source-id:f9dabc3Pull Requestresolved:#2865

Update on "v0 param server (using collectives not object store)"

5c6b015

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request

Mar 22, 2025

v0 param server (using collectives not object store)

6125fca

ghstack-source-id:b30dce2Pull Requestresolved:#2865

mikaylagawarecki commented

Mar 22, 2025

View reviewed changes

torchrl/collectors/vllm_weight_update.py

Comment on lines +158 to +159

		# here again, I want to grab the tp size from the vLLM worker... :(
		# llm.llm_engine.parallel_config.tensor_parallel_size

Copy link

ContributorAuthor

mikaylagawareckiMar 22, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@vmoens I keep finding that I want to get info off vllm directly :/, what would you do here?

Should this vLLMRemoteWeightUpdaterBase be aware of all the vllm engines and tp size of each owned by its parent RayCollector in its __init__, I already needed to pass separate master_address and master_port I guess

Copy link

Collaborator

vmoensMar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think you're right, there's no way around it

We need for the main worker to know that stuff about the remote ones

Update on "v0 param server (using collectives not object store)"

7bd553d

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request

Mar 22, 2025

v0 param server (using collectives not object store)

d37f0bb

ghstack-source-id:74de8e0Pull Requestresolved:#2865

Update on "v0 param server (using collectives not object store)"

0a265a9

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request

Mar 22, 2025

v0 param server (using collectives not object store)

3b0e253

ghstack-source-id:2352b0bPull Requestresolved:#2865

joecummings reviewed

Mar 24, 2025

View reviewed changes

torchrl/collectors/vllm_weight_update.py



		VLLM_ERR = None
		try:

Copy link

Member

joecummingsMar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

importfind_spec might be preferred:https://docs.python.org/3/library/importlib.html#importlib.util.find_spec as it doesn't do the actual import until needed.

Copy link

Collaborator

vmoensMar 28, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I agree, that's the proper way to do it
All third party imports should be done locally even if not optional (otherwise that slows down multiproc / distributed start time and can cause bugs that are hard to debug)

Update on "v0 param server (using collectives not object store)"

32c10d7

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request

Mar 28, 2025

v0 param server (using collectives not object store)

d796e1d

ghstack-source-id:70da726Pull Requestresolved:#2865

Labels

CLA Signed

This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

5 participants

Movatterモバイル変換

v0 param server (using collectives not object store)#2865

Are you sure you want to change the base?

v0 param server (using collectives not object store)#2865

Uh oh!

Conversation

mikaylagawarecki commentedMar 21, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

pytorch-botbot commentedMar 21, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/pytorch/rl/2865

❌ 7 New Failures, 1 Cancelled Job, 1 Unrelated Failure

Uh oh!

mikaylagawareckiMar 21, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikaylagawareckiMar 22, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vmoensMar 24, 2025

Choose a reason for hiding this comment

Uh oh!

joecummingsMar 24, 2025

Choose a reason for hiding this comment

Uh oh!

vmoensMar 28, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mikaylagawarecki commentedMar 21, 2025•
edited
Loading

pytorch-botbot commentedMar 21, 2025•
edited
Loading

mikaylagawareckiMar 21, 2025•
edited
Loading

mikaylagawareckiMar 22, 2025•
edited
Loading

vmoensMar 28, 2025•
edited
Loading