NotificationsYou must be signed in to change notification settings
Fork6.6k
Star38k

[train] add model (pipeline) parallelism example#22894

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

matthewdeng wants to merge3 commits intoray-project:master

base:master

Choose a base branch

frommatthewdeng:pipeline

Open

[train] add model (pipeline) parallelism example#22894

matthewdeng wants to merge3 commits intoray-project:masterfrommatthewdeng:pipeline

Conversation

Copy link

Contributor

matthewdeng commentedMar 8, 2022•
edited
Loading

Why are these changes needed?

This PR converts the following PyTorch example to run with Ray Train:
Training Transformer models using Distributed Data Parallel and Pipeline Parallelism

The script is modified to support multi-node, a configurable number of workers, and a configurable number of GPUs per worker.

This is also added to the docs as an end-to-end examplehere.

Usage

pipeline_parallelism.yaml:

cluster_name:pipeline_parallelismmax_workers:1provider:type:awsregion:us-west-2auth:ssh_user:ubuntuavailable_node_types:4_gpu_node:min_workers:0max_workers:1node_config:InstanceType:g4dn.12xlargeImageId:latest_dlamiresources:{}head_node_type:4_gpu_nodesetup_commands:    -pip install -U ray torch torchtext torchdata

Deploy cluster & set up SSH forwarding:

ray up -y pipeline_parallelism.yamlray attach pipeline_parallelism.yaml -p 10001

Run script with Ray Client:

# 2 workers, 2 GPU each (single node)python python/ray/train/examples/pipeline_parallelism_example.py --address"ray://localhost:10001" -n 2 -g 2# 2 workers, 4 GPU each (multi node)python python/ray/train/examples/pipeline_parallelism_example.py --address"ray://localhost:10001" -n 2 -g 4# 4 workers, 2 GPU each (multi node)python python/ray/train/examples/pipeline_parallelism_example.py --address"ray://localhost:10001" -n 4 -g 2

Related issue number

Closes#21508

Checks

I've runscripts/format.sh to lint the changes in this PR.
I've included any doc changes needed forhttps://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures athttps://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

[train] add model (pipeline) parallelism example

e5b898d

matthewdeng assignedamogkam

Mar 8, 2022

amogkam reviewed

Mar 9, 2022

View reviewed changes

Copy link

Contributor

amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice, this is a super cool example!

I know you already mentioned this as a TODO, but one thing that I think would be great is creating directory containing the python script, the yaml file you used, and then a README explaining what the example, how Ray Train helps, and how to deploy (just copy from your PR description).

python/ray/train/examples/pipeline_parallelism_example.py

		# Initialize process group and wrap model in DDP.
		from torch.nn.parallel import DistributedDataParallel

		model = DistributedDataParallel(model)

Copy link

Contributor

amogkamMar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

use torch.prepare_model() here?

Copy link

ContributorAuthor

matthewdengMar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yeah, let me add that withmove_to_device=False.

Copy link

ContributorAuthor

matthewdengApr 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actually this doesn't work:

ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cuda', index=0), device(type='cuda', index=1)}.

python/ray/train/examples/pipeline_parallelism_example.pyShow resolvedHide resolved

python/ray/train/examples/pipeline_parallelism_example.py

		val_data = data_process(val_iter)
		test_data = data_process(test_iter)

		device = torch.device(num_gpus * local_rank)

Copy link

Contributor

amogkamMar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

can we use train.torch.get_device() here?

Copy link

ContributorAuthor

matthewdengMar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah this actually doesn't work because the current implementation assumesdevice == local_rank. We need to expose a new API to get the list of devices for the multi-GPU case. I can create an issue to track this!

python/ray/train/examples/pipeline_parallelism_example.pyShow resolvedHide resolved

python/ray/train/examples/pipeline_parallelism_example.py

		@@ -0,0 +1,352 @@
		import argparse

Copy link

Contributor

amogkamMar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Should we also add this to herehttps://docs.ray.io/en/latest/train/examples.html?

Copy link

ContributorAuthor

matthewdengMar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Definitely, I'm talking to@maxpumperla about converting this to a notebook before merging. With that, I will address the suggestions made in your main comment about including the README & yaml content.

python/ray/train/examples/pipeline_parallelism_example.pyShow resolvedHide resolved

matthewdeng added2 commits

April 3, 2022 18:19

Merge branch 'master' of github.com:ray-project/ray into pipeline

8e28391

add to docs

2410d6c

matthewdeng requested a review fromamogkam

April 4, 2022 16:58

amogkam reviewed

Apr 4, 2022

View reviewed changes

python/ray/train/examples/pipeline_parallelism_example.py

		@@ -0,0 +1,352 @@
		import argparse

Copy link

Contributor

amogkamApr 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can we also run a smoke test of this example in GPU CI?

Copy link

stalebot commentedJun 19, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.