Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

[train] add model (pipeline) parallelism example#22894

Open
matthewdeng wants to merge3 commits intoray-project:master
base:master
Choose a base branch
Loading
frommatthewdeng:pipeline

Conversation

matthewdeng
Copy link
Contributor

@matthewdengmatthewdeng commentedMar 8, 2022
edited
Loading

Why are these changes needed?

This PR converts the following PyTorch example to run with Ray Train:
Training Transformer models using Distributed Data Parallel and Pipeline Parallelism

The script is modified to support multi-node, a configurable number of workers, and a configurable number of GPUs per worker.

This is also added to the docs as an end-to-end examplehere.

Usage

pipeline_parallelism.yaml:

cluster_name:pipeline_parallelismmax_workers:1provider:type:awsregion:us-west-2auth:ssh_user:ubuntuavailable_node_types:4_gpu_node:min_workers:0max_workers:1node_config:InstanceType:g4dn.12xlargeImageId:latest_dlamiresources:{}head_node_type:4_gpu_nodesetup_commands:    -pip install -U ray torch torchtext torchdata

Deploy cluster & set up SSH forwarding:

ray up -y pipeline_parallelism.yamlray attach pipeline_parallelism.yaml -p 10001

Run script with Ray Client:

# 2 workers, 2 GPU each (single node)python python/ray/train/examples/pipeline_parallelism_example.py --address"ray://localhost:10001" -n 2 -g 2# 2 workers, 4 GPU each (multi node)python python/ray/train/examples/pipeline_parallelism_example.py --address"ray://localhost:10001" -n 2 -g 4# 4 workers, 2 GPU each (multi node)python python/ray/train/examples/pipeline_parallelism_example.py --address"ray://localhost:10001" -n 4 -g 2

Related issue number

Closes#21508

Checks

  • I've runscripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed forhttps://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures athttps://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

justinvyu reacted with eyes emoji
Copy link
Contributor

@amogkamamogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice, this is a super cool example!

I know you already mentioned this as a TODO, but one thing that I think would be great is creating directory containing the python script, the yaml file you used, and then a README explaining what the example, how Ray Train helps, and how to deploy (just copy from your PR description).

# Initialize process group and wrap model in DDP.
from torch.nn.parallel import DistributedDataParallel

model = DistributedDataParallel(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

use torch.prepare_model() here?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yeah, let me add that withmove_to_device=False.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actually this doesn't work:

ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cuda', index=0), device(type='cuda', index=1)}.

val_data = data_process(val_iter)
test_data = data_process(test_iter)

device = torch.device(num_gpus * local_rank)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

can we use train.torch.get_device() here?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah this actually doesn't work because the current implementation assumesdevice == local_rank. We need to expose a new API to get the list of devices for the multi-GPU case. I can create an issue to track this!

@@ -0,0 +1,352 @@
import argparse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Definitely, I'm talking to@maxpumperla about converting this to a notebook before merging. With that, I will address the suggestions made in your main comment about including the README & yaml content.

@@ -0,0 +1,352 @@
import argparse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can we also run a smoke test of this example in GPU CI?

@stale
Copy link

stalebot commentedJun 19, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stalestalebot added the staleThe issue is stale. It will be closed within 7 days unless there are further conversation labelJun 19, 2022
@stale
Copy link

stalebot commentedJul 10, 2022

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on ourdiscussion forum orRay's public slack channel.

Thanks again for opening the issue!

@stalestalebot closed thisJul 10, 2022
@amogkamamogkam reopened thisAug 30, 2022
@stalestalebot removed the staleThe issue is stale. It will be closed within 7 days unless there are further conversation labelAug 30, 2022
Copy link
Collaborator

@aslonnieaslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

is this still relevant? should we close this?

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@amogkamamogkamamogkam left review comments

@aslonnieaslonnieaslonnie left review comments

@richardliawrichardliawAwaiting requested review from richardliaw

@krfrickekrfrickeAwaiting requested review from krfricke

@xwjiang2010xwjiang2010Awaiting requested review from xwjiang2010

@Yard1Yard1Awaiting requested review from Yard1

@maxpumperlamaxpumperlaAwaiting requested review from maxpumperla

@justinvyujustinvyuAwaiting requested review from justinvyujustinvyu is a code owner

@woshiyyyawoshiyyyaAwaiting requested review from woshiyyyawoshiyyya is a code owner

@hongpeng-guohongpeng-guoAwaiting requested review from hongpeng-guohongpeng-guo is a code owner

@raulchenraulchenAwaiting requested review from raulchenraulchen is a code owner

At least 1 approving review is required to merge this pull request.

Assignees

@amogkamamogkam

Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[train] add example for pytorch model parallelism
3 participants
@matthewdeng@amogkam@aslonnie

[8]ページ先頭

©2009-2025 Movatter.jp