- Notifications
You must be signed in to change notification settings - Fork6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
[train] add model (pipeline) parallelism example#22894
base:master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Nice, this is a super cool example!
I know you already mentioned this as a TODO, but one thing that I think would be great is creating directory containing the python script, the yaml file you used, and then a README explaining what the example, how Ray Train helps, and how to deploy (just copy from your PR description).
# Initialize process group and wrap model in DDP. | ||
from torch.nn.parallel import DistributedDataParallel | ||
model = DistributedDataParallel(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
use torch.prepare_model() here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yeah, let me add that withmove_to_device=False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Actually this doesn't work:
ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cuda', index=0), device(type='cuda', index=1)}.
val_data = data_process(val_iter) | ||
test_data = data_process(test_iter) | ||
device = torch.device(num_gpus * local_rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
can we use train.torch.get_device() here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Ah this actually doesn't work because the current implementation assumesdevice == local_rank
. We need to expose a new API to get the list of devices for the multi-GPU case. I can create an issue to track this!
@@ -0,0 +1,352 @@ | |||
import argparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Should we also add this to herehttps://docs.ray.io/en/latest/train/examples.html?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Definitely, I'm talking to@maxpumperla about converting this to a notebook before merging. With that, I will address the suggestions made in your main comment about including the README & yaml content.
@@ -0,0 +1,352 @@ | |||
import argparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Can we also run a smoke test of this example in GPU CI?
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message. Please feel free to reopen or open a new issue if you'd still like it to be addressed. Again, you can always ask for help on ourdiscussion forum orRay's public slack channel. Thanks again for opening the issue! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
is this still relevant? should we close this?
Why are these changes needed?
This PR converts the following PyTorch example to run with Ray Train:
Training Transformer models using Distributed Data Parallel and Pipeline Parallelism
The script is modified to support multi-node, a configurable number of workers, and a configurable number of GPUs per worker.
This is also added to the docs as an end-to-end examplehere.
Usage
pipeline_parallelism.yaml
:Deploy cluster & set up SSH forwarding:
Run script with Ray Client:
Related issue number
Closes#21508
Checks
scripts/format.sh
to lint the changes in this PR.