Rate this Page

Introduction ||What is DDP ||Single-Node Multi-GPU Training ||Fault Tolerance ||Multi-Node training ||minGPT Training

Distributed Data Parallel in PyTorch - Video Tutorials#

Created On: Sep 27, 2022 | Last Updated: Nov 15, 2024 | Last Verified: Nov 05, 2024

Authors:Suraj Subramanian

Follow along with the video below or onyoutube.

This series of video tutorials walks you through distributed training inPyTorch via DDP.

The series starts with a simple non-distributed training job, and endswith deploying a training job across several machines in a cluster.Along the way, you will also learn abouttorchrun forfault-tolerant distributed training.

The tutorial assumes a basic familiarity with model training in PyTorch.

Running the code#

You will need multiple CUDA GPUs to run the tutorial code. Typically,this can be done on a cloud instance with multiple GPUs (the tutorialsuse an Amazon EC2 P3 instance with 4 GPUs).

The tutorial code is hosted in thisgithub repo.Clone the repository and follow along!

Tutorial sections#

  1. Introduction (this page)

  2. What is DDP? Gently introduces what DDP is doingunder the hood

  3. Single-Node Multi-GPU Training Training modelsusing multiple GPUs on a single machine

  4. Fault-tolerant distributed trainingMaking your distributed training job robust with torchrun

  5. Multi-Node training Training models usingmultiple GPUs on multiple machines

  6. Training a GPT model with DDP “Real-world”example of training aminGPTmodel with DDP