Movatterモバイル変換


[0]ホーム

URL:


Packt
Search iconClose icon
Search icon CANCEL
Subscription
0
Cart icon
Your Cart(0 item)
Close icon
You have no products in your basket yet
Save more on your purchases!discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Profile icon
Account
Close icon

Change country

Modal Close icon
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timerSALE ENDS IN
0Days
:
00Hours
:
00Minutes
:
00Seconds
Home> Data> Machine Learning> Distributed Machine Learning with Python
Distributed Machine Learning with Python
Distributed Machine Learning with Python

Distributed Machine Learning with Python: Accelerating model training and serving with distributed systems

Arrow left icon
Profile Icon Wang
Arrow right icon
$9.99$33.99
Full star iconFull star iconFull star iconFull star iconHalf star icon4.3(14 Ratings)
eBookApr 2022284 pages1st Edition
eBook
$9.99 $33.99
Paperback
$41.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$9.99 $33.99
Paperback
$41.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature iconInstant access to your Digital eBook purchase
Product feature icon Download this book inEPUB andPDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature iconDRM FREE - Read whenever, wherever and however you want
Product feature iconAI Assistant (beta) to help accelerate your learning

Contact Details

Modal Close icon
Payment Processing...
tickCompleted

Billing Address

Table of content iconView table of contentsPreview book icon Preview Book

Distributed Machine Learning with Python

Chapter 1: Splitting Input Data

Over the recent years, data has grown drastically in size. For instance, if you take the computer vision domain as an example, datasets such as MNIST and CIFAR-10/100 consist of only 50k training images each, whereas recent datasets such as ImageNet-1k contain over 1 million training images. However, having a larger input data size leads to a much longer model training time on a single GPU/node. In the example mentioned previously, the total training time of a useable state-of-the-art single GPU training model on a CIFAR-10/100 dataset only takes a couple of hours. However, when it comes to the ImageNet-1K dataset, the training time for a GPU model will take days or even weeks.

The standard practice for speeding up the model training process is parallel execution, which is the main focus of this book. The most popular in-parallel model training is calleddata parallelism. In data parallel training, each GPU/node holds the full copy of a model. Then, it partitions the input data into disjoint subsets, where each GPU/node is only responsible for model training on one of the input partitions. Since each GPU only trains its local model on a subset (not the whole set) of the input data, we need to conduct a procedure called model synchronization periodically.Model synchronization is done to ensure that, after each training iteration, all the GPUs involved in this training job are on the same page. This guarantees that the model copies that are held on different GPUs have the same parameter values.

Data parallelism can also be applied at the model serving stage. Given that the fully-trained model may need to serve a large number of inference tasks, splitting the inference input data can reduce the end-to-end model serving time as well. One major difference compared to data parallel training is that in data parallel inference, all the GPUs/nodes involved in a single jobdo not need to communicate anymore, which means that the model synchronization phase during data parallel training is completely removed.

This chapter will discuss the bottleneck of model training with large datasets and how data parallelism mitigates this.

The following topics will be covered in this chapter:

  • Single-node training is too slow
  • Data parallelism – the high-level bits
  • Hyperparameter tuning

Single-node training is too slow

The vanillamodel training process is to load both the training data and ML model into the same accelerator (for example, a GPU), which is called single-node training. There are mainly three steps that occur in a single node training model:

  1. Input pre-processing
  2. Training
  3. Validation

The following diagram shows what this looks like in a typical model training workflow:

Figure 1.1 – Model training workflow on a single node

Figure 1.1 – Model training workflow on a single node

As you can see, after input pre-processing, the augmented input data is loaded into the memory of the accelerators (such as GPUs). Following that, the model is trained on the loaded input data batch and validates our trained model iteratively. The goal of this section is to discuss why single-node training is way too slow. First, we will show the real bottleneck in single-node training and then describe how data parallelism mitigates this bottleneck.

The mismatch between data loading bandwidth and model training bandwidth

Now, let'sfocuson the two kinds ofbandwidth (BW) in this data pipeline, namelydata loading bandwidth andmodel training bandwidth, as shown in the precedingdiagram. Nowadays, we have more and more input data. Hence, we would ideally want thedata loading bandwidth to be as large as possible (the wide gray arrow in the preceding diagram). However, due to the limited on-device memory of the GPUs or other accelerators, the realmodel training bandwidth is also limited (the narrow gray arrow in the preceding diagram).

Although it is generally believed that the larger input data size leads to a longer training time in single-node training, this is not true from the data flow perspective. From a system perspective, the mismatch between data loading bandwidth and model training bandwidth is the real issue.If we can match data loading bandwidth and model training bandwidth in single-node training, it is unnecessary to conduct in-parallel model training since distributed data processing will always introduce control overheads.

Real Bottleneck

A large input data size is not the fundamental cause of long training times in terms of single nodes. The mismatch betweendata loading bandwidth andmodel training bandwidth is the key issue.

Now that we know the reason behind the delay in single-node training when faced with large input data, let's move on to the next subtopic. Next, we will quantitively show the training times of some classic deep learning models by using standard datasets. This should help you understand why data parallel training is a must-have to deal with the mismatch between data loading bandwidth and model training bandwidth.

Single-node training time on popular datasets

Let's directly jump into training time analysis using a single GPU. We will use an NVIDIA Tesla M60 GPU as the accelerator. First, we will train both VGG-19 and ResNet-164 on the CIFAR-10 and CIFAR-100 datasets. The following diagram shows the corresponding total training time for reaching a model test accuracy over 91%:

Figure 1.2 – Model training time of a single node on the CIFAR-10/100 datasets

Figure 1.2 – Model training time of a single node on the CIFAR-10/100 datasets

As we can see, the total training time of VGG-19 is around 2 hours for both the CIFAR-10 and CIFAR-100 datasets, while for ResNet-164, the total training time for both the CIFAR-10 and CIFAR-100 datasets is around 10 hours.

It seems that the standard model training time, when using a single GPU on the CIFAR-10/100 dataset, is neither short nor long, which is acceptable. This is mainly because of low image resolution. For the CIFAR-10/100 datasets, the resolution of each image is very low at 32x32. Thus, the intermediate results that are generated during the model training stage are relatively small, since the activation matrices in the intermediate results are always less than 32x32. Since we generate smaller activations during training in a given fixed hardware memory size, we can train more input images at once. Consequently, we can achieve a highermodel training bandwidth, which mitigates the mismatch betweendata loading bandwidth andmodel training bandwidth.

Now, let's look at a modern ML model training dataset, such as ImageNet-1K. We have maintained a similar training environment setup to what we had for our CIFAR-10/100 training jobs. The difference is that we are training the VGG-19 and ResNet-50 models. The following diagram shows the corresponding total training time with a single GPU setting:

Figure 1.3 – Model training time for a single node on the ImageNet-1K dataset

Figure 1.3 – Model training time for a single node on the ImageNet-1K dataset

As we can see, the training time on a single GPU is unacceptable. It takes around2 weeks to train a single model, such as VGG-19 or ResNet-50. The main reason for this much slower training speed on the ImageNet-1K dataset is the higher image resolution, which is now around 256x256. Having a higher image resolution means that each training image will have a bigger memory footprint for storing its activations, which means that we can only train a smaller amount of images at once. Thus, the gap between model training bandwidth and data loading bandwidth is larger. Furthermore, the training time can be even longer for wider and deeper model training.

For our machine learning practitioners, the whole model updating cycle is way too long if we only limit ourselves to using a single GPU. This long training time is amplified since we need to try multiple sets of hyperparameters and find the best training recipes.

Therefore, we need to adopt the data parallel training paradigm to mitigate this mismatch between data loading bandwidth and model training bandwidth.

Accelerating the training process with data parallelism

So far, wehave discussed why data parallel training is a must-have due to the mismatch betweendata loading bandwidth andmodel training bandwidth. Before we dive into the details of how data parallel training works, let's look at the speed-ups that data parallelism can achieve over single node training.

Let's take ResNet-50 training on the ImageNet-1K dataset as an example. By using a proper hyperparameter setup, the following diagram shows the normalized speedups over different GPU training baselines:

Figure 1.4 – Normalized speedups over a single GPU baseline

Figure 1.4 – Normalized speedups over a single GPU baseline

As we can see, we have tested the system throughput for the data parallel training process over a single GPU training baseline. By incorporating multiple GPUs into the same training job, we expanded our model training bandwidth significantly with parallelism. Ideally, the extended model training bandwidth should be linearly increased by the number of GPUs involved. Due to system control overheads and network communications introduced in data parallel training, we cannot achieve linear scaling perfectly.

However, even with system overhead involved in data parallel training, the speed-up numbers are still significant compared to a single GPU training baseline. As depicted in the preceding diagram, by incorporating 8 GPUs for data parallel training, we can increase training throughput by more than 6x. With 16 GPUs involved in the same data parallel training job, the speed-up number is even better as it can achieve near 12x higher throughput compared to the single GPU baseline. Let's convert these throughput speed-up numbers into training time: if data parallel training using 16 GPUs, we can reduce ResNet-50 training on the ImageNet-1K dataset from 14 days to around just 1-2 days.

In addition, this speed-up number can continue growing when we have more GPUs involved in the same data parallel training job. With state-of-the-art hardware such as NVIDIA's DGX-1 and DGX-2 machines, the training time of ResNet-50 on the ImageNet-1K dataset can be significantly reduced toless than 1 hour if we incorporate hundreds of GPUs into this data parallel model training job.

To conclude this section, single-node model training takes up a lot of time, which is mainly due to the mismatch problem between the data loading bandwidth and the model training bandwidth. By incorporating data parallelism, we can increase the model training bandwidth proportionally to the number of accelerators involved in the same training job.

Data parallelism – the high-level bits

So far, we have discussed the benefits of using data parallelism in machine learning model training, which can tremendously reduce the overall model training time. Now, we need to dive into some fundamental theories about how data parallel training works, such asstochastic gradient descent (SGD) and model synchronization. But before that, let's take a look at the system architecture for data parallel training, and how it is different from single-node training.

The simplified workflow for data parallel training is depicted in the following diagram. We have omitted some technical details during the training phase as we are mainly concerned with thetwo bandwidths (that is, thedata loading bandwidth and themodel training bandwidth):

Figure 1.5 – Simplified workflow of data parallel training

Figure 1.5 – Simplified workflow of data parallel training

As we can see, the main difference between single-node training and data parallel training is that we split thedata loading bandwidth between multiple workers/GPUs (shown as blue arrows in the preceding diagram). Therefore, for each GPU involved in the data parallel training job, the difference between its localdata loading bandwidth andmodel training bandwidth is much smaller compared to the single-node case.

At a high level, even though we cannot increase themodel training bandwidth on each accelerator due to hardware limitations, we can split and balance the wholedata loading bandwidth across multiple accelerators. And thisdata loading bandwidth split is not only applicable to data parallel training. It can be directly adopted in the data parallel model serving stage.

Note

By decreasing the per-GPUdata loading bandwidth, data parallel training mitigates the gap betweendata loading bandwidth andmodel training bandwidth on each GPU.

At this point, we should understand how data parallel training increases end-to-end throughput by splitting thedata loading bandwidth across multiple accelerators. After each GPU receives its local batch of augmented input data, it will conduct local model training and validation. Here, model validation in data parallel training is the same as in the single-node case (there are some small variations, which we will discuss later) and we mainly focus on the difference at thetraining stage (excluding validation).

As shown in the following diagram, in the case of a single node, we divide the model training stage into three steps: data loading, training, and model updating. As we mentioned in theSingle-node training is too slow section, data loading is for loading new mini-batches of training data. Training is done to conduct forward and backward propagations through the model. Once we've generated gradients during backward propagation, we perform the third step; that is, updating the model parameters:

Figure 1.6 – The three steps in the model training stage

Figure 1.6 – The three steps in the model training stage

Compared to the data parallel training stage, as shown in the following diagram, there are several major differences:

  • First, in data parallel training, different accelerators are trained on different batches of input data (for example, Partition 1 and Partition 2 in the following diagram). Consequently, none of the GPUs can see the full training data. Thus, traditional gradient descent optimization cannot be applied here. We also need to do a stochastic approximation of gradient descent, which can be used in the single-node case. One popular stochastic approximation method is SGD. We will look at this in more detail in the next section.
  • Second, in data parallel training, besides the three steps included in single-node training, as shown in the preceding diagram, we have an additional step here calledmodel synchronization, which is shown in the following diagram. Model synchronization is about collecting and aggregating local gradients that have been generated by different nodes. We will learn more about model synchronization later in this book:
Figure 1.7 – Data parallelism procedures within the model training stage

Figure 1.7 – Data parallelism procedures within the model training stage

In the nexttwo sections, we will discuss the theoretical details about SGD and model synchronization.

Stochastic gradient descent

In this section, wewill discuss why SGD is a must-have for data parallel training and how it works.

In theory, we can usetraditionalgradient descent (GD) for single-node training. It works as follows:

for i in dataset:  g_all += g_iw = w - a*g_all

First, we need to calculate the gradients from each data point of our training dataset, whereg_i is the gradients. Here, we calculate this on thei-th training data point. The formal definition ofg_i is as follows:

Then, we sum up all the gradients that have been calculated by all the training data points (g_all += g_i) and then do a single step model update withw = w - a*g_all.

However, in data parallel training, each GPU can only see part of (not the full) training dataset, which makes it impossible to use traditional GD optimization since we cannot calculateg_all in this case. Thus, SGD is a must-have. In addition, SGD is also applicable to single-node training. SGD works as follows:

for i in dataset:  w = w - a*g_i

Basically, instead of updating the model weights (w) after generating the gradients fromall the training data, SGD allows for model weights updates usinga single or a few training samples (for example, a mini-batch). With this relaxation of model updating restrictions, the workers in data parallel training can update their model weights using their local (not global) training samples.

GD versus SGD

In GD, we need to compute the gradients overall the training data and update the model weights.

In SGD, we compute the gradients over asubset of all the training data and update the model weights.

However, since each worker updates their model weights based on their local training data, the model parameters of different workers can be different after each of the training iterations. Therefore, we need to conductmodel synchronization periodically to guarantee that all the workers are on the same page, meaning that they maintain the model parameters after each training iteration.

Model synchronization

As we saw previously, in data parallel training, different workers train their local models using disjointed subsets of the total training data, so the trained model weights may be different. To force all the workers to have the same view of the model parameters, we need to conduct model synchronization.

Let's study this in a simple four-GPU setting, as shown in the following diagram:

Figure 1.8 – Model synchronization in a four-GPU setting

Figure 1.8 – Model synchronization in a four-GPU setting

As we can see, we have four GPUs in a data parallel training job. Here, each GPU maintains a copy of the full ML model locally inside its on-device memory.

Let's assume that all the GPUs are initialized with the same model parameters, which is a standard practice, by setting the randomize function with a fixed seed.

After the first training iteration, each GPU will generate its local gradients as, wherei refers to thei-th GPU. Given that they are training on different local training inputs, all the gradients from different GPUs may be different. To guarantee that all four GPUs have the same model updates, we need to conduct model synchronization before the modelparameter updates:

Model synchronization does two things:

  1. Collects and sums up all the gradients from all the GPUs in use, as shown here:
  1. Broadcasts the aggregated gradients to all the GPUs.

Once the model synchronization steps have been completed, we can get the aggregated gradients,, locally on each GPU. Then, we can use these aggregated gradients,, for the model updates, which guarantees that the updated model parameters remain the same after this first data parallel training iteration.

Similarly, in the following training iterations, we conduct model synchronization after each GPU generates its local gradients. So, model synchronization guarantees that the model parameters remain the same after every training iteration in a particular data parallel training job.

For the real system implementations, this model synchronization mainly has two different variations: theparameter server architecture and theAll-Reduce architecture, which we will discuss in detail in the next chapter.

So far, we have come across some of the key concepts in data parallel training jobs, such as SGD andmodel synchronization. Next, we will discuss some important hyperparameters related to data parallel training.

Hyperparameter tuning

In this section, we will focus on thehyperparameters that are closely related to data parallel training: global batch size, learning rate adjustment, and optimizer selection.

Let's discuss them one by one.

Notes on Hyperparameters

While some of these hyperparameters have existed in the standard single-node training process, in data parallel training, these parameters may have new searching dimensions and new correlations.

Global batch size

The globalbatch size refers to how many training samples will be loaded into all the GPUs for training simultaneously. The counterpart of this concept in single-node training is the batch size or mini-batch.

Selecting the proper global batch size is different from selecting a single node's batch size. In single-node training, we always set the batch size to be themaximum number that can fit into the accelerator's memorywithout causingout-of-memory (OOM) issues. In data parallel training, givenN GPUs, we maynot set the global batch-size to beN*Max(single_node), whereMax(single_node) refers to the maximum batch size on a single GPU.

In data parallel training, this global batch size is the first hyperparameter we need to search or fine-tune. If the global batch size is too large, the training model may not converge. If the global batch size is too small, it is just a waste of distributed computational resources.

Learning rate adjustment

Since we have used a very large global batch size compared to single node training, we also need to adjust the learning rate accordingly.

Rule of Thumb Regarding Learning Rate Adjustment

The rule of thumb policy for determining the learning rate in data parallel training is to multiply the learning rate in the single-node case byN, if we useN GPUs to do the data parallel training together.

Recent research literature suggests that, for large-batch data parallel training, we should have a warmup stage at the very beginning of the training stage. This warmup policy suggests that we should start data parallel training with a relatively small learning rate. After this warmup period, we should gradually increase the learning rate for several epochs of training, and then stop increasing the learning rate by defining a peak learning rate.

Model synchronization schemes

Now that we have chosen our optimizer (global batch size) andadjusted the learning rate accordingly, the next thing we need to do is select an appropriate model synchronization model to use. We need this because we need to initialize a group of processes to run our data parallel training job in a distributed manner, where each process will be responsible for handling model synchronization on one machine or one GPU.

Let's takepytorch as an example. Here, you need to initialize your process groups, as follows:

torch.distributed.init_process_group(backend='nccl',                                     init_method = '...',                                     world_size = N,                                     timeout = M)  

Here, the first parameter (backend='nccl') we need to choose from is the model synchronization backend. Right now, deep learning platforms such as PyTorch mainly support three different communication backends: NCCL, Gloo, and MPI.

The main differences amongthese three communication backends are as follows:

  • NCCL:
    • GPUonly
    • No support for one-to-all communication primitives such as Scatter
    • No support for all-to-one communication primitives such as Gather
  • Gloo:
    • Mainly support for CPU, partial support for GPU.
    • For CPU, it supports most communication primitives.
    • For GPU, it only supports the most commonly used communication primitives, such as Broadcast and All-Reduce.
    • No support for all-to-all communication.
  • MPI:
    • CPUonly
    • Supportsspecial hardware communication, such as IP over InfiniBand

Among these three, the following are some high-level suggestions on selecting communication schemes:

  • For GPU clusters, use NCCL.
  • For CPU clusters, use Gloo first. If that doesn't work, try MPI.

With that, we have discussed three main communication schemes we can use in data parallel training jobs. Since the nodes we have used for model training are GPUs, we usually set NCCLas our default communication backend.

Summary

After reading this chapter, you should be able to explore and find the real bottleneck in single-node training. You should also know how data parallelism mitigates this bottleneck in single-node training, thus increasing the overall throughput. Finally, you should know about the several main hyperparameters related to data parallel training.

In the next chapter, we will focus on two major system architectures for data parallel training, namely theparameter server (PS) andAll-Reduce paradigms.

Left arrow icon

Page1 of 5

Right arrow icon
Download code iconDownload Code

Key benefits

  • Accelerate model training and interference with order-of-magnitude time reduction
  • Learn state-of-the-art parallel schemes for both model training and serving
  • A detailed study of bottlenecks at distributed model training and serving stages

Description

Reducing time cost in machine learning leads to a shorter waiting time for model training and a faster model updating cycle. Distributed machine learning enables machine learning practitioners to shorten model training and inference time by orders of magnitude. With the help of this practical guide, you'll be able to put your Python development knowledge to work to get up and running with the implementation of distributed machine learning, including multi-node machine learning systems, in no time. You'll begin by exploring how distributed systems work in the machine learning area and how distributed machine learning is applied to state-of-the-art deep learning models. As you advance, you'll see how to use distributed systems to enhance machine learning model training and serving speed. You'll also get to grips with applying data parallel and model parallel approaches before optimizing the in-parallel model training and serving pipeline in local clusters or cloud environments. By the end of this book, you'll have gained the knowledge and skills needed to build and deploy an efficient data processing pipeline for machine learning model training and inference in a distributed manner.

Who is this book for?

This book is for data scientists, machine learning engineers, and ML practitioners in both academia and industry. A fundamental understanding of machine learning concepts and working knowledge of Python programming is assumed. Prior experience implementing ML/DL models with TensorFlow or PyTorch will be beneficial. You'll find this book useful if you are interested in using distributed systems to boost machine learning model training and serving speed.

What you will learn

  • Deploy distributed model training and serving pipelines
  • Get to grips with the advanced features in TensorFlow and PyTorch
  • Mitigate system bottlenecks during in-parallel model training and serving
  • Discover the latest techniques on top of classical parallelism paradigm
  • Explore advanced features in Megatron-LM and Mesh-TensorFlow
  • Use state-of-the-art hardware such as NVLink, NVSwitch, and GPUs

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date :Apr 29, 2022
Length:284 pages
Edition :1st
Language :English
ISBN-13 :9781801817219
Category :

What do you get with eBook?

Product feature iconInstant access to your Digital eBook purchase
Product feature icon Download this book inEPUB andPDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature iconDRM FREE - Read whenever, wherever and however you want
Product feature iconAI Assistant (beta) to help accelerate your learning

Contact Details

Modal Close icon
Payment Processing...
tickCompleted

Billing Address

Product Details

Publication date :Apr 29, 2022
Length:284 pages
Edition :1st
Language :English
ISBN-13 :9781801817219
Category :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99billed monthly
Feature tick iconUnlimited access to Packt's library of 7,000+ practical books and videos
Feature tick iconConstantly refreshed with 50+ new titles a month
Feature tick iconExclusive Early access to books as they're written
Feature tick iconSolve problems while you work with advanced search and reference features
Feature tick iconOffline reading on the mobile app
Feature tick iconSimple pricing, no contract
$199.99billed annually
Feature tick iconUnlimited access to Packt's library of 7,000+ practical books and videos
Feature tick iconConstantly refreshed with 50+ new titles a month
Feature tick iconExclusive Early access to books as they're written
Feature tick iconSolve problems while you work with advanced search and reference features
Feature tick iconOffline reading on the mobile app
Feature tick iconChoose a DRM-free eBook or Video every month to keep
Feature tick iconPLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick iconExclusive print discounts
$279.99billed in 18 months
Feature tick iconUnlimited access to Packt's library of 7,000+ practical books and videos
Feature tick iconConstantly refreshed with 50+ new titles a month
Feature tick iconExclusive Early access to books as they're written
Feature tick iconSolve problems while you work with advanced search and reference features
Feature tick iconOffline reading on the mobile app
Feature tick iconChoose a DRM-free eBook or Video every month to keep
Feature tick iconPLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick iconExclusive print discounts

Frequently bought together


Machine Learning Techniques for Text
Machine Learning Techniques for Text
Read more
Oct 2022448 pages
Full star icon4.8 (6)
eBook
eBook
$9.99$37.99
$46.99
Deep Learning with PyTorch Lightning
Deep Learning with PyTorch Lightning
Read more
Apr 2022366 pages
Full star icon4.3 (15)
eBook
eBook
$9.99$39.99
$48.99
Distributed Machine Learning with Python
Distributed Machine Learning with Python
Read more
Apr 2022284 pages
Full star icon4.3 (14)
eBook
eBook
$9.99$33.99
$41.99
Stars icon
Total$137.97
Machine Learning Techniques for Text
$46.99
Deep Learning with PyTorch Lightning
$48.99
Distributed Machine Learning with Python
$41.99
Total$137.97Stars icon
Buy 2+ to unlock$7.99 prices - master what's next.
SHOP NOW

Table of Contents

16 Chapters
Section 1 – Data ParallelismChevron down iconChevron up icon
Section 1 – Data Parallelism
Chapter 1: Splitting Input DataChevron down iconChevron up icon
Chapter 1: Splitting Input Data
Single-node training is too slow
Data parallelism – the high-level bits
Hyperparameter tuning
Summary
Chapter 2: Parameter Server and All-ReduceChevron down iconChevron up icon
Chapter 2: Parameter Server and All-Reduce
Technical requirements
Parameter server architecture
Implementing the parameter server
Issues with the parameter server
All-Reduce architecture
Collective communication
Summary
Chapter 3: Building a Data Parallel Training and Serving PipelineChevron down iconChevron up icon
Chapter 3: Building a Data Parallel Training and Serving Pipeline
Technical requirements
The data parallel training pipeline in a nutshell
Single-machine multi-GPUs and multi-machine multi-GPUs
Checkpointing and fault tolerance
Model evaluation and hyperparameter tuning
Model serving in data parallelism
Summary
Chapter 4: Bottlenecks and SolutionsChevron down iconChevron up icon
Chapter 4: Bottlenecks and Solutions
Communication bottlenecks in data parallel training
Leveraging idle links and host resources
On-device memory bottlenecks
Recomputation and quantization
Summary
Section 2 – Model ParallelismChevron down iconChevron up icon
Section 2 – Model Parallelism
Chapter 5: Splitting the ModelChevron down iconChevron up icon
Chapter 5: Splitting the Model
Technical requirements
Single-node training error – out of memory
ELMo, BERT, and GPT
Pre-training and fine-tuning
State-of-the-art hardware
Summary
Chapter 6: Pipeline Input and Layer SplitChevron down iconChevron up icon
Chapter 6: Pipeline Input and Layer Split
Vanilla model parallelism is inefficient
Pipeline input
Pros and cons of pipeline parallelism
Layer split
Notes on intra-layer model parallelism
Summary
Chapter 7: Implementing Model Parallel Training and Serving WorkflowsChevron down iconChevron up icon
Chapter 7: Implementing Model Parallel Training and Serving Workflows
Technical requirements
Wrapping up the whole model parallelism pipeline
Fine-tuning transformers
Hyperparameter tuning in model parallelism
NLP model serving
Summary
Chapter 8: Achieving Higher Throughput and Lower LatencyChevron down iconChevron up icon
Chapter 8: Achieving Higher Throughput and Lower Latency
Technical requirements
Freezing layers
Exploring memory and storage resources
Understanding model decomposition and distillation
Reducing bits in hardware
Summary
Section 3 – Advanced Parallelism ParadigmsChevron down iconChevron up icon
Section 3 – Advanced Parallelism Paradigms
Chapter 9: A Hybrid of Data and Model ParallelismChevron down iconChevron up icon
Chapter 9: A Hybrid of Data and Model Parallelism
Technical requirements
Case study of Megatron-LM
Implementation of Megatron-LM
Case study of Mesh-TensorFlow
Implementation of Mesh-TensorFlow
Pros and cons of Megatron-LM and Mesh-TensorFlow
Summary
Chapter 10: Federated Learning and Edge DevicesChevron down iconChevron up icon
Chapter 10: Federated Learning and Edge Devices
Technical requirements
Sharing knowledge without sharing data
Case study: TensorFlow Federated
Running edge devices with TinyML
Case study: TensorFlow Lite
Summary
Chapter 11: Elastic Model Training and ServingChevron down iconChevron up icon
Chapter 11: Elastic Model Training and Serving
Technical requirements
Introducing adaptive model training
Implementing adaptive model training in the cloud
Elasticity in model inference
Summary
Chapter 12: Advanced Techniques for Further Speed-UpsChevron down iconChevron up icon
Chapter 12: Advanced Techniques for Further Speed-Ups
Technical requirements
Debugging and performance analytics
Job migration and multiplexing
Model training in a heterogeneous environment
Summary
Why subscribe?
Other Books You May EnjoyChevron down iconChevron up icon
Other Books You May Enjoy
Packt is searching for authors like you
Share Your Thoughts

Recommendations for you

Left arrow icon
LLM Engineer's Handbook
LLM Engineer's Handbook
Read more
Oct 2024522 pages
Full star icon4.9 (27)
eBook
eBook
$9.99$47.99
$59.99
Getting Started with Tableau 2018.x
Getting Started with Tableau 2018.x
Read more
Sep 2018396 pages
Full star icon4 (3)
eBook
eBook
$9.99$43.99
$54.99
Python for Algorithmic Trading Cookbook
Python for Algorithmic Trading Cookbook
Read more
Aug 2024406 pages
Full star icon4.3 (20)
eBook
eBook
$9.99$47.99
$59.99
RAG-Driven Generative AI
RAG-Driven Generative AI
Read more
Sep 2024338 pages
Full star icon4.3 (16)
eBook
eBook
$9.99$35.99
$43.99
Machine Learning with PyTorch and Scikit-Learn
Machine Learning with PyTorch and Scikit-Learn
Read more
Feb 2022774 pages
Full star icon4.4 (87)
eBook
eBook
$9.99$43.99
$54.99
$79.99
Building LLM Powered Applications
Building LLM Powered Applications
Read more
May 2024342 pages
Full star icon4.2 (21)
eBook
eBook
$9.99$39.99
$49.99
Python Machine Learning By Example
Python Machine Learning By Example
Read more
Jul 2024526 pages
Full star icon4.3 (25)
eBook
eBook
$9.99$36.99
$45.99
AI Product Manager's Handbook
AI Product Manager's Handbook
Read more
Nov 2024488 pages
eBook
eBook
$9.99$31.99
$39.99
Right arrow icon

Customer reviews

Top Reviews
Rating distribution
Full star iconFull star iconFull star iconFull star iconHalf star icon4.3
(14 Ratings)
5 star78.6%
4 star0%
3 star7.1%
2 star0%
1 star14.3%
Filter icon Filter
Top Reviews

Filter reviews by




Baron C.Jun 03, 2022
Full star iconFull star iconFull star iconFull star iconFull star icon5
This book covers an area that isn't taught much, and especially not in academia. Distributed ML is going to be how you get the performance you need. Python is naturally synchronous and this book teaches how to scale up ML to be asynchronous (a necessary addition to anyone's toolset). It also does a great job in covering the pros and cons of each approach. Understanding why you do something is paramount in tech as explaining tradeoffs is a critical part of the job.At a high level, this book covers data parallelism, model synchronization, parallel training, bottlenecks and solutions, pipeline parallelism, parallel serving, elastic model training, and various other ways to speed up the process. You get the picture from the 30k foot view and in great detail.
Amazon Verified reviewAmazon
Haoran YUMay 26, 2022
Full star iconFull star iconFull star iconFull star iconFull star icon5
An awesome book for ML engineers in cloud computing for practices of the machine learning algorithms on modern distributed computing platforms. I find this book a great source of information regarding for algorithm and system design! Highly Recommend!!
Amazon Verified reviewAmazon
@maxgoffMay 24, 2022
Full star iconFull star iconFull star iconFull star iconFull star icon5
Although distributed computing has become de rigueur in most modern web applications, the fact remains that most training and reference materials for ML/AI programming still focus on single node architectures. One undeniable trend is the growing girth of data required to train some of the most interesting models emerging today. In order to rapidly innovate and compete, distributed ML will become table stakes in the near future as we move forward.If you write ML/AI code, implement smart data pipelines, architect systems in order to scale or simply want to learn techniques beyond the common core ML/AI training available, this book is a must-have for your shelf. Wang covers a lot of territory and does so clearly with excellent examples. He also provides the technical foundation for the WHY.As more data and processing capabilities accumulate at the edge, the exponentially expanding universe of data processing demands distributed computing. Machine Learning must follow a distributed pattern if it is to continue to provide value. Wang's text provides a solid foundation and reference point.Distributed computing is awesome. We use distributed computing applications every day. Wang's text provides the lessons you will need to ensure that modern ML innovations will utilize resources with much greater productivity. Time is our most precious resource. Distributed Machine Learning with Python will save you LOTS of it!
Amazon Verified reviewAmazon
KenanMay 14, 2022
Full star iconFull star iconFull star iconFull star iconFull star icon5
As someone who works with large NLP models everyday, I found this book extremely helpful in industry settings. Not only it provides detailed explanation on different parallel training techniques with clear and simple design-flow pictures, the book also contains code snippets and error messages. One thing I love most about this book is that it takes a very practical perspective. The discussion on outputs and errors with screenshots just makes the process of re-implementing those techniques so much easier for me!I would recommend this book to all researchers and young ML engineers 100%!
Amazon Verified reviewAmazon
Hitesh HindujaAug 11, 2022
Full star iconFull star iconFull star iconFull star iconFull star icon5
Interesting book with a need of hour in today's age of data. Must read for all the distributed systems enthusiasts.
Amazon Verified reviewAmazon
  • Arrow left icon Previous
  • 1
  • 2
  • 3
  • Arrow right icon Next

People who bought this also bought

Left arrow icon
Causal Inference and Discovery in Python
Causal Inference and Discovery in Python
Read more
May 2023466 pages
Full star icon4.5 (47)
eBook
eBook
$9.99$43.99
$53.99
Generative AI with LangChain
Generative AI with LangChain
Read more
Dec 2023376 pages
Full star icon4.1 (33)
eBook
eBook
$9.99$63.99
$79.99
Modern Generative AI with ChatGPT and OpenAI Models
Modern Generative AI with ChatGPT and OpenAI Models
Read more
May 2023286 pages
Full star icon4.1 (30)
eBook
eBook
$9.99$39.99
$49.99
Deep Learning with TensorFlow and Keras – 3rd edition
Deep Learning with TensorFlow and Keras – 3rd edition
Read more
Oct 2022698 pages
Full star icon4.5 (44)
eBook
eBook
$9.99$39.99
$49.99
Machine Learning Engineering  with Python
Machine Learning Engineering with Python
Read more
Aug 2023462 pages
Full star icon4.6 (37)
eBook
eBook
$9.99$39.99
$49.99
Right arrow icon

About the author

Profile icon Wang
Wang
Github icon
Guanhua Wang is a final-year Computer Science PhD student in the RISELab at UC Berkeley, advised by Professor Ion Stoica. His research lies primarily in the Machine Learning Systems area including fast collective communication, efficient in-parallel model training and real-time model serving. His research gained lots of attention from both academia and industry. He was invited to give talks to top-tier universities (MIT, Stanford, CMU, Princeton) and big tech companies (Facebook/Meta, Microsoft). He received his master's degree from HKUST and bachelor's degree from Southeast University in China. He also did some cool research on wireless networks. He likes playing soccer and runs half-marathon multiple times in the Bay Area of California.
Read more
See other products by Wang
Getfree access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook?Chevron down iconChevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website?Chevron down iconChevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook?Chevron down iconChevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support?Chevron down iconChevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks?Chevron down iconChevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook?Chevron down iconChevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.

Create a Free Account To Continue Reading

Modal Close icon
OR
    First name is required.
    Last name is required.

The Password should contain at least :

  • 8 characters
  • 1 uppercase
  • 1 number
Notify me about special offers, personalized product recommendations, and learning tips By signing up for the free trial you will receive emails related to this service, you can unsubscribe at any time
By clicking ‘Create Account’, you are agreeing to ourPrivacy Policy andTerms & Conditions
Already have an account? SIGN IN

Sign in to activate your 7-day free access

Modal Close icon
OR
By redeeming the free trial you will receive emails related to this service, you can unsubscribe at any time.

[8]ページ先頭

©2009-2025 Movatter.jp