- Notifications
You must be signed in to change notification settings - Fork2
How to perform distributed training on Amazon SageMaker using SageMaker's Distributed Data Parallel library and debug using Amazon SageMaker Debugger.
License
aws-samples/amazon-sagemaker-dist-data-parallel-with-debugger
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Distributed training using Amazon SageMaker Distributed Data Parallel library and debugging using Amazon SageMaker Debugger
This repository contains an example for performing distributed training on Amazon SageMaker using SageMaker's Distributed Data Parallel library and debugging using Amazon SageMaker Debugger. The training scripts cover both zero-script-change and with-script-change scenarios for the Debugger.
Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train and deploy machine learning (ML) models quickly. With SageMaker, you have the option of using the built-in algorithms as well as bringing your own algorithms and frameworks. One such framework is TensorFlow 2.x. When performing distributed training with this framework, you can use SageMaker's Distributed Data Parallel or Distributed Model Parallel libraries. Amazon SageMaker Debugger debugs, monitors and profiles training jobs in real time thereby helping with detecting non-converging conditions, optimizing resource utilization by eliminating bottlenecks, improving training time and reducing costs of your machine learning models.
This example contains a Jupyter Notebook that demonstrates how to use a SageMaker optimized TensorFlow 2.x container to perform distributed training on theFashion MNIST dataset using theSageMaker Distributed Data Parallel library and debug usingSageMaker Debugger. It also implements a custom training loop i.e. customizes what goes on in the fit() loop. Finally the debugger's output is analyzed. This notebook will take your training script and use SageMaker in script mode.
This repository contains
A Jupyter Notebook to get started
A training script in Python for zero-script-change scenario that is passed to the training job
A training script in Python for with-script-change scenario that is passed to the training job
SeeCONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.
About
How to perform distributed training on Amazon SageMaker using SageMaker's Distributed Data Parallel library and debug using Amazon SageMaker Debugger.
Topics
Resources
License
Code of conduct
Contributing
Security policy
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.