sql-machine-learning/elasticdlPublic

NotificationsYou must be signed in to change notification settings
Fork116
Star743

Kubernetes-native Deep Learning Framework

License

MIT license

743 stars 116 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,829 Commits
docs		docs
elasticai_api		elasticai_api
elasticdl		elasticdl
elasticdl_client		elasticdl_client
elasticdl_preprocessing		elasticdl_preprocessing
model_zoo		model_zoo
scripts		scripts
.clang-format		.clang-format
.codecov.yml		.codecov.yml
.flake8		.flake8
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.travis.yml		.travis.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
RELEASE.md		RELEASE.md
index.html		index.html
setup.py		setup.py
setup_api.py		setup_api.py
setup_client.py		setup_client.py
setup_preprocessing.py		setup_preprocessing.py

Repository files navigation

ElasticDL: A Kubernetes-native Deep Learning Framework

ElasticDL is a Kubernetes-native deep learning frameworkthat supports fault-tolerance and elastic scheduling.

Main Features

Elastic Scheduling and Fault-Tolerance

Through Kubernetes-native design, ElasticDL enables fault-tolerance and workswith the priority-based preemption of Kubernetes to achieve elastic schedulingfor deep learning tasks.

Support TensorFlow and PyTorch

TensorFlow Estimator.
TensorFlow Keras.
PyTorch

Minimalism Interface

Given amodel definedwith Keras API, train the model distributedly with a command line.

elasticdl train \  --image_name=elasticdl:mnist \  --model_zoo=model_zoo \  --model_def=mnist.mnist_functional_api.custom_model \  --training_data=/data/mnist/train \  --job_name=test-mnist \  --volume="host_path=/data,mount_path=/data"

Quick Start

Please check out ourstep-by-step tutorial forrunning ElasticDL on local laptop, on-prem cluster, or on public cloud such asGoogle Kubernetes Engine.

TensorFlow Estimator on MiniKube

TensorFlow Keras on MiniKube

PyTorch on MiniKube

Background

TensorFlow/PyTorch has its native distributed computing feature that isfault-recoverable. In the case that some processes fail, the distributedcomputing job would fail; however, we can restart the job and recover its statusfrom the most recent checkpoint files.

ElasticDL supports fault-tolerance during distributed training.In the case that some processes fail, the job wouldgo on running. Therefore, ElasticDL doesn't need to save checkpoint nor recoverfrom checkpoints.

The feature of fault-tolerance makes ElasticDL works with the priority-basedpreemption of Kubernetes to achieve elastic scheduling. When Kubernetes killssome processes of a job to free resource for new-coming jobs with higherpriority, the current job doesn't fail but continues with less resource.

Elastic scheduling could significantly improve the overall utilization of acluster. Suppose that a cluster has N GPUs, and a job is using one ofthem. Without elastic scheduling, a new job claiming N GPUs would have to waitfor the first job to complete before starting. This pending time could be hours,days, or even weeks. During this very long time, the utilization of the clusteris 1/N. With elastic scheduling, the new job could start running immediatelywith N-1 GPUs, and Kubernetes might increase its GPU consumption by 1 after thefirst job completes. In this case, the overall utilization is 100%.

The feature of elastic scheduling of ElasticDL comes from its Kubernetes-nativedesign -- it doesn't rely on Kubernetes extensions like Kubeflow to runTensorFlow/PyTorch programs; instead, the master process of an ElasticDL job callsKubernetes API to start workers and parameter servers; it also watches eventslike process/pod killing and reacts to such events to realize fault-tolerance.

In short, ElasticDL enhances TensorFlow/PyTorch with fault-tolerance and elasticscheduling in the case that you have a Kubernetes cluster. We provide a tutorialshowing how to set up a Kubernetes cluster on Google Cloud and run ElasticDLjobs there. We respect TensorFlow's native distributed computing feature, whichdoesn't require specific computing platforms like Kubernetes and allowsTensorFlow running on any platform.

Development Guide

Please refer tothis document for development guide.

About

Kubernetes-native Deep Learning Framework

elasticdl.org

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ElasticDL: A Kubernetes-native Deep Learning Framework

Main Features

Elastic Scheduling and Fault-Tolerance

Support TensorFlow and PyTorch

Minimalism Interface

Quick Start

Background

Development Guide

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors23

Uh oh!

Languages

Movatterモバイル変換

License

sql-machine-learning/elasticdl

Folders and files

Latest commit

History

Repository files navigation

ElasticDL: A Kubernetes-native Deep Learning Framework

Main Features

Elastic Scheduling and Fault-Tolerance

Support TensorFlow and PyTorch

Minimalism Interface

Quick Start

Background

Development Guide

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors23

Uh oh!

Languages

Packages