distributed-training

Star

Here are 209 public repositories matching this topic...

Language:All

Filter by language

All209 Python133 Jupyter Notebook41 C++10 Shell6 Go5 HTML2 Cuda1 HCL1 JavaScript1 Scala1

Sort:Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

GokuMohandas /Made-With-ML

Star43.6k

Learn how to design, develop, deploy and iterate on production-grade ML applications.

python data-science machine-learning natural-language-processing deep-learning pytorch data-engineering ray data-quality distributed-training mlops distributed-ml llms

UpdatedAug 18, 2024
Jupyter Notebook

huggingface /pytorch-image-models

Star35.5k

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

optimizer pytorch imagenet image-classification resnet pretrained-models mixnet pretrained-weights distributed-training mobilenet-v2 mobile-deep-learning mobilenetv3 efficientnet augmix randaugment nfnets normalization-free-training vision-transformer-models convnext maxvit

UpdatedOct 12, 2025
Python

PaddlePaddle /Paddle

Star23.3k

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

python machine-learning deep-learning neural-network scalability efficiency paddlepaddle distributed-training

UpdatedOct 13, 2025
C++

PaddlePaddle /PaddleNLP

Star12.8k

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

nlp search-engine compression sentiment-analysis transformers information-extraction question-answering llama pretrained-models embedding bert semantic-analysis distributed-training ernie neural-search uie document-intelligence paddlenlp llm

UpdatedOct 13, 2025
Python

Netflix /metaflow

Star9.6k

Build, Manage and Deploy AI/ML Systems

python kubernetes aws machine-learning ai azure gcp ml datascience high-performance-computing agents model-management cost-optimization distributed-training ml-infrastructure mlops ml-platform llm generative-ai llmops

UpdatedOct 10, 2025
Python

skypilot-org /skypilot

Star8.8k

Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, 17+ clouds, or on-prem).

data-science machine-learning deep-learning gpu job-scheduler cloud-management spot-instances cloud-computing job-queue hyperparameter-tuning cost-optimization distributed-training multicloud ml-infrastructure tpu cost-management finops ml-platform llm-serving llm-training

UpdatedOct 13, 2025
Python

IDEA-CCNL /Fengshenbang-LM

Star4.1k

Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系，成为中文AIGC和认知智能的基础设施。

transformers pytorch chinese-nlp pretrained-models distributed-training multimodal aigc

UpdatedAug 13, 2024
Python

FedML-AI /FedML

Star3.9k

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.

machine-learning deep-learning inference-engine model-deployment model-serving distributed-training federated-learning mlops edge-ai ai-agent on-device-training

UpdatedAug 11, 2025
Python

bytedance /byteps

Star3.7k

A high performance and generic framework for distributed DNN training

machine-learning deep-learning mxnet tensorflow keras pytorch distributed-training

UpdatedOct 3, 2023
Python

tensorflow /adanet

Star3.5k

Fast and flexible AutoML with learning guarantees.

python machine-learning deep-learning tensorflow gpu ensemble automl learning-theory neural-architecture-search distributed-training tpu

UpdatedNov 30, 2023
Jupyter Notebook

determined-ai /determined

Star3.2k

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

kubernetes data-science machine-learning deep-learning tensorflow keras pytorch hyperparameter-optimization hyperparameter-tuning hyperparameter-search distributed-training ml-infrastructure mlops ml-platform