NVIDIA/Megatron-LMPublic

NotificationsYou must be signed in to change notification settings
Fork3.3k
Star14.4k

[ROADMAP][Updated on November 4] Megatron Core MoE Q3-Q4 2025 Roadmap #1729

New issue

Open

[ROADMAP][Updated on November 4] Megatron Core MoE Q3-Q4 2025 Roadmap#1729

Assignees

Labels

call for contributiondev branchDev branch related issues and developmentmodule: moe

Description

yanring

opened

on Aug 4, 2025

Description

The focus for Megatron Core MoE Q3-Q4 2025 is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.

🎉 This Roadmap is based on thedev branch; please see the details in its README.

Model Support

✅DeepSeek
- ✅ DeepSeek-V2
- ✅ DeepSeek-V3, including MTP
- 🚧 DeepSeek-V3.2, WIP
✅Qwen
- ✅ Qwen2-57B-A14B
- ✅ Qwen3-235B-A22B
- ✅(🚀New!) Qwen3-Next
✅Mixtral
- ✅ Mixtral-8x7B
- ✅ Mixtral-8x22B

Core MoE Functionality

✅Token dropless MoE - Advanced routing without token dropping
✅Top-K Router with flexible K selection
✅Load balancing losses for expert load balancing optimization

Advanced Parallelism

✅Expert Parallel (EP) with 3D parallelism integration
✅Full parallelism combo: EP + DP + TP + PP + SP support
✅Context Parallel (CP) for long sequence MoE training
✅Parallel Folding Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training
✅Distributed Optimizer for MoE (ZeRO-1 equivalent)
✅(🚀New!) Megatron FSDP with fullexpert parallel support

Optimizations

✅Memory Efficient token permutation
✅Fine-grained Recomputations (mla, moe, mlp, moe_act, norm)
✅GroupedGEMM and Gradient Accumulation Fusion
✅DP/PP/TP/EP Communication Overlapping
✅Advanced fusions for Router; Permutation; MLA Rope; FP8 casting, etc
✅cuDNN fused Attention and FlashAttn integration
✅ (🚀New!) 1F1B EP A2A Overlap - Hiding Expert Parallel Communication with 1F1B Pipeline Schedule
✅(🚀New!) Muon and Layer-wise distributed optimizer
✅(🚀New!) Pipeline-aware fine-grained activation offloading

Precision Support

✅GroupedGEMM including FP8/MXFP8 support
✅FP8 weights with BF16 optimizer states
✅FP8 training full support

Optimized Expert Parallel Communication Support

✅DeepEP support for H100 and B200
✅(🚀New!) HybridEP for GB200

Developer Experience

✅MoE Model Zoo with pre-training best practices
✅MCore2HF Converter for ecosystem compatibility in megatron-bridge
✅Distributed Checkpointing Support
✅Runtime Upcycling Support for efficient model scaling
✅Layer-wise logging for detailed monitoring

Next Release Roadmap (MCore v0.16)

Performance & Memory Enhancements

🚀Support placing MTP layers into standalone pipeline stages
🚀Fused Linear and Cross Entropy operations
Cuda graph support for FP8 Primary weight

Advanced Functionality

🚀Enhanced cuda_graph_scope for MoE and Mamba
1. More fine-grained graph scope like MoE router and dispatch preprocessing
2. A minimally intrusive implementation
MuonClip support (non-split version)
Adding context parallel support to eager attention implementation
CUDA Graph support with 1F1B EP A2A overlapping
Remove calculation of padding token in moe routing loss
Revive FP16 Training
Router replay support for RL training
Support NVFP4 MOE with Proper Padding

Communication Optimization

HybridEP Kernel Optimizations
HybridEP for NVL8+IB

Bug Fix

Tokenizer compatibility fix for DeepSeek and Qwen HF tokenizer

Ongoing Long-term Features

E2E Performance optimization for DeepSeek-V3, Qwen-3 and other fine-grained MoEs
Sync-Free and Full-Iter cudaGraph MoE Training
- Targeting for dropless MoE
- Device initiated HybridEP and GroupedGEMM
- MoE ECHO Dispatcher
CPU Overhead Optimizations for Blackwell Performance
MLA CP 2.0 - MLA CP Enhancement for Longer Sequence Training
Dynamic Context Parallel for Imbalanced Long-Sequence Training
Megatron FSDP Performance Optimization for MoE Training

Call for Community Contributions

Model implementations - Additional MoE model variants
Performance testing - Performance tests across different platforms and workloads
Documentation and tutorials - Best practices and optimization guides

Call for Community Contributions

Model implementations - Additional MoE model variants
Performance testing - Performance tests across different platforms and workloads
Documentation and tutorials - Best practices and optimization guides

This roadmap reflects the collective efforts of NVIDIA and our collaborators

Credits: MCore MoE Team and@sbhavani

Labels:roadmap,moe,call-for-contribution

Metadata

Assignees

yanring

Labels

call for contributiondev branchDev branch related issues and developmentmodule: moe

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

You can’t perform that action at this time.

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROADMAP][Updated on November 4] Megatron Core MoE Q3-Q4 2025 Roadmap #1729

Description

Description

Model Support

Core MoE Functionality

Advanced Parallelism

Optimizations

Precision Support

Optimized Expert Parallel Communication Support

Developer Experience

Next Release Roadmap (MCore v0.16)

Performance & Memory Enhancements

Advanced Functionality

Communication Optimization

Bug Fix

Ongoing Long-term Features

Call for Community Contributions

Call for Community Contributions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions