- Notifications
You must be signed in to change notification settings - Fork3.3k
Open
Description
Description
The focus for Megatron Core MoE Q3-Q4 2025 is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.
🎉 This Roadmap is based on thedev branch; please see the details in its README.
Model Support
- ✅DeepSeek
- ✅ DeepSeek-V2
- ✅ DeepSeek-V3, including MTP
- 🚧 DeepSeek-V3.2, WIP
- ✅Qwen
- ✅ Qwen2-57B-A14B
- ✅ Qwen3-235B-A22B
- ✅(🚀New!) Qwen3-Next
- ✅Mixtral
- ✅ Mixtral-8x7B
- ✅ Mixtral-8x22B
Core MoE Functionality
- ✅Token dropless MoE - Advanced routing without token dropping
- ✅Top-K Router with flexible K selection
- ✅Load balancing losses for expert load balancing optimization
Advanced Parallelism
- ✅Expert Parallel (EP) with 3D parallelism integration
- ✅Full parallelism combo: EP + DP + TP + PP + SP support
- ✅Context Parallel (CP) for long sequence MoE training
- ✅Parallel Folding Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training
- ✅Distributed Optimizer for MoE (ZeRO-1 equivalent)
- ✅(🚀New!) Megatron FSDP with fullexpert parallel support
Optimizations
- ✅Memory Efficient token permutation
- ✅Fine-grained Recomputations (mla, moe, mlp, moe_act, norm)
- ✅GroupedGEMM and Gradient Accumulation Fusion
- ✅DP/PP/TP/EP Communication Overlapping
- ✅Advanced fusions for Router; Permutation; MLA Rope; FP8 casting, etc
- ✅cuDNN fused Attention and FlashAttn integration
- ✅ (🚀New!) 1F1B EP A2A Overlap - Hiding Expert Parallel Communication with 1F1B Pipeline Schedule
- ✅(🚀New!) Muon and Layer-wise distributed optimizer
- ✅(🚀New!) Pipeline-aware fine-grained activation offloading
Precision Support
- ✅GroupedGEMM including FP8/MXFP8 support
- ✅FP8 weights with BF16 optimizer states
- ✅FP8 training full support
Optimized Expert Parallel Communication Support
- ✅DeepEP support for H100 and B200
- ✅(🚀New!) HybridEP for GB200
Developer Experience
- ✅MoE Model Zoo with pre-training best practices
- ✅MCore2HF Converter for ecosystem compatibility in megatron-bridge
- ✅Distributed Checkpointing Support
- ✅Runtime Upcycling Support for efficient model scaling
- ✅Layer-wise logging for detailed monitoring
Next Release Roadmap (MCore v0.16)
Performance & Memory Enhancements
- 🚀Support placing MTP layers into standalone pipeline stages
- 🚀Fused Linear and Cross Entropy operations
- Cuda graph support for FP8 Primary weight
Advanced Functionality
- 🚀Enhanced cuda_graph_scope for MoE and Mamba
1. More fine-grained graph scope like MoE router and dispatch preprocessing
2. A minimally intrusive implementation - MuonClip support (non-split version)
- Adding context parallel support to eager attention implementation
- CUDA Graph support with 1F1B EP A2A overlapping
- Remove calculation of padding token in moe routing loss
- Revive FP16 Training
- Router replay support for RL training
- Support NVFP4 MOE with Proper Padding
Communication Optimization
- HybridEP Kernel Optimizations
- HybridEP for NVL8+IB
Bug Fix
- Tokenizer compatibility fix for DeepSeek and Qwen HF tokenizer
Ongoing Long-term Features
- E2E Performance optimization for DeepSeek-V3, Qwen-3 and other fine-grained MoEs
- Sync-Free and Full-Iter cudaGraph MoE Training
- Targeting for dropless MoE
- Device initiated HybridEP and GroupedGEMM
- MoE ECHO Dispatcher
- CPU Overhead Optimizations for Blackwell Performance
- MLA CP 2.0 - MLA CP Enhancement for Longer Sequence Training
- Dynamic Context Parallel for Imbalanced Long-Sequence Training
- Megatron FSDP Performance Optimization for MoE Training
Call for Community Contributions
- Model implementations - Additional MoE model variants
- Performance testing - Performance tests across different platforms and workloads
- Documentation and tutorials - Best practices and optimization guides
Call for Community Contributions
- Model implementations - Additional MoE model variants
- Performance testing - Performance tests across different platforms and workloads
- Documentation and tutorials - Best practices and optimization guides
This roadmap reflects the collective efforts of NVIDIA and our collaborators
Credits: MCore MoE Team and@sbhavani
Labels:roadmap,moe,call-for-contribution