Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[ROADMAP][Updated on November 4] Megatron Core MoE Q3-Q4 2025 Roadmap #1729

Open
Assignees
yanring
@yanring

Description

@yanring

Description

The focus for Megatron Core MoE Q3-Q4 2025 is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.

🎉 This Roadmap is based on thedev branch; please see the details in its README.

Model Support

  • DeepSeek
    • ✅ DeepSeek-V2
    • ✅ DeepSeek-V3, including MTP
    • 🚧 DeepSeek-V3.2, WIP
  • Qwen
    • ✅ Qwen2-57B-A14B
    • ✅ Qwen3-235B-A22B
    • (🚀New!) Qwen3-Next
  • Mixtral
    • ✅ Mixtral-8x7B
    • ✅ Mixtral-8x22B

Core MoE Functionality

  • Token dropless MoE - Advanced routing without token dropping
  • Top-K Router with flexible K selection
  • Load balancing losses for expert load balancing optimization

Advanced Parallelism

  • Expert Parallel (EP) with 3D parallelism integration
  • Full parallelism combo: EP + DP + TP + PP + SP support
  • Context Parallel (CP) for long sequence MoE training
  • Parallel Folding Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training
  • Distributed Optimizer for MoE (ZeRO-1 equivalent)
  • (🚀New!) Megatron FSDP with fullexpert parallel support

Optimizations

  • Memory Efficient token permutation
  • Fine-grained Recomputations (mla, moe, mlp, moe_act, norm)
  • GroupedGEMM and Gradient Accumulation Fusion
  • DP/PP/TP/EP Communication Overlapping
  • Advanced fusions for Router; Permutation; MLA Rope; FP8 casting, etc
  • cuDNN fused Attention and FlashAttn integration
  • ✅ (🚀New!) 1F1B EP A2A Overlap - Hiding Expert Parallel Communication with 1F1B Pipeline Schedule
  • (🚀New!) Muon and Layer-wise distributed optimizer
  • (🚀New!) Pipeline-aware fine-grained activation offloading

Precision Support

  • GroupedGEMM including FP8/MXFP8 support
  • FP8 weights with BF16 optimizer states
  • FP8 training full support

Optimized Expert Parallel Communication Support

  • DeepEP support for H100 and B200
  • (🚀New!) HybridEP for GB200

Developer Experience

  • MoE Model Zoo with pre-training best practices
  • MCore2HF Converter for ecosystem compatibility in megatron-bridge
  • Distributed Checkpointing Support
  • Runtime Upcycling Support for efficient model scaling
  • Layer-wise logging for detailed monitoring

Next Release Roadmap (MCore v0.16)

Performance & Memory Enhancements

  • 🚀Support placing MTP layers into standalone pipeline stages
  • 🚀Fused Linear and Cross Entropy operations
  • Cuda graph support for FP8 Primary weight

Advanced Functionality

  • 🚀Enhanced cuda_graph_scope for MoE and Mamba
    1. More fine-grained graph scope like MoE router and dispatch preprocessing
    2. A minimally intrusive implementation
  • MuonClip support (non-split version)
  • Adding context parallel support to eager attention implementation
  • CUDA Graph support with 1F1B EP A2A overlapping
  • Remove calculation of padding token in moe routing loss
  • Revive FP16 Training
  • Router replay support for RL training
  • Support NVFP4 MOE with Proper Padding

Communication Optimization

  • HybridEP Kernel Optimizations
  • HybridEP for NVL8+IB

Bug Fix

  • Tokenizer compatibility fix for DeepSeek and Qwen HF tokenizer

Ongoing Long-term Features

  • E2E Performance optimization for DeepSeek-V3, Qwen-3 and other fine-grained MoEs
  • Sync-Free and Full-Iter cudaGraph MoE Training
    • Targeting for dropless MoE
    • Device initiated HybridEP and GroupedGEMM
    • MoE ECHO Dispatcher
  • CPU Overhead Optimizations for Blackwell Performance
  • MLA CP 2.0 - MLA CP Enhancement for Longer Sequence Training
  • Dynamic Context Parallel for Imbalanced Long-Sequence Training
  • Megatron FSDP Performance Optimization for MoE Training

Call for Community Contributions

  • Model implementations - Additional MoE model variants
  • Performance testing - Performance tests across different platforms and workloads
  • Documentation and tutorials - Best practices and optimization guides

Call for Community Contributions

  • Model implementations - Additional MoE model variants
  • Performance testing - Performance tests across different platforms and workloads
  • Documentation and tutorials - Best practices and optimization guides

This roadmap reflects the collective efforts of NVIDIA and our collaborators

Credits: MCore MoE Team and@sbhavani

Labels:roadmap,moe,call-for-contribution

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions


    [8]ページ先頭

    ©2009-2025 Movatter.jp