yanring/Megatron-MoE-ModelZooPublic

NotificationsYou must be signed in to change notification settings
Fork28
Star130

Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.

License

MIT license

130 stars 28 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
best_practice		best_practice
ckpt_convert_scripts		ckpt_convert_scripts
cluster_configs/benchmarking		cluster_configs/benchmarking
design_docs		design_docs
dockers		dockers
misc/tools		misc/tools
model_configs/benchmarking		model_configs/benchmarking
runtime_configs/benchmarking		runtime_configs/benchmarking
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
interactive_benchmarking.sh		interactive_benchmarking.sh
sbatch_benchmarking.sh		sbatch_benchmarking.sh

Repository files navigation

Megatron MoE Testing Guide

This guide provides detailed instructions, best practices, and optimized configurations for testing Mixtral, DeepSeek, and Qwen series models using the Megatron-Core framework to achieve optimal performance and reliability.

Key Updates

DeepSeek-V3 best practices in a single command
- Current includes H100, B200, and Long Context. GB200 config coming soon.

Container Setup

Dockerfile:dockers/Dockerfile

Design Docs

Please refer todesign_docs.

Before Running

Login Node Setup

Before entering the container, you need to installyq to process.yaml configuration files.

Click here to view installation steps.

Create a local bin directory:
```
mkdir -p~/.local/bin
```

Download theyq executable:

wget https://github.com/mikefarah/yq/releases/download/v4.27.5/yq_linux_amd64 -O~/.local/bin/yq

Make it executable:
```
chmod +x~/.local/bin/yq
```
Add the local bin directory to yourPATH in~/.bashrc:
```
export PATH="$HOME/.local/bin:$PATH"
```
Apply the changes:
```
source~/.bashrc
```

Environment Setup

Before running any scripts, you need to set up the following environment variables:

export WANDB_API_KEY="your_wandb_api_key_here"export MEGATRON_PATH="/path/to/your/megatron/directory"export MCORE_RELEASE_VERSION="0.13"export CONTAINER_IMAGE="/path/to/container/image.sqsh"export CLUSTER="your_cluster_name"

WANDB_API_KEY: Your Weights & Biases API key for experiment tracking.
- Get your key fromwandb.ai/authorize.
MEGATRON_PATH: Absolute path to your Megatron-MoE installation directory.
- Example:path/to/Megatron-LM
MCORE_RELEASE_VERSION: Version of Megatron-Core to use.
- Currently recommended:0.13
CONTAINER_IMAGE: Path to the container image file (.sqsh).
- Example:path/to/container/image.sqsh
CLUSTER: Name of your cluster environment (e.g.,EOS,CW).

Performance Benchmarking

Benchmarking Script Usage

For performance benchmarking, you can launch scripts either withsbatch viasbatch_benchmarking.sh or on an interactive node viainteractive_benchmarking.sh.

MODEL
- This is a required environment variable that must be set in your script or command.
- Predefined models include:Mixtral-8x2B,Mixtral-8x7B,Mixtral-8x22B,DeepSeek-V2,DeepSeek-V2-Lite,DeepSeek-V3,DeepSeek-V3-Lite, andQwen2-57B-A14B.
CLUSTER,MCORE_RELEASE_VERSION, andMEGATRON_PATH
- These required variables must be defined in your script or command for proper execution.
CONTAINER_IMAGE
Using WandB for Experiment Tracking
Click here to view WandB setup instructions.
- To use WandB for experiment tracking, setWANDB_API_KEY with your key fromwandb.ai/authorize. It is highly recommended to addexport WANDB_API_KEY="your_own_wandb_api_key" to your~/.bashrc.
- If you do not wish to use WandB, comment out the following lines in your model's.yaml configuration file:
```
# --wandb-project: wandb_project_name# --wandb-exp-name: wandb_experiment_name
```

Runner Configuration Setup

All model-specific runner configurations can be adjusted throughruntime_configs/benchmarking/runtime.conf or via the benchmarking command.

Available Model-Specific Runner Configurations
Click here to view available model-specific benchmarking configurations.
- Parallel Mappings:TP,PP,EP,CP,VPP,PP_FIRST,PP_LAST, andLAYERS_PER_VP
- Batch Sizes:MBS andGBS
- Model Architecture:NUM_LAYERS
- MoE Configurations:MOE_TOKEN_DISPATCHER,MOE_GROUPED_GEMM, and--moe-extended-ep
- Training Configurations:NNODES,RUN_TIME, andPRETRAIN. Note that specifying a shorter run time may improve your job's priority in the Slurm queue.
- Data Configurations:SEQ_LEN andDATASET
All available optimial configurations are listed inruntime_configs/benchmarking/runtime.conf.

Cluster-Related Configuration Setup

All cluster configurations can be customized either throughcluster_configs/benchmarking/your_own_cluster.conf or via the benchmarking command. For guidance on creating your own cluster configurations, refer to the template provided incluster_configs/benchmarking/template.conf.

Required Cluster-Specific Slurm Settings:ACCOUNT,PARTITION,RUN_NAME, andCONTAINER_MOUNTS
Required Cluster-Specific Paths:OUTPUT_PATH,DATA_PATH,TOKENIZER_MODEL, andLOAD_PATH

Benchmarking Script Launch

To benchmark a model from scratch with preconfigured parameters:

# Example for DeepSeek-V3MODEL=DeepSeek-V3 bash ./sbatch_benchmarking.sh

To train a model with custom parameters:

# Example for DeepSeek-V3MODEL=DeepSeek-V3 TP=2 PP=8 EP=64 VPP=1 PP_FIRST=8 PP_LAST=5 RUN_TIME=00:60:00 NNODES=64 bash sbatch_benchmarking.sh --recompute-granularity selective --recompute-modules mla_up_proj layernorm

To monitor your jobs, usesqueue -u $USER for a one-time status check orwatch -n 1 squeue -u $USER for continuous monitoring. For detailed logging, refer to the WandB dashboard.

DeepSeek Checkpoint Conversion

| Please tryMBridge andMegatron-Bridge for better HF<->MCore conversion support.

1. Download DeepSeek-V3 Checkpoint

Download the DeepSeek-V3 checkpoint fromHuggingFace:

# Make sure git-lfs is installed (https://git-lfs.com)git lfs installgit clone https://huggingface.co/deepseek-ai/DeepSeek-V3

The downloaded checkpoint is in FP8 format. Run the following command to convert it to BF16 format, using thisscript:

python inference/fp8_cast_bf16.py --input-fp8-hf-path /your/input/fp8/hf/path --output-bf16-hf-path /your/output/bf16/hf/path

2. Convert to Megatron Legacy Checkpoint

To convert the BF16 HuggingFace checkpoint to a Megatron legacy checkpoint, execute the following command:

# Example for DeepSeek-V3MODEL=DeepSeek-V3 bash ./ckpt_convert_scripts/DeepSeek-V3/convert_deepseek_v3.sh

3. Convert to Distributed Checkpoint

Finally, run this command to convert the legacy checkpoint into a distributed checkpoint:

MODEL=DeepSeek-V3 TP=1 PP=4 EP=64 VPP=1 PP_FIRST=16 PP_LAST=13 NNODES=32 LOAD_PATH=/path/to/legacy/checkpoint bash ./sbatch_benchmarking.sh --ckpt-convert-save /path/to/save/distributed/checkpoint --ckpt-convert-format torch_dist --no-save-optim

For reference, after conversion, the legacy checkpoint is approximately 3.4T, and the distributed checkpoint is about 1.4T.

About

Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.

Movatterモバイル変換

License

yanring/Megatron-MoE-ModelZoo

Folders and files

Latest commit

History

Repository files navigation

Megatron MoE Testing Guide

Key Updates

Table of Contents

Container Setup

Design Docs

Before Running

Login Node Setup

Environment Setup

Performance Benchmarking

Benchmarking Script Usage

Runner Configuration Setup

Cluster-Related Configuration Setup

Benchmarking Script Launch

DeepSeek Checkpoint Conversion

1. Download DeepSeek-V3 Checkpoint

2. Convert to Megatron Legacy Checkpoint

3. Convert to Distributed Checkpoint

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors8

Uh oh!

Languages

Packages