- Notifications
You must be signed in to change notification settings - Fork28
Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.
License
yanring/Megatron-MoE-ModelZoo
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This guide provides detailed instructions, best practices, and optimized configurations for testing Mixtral, DeepSeek, and Qwen series models using the Megatron-Core framework to achieve optimal performance and reliability.
- DeepSeek-V3 best practices in a single command
- Current includes H100, B200, and Long Context. GB200 config coming soon.
- Dockerfile:dockers/Dockerfile
Please refer todesign_docs.
Before entering the container, you need to installyq to process.yaml configuration files.
Click here to view installation steps.
Create a local bin directory:
mkdir -p~/.local/binDownload the
yqexecutable:wget https://github.com/mikefarah/yq/releases/download/v4.27.5/yq_linux_amd64 -O~/.local/bin/yqMake it executable:
chmod +x~/.local/bin/yqAdd the local bin directory to your
PATHin~/.bashrc:export PATH="$HOME/.local/bin:$PATH"
Apply the changes:
source~/.bashrc
Before running any scripts, you need to set up the following environment variables:
export WANDB_API_KEY="your_wandb_api_key_here"export MEGATRON_PATH="/path/to/your/megatron/directory"export MCORE_RELEASE_VERSION="0.13"export CONTAINER_IMAGE="/path/to/container/image.sqsh"export CLUSTER="your_cluster_name"
WANDB_API_KEY: Your Weights & Biases API key for experiment tracking.- Get your key fromwandb.ai/authorize.
MEGATRON_PATH: Absolute path to your Megatron-MoE installation directory.- Example:
path/to/Megatron-LM
- Example:
MCORE_RELEASE_VERSION: Version of Megatron-Core to use.- Currently recommended:
0.13
- Currently recommended:
CONTAINER_IMAGE: Path to the container image file (.sqsh).- Example:
path/to/container/image.sqsh
- Example:
CLUSTER: Name of your cluster environment (e.g.,EOS,CW).
For performance benchmarking, you can launch scripts either withsbatch viasbatch_benchmarking.sh or on an interactive node viainteractive_benchmarking.sh.
MODEL- This is a required environment variable that must be set in your script or command.
- Predefined models include:
Mixtral-8x2B,Mixtral-8x7B,Mixtral-8x22B,DeepSeek-V2,DeepSeek-V2-Lite,DeepSeek-V3,DeepSeek-V3-Lite, andQwen2-57B-A14B.
CLUSTER,MCORE_RELEASE_VERSION, andMEGATRON_PATH- These required variables must be defined in your script or command for proper execution.
CONTAINER_IMAGEUsing WandB for Experiment Tracking
Click here to view WandB setup instructions.
- To use WandB for experiment tracking, set
WANDB_API_KEYwith your key fromwandb.ai/authorize. It is highly recommended to addexport WANDB_API_KEY="your_own_wandb_api_key"to your~/.bashrc. - If you do not wish to use WandB, comment out the following lines in your model's
.yamlconfiguration file:# --wandb-project: wandb_project_name# --wandb-exp-name: wandb_experiment_name
- To use WandB for experiment tracking, set
All model-specific runner configurations can be adjusted throughruntime_configs/benchmarking/runtime.conf or via the benchmarking command.
Available Model-Specific Runner Configurations
Click here to view available model-specific benchmarking configurations.
- Parallel Mappings:
TP,PP,EP,CP,VPP,PP_FIRST,PP_LAST, andLAYERS_PER_VP - Batch Sizes:
MBSandGBS - Model Architecture:
NUM_LAYERS - MoE Configurations:
MOE_TOKEN_DISPATCHER,MOE_GROUPED_GEMM, and--moe-extended-ep - Training Configurations:
NNODES,RUN_TIME, andPRETRAIN. Note that specifying a shorter run time may improve your job's priority in the Slurm queue. - Data Configurations:
SEQ_LENandDATASET
- Parallel Mappings:
All available optimial configurations are listed in
runtime_configs/benchmarking/runtime.conf.
All cluster configurations can be customized either throughcluster_configs/benchmarking/your_own_cluster.conf or via the benchmarking command. For guidance on creating your own cluster configurations, refer to the template provided incluster_configs/benchmarking/template.conf.
- Required Cluster-Specific Slurm Settings:
ACCOUNT,PARTITION,RUN_NAME, andCONTAINER_MOUNTS - Required Cluster-Specific Paths:
OUTPUT_PATH,DATA_PATH,TOKENIZER_MODEL, andLOAD_PATH
- To benchmark a model from scratch with preconfigured parameters:
# Example for DeepSeek-V3MODEL=DeepSeek-V3 bash ./sbatch_benchmarking.sh - To train a model with custom parameters:
# Example for DeepSeek-V3MODEL=DeepSeek-V3 TP=2 PP=8 EP=64 VPP=1 PP_FIRST=8 PP_LAST=5 RUN_TIME=00:60:00 NNODES=64 bash sbatch_benchmarking.sh --recompute-granularity selective --recompute-modules mla_up_proj layernorm - To monitor your jobs, use
squeue -u $USERfor a one-time status check orwatch -n 1 squeue -u $USERfor continuous monitoring. For detailed logging, refer to the WandB dashboard.
| Please tryMBridge andMegatron-Bridge for better HF<->MCore conversion support.
Download the DeepSeek-V3 checkpoint fromHuggingFace:
# Make sure git-lfs is installed (https://git-lfs.com)git lfs installgit clone https://huggingface.co/deepseek-ai/DeepSeek-V3The downloaded checkpoint is in FP8 format. Run the following command to convert it to BF16 format, using thisscript:
python inference/fp8_cast_bf16.py --input-fp8-hf-path /your/input/fp8/hf/path --output-bf16-hf-path /your/output/bf16/hf/path
To convert the BF16 HuggingFace checkpoint to a Megatron legacy checkpoint, execute the following command:
# Example for DeepSeek-V3MODEL=DeepSeek-V3 bash ./ckpt_convert_scripts/DeepSeek-V3/convert_deepseek_v3.shFinally, run this command to convert the legacy checkpoint into a distributed checkpoint:
MODEL=DeepSeek-V3 TP=1 PP=4 EP=64 VPP=1 PP_FIRST=16 PP_LAST=13 NNODES=32 LOAD_PATH=/path/to/legacy/checkpoint bash ./sbatch_benchmarking.sh --ckpt-convert-save /path/to/save/distributed/checkpoint --ckpt-convert-format torch_dist --no-save-optim
For reference, after conversion, the legacy checkpoint is approximately 3.4T, and the distributed checkpoint is about 1.4T.
About
Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Contributors8
Uh oh!
There was an error while loading.Please reload this page.