Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.

License

NotificationsYou must be signed in to change notification settings

yanring/Megatron-MoE-ModelZoo

Repository files navigation

This guide provides detailed instructions, best practices, and optimized configurations for testing Mixtral, DeepSeek, and Qwen series models using the Megatron-Core framework to achieve optimal performance and reliability.

Key Updates

Table of Contents

Container Setup

Design Docs

Please refer todesign_docs.

Before Running

Login Node Setup

Before entering the container, you need to installyq to process.yaml configuration files.

Click here to view installation steps.
  1. Create a local bin directory:

    mkdir -p~/.local/bin
  2. Download theyq executable:

    wget https://github.com/mikefarah/yq/releases/download/v4.27.5/yq_linux_amd64 -O~/.local/bin/yq
  3. Make it executable:

    chmod +x~/.local/bin/yq
  4. Add the local bin directory to yourPATH in~/.bashrc:

    export PATH="$HOME/.local/bin:$PATH"
  5. Apply the changes:

    source~/.bashrc

Environment Setup

Before running any scripts, you need to set up the following environment variables:

export WANDB_API_KEY="your_wandb_api_key_here"export MEGATRON_PATH="/path/to/your/megatron/directory"export MCORE_RELEASE_VERSION="0.13"export CONTAINER_IMAGE="/path/to/container/image.sqsh"export CLUSTER="your_cluster_name"
  • WANDB_API_KEY: Your Weights & Biases API key for experiment tracking.
  • MEGATRON_PATH: Absolute path to your Megatron-MoE installation directory.
    • Example:path/to/Megatron-LM
  • MCORE_RELEASE_VERSION: Version of Megatron-Core to use.
    • Currently recommended:0.13
  • CONTAINER_IMAGE: Path to the container image file (.sqsh).
    • Example:path/to/container/image.sqsh
  • CLUSTER: Name of your cluster environment (e.g.,EOS,CW).

Performance Benchmarking

Benchmarking Script Usage

For performance benchmarking, you can launch scripts either withsbatch viasbatch_benchmarking.sh or on an interactive node viainteractive_benchmarking.sh.

  • MODEL

    • This is a required environment variable that must be set in your script or command.
    • Predefined models include:Mixtral-8x2B,Mixtral-8x7B,Mixtral-8x22B,DeepSeek-V2,DeepSeek-V2-Lite,DeepSeek-V3,DeepSeek-V3-Lite, andQwen2-57B-A14B.
  • CLUSTER,MCORE_RELEASE_VERSION, andMEGATRON_PATH

    • These required variables must be defined in your script or command for proper execution.
  • CONTAINER_IMAGE

  • Using WandB for Experiment Tracking

    Click here to view WandB setup instructions.
    • To use WandB for experiment tracking, setWANDB_API_KEY with your key fromwandb.ai/authorize. It is highly recommended to addexport WANDB_API_KEY="your_own_wandb_api_key" to your~/.bashrc.
    • If you do not wish to use WandB, comment out the following lines in your model's.yaml configuration file:
      # --wandb-project: wandb_project_name# --wandb-exp-name: wandb_experiment_name

Runner Configuration Setup

All model-specific runner configurations can be adjusted throughruntime_configs/benchmarking/runtime.conf or via the benchmarking command.

  • Available Model-Specific Runner Configurations

    Click here to view available model-specific benchmarking configurations.
    • Parallel Mappings:TP,PP,EP,CP,VPP,PP_FIRST,PP_LAST, andLAYERS_PER_VP
    • Batch Sizes:MBS andGBS
    • Model Architecture:NUM_LAYERS
    • MoE Configurations:MOE_TOKEN_DISPATCHER,MOE_GROUPED_GEMM, and--moe-extended-ep
    • Training Configurations:NNODES,RUN_TIME, andPRETRAIN. Note that specifying a shorter run time may improve your job's priority in the Slurm queue.
    • Data Configurations:SEQ_LEN andDATASET
  • All available optimial configurations are listed inruntime_configs/benchmarking/runtime.conf.

Cluster-Related Configuration Setup

All cluster configurations can be customized either throughcluster_configs/benchmarking/your_own_cluster.conf or via the benchmarking command. For guidance on creating your own cluster configurations, refer to the template provided incluster_configs/benchmarking/template.conf.

  • Required Cluster-Specific Slurm Settings:ACCOUNT,PARTITION,RUN_NAME, andCONTAINER_MOUNTS
  • Required Cluster-Specific Paths:OUTPUT_PATH,DATA_PATH,TOKENIZER_MODEL, andLOAD_PATH

Benchmarking Script Launch

  • To benchmark a model from scratch with preconfigured parameters:
    # Example for DeepSeek-V3MODEL=DeepSeek-V3 bash ./sbatch_benchmarking.sh
  • To train a model with custom parameters:
    # Example for DeepSeek-V3MODEL=DeepSeek-V3 TP=2 PP=8 EP=64 VPP=1 PP_FIRST=8 PP_LAST=5 RUN_TIME=00:60:00 NNODES=64 bash sbatch_benchmarking.sh --recompute-granularity selective --recompute-modules mla_up_proj layernorm
  • To monitor your jobs, usesqueue -u $USER for a one-time status check orwatch -n 1 squeue -u $USER for continuous monitoring. For detailed logging, refer to the WandB dashboard.

DeepSeek Checkpoint Conversion

| Please tryMBridge andMegatron-Bridge for better HF<->MCore conversion support.

1. Download DeepSeek-V3 Checkpoint

Download the DeepSeek-V3 checkpoint fromHuggingFace:

# Make sure git-lfs is installed (https://git-lfs.com)git lfs installgit clone https://huggingface.co/deepseek-ai/DeepSeek-V3

The downloaded checkpoint is in FP8 format. Run the following command to convert it to BF16 format, using thisscript:

python inference/fp8_cast_bf16.py --input-fp8-hf-path /your/input/fp8/hf/path --output-bf16-hf-path /your/output/bf16/hf/path

2. Convert to Megatron Legacy Checkpoint

To convert the BF16 HuggingFace checkpoint to a Megatron legacy checkpoint, execute the following command:

# Example for DeepSeek-V3MODEL=DeepSeek-V3 bash ./ckpt_convert_scripts/DeepSeek-V3/convert_deepseek_v3.sh

3. Convert to Distributed Checkpoint

Finally, run this command to convert the legacy checkpoint into a distributed checkpoint:

MODEL=DeepSeek-V3 TP=1 PP=4 EP=64 VPP=1 PP_FIRST=16 PP_LAST=13 NNODES=32 LOAD_PATH=/path/to/legacy/checkpoint bash ./sbatch_benchmarking.sh --ckpt-convert-save /path/to/save/distributed/checkpoint --ckpt-convert-format torch_dist --no-save-optim

For reference, after conversion, the legacy checkpoint is approximately 3.4T, and the distributed checkpoint is about 1.4T.

About

Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors8


[8]ページ先頭

©2009-2025 Movatter.jp