Pushing the Boundaries of Foundation Model Training with AMD

AMD is committed to open-source AI by releasing everything behind our GenAI models—from model weights and training configs to datasets and code. Whether you're benchmarking, building, or contributing, you’ll find everything you need to replicate, innovate, and scale with confidence.

Explore Models

Explore Publications

  1. AI Agent
  2. Model Compression
  3. Efficient Architecture
  4. Specultive Decoding

 AI Agent

Model Compression

Quantization | Sparsity 

Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism (COLING 2025)

SDS prunes PLMs in three steps—one-shot pruning, sparse regularization, and a second prune—achieving better weight distribution and outperforming SparseGPT and Wanda

TernaryLLM: Ternarized Large Language Model

Dual Learnable Ternarization (DLT) and Outlier-Friendly Feature Knowledge Distillation (OFF) handle outliers in weights and activations, enabling TernaryLLM to outperform prior low-bit methods in text generation and zero-shot tasks.

DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models (EMNLP 2024 Industry Track)

DL-QAT is a novel approach for quantization-aware training in large language models that combines weight decomposition and low-rank matrices to optimize quantized weights with minimal parameter changes.

Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization

Týr-the-Pruner is an end-to-end search-based global structural pruning framework for LLMs. It constructs a supernet via local pruning across sparsity ratios and uses an iterative prune-and-search strategy. It retains 97% of the dense model's performance while pruning 50% of Llama-3.1-70B's parameters.

Efficient Architecture

Transformer | Diffusion | Hybrid 

MSWA: Refining Local Attention with Multi-Scale Window Attention

MSWA improves SWA by using diverse window sizes across Transformer heads and layers, enhancing performance and efficiency. It assigns smaller windows to shallow layers and larger ones to deep layers. 

Specultive Decoding

Accelerating Generative LLMs Inference with Parallel Draft Models (PARD)

Parallel Draft (PARD) is a speculative decoding technique that dramatically accelerates large-model inference. By generating and verifying multiple “draft” tokens in parallel, PARD delivers up to 3.3× speedup on the Llama 3 series, 2.3× on DeepSeek‐R1, and 4.87× on the Qwen.2,3,4

脚注
  1. MI200-94:Testing conducted internally by AMD Research team as of December 2024, on AMD Instinct MI250 accelerator, measuring the latency of AMD Hummingbird-0.9B, VideoLCM, animatedLCM, Turbo-v1, Turbo-v2 and VideoCrafter2, all in FP16, results are an averageof tested 5 rounds.
    Test environment:
    OS:  Ubuntu 22.04 LTS
    CPU: AMD EPYC 73F3 CPU x1
    GPU: Instinct MI250 GPU x1
    GPU Driver: ROCm 6.1
    Python 3.8, PyTorch 2.2.0, and FlashAttention 2.2.0.
    Inference latency:
    VideoLCM = 2.35s
    animateLCM = 6.38s
    Turbo-v1 = 2.49s
    Turbo-v2 = 2.57s
    VideoCrafter2 = 44.16s
    Hummingbird0.9B = 1.87s
    Performance may vary based on different hardware configurations, software versions and optimization.
  2. MI200-095:
    On average, a system configured with an AMD Instinct™ MI250X GPU shows that with Parallele Draft (PARD), the Llama3 series models achieve up to 3.3× inference speedup. Testing done by AMD on 03/17/2025, results may vary based on configuration, usage, software version, and optimizations.

    SYSTEM CONFIGURATION
    System Model: Supermicro GPU A+ Server AS - 4124GQ-TNMI
    CPU: AMD EPYC 73F3 16-Core Processor (2 sockets, 16 cores per socket, 2 threads per core)
    NUMA Config: 2 NUMA node per socket
    Memory: 1024 GB (16 DIMMs, 3200 MT/s, 64 GiB/DIMM)
    Disk: Root drive + Data drive combined:
    2 x 894.3G SAMSUNG MZQL2960HCJR-00A07
    4 x 7T SAMSUNG MZQL27T6HBLA-00A07
    GPU: 4x AMD MI250X 128GB HBM2e 500W
    Host OS: Ubuntu 22.04.5 LTS 5.15.0-41-generic
    System BIOS: 2.5
    System Bios Vendor:American Megatrends International, LLC.
    Host GPU Driver: ROCm™ 6.3.2
  3. MI200-096
    On average, a system configured with an AMD Instinct™ MI250X GPU shows that with Parallele Draft (PARD), the DeepSeek series models achieve up to 2.3× inference speedup. Testing done by AMD on 03/17/2025, results may vary based on configuration, usage, software version, and optimizations. 

    SYSTEM CONFIGURATION
    System Model: Supermicro GPU A+ Server AS - 4124GQ-TNMI
    CPU: AMD EPYC 73F3 16-Core Processor (2 sockets, 16 cores per socket, 2 threads per core)
    NUMA Config: 2 NUMA node per socket
    Memory: 1024 GB (16 DIMMs, 3200 MT/s, 64 GiB/DIMM)
    Disk: Root drive + Data drive combined:
    2 x 894.3G SAMSUNG MZQL2960HCJR-00A07
    4 x 7T SAMSUNG MZQL27T6HBLA-00A07
    GPU: 4x AMD MI250X 128GB HBM2e 500W
    Host OS: Ubuntu 22.04.5 LTS 5.15.0-41-generic
    System BIOS: 2.5
    System Bios Vendor:American Megatrends International, LLC.
    Host GPU Driver: ROCm™ 6.3.2
  4. MI200-097
    On average, a system configured with an AMD Instinct™ MI250X GPU shows that with Parallele Draft (PARD), the Qwen model series benefit from a 4.87× inference speedup. Testing done by AMD on 03/17/2025, results may vary based on configuration, usage, software version, and optimizations. 

    SYSTEM CONFIGURATION
    System Model: Supermicro GPU A+ Server AS - 4124GQ-TNMI
    CPU: AMD EPYC 73F3 16-Core Processor (2 sockets, 16 cores per socket, 2 threads per core)
    NUMA Config: 2 NUMA node per socket
    Memory: 1024 GB (16 DIMMs, 3200 MT/s, 64 GiB/DIMM)
    Disk: Root drive + Data drive combined:
    2 x 894.3G SAMSUNG MZQL2960HCJR-00A07
    4 x 7T SAMSUNG MZQL27T6HBLA-00A07
    GPU: 4x AMD MI250X 128GB HBM2e 500W
    Host OS: Ubuntu 22.04.5 LTS 5.15.0-41-generic
    System BIOS: 2.5
    System Bios Vendor:American Megatrends International, LLC.
    Host GPU Driver: ROCm™ 6.3.2