Pushing the Boundaries of Foundation Model Training with AMD
AMD is committed to open-source AI by releasing everything behind our GenAI models—from model weights and training configs to datasets and code. Whether you're benchmarking, building, or contributing, you’ll find everything you need to replicate, innovate, and scale with confidence.
Explore Models

Explore Publications
- AI Agent
- Model Compression
- Efficient Architecture
- Specultive Decoding
AI Agent
Model Compression
Quantization | Sparsity
DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models (EMNLP 2024 Industry Track)
DL-QAT is a novel approach for quantization-aware training in large language models that combines weight decomposition and low-rank matrices to optimize quantized weights with minimal parameter changes.
Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization
Týr-the-Pruner is an end-to-end search-based global structural pruning framework for LLMs. It constructs a supernet via local pruning across sparsity ratios and uses an iterative prune-and-search strategy. It retains 97% of the dense model's performance while pruning 50% of Llama-3.1-70B's parameters.
Efficient Architecture
Transformer | Diffusion | Hybrid
Specultive Decoding
Accelerating Generative LLMs Inference with Parallel Draft Models (PARD)
Parallel Draft (PARD) is a speculative decoding technique that dramatically accelerates large-model inference. By generating and verifying multiple “draft” tokens in parallel, PARD delivers up to 3.3× speedup on the Llama 3 series, 2.3× on DeepSeek‐R1, and 4.87× on the Qwen.2,3,4
脚注- MI200-94:Testing conducted internally by AMD Research team as of December 2024, on AMD Instinct MI250 accelerator, measuring the latency of AMD Hummingbird-0.9B, VideoLCM, animatedLCM, Turbo-v1, Turbo-v2 and VideoCrafter2, all in FP16, results are an averageof tested 5 rounds.
Test environment:
OS: Ubuntu 22.04 LTS
CPU: AMD EPYC 73F3 CPU x1
GPU: Instinct MI250 GPU x1
GPU Driver: ROCm 6.1
Python 3.8, PyTorch 2.2.0, and FlashAttention 2.2.0.
Inference latency:
VideoLCM = 2.35s
animateLCM = 6.38s
Turbo-v1 = 2.49s
Turbo-v2 = 2.57s
VideoCrafter2 = 44.16s
Hummingbird0.9B = 1.87s
Performance may vary based on different hardware configurations, software versions and optimization. - MI200-095:
On average, a system configured with an AMD Instinct™ MI250X GPU shows that with Parallele Draft (PARD), the Llama3 series models achieve up to 3.3× inference speedup. Testing done by AMD on 03/17/2025, results may vary based on configuration, usage, software version, and optimizations.
SYSTEM CONFIGURATION
System Model: Supermicro GPU A+ Server AS - 4124GQ-TNMI
CPU: AMD EPYC 73F3 16-Core Processor (2 sockets, 16 cores per socket, 2 threads per core)
NUMA Config: 2 NUMA node per socket
Memory: 1024 GB (16 DIMMs, 3200 MT/s, 64 GiB/DIMM)
Disk: Root drive + Data drive combined:
2 x 894.3G SAMSUNG MZQL2960HCJR-00A07
4 x 7T SAMSUNG MZQL27T6HBLA-00A07
GPU: 4x AMD MI250X 128GB HBM2e 500W
Host OS: Ubuntu 22.04.5 LTS 5.15.0-41-generic
System BIOS: 2.5
System Bios Vendor:American Megatrends International, LLC.
Host GPU Driver: ROCm™ 6.3.2 - MI200-096
On average, a system configured with an AMD Instinct™ MI250X GPU shows that with Parallele Draft (PARD), the DeepSeek series models achieve up to 2.3× inference speedup. Testing done by AMD on 03/17/2025, results may vary based on configuration, usage, software version, and optimizations.
SYSTEM CONFIGURATION
System Model: Supermicro GPU A+ Server AS - 4124GQ-TNMI
CPU: AMD EPYC 73F3 16-Core Processor (2 sockets, 16 cores per socket, 2 threads per core)
NUMA Config: 2 NUMA node per socket
Memory: 1024 GB (16 DIMMs, 3200 MT/s, 64 GiB/DIMM)
Disk: Root drive + Data drive combined:
2 x 894.3G SAMSUNG MZQL2960HCJR-00A07
4 x 7T SAMSUNG MZQL27T6HBLA-00A07
GPU: 4x AMD MI250X 128GB HBM2e 500W
Host OS: Ubuntu 22.04.5 LTS 5.15.0-41-generic
System BIOS: 2.5
System Bios Vendor:American Megatrends International, LLC.
Host GPU Driver: ROCm™ 6.3.2 - MI200-097
On average, a system configured with an AMD Instinct™ MI250X GPU shows that with Parallele Draft (PARD), the Qwen model series benefit from a 4.87× inference speedup. Testing done by AMD on 03/17/2025, results may vary based on configuration, usage, software version, and optimizations.
SYSTEM CONFIGURATION
System Model: Supermicro GPU A+ Server AS - 4124GQ-TNMI
CPU: AMD EPYC 73F3 16-Core Processor (2 sockets, 16 cores per socket, 2 threads per core)
NUMA Config: 2 NUMA node per socket
Memory: 1024 GB (16 DIMMs, 3200 MT/s, 64 GiB/DIMM)
Disk: Root drive + Data drive combined:
2 x 894.3G SAMSUNG MZQL2960HCJR-00A07
4 x 7T SAMSUNG MZQL27T6HBLA-00A07
GPU: 4x AMD MI250X 128GB HBM2e 500W
Host OS: Ubuntu 22.04.5 LTS 5.15.0-41-generic
System BIOS: 2.5
System Bios Vendor:American Megatrends International, LLC.
Host GPU Driver: ROCm™ 6.3.2
- MI200-94:Testing conducted internally by AMD Research team as of December 2024, on AMD Instinct MI250 accelerator, measuring the latency of AMD Hummingbird-0.9B, VideoLCM, animatedLCM, Turbo-v1, Turbo-v2 and VideoCrafter2, all in FP16, results are an averageof tested 5 rounds.
Test environment:
OS: Ubuntu 22.04 LTS
CPU: AMD EPYC 73F3 CPU x1
GPU: Instinct MI250 GPU x1
GPU Driver: ROCm 6.1
Python 3.8, PyTorch 2.2.0, and FlashAttention 2.2.0.
Inference latency:
VideoLCM = 2.35s
animateLCM = 6.38s
Turbo-v1 = 2.49s
Turbo-v2 = 2.57s
VideoCrafter2 = 44.16s
Hummingbird0.9B = 1.87s
Performance may vary based on different hardware configurations, software versions and optimization. - MI200-095:
On average, a system configured with an AMD Instinct™ MI250X GPU shows that with Parallele Draft (PARD), the Llama3 series models achieve up to 3.3× inference speedup. Testing done by AMD on 03/17/2025, results may vary based on configuration, usage, software version, and optimizations.
SYSTEM CONFIGURATION
System Model: Supermicro GPU A+ Server AS - 4124GQ-TNMI
CPU: AMD EPYC 73F3 16-Core Processor (2 sockets, 16 cores per socket, 2 threads per core)
NUMA Config: 2 NUMA node per socket
Memory: 1024 GB (16 DIMMs, 3200 MT/s, 64 GiB/DIMM)
Disk: Root drive + Data drive combined:
2 x 894.3G SAMSUNG MZQL2960HCJR-00A07
4 x 7T SAMSUNG MZQL27T6HBLA-00A07
GPU: 4x AMD MI250X 128GB HBM2e 500W
Host OS: Ubuntu 22.04.5 LTS 5.15.0-41-generic
System BIOS: 2.5
System Bios Vendor:American Megatrends International, LLC.
Host GPU Driver: ROCm™ 6.3.2 - MI200-096
On average, a system configured with an AMD Instinct™ MI250X GPU shows that with Parallele Draft (PARD), the DeepSeek series models achieve up to 2.3× inference speedup. Testing done by AMD on 03/17/2025, results may vary based on configuration, usage, software version, and optimizations.
SYSTEM CONFIGURATION
System Model: Supermicro GPU A+ Server AS - 4124GQ-TNMI
CPU: AMD EPYC 73F3 16-Core Processor (2 sockets, 16 cores per socket, 2 threads per core)
NUMA Config: 2 NUMA node per socket
Memory: 1024 GB (16 DIMMs, 3200 MT/s, 64 GiB/DIMM)
Disk: Root drive + Data drive combined:
2 x 894.3G SAMSUNG MZQL2960HCJR-00A07
4 x 7T SAMSUNG MZQL27T6HBLA-00A07
GPU: 4x AMD MI250X 128GB HBM2e 500W
Host OS: Ubuntu 22.04.5 LTS 5.15.0-41-generic
System BIOS: 2.5
System Bios Vendor:American Megatrends International, LLC.
Host GPU Driver: ROCm™ 6.3.2 - MI200-097
On average, a system configured with an AMD Instinct™ MI250X GPU shows that with Parallele Draft (PARD), the Qwen model series benefit from a 4.87× inference speedup. Testing done by AMD on 03/17/2025, results may vary based on configuration, usage, software version, and optimizations.
SYSTEM CONFIGURATION
System Model: Supermicro GPU A+ Server AS - 4124GQ-TNMI
CPU: AMD EPYC 73F3 16-Core Processor (2 sockets, 16 cores per socket, 2 threads per core)
NUMA Config: 2 NUMA node per socket
Memory: 1024 GB (16 DIMMs, 3200 MT/s, 64 GiB/DIMM)
Disk: Root drive + Data drive combined:
2 x 894.3G SAMSUNG MZQL2960HCJR-00A07
4 x 7T SAMSUNG MZQL27T6HBLA-00A07
GPU: 4x AMD MI250X 128GB HBM2e 500W
Host OS: Ubuntu 22.04.5 LTS 5.15.0-41-generic
System BIOS: 2.5
System Bios Vendor:American Megatrends International, LLC.
Host GPU Driver: ROCm™ 6.3.2











