Enabling FlashInfer on ROCm for Accelerated LLM Serving
Contents
Enabling FlashInfer on ROCm for Accelerated LLM Serving#
FlashInfer is an innovative framework designed to accelerate inference of large language models (LLMs). Given the explosive growth and adoption of models like DeepSeek R1, Llama 3, and Qwen 3, efficient inference is critical to meet the demands of real-world deployment. However, challenges such as GPU memory bottlenecks, throughput limitations, and latency remain significant hurdles for deploying these models at scale.
Originally developed for NVIDIA GPUs using CUDA, FlashInfer leverages advanced techniques like efficient key-value (KV) cache management and optimized attention mechanisms to minimize latency and memory usage. Now we are excited to announce an early release ofFlashInfer on ROCm, enabling users with AMD GPUs to achieve improved inference performance. This release extends FlashInfer’s capabilities to AMD hardware, laying the foundation for reduced inference latency, optimized memory usage, and lower operational costs.
This blog provides an overview of FlashInfer, its core concepts, and a simple example of how to use FlashInfer on ROCm for decoding. Note that this is an early release, with some features still in development.
What is FlashInfer?#
FlashInfer is a library that accelerates LLM inference by optimizing critical components of the transformer architecture, including:
Efficient KV-Cache Management: Reduces memory overhead by compressing and managing key-value caches dynamically.
Optimized Attention Mechanisms: Implements high-performance attention kernels to minimize computation latency.
Memory-Efficient Decoding: Streamlines decoding processes to maximize throughput and reduce memory footprint.
AMD GPU users can now begin leveraging these optimizations.
ROCm FlashInfer#
Prerequisites#
To run FlashInfer on ROCm, ensure the following requirements are met:
Linux: see thesupported Linux distributions.
ROCm 6.4+: see theinstallation instructions.
MI300X or MI325X
Once your system is ready, follow the steps below to set up the environment.
Getting Started#
To simplify the setup process, we’ll clone the FlashInfer repository from ROCm’s GitHub and build a Docker container tailored for ROCm.
gitclonehttps://github.com/ROCm/flashinfer.gitcdflashinferUsing the provided Dockerfile, create a container with all necessary dependencies for FlashInfer on ROCm:
dockerbuild-fdocker/Dockerfile.rocm_ci--targetflashinfer_base-tflashinfer-rocm.2>&1|teedocker_build.log
Run the container with necessary privileges and GPU access:
dockerrun-it--network=host--group-add=video\--privileged--ipc=host--cap-add=SYS_PTRACE\--security-optseccomp=unconfined--device/dev/kfd\--device/dev/driflashinfer-rocm
The following section explains how to run FlashInfer on ROCm using a basic example.
Running FlashInfer on ROCm#
To illustrate FlashInfer’s capabilities, here’s a simple example showcasing a single-request decode attention kernel:
importtorchimportflashinferkv_len=2048num_kv_heads=32head_dim=128k=torch.randn(kv_len,num_kv_heads,head_dim).half().to(0)v=torch.randn(kv_len,num_kv_heads,head_dim).half().to(0)# decode attentionnum_qo_heads=32q=torch.randn(num_qo_heads,head_dim).half().to(0)o=flashinfer.single_decode_with_kv_cache(q,k,v)# decode attention without RoPE on-the-fly
Summary#
The early release of FlashInfer for ROCm marks an important milestone in making high-performance LLM inference accessible on AMD GPUs. While this is just the beginning, ongoing development will expand features and improve support, helping the community run large models more efficiently and cost-effectively on AMD hardware.
Stay tuned for updates, and try FlashInfer on ROCm today to start experiencing accelerated LLM serving on AMD GPUs!
Acknowledgements#
The authors wish to acknowledge the AMD teams that supported this work, whose contributions were instrumental in enabling FlashInfer: Aditya Bhattacharji, Pankaj Gupta, Radha Srimanthula, Anisha Sankar, Amit Kumar, Ram Seenivasan, Eliot Li, Ian Dass, Kiran Thumma, Aakash Sudhanwa, Ehud Sharlin, Saad Rahim.
Additional Resources#
Ye, Z., Chen, L., Lai, R., Lin, W., Zhang, Y., Wang, S., Chen, T., Kasikci, B., Grover, V., Krishnamurthy, A., & Ceze, L. (2025). FlashInfer: Efficient and customizable attention engine for LLM inference serving.arXiv preprint.arXiv:2501.01005
Disclaimers#
Third-party content is licensed to you directly by the third party that owns thecontent and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT ISPROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENTIS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TOYOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLEFOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.