Coding Agents on AMD GPUs: Fast LLM Pipelines for Developers

Coding Agents on AMD GPUs: Fast LLM Pipelines for Developers #

September 30, 2025 byLin Sun,Yao Liu,Phani Vaddadi,Vish Vadlamani.

3 min read. | 674 total words.

The rapid rise of AI-assisted development is transforming how software is built, with coding agents emerging as powerful tools for modern developers. In this blog, we will show you how to deploy coding agents on AMD GPUs using frameworks such as SGLang, vLLM, and llama.cpp, and walk through a practical workflow example: creating a Minesweeper game using Aider.

A coding agent is a type of AI agent designed specifically to assist or automate software development tasks. Unlike simple autocomplete tools, a coding agent can interpret natural language instructions, plan a sequence of coding steps, interact with development tools (like Git, editors, and test frameworks), and iteratively refine the code until it works. By combining large language model reasoning with external tool integrations, a coding agent functions like an AI-powered developer that can handle project-wide context, create and modify files, and collaborate with developers in a feedback loop.

Coding agents provide a wide range of features that make them powerful development assistants. They can generate code from natural language, perform multi-file edits, understand repo-wide dependencies, and even run tests to validate correctness. They are capable of error-driven self-correction, refactoring, and environment setup, while also managing Git operations such as commits, diffs, and pull requests. More advanced agents extend their capabilities to code review, offering insights on bugs, performance, security, and best practices. With memory and contextual awareness, coding agents not only automate repetitive tasks but also maintain consistency across complex projects.

Developers increasingly rely on coding agents because they bridge the gap between high-level human intent and low-level code implementation, allowing them to focus on product design and architecture instead of boilerplate or debugging. Coding agents improve productivity, reduce human error, and scale development work that would normally require larger teams. Importantly, coding agent workflows can now run efficiently on AMD GPUs, which provide the high memory bandwidth and compute performance needed for large model inference. Frameworks such asSGLang,vLLM, andllama.cpp offer highly efficient inferencing, all fully supported on AMD datacenter GPUs, and benchmarking results have demonstrated strong performance, making it easier for individuals and organizations to deploy coding agents locally or in the cloud and accelerate AI-assisted software development at scale.

Some of the widely used AI coding agents includeGitHub Copilot,Cursor,Codex,Cline, andAider. While Copilot, Cursor and Cline excel at code completion and suggestions within popular IDEs, and Codex powers natural-language-to-code generation, Aider stands out for its workflow-driven flexibility. It enables integration with custom LLMs, supports multi-step coding tasks, and can run models locally or via APIs. Aider is designed for iterative development, debugging, and project management all through the command line, making it especially powerful for complex coding environments.

Requirements#

AMD GPU: See theROCm documentation page for supported hardware and operating systems.
ROCm 6.4: See theROCm installation for Linux for installation instructions.
Docker: SeeInstall Docker Engine on Ubuntu for installation instructions.
8 x MI300x GPUs: At least one node with 8 x MI300x is needed to deploy DeepSeek-V3.1 used in this example

Deploy LLM#

AMD supports a wide range of inference frameworks. In this guide, we walk through the steps to deploy Deepseek-v3.1 for inference using SGLang, vLLM, and llama.cpp.

SGLang#

SGLang is a high-performance framework designed for the efficient serving of large language models and vision language models. It offers a range of features that enhance the interaction with models, making it faster and more controllable. The core features include efficient serving withRadixAttention for prefix caching, zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization, and multi-lora batching. You can find both ROCm supportedofficial docker images andstaged docker images on dockerhub.

Deploy DeepSeek-V3.1 on MI300x

dockerpulllmsysorg/sglang:v0.5.3rc0-rocm630-mi30xdockerrun--cap-add=SYS_PTRACE--ipc=host--privileged=true\--shm-size=128GB--network=host--device=/dev/kfd\--device=/dev/dri--group-addvideo-it\lmsysorg/sglang:v0.5.3rc0-rocm630-mi30xRCCL_MSCCL_ENABLE=0CK_MOE=1HSA_NO_SCRATCH_RECLAIM=1python3-msglang.launch_server--model-pathdeepseek-ai/DeepSeek-V3.1--host0.0.0.0--port30000--tp8--trust-remote-code

This should launch DeepSeek-V3.1 on port 30000, providing an interface for OpenAI compatible APIs.

The plot below shows a performance comparison between AMD MI300X and NVIDIA H200 on SGLang running the DeepSeek-R1 model.
For more details, check outthis blog.

sglang_deepseek_R1_MI300x_H200

vLLM#

vLLM accelerates large language model (LLM) inference and serving primarily through two core features: PagedAttention, an optimized memory management technique that reduces KV cache memory waste, and continuous batching, a dynamic scheduling algorithm that maximizes GPU utilization by processing requests as they arrive. These innovations allow vLLM to achieve significantly higher throughput and lower latency, while also supporting additional optimizations like quantization and speculative decoding. You can find both ROCm supportedofficial docker images andstaged docker images on dockerhub.

Deploy DeepSeek-V3.1 on MI300x

dockerpullrocm/vllm:latestdockerrun--cap-add=SYS_PTRACE--ipc=host--privileged=true\--shm-size=128GB--network=host--device=/dev/kfd\--device=/dev/dri--group-addvideo-it\rocm/vllm:latestVLLM_USE_V1=1VLLM_ROCM_USE_AITER=1VLLM_ROCM_USE_AITER_RMSNORM=0VLLM_ROCM_USE_AITER_MHA=0VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1vllmservedeepseek-ai/DeepSeek-V3.1--tensor-parallel-size8--disable-log-requests--trust-remote-code--host0.0.0.0--port30000

This should launch DeepSeek-V3.1 on port 30000, providing an interface for OpenAI-compatible APIs.

llama.cpp#

Llama.cpp is a lightweight, high-performance C++ framework designed for running large language models (LLMs) by efficiently using computational resources. Its core features include flexible model quantization using the GGUF format, which compresses models to enable inference on devices with limited memory. You can find ROCm supporteddocker images on dockerhub.

dockerpullrocm/llama.cpp:llama.cpp-b5997_rocm6.4.0_ubuntu24.04_fulldockerrun--cap-add=SYS_PTRACE--ipc=host--privileged=true\--shm-size=128GB--network=host--device=/dev/kfd\--device=/dev/dri--group-addvideo-it\rocm/llama.cpp:llama.cpp-b5997_rocm6.4.0_ubuntu24.04_full

Download Q4 GGUF checkpoints for DeepSeek-V3.1.

fromhuggingface_hubimportsnapshot_download# Define the model repository and destination directorymodel_id="unsloth/DeepSeek-V3.1-GGUF"local_dir="<your huggingface cache directory>/hub/models--unsloth--DeepSeek-V3.1-GGUF"# Download only files matching the pattern "DeepSeek-V3.1-Q4_K_M*"snapshot_download(repo_id=model_id,local_dir=local_dir,local_dir_use_symlinks=False,allow_patterns=["Q4_K_M/DeepSeek-V3.1-Q4_K_M*"])print(f"Downloaded GGUF file(s) matching pattern to:{local_dir}")

Launch DeepSeek-V3.1 Q4 model and start the inference server that accepts OpenAI compatible APIs.

The plot below shows a performance comparison between AMD MI300X and NVIDIA H100 on llama.cpp running the DeepSeek-V3 Q4 model.
For more details, check outthis blog.

llama_cpp_deepseek_V3_MI300x_H100

cd/app/build/bin./llama-server-m<yourhuggingfacecachedirectoryinsidethecontainer>/hub/models--unsloth--DeepSeek-V3.1-GGUF/Q4_K_M/DeepSeek-V3.1-Q4_K_M-00001-of-00009.gguf-ngl999-np4--aliasunsloth/DeepSeek-V3.1-Q4_K_M--host0.0.0.0--port30000

This should launch DeepSeek-V3.1 Q4 on port 30000, providing an interface for openai compatible APIs.

Setup aider#

Install Aider on your Linux edge machine (if you only need to do a quick test, you can also run it directly on the server that hosts the inference service, allowing Aider to call the localhost API)

python-mpipinstallaider-installaider-install

Set the OpenAI compatible APIs base URL and API keys.

exportOPENAI_API_BASE=http://<Your-Server-IP>:30000/v1# export OPENAI_API_BASE=http://0.0.0.0:30000/v1 if you are testing aider on the server machineexportOPENAI_API_KEY=<Your-OpenAI-API-Key># set to random values if you didn't deploy auth service

You’re now ready to start the coding agent workflow.

Example: Building a minesweeper game using aider#

Start aider#

create a Minesweeper directory and start Aider

mkdirminesweeper&&cdminesweeperaider--architect--modelopenai/deepseek-ai/DeepSeek-V3.1--no-show-model-warnings--cache-prompts

Now you should have aider running in your minesweeper code repo

Here’s the list of commands provided by Aider

Add context#

add context

Build game through prompts#

create minesweeper game

Here’s the first version the agent generated
create minesweeper gif

Replace emoji on reset buttons with text
change button text Here’s the updated app
change button text gif

Add animation when bombs are revealed
Here’s the updated app
add animation gif

Set color theme to green, and change to red when bombs are revealed
change color theme Here’s the updated app
change color theme gif

Summary#

AMD GPUs offer significant advantages for hosting agentic workflows, delivering high-performance, efficient inferencing for AI-driven tasks. In this guide, we showcased a practical use case: running a coding agent workflow using the coding assistant tool, Aider. However, the workflow is not limited to a single tool—developers can leverage a variety of coding agents, and different inferencing frameworks powered by AMD GPUs, to efficiently run large language models and accelerate AI-assisted coding tasks. This approach highlights the flexibility and performance benefits of AMD hardware for modern AI workflows.

Disclaimers#

Third-party content is licensed to you directly by the third party that owns the content and isnot licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS”WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE ATYOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FORANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANYDAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.

Contents

Movatterモバイル変換