AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 1

AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 1 #

February 07, 2025 byVictor Robles.

2 min read. | 496 total words.

As organizations scale their AI inference workloads, they face the challenge of efficiently deploying and managing large language models across GPU infrastructure. This three-part blog series provides a production-ready foundation for orchestrating AI inference workloads on the AMD Instinct platform with Kubernetes.

In this post, we’ll establish the essential infrastructure by setting up a Kubernetes cluster using MicroK8s, configuring Helm for streamlined deployments, implementing persistent storage for model caching, and installing the AMD GPU Operator to seamlessly integrate AMD hardware with Kubernetes.

Part 2 will focus on deploying and scaling the vLLM inference engine, implementingMetalLB for load balancing, and optimizing multi-GPU deployments to maximize the performance of AMD Instinct accelerators.

The series concludes inPart 3 with implementing Prometheus for metrics collection, Grafana for performance visualization, andOpen WebUI for interacting with models deployed with vLLM.

Let’s begin with Part 1, where we’ll build the foundational components needed for a production-ready AI inference platform.

MI300X Test System Specifications#


Node Type	Supermicro AS -8125GS-TNMR2
CPU	2x AMD EPYC 9654 96-Core Processor
Memory	24x 96GB 2R ddr5-4800 dual rank
GPU	8x MI300X
OS	Ubuntu 22.04.4 LTS
Kernel	6.5.0-45-generic
ROCm	6.2.0
AMD GPU driver	6.7.0-1787201.22.04

Install Kubernetes (microk8s)#

Microk8s is a lightweight, but powerful kubernetes instance that can run on as little as a single node and can scale up to mid-level cluster sizes. Here we run the snap install to load microk8s on the host node.

sudoaptupdatesudoaptinstallsnapdsudosnapinstallmicrok8s--classic

Add the current user to the microk8s group and create the.kube directory in your home directory to store the microk8s config file.

sudousermod-a-Gmicrok8s$USERmkdir-p~/.kubechmod0700~/.kube

newgrpmicrok8s

Note

For convenience you can add the following alias to your bashrc to use the native “kubectl” command with microk8secho"aliaskubectl='microk8skubectl'">>~/.bashrc;source~/.bashrc

Let’s confirm our cluster is up and running.

kubectlgetnodes

We should see theSTATUS asReady

NAMESTATUSROLESAGEVERSIONmi300x-server01Ready<none>5mv1.31.3

Since this is a single-node instance of Kubernetes, we will need to label our node as a control-plane node in order for the AMD GPU Operator to successfully find the node.

Note

For vanilla Kubernetes installations this is not required, but if you’re running Kubernetes on a single master node you will need toremove the taint to be able to schedule jobs on this node

We can do this in one line

kubectllabelnode$(kubectlgetnodes--no-headers|grep"<none>"|awk'{print $1}')node-role.kubernetes.io/control-plane=''

Let’s confirm the node has been labeled correctly

kubectlgetnodes

We should see theROLES ascontrol-plane

NAMESTATUSROLESAGEVERSIONmi300x-server01Readycontrol-plane8mv1.31.3

Install Helm#

TheAMD GPU Operator, Grafana, and Prometheus installations are faciliated by using Helm charts. To install the latest version of Helm we will download the install script from the Helm repository and run it.

curl-fsSL-oget_helm.shhttps://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3chmod700get_helm.sh./get_helm.sh

If you are utilizing microk8s as your Kubernetes instance, you also need to tell Helm where theKUBECONFIG environment variable is for microk8s. This is accomplished by exporting the environment variable into the.bashrc of the current user.

echo"export KUBECONFIG=/var/snap/microk8s/current/credentials/client.config">>~/.bashrc;source~/.bashrc

Note

For vanilla Kubernetes installations this is not required unless you also wish to change the default location$HOME/.kube/config

Persistent Storage#

Next, we enable persistent storage so that we don’t have to keep downloading the model when starting or scaling up an instance of the vLLM inference server. Here are methods for both microk8s and vanilla Kubernetes.

Microk8s#

Enable storage on microk8s with:

microk8senablestorage

Create a persistent volume claim (PVC) usingvllm-pvc.yaml:

apiVersion:v1kind:PersistentVolumeClaim# Defines a PersistentVolumeClaim (PVC) resourcemetadata:name:llama-3.2-1b# Name of the PVC, used for reference in deploymentsnamespace:default# Kubernetes namespace where this PVC residesspec:accessModes:-ReadWriteOnce# Access mode indicating the volume can be mounted as read-write by a single noderesources:requests:storage:50Gi# Specifies the amount of storage requested for this volumestorageClassName:microk8s-hostpath# Specifies the storage class to use (e.g., MicroK8s hostPath)volumeMode:Filesystem# Indicates that the volume will be formatted as a filesystem

Apply the persistent volume claim in the console with:

kubectlapply-fvllm-pvc.yaml

Note

For microk8s, hostpath provisioner will store all volume data in/var/snap/microk8s/common/default-storage on the host machine.

Vanilla Kubernetes#

In environments without dynamic storage provisioning, define aPersistentVolume (PV) that the PVC can bind to withpv.yaml:

apiVersion:v1kind:PersistentVolumemetadata:name:llama-3.2-1b-pv# Name of the PersistentVolumenamespace:defaultspec:capacity:storage:50Gi# Amount of storage available in this PVaccessModes:-ReadWriteOnce# Access mode for the PVhostPath:path:/mnt/data/llama# Path on the host where the data is storedvolumeMode:Filesystem# Indicates that the volume will be formatted as a filesystem

Define aPersistentVolumeClaim (PVC) withvllm-pvc.yaml:

apiVersion:v1kind:PersistentVolumeClaimmetadata:name:llama-3.2-1bnamespace:defaultspec:accessModes:-ReadWriteOnce# Access mode indicating the volume can be mounted as read-write by a single noderesources:requests:storage:50Gi# Amount of storage requestedvolumeMode:Filesystem# Indicates that the volume will be formatted as a filesystem

Note

The PVC will bind to the specific PV, based on matchingstorage andaccessModes attributes.

Apply the two manifests:

kubectlapply-fpv.yamlkubectlapply-fvllm-pvc.yaml

Install the AMD GPU Operator#

We proceed with installing theAMD GPU Operator. First we need to installcert-manager as a prerequisite by adding the Jetstack repo to Helm.

helmrepoaddjetstackhttps://charts.jetstack.io

Then run the install forcert-manager via Helm.

helminstallcert-managerjetstack/cert-manager\--namespacecert-manager\--create-namespace\--versionv1.15.1\--setcrds.enabled=true

Finally, install the AMD GPU Operator.

# Add the Helm repositoryhelmrepoaddrocmhttps://rocm.github.io/gpu-operatorhelmrepoupdate# Install the GPU Operatorhelminstallamd-gpu-operatorrocm/gpu-operator-charts\--namespacekube-amd-gpu--create-namespace

Configure AMD GPU Operator#

For our workload we will use a defaultdevice-config.yaml file to configure the AMD GPU Operator. This sets up the device plugin, node labeler, and metrics exporter to properly assign AMD GPUs to workloads, label the nodes that have AMD GPUs, and export metrics for Grafana and Prometheus. For a full list of configurable options refer to theFull Reference Config page of the AMD GPU Operator documentation.

apiVersion:amd.com/v1alpha1kind:DeviceConfigmetadata:name:gpu-operator# use the namespace where AMD GPU Operator is runningnamespace:kube-amd-gpuspec:driver:# disable the installation of out-of-tree amdgpu kernel moduleenable:falsedevicePlugin:# Specify the device plugin image# default value is rocm/k8s-device-plugin:latestdevicePluginImage:rocm/k8s-device-plugin:latest# Specify the node labeller image# default value is rocm/k8s-device-plugin:labeller-latestnodeLabellerImage:rocm/k8s-device-plugin:labeller-latest# Specify to enable/disable the node labeller# node labeller is required for adding / removing blacklist config of amdgpu kernel module# please set to true if you want to blacklist the inbox driver and use our-of-tree driverenableNodeLabeller:truemetricsExporter:# To enable/disable the metrics exporter, disabled by defaultenable:true# kubernetes service type for metrics exporter, clusterIP(default) or NodePortserviceType:"NodePort"# internal service port used for in-cluster and node access to pull metrics from the metrics-exporter (default 5000)port:5000# Node port for metrics exporter service, metrics endpoint $node-ip:$nodePortnodePort:32500# exporter imageimage:"docker.io/rocm/device-metrics-exporter:v1.0.0"# Specify the node to be managed by this DeviceConfig Custom Resourceselector:feature.node.kubernetes.io/amd-gpu:"true"

We apply the device-config file as:

kubectlapply-fdevice-config.yaml

To confirm the node labeler is working we run

kubectlgetnodes-Lfeature.node.kubernetes.io/amd-gpu

and we should seeAMD-GPU set astrue

NAMESTATUSROLESAGEVERSIONAMD-GPUmi300x-server01Readycontrol-plane15mv1.31.3true

To show available GPUs for workloads we can use:

kubectlgetnodes-ocustom-columns=NAME:.metadata.name,"Total GPUs:.status.capacity.amd\.com/gpu","Allocatable GPUs:.status.allocatable.amd\.com/gpu"

Now, we can see the total available GPUs

NAMETotalGPUsAllocatableGPUsmi300x-server0188

Summary#

In this post, we’ve established a solid foundation for AI inference workloads by:

Setting up a Kubernetes cluster with MicroK8s
Configuring essential components like Helm
Implementing persistent storage for model management
Installing and validating the AMD GPU Operator

The next installment in this series will walk through deploying and scaling vLLM for inference, implementing MetalLB for load balancing, and optimizing multi-GPU deployments on AMD Instinct hardware. Part 3 will round out the series by deploying Open WebUI as a front end and configuring monitoring and management with Prometheus and Grafana. Stay tuned!

Disclaimers
Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.

Contents

Movatterモバイル変換

AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 1

Contents

AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 1#

MI300X Test System Specifications#

Install Kubernetes (microk8s)#

Install Helm#

Persistent Storage#

Microk8s#

Vanilla Kubernetes#

Install the AMD GPU Operator#

Configure AMD GPU Operator#

Summary#

AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 1 #