Streamline Complex AI Inference on Kubernetes with NVIDIA Grove

New component in NVIDIA Dynamo enables efficient scaling of distributed inference

Nov 10, 2025

BySanjay Chatterjee,Anish Maddipoti,Ekin Karabulut andRohan Varma

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA Grove is a Kubernetes API that enables the orchestration of complex, multicomponent AI inference systems, allowing for the scaling and management of multiple model instances as a single logical system.
Grove provides multilevel autoscaling, which enables scaling at the level of individual components, related component groups, and entire service replicas, ensuring that the system can adapt to changing workload demands.
Grove's hierarchical custom resources, including PodCliques, PodCliqueScalingGroups, and PodCliqueSets, allow developers to define complex AI stacks in a concise and declarative manner, enabling efficient and reliable deployment of AI inference workloads on Kubernetes clusters.

AI-generated content may summarize information incompletely. Verify important information.Learn more

Over the past few years,AI inference has evolved from single-model, single-pod deployments into complex, multicomponent systems. A model deployment may now consist of several distinct components—prefill, decode, vision encoders, key value (KV) routers, and more. In addition, entire agentic pipelines are emerging, where multiple such model instances collaborate to perform reasoning, retrieval, or multimodal tasks.

This shift has changed the scaling and orchestration problem from “run N replicas of a pod” to “coordinate a group of components as one logical system.” Managing such a system requires scaling and scheduling the right pods together, understanding that each component has distinct configuration and resource needs, starting them in a deliberate order, and placing them in the cluster with network topology in mind. Ultimately, the goal is to orchestrate a system and scale components with awareness of their dependencies as a whole, rather than one pod at a time.

To address these challenges, today we are announcing thatNVIDIA Grove, a Kubernetes API for running modern ML inference workloads on Kubernetes clusters, is now available withinNVIDIA Dynamo as a modular component. Grove is fully open source and available on theai-dynamo/grove GitHub repo.

How NVIDIA Grove orchestrates inference as a whole

Grove enables you to scale your multinode inference deployment from a single replica to data center scale, supporting tens of thousands of GPUs. With Grove, you can describe your whole inference serving system in Kubernetes (for example, prefill, decode, routing, or any other component) as a single Custom Resource (CR).

From that one spec, the platform coordinates hierarchical gang scheduling, topology‑aware placement, multilevel autoscaling, and explicit startup ordering. You get precise control of how the system behaves without stitching together scripts, YAML files, or custom controllers.

Originally motivated by the challenges of orchestrating multinode, disaggregated inference systems, Grove is flexible enough to map naturally to any real-world inference architecture—from traditional single-node aggregated inference to agentic pipelines with multiple models. Grove enables developers to define complex AI stacks in a concise, declarative, and framework-agnostic manner.

Prerequisites for themultinode disaggregated serving are detailed below.

Multilevel autoscaling for interdependent components

Modern inference systems need autoscaling at multiple levels: individual components (prefill workers for traffic spikes), related component groups (prefill leaders with their workers), and entire service replicas for overall capacity. These levels affect one another: scaling prefill workers may require more decode capacity, and new service replicas need proper component ratios. Traditional pod-level autoscaling can’t handle these interdependencies.

System-level lifecycle management with recovery and rolling updates

Recovery and updates must operate on complete service instances, not individualKubernetes pods. A failed prefill worker needs to properly reconnect to its leader after a restart, and rolling updates must preserve network topology to maintain low latency. The platform must treat multicomponent systems as single operational units optimized for both performance and availability.

Flexible hierarchical gang scheduling

The AI workload scheduler should support flexible gang scheduling that goes beyond traditional all-or-nothing placement. Disaggregated serving creates a new challenge: the inference system needs to guarantee essential component combinations (at least one prefill and decode worker, for example) while allowing independent scaling of each component type. The challenge is that prefill and decode components should scale at different ratios based on workload patterns.

Traditional gang scheduling prevents this independent scaling by forcing everything into groups that must scale together. The system needs policies that enforce minimum viable component combinations while enabling flexible scaling.

Topology-aware scheduling

Component placement affects performance. On systems likeNVIDIA GB200 NVL72, scheduling the related prefill and decode pods on the sameNVIDIA NVLink domain optimizes KV-cache transfer latency. The scheduler must understand physical network topology, placing related components near each other while spreading replicas for availability.

Role‑aware orchestration and explicit startup ordering

Components have different jobs, configurations, and startup requirements. For example, prefill and decode leaders execute specialized startup logic than workers, and workers can’t start before leaders are ready. The platform needs role-specific configuration and dependency enforcement for reliable initialization.

Put together, this is the bigger picture: inference teams need an easy and declarative way to describe their system as it is actually operated (multiple roles, multiple nodes, clear multilevel dependencies) and have the system schedule, scale, heal, and update to that description.

Grove primitives

High-performance inference frameworks use Grove hierarchical APIs to express role-specific logic and multilevel scaling, enabling consistent, optimized deployment across diverse cluster environments. Grove achieves this by orchestrating multicomponent AI workloads using three hierarchical custom resources in its Workload API.

For the example shown in Figure 1, PodClique A represents a frontend component, B and C represent prefill-leader and prefill-worker, and D and E represent decode-leader and decode-worker.

PodClique specifies its own replica and minimum availability counts (for example, PodClique C with three replicas and two MinAvailable). — *Figure 1.Key components of NVIDIA Grove include PodClique, ScalingGroup, and PodCliqueSet, and how they work together*

PodCliques represent groups ofKubernetes pods with specific roles, such as prefill leader or worker, decode leader or worker, or a frontend service, each with independent configuration and scaling logic.
PodCliqueScalingGroups bundle tightly coupled PodCliques that must scale together, such as the prefill leader and prefill workers that together represent one model instance.
PodCliqueSets define the entire multicomponent workload, specifying startup ordering, scaling policies, and gang-scheduling constraints that ensure all components start together or fail together. When scaling for additional capacity, Grove creates complete replicas of the entire PodGangSet and defines spread constraints that distribute these replicas across the cluster for high availability, while keeping each replica’s components network-packed for optimal performance.

Diagram showing the Grove workflow. A user defines a PodCliqueSet, which is processed by the Grove Operator. The operator creates and manages resources such as PodCliques, ScalingGroups (Prefill and Decode), Secrets, HPAs, and Services. These resources are combined into a PodGang that represents a schedulable unit of work. The PodGang is then passed to an Advanced AI Scheduler (KAI scheduler, for example). The right side of the figure depicts multiple pods running on GPU-enabled nodes. — *Figure 2. Grove workflow*

A Grove-enabled Kubernetes cluster brings two key components together: the Grove operator and a scheduler capable of understanding PodGang resources, such as theKAI Scheduler, an open source subcomponent of theNVIDIA Run:ai platform.

When a PodCliqueSet resource is created, the Grove operator validates the specification and automatically generates the underlying Kubernetes objects required to realize it. This includes the constituent PodCliques, PodCliqueScalingGroups, and the associated pods, services, secrets, and autoscaling policies. As part of this process, Grove also creates PodGang resources, which is a part of the Scheduler API, that translate workload definitions into concrete scheduling constraints for the cluster’s scheduler.

Each PodGang encapsulates detailed requirements for its workload, including minimum replica guarantees, network topology preferences to optimize inter-component bandwidth, and spread constraints to maintain availability. Together, these ensure topology-aware placement and efficient resource utilization across the cluster.

The scheduler continuously watches for PodGang resources and applies gang scheduling logic, ensuring that all required components are scheduled together or not at all until resources are available. Placement decisions are made withGPU topology awareness and cluster locality in mind.

The result is a coordinated deployment of multicomponent AI systems, where prefill services, decode workers, and routing components start in the correct order, are located closely for performance in the network, and recover cohesively as a group. This prevents resource fragmentation, avoids partial deployments, and enables stable, efficient operation of complex model-serving pipelines at scale.

How to get started with Grove using Dynamo

This section walks you through how to deploy a disaggregated serving architecture with a KV-routing deployer using Dynamo and Grove. The setup uses theQwen3 0.6B model and demonstrates the ability of Grove to manage distributed inference workloads with separate prefill and decode workers.

Note: This is a foundational example designed to help you understand the core concepts. For more complicated deployments, refer to theai-dynamo/grove GitHub repo.

Prerequisites

First, ensure that you have the following components ready in your Kubernetes cluster:

Kubernetes cluster with GPU support
kubectl configured to access your cluster
Helm CLI installed
Hugging Face token secret (referenced ashf-token-secret), which can be created with the following command:

kubectl create secret generic hf-token-secret \  --from-literal=HF_TOKEN=<insert_huggingface_token>

Note: In the code, replace<insert_huggingface_token> with your actual Hugging Face token. Keep this token secure and never commit it to source control.

Step 1: Create a namespace

kubectl create namespace vllm-v1-disagg-router

Step 2: Install Dynamo CRDs and Dynamo Operator with Grove

# 1. Set environmentexport NAMESPACE=vllm-v1-disagg-routerexport RELEASE_VERSION=0.5.1# 2. Install CRDshelm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgzhelm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default# 3. Install Dynamo Operator + Grovehelm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgzhelm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace --set "grove.enabled=true"

Step 3: Verify Grove installation

kubectl get crd | grep grove

Expected output:

podcliques.grove.io                             podcliquescalinggroups.grove.io                  podcliquesets.grove.io                           podgangs.scheduler.grove.io                      podgangsets.grove.io

Step 4: Create the DynamoGraphDeployment configuration

Create aDynamoGraphDeployment manifest that defines a disaggregated serving architecture with one frontend, two decode workers, and one prefill worker:

apiVersion: nvidia.com/v1alpha1kind: DynamoGraphDeploymentmetadata:  name: dynamo-grovespec:  services:    Frontend:      dynamoNamespace: vllm-v1-disagg-router      componentType: frontend      replicas: 1      extraPodSpec:        mainContainer:          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1      envs:        - name: DYN_ROUTER_MODE          value: kv    VllmDecodeWorker:      dynamoNamespace: vllm-v1-disagg-router      envFromSecret: hf-token-secret      componentType: worker      replicas: 2      resources:        limits:          gpu: "1"      extraPodSpec:        mainContainer:          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1          workingDir: /workspace/components/backends/vllm          command:          - python3          - -m          - dynamo.vllm          args:            - --model            - Qwen/Qwen3-0.6B    VllmPrefillWorker:      dynamoNamespace: vllm-v1-disagg-router      envFromSecret: hf-token-secret      componentType: worker      replicas: 1      resources:        limits:          gpu: "1"      extraPodSpec:        mainContainer:          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1          workingDir: /workspace/components/backends/vllm          command:          - python3          - -m          - dynamo.vllm          args:            - --model            - Qwen/Qwen3-0.6B            - --is-prefill-worker

Step 5: Deploy the configuration

kubectl apply -f dynamo-grove.yaml -n ${NAMESPACE}

Step 6: Verify the deployment

Verify that operator and Grove pods were created:

kubectl get pods -n ${NAMESPACE}

Expected output:

NAME                                                              READY   STATUS    RESTARTS      AGEdynamo-grove-0-frontend-w2xxl                                     1/1     Running     0           10mdynamo-grove-0-vllmdecodeworker-57ghl                             1/1     Running     0           10mdynamo-grove-0-vllmdecodeworker-drgv4                             1/1     Running     0           10mdynamo-grove-0-vllmprefillworker-27hhn                            1/1     Running     0           10mdynamo-platform-dynamo-operator-controller-manager-7774744kckrr   2/2     Running     0           10mdynamo-platform-etcd-0                                            1/1     Running     0           10mdynamo-platform-nats-0                                            2/2     Running     0           10m

Step 7: Test the deployment

First, port-forward the frontend:

kubectl port-forward svc/dynamo-grove-frontend 8000:8000 -n ${NAMESPACE}

Then test the endpoint:

curl http://localhost:8000/v1/models

Optionally, you can inspect the PodClique resource to see how Grove groups pods together including replica counts:

kubectl get podclique dynamo-grove-0-vllmdecodeworker -n vllm-v1-disagg-router -o yaml

Ready for more?

NVIDIA Grove is fully open source and available on theai-dynamo/grove GitHub repo. We invite you to try Grove in your own Kubernetes environments—withDynamo, as a standalone component, or along high-performance AI inference engines.

Explore theGrove Deployment Guide and ask questions onGitHub orDiscord. To see Grove in action, visit theNVIDIA Booth #753 at KubeCon 2025 in Atlanta. We welcome contributions, pull requests, and feedback from the community.

To learn more, check out these additional resources:

Acknowledgments

The NVIDIA Grove project acknowledges the valuable contributions of all open source developers, testers, and community members who have participated in its evolution, with special thanks to SAP (Madhav Bhargava, Saketh Kalaga, Frank Heine) for their exceptional contributions and support. Open source thrives on collaboration—thank you for being part of Grove.

Discuss (0)

About the Authors

About Sanjay Chatterjee
Sanjay Chatterjee is an engineering manager at NVIDIA. He works on GPU compute infrastructure with a focus on GPU scheduling to enable AI and HPC workloads to scale on Kubernetes. He is the creator and architect of the open source NVIDIA Grove project. Previously he worked on multiple DoE/DARPA funded advanced technology projects towards designing the first exascale systems. His interests include novel programming models, parallel languages, and runtime systems.

View all posts by Sanjay Chatterjee

About Anish Maddipoti
Anish Maddipoti is a product manager at NVIDIA. He currently works on building AI/ML frameworks, such as NVIDIA Dynamo and NVIDIA Grove. Previously, he was a founding team member of Brev.dev (acquired by NVIDIA) and co-founder of Agora Labs. He studied in the Plan II program at the University of Texas at Austin.

View all posts by Anish Maddipoti

About Ekin Karabulut
Ekin Karabulut is a data scientist and developer advocate previously at Run:ai, now at NVIDIA, exploring the efficient usage of large models in different production scenarios. Previously she worked on privacy implications of federated learning, focused on distributed training techniques and got fascinated by inefficiencies in GPU usage in research and industry settings. She established the AI Infrastructure Club and is based in Munich, Germany.

View all posts by Ekin Karabulut

About Rohan Varma
Rohan Varma is an AI dev tech engineer at NVIDIA. He focuses on optimizing NVIDIA inference solutions including Dynamo, Grove, and TensorRT-LLM. He has a master’s degree in Computer Science from University of Michigan, Ann Arbor. He enjoys racing games, piano, and most racket sports.

View all posts by Rohan Varma

Comments

Smart Multi-Node Scheduling for Fast and Efficient LLM Inference with NVIDIA Run:ai and NVIDIA Dynamo

NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models

Scale High-Performance AI Inference with Google Kubernetes Engine and NVIDIA NIM

Amazon Elastic Kubernetes Services Now Offers Native Support for NVIDIA A100 Multi-Instance GPUs

NVIDIA TensorRT Inference Server and Kubeflow Make Deploying Data Center Inference Simple

Movatterモバイル変換

Streamline Complex AI Inference on Kubernetes with NVIDIA Grove

How NVIDIA Grove orchestrates inference as a whole

Multilevel autoscaling for interdependent components

System-level lifecycle management with recovery and rolling updates

Flexible hierarchical gang scheduling

Topology-aware scheduling

Role‑aware orchestration and explicit startup ordering

Grove primitives

How to get started with Grove using Dynamo

Prerequisites

Step 1: Create a namespace

Step 2: Install Dynamo CRDs and Dynamo Operator with Grove

Step 3: Verify Grove installation

Step 4: Create the DynamoGraphDeployment configuration

Step 5: Deploy the configuration

Step 6: Verify the deployment

Step 7: Test the deployment

Ready for more?

Acknowledgments

Tags

About the Authors

Comments

Related posts

Smart Multi-Node Scheduling for Fast and Efficient LLM Inference with NVIDIA Run:ai and NVIDIA Dynamo

NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models

Scale High-Performance AI Inference with Google Kubernetes Engine and NVIDIA NIM

Amazon Elastic Kubernetes Services Now Offers Native Support for NVIDIA A100 Multi-Instance GPUs

NVIDIA TensorRT Inference Server and Kubeflow Make Deploying Data Center Inference Simple

Related posts

Enable Gang Scheduling and Workload Prioritization in Ray with NVIDIA KAI Scheduler

Smart Multi-Node Scheduling for Fast and Efficient LLM Inference with NVIDIA Run:ai and NVIDIA Dynamo

Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap

Accelerate AI Model Orchestration with NVIDIA Run:ai on AWS

Powering the Next Frontier of Networking for AI Platforms with NVIDIA DOCA 3.0