New component in NVIDIA Dynamo enables efficient scaling of distributed inference

AI-Generated Summary
AI-generated content may summarize information incompletely. Verify important information.Learn more
Over the past few years,AI inference has evolved from single-model, single-pod deployments into complex, multicomponent systems. A model deployment may now consist of several distinct components—prefill, decode, vision encoders, key value (KV) routers, and more. In addition, entire agentic pipelines are emerging, where multiple such model instances collaborate to perform reasoning, retrieval, or multimodal tasks.
This shift has changed the scaling and orchestration problem from “run N replicas of a pod” to “coordinate a group of components as one logical system.” Managing such a system requires scaling and scheduling the right pods together, understanding that each component has distinct configuration and resource needs, starting them in a deliberate order, and placing them in the cluster with network topology in mind. Ultimately, the goal is to orchestrate a system and scale components with awareness of their dependencies as a whole, rather than one pod at a time.
To address these challenges, today we are announcing thatNVIDIA Grove, a Kubernetes API for running modern ML inference workloads on Kubernetes clusters, is now available withinNVIDIA Dynamo as a modular component. Grove is fully open source and available on theai-dynamo/grove GitHub repo.
Grove enables you to scale your multinode inference deployment from a single replica to data center scale, supporting tens of thousands of GPUs. With Grove, you can describe your whole inference serving system in Kubernetes (for example, prefill, decode, routing, or any other component) as a single Custom Resource (CR).
From that one spec, the platform coordinates hierarchical gang scheduling, topology‑aware placement, multilevel autoscaling, and explicit startup ordering. You get precise control of how the system behaves without stitching together scripts, YAML files, or custom controllers.
Originally motivated by the challenges of orchestrating multinode, disaggregated inference systems, Grove is flexible enough to map naturally to any real-world inference architecture—from traditional single-node aggregated inference to agentic pipelines with multiple models. Grove enables developers to define complex AI stacks in a concise, declarative, and framework-agnostic manner.
Prerequisites for themultinode disaggregated serving are detailed below.
Modern inference systems need autoscaling at multiple levels: individual components (prefill workers for traffic spikes), related component groups (prefill leaders with their workers), and entire service replicas for overall capacity. These levels affect one another: scaling prefill workers may require more decode capacity, and new service replicas need proper component ratios. Traditional pod-level autoscaling can’t handle these interdependencies.
Recovery and updates must operate on complete service instances, not individualKubernetes pods. A failed prefill worker needs to properly reconnect to its leader after a restart, and rolling updates must preserve network topology to maintain low latency. The platform must treat multicomponent systems as single operational units optimized for both performance and availability.
The AI workload scheduler should support flexible gang scheduling that goes beyond traditional all-or-nothing placement. Disaggregated serving creates a new challenge: the inference system needs to guarantee essential component combinations (at least one prefill and decode worker, for example) while allowing independent scaling of each component type. The challenge is that prefill and decode components should scale at different ratios based on workload patterns.
Traditional gang scheduling prevents this independent scaling by forcing everything into groups that must scale together. The system needs policies that enforce minimum viable component combinations while enabling flexible scaling.
Component placement affects performance. On systems likeNVIDIA GB200 NVL72, scheduling the related prefill and decode pods on the sameNVIDIA NVLink domain optimizes KV-cache transfer latency. The scheduler must understand physical network topology, placing related components near each other while spreading replicas for availability.
Components have different jobs, configurations, and startup requirements. For example, prefill and decode leaders execute specialized startup logic than workers, and workers can’t start before leaders are ready. The platform needs role-specific configuration and dependency enforcement for reliable initialization.
Put together, this is the bigger picture: inference teams need an easy and declarative way to describe their system as it is actually operated (multiple roles, multiple nodes, clear multilevel dependencies) and have the system schedule, scale, heal, and update to that description.
High-performance inference frameworks use Grove hierarchical APIs to express role-specific logic and multilevel scaling, enabling consistent, optimized deployment across diverse cluster environments. Grove achieves this by orchestrating multicomponent AI workloads using three hierarchical custom resources in its Workload API.
For the example shown in Figure 1, PodClique A represents a frontend component, B and C represent prefill-leader and prefill-worker, and D and E represent decode-leader and decode-worker.


A Grove-enabled Kubernetes cluster brings two key components together: the Grove operator and a scheduler capable of understanding PodGang resources, such as theKAI Scheduler, an open source subcomponent of theNVIDIA Run:ai platform.
When a PodCliqueSet resource is created, the Grove operator validates the specification and automatically generates the underlying Kubernetes objects required to realize it. This includes the constituent PodCliques, PodCliqueScalingGroups, and the associated pods, services, secrets, and autoscaling policies. As part of this process, Grove also creates PodGang resources, which is a part of the Scheduler API, that translate workload definitions into concrete scheduling constraints for the cluster’s scheduler.
Each PodGang encapsulates detailed requirements for its workload, including minimum replica guarantees, network topology preferences to optimize inter-component bandwidth, and spread constraints to maintain availability. Together, these ensure topology-aware placement and efficient resource utilization across the cluster.
The scheduler continuously watches for PodGang resources and applies gang scheduling logic, ensuring that all required components are scheduled together or not at all until resources are available. Placement decisions are made withGPU topology awareness and cluster locality in mind.
The result is a coordinated deployment of multicomponent AI systems, where prefill services, decode workers, and routing components start in the correct order, are located closely for performance in the network, and recover cohesively as a group. This prevents resource fragmentation, avoids partial deployments, and enables stable, efficient operation of complex model-serving pipelines at scale.
This section walks you through how to deploy a disaggregated serving architecture with a KV-routing deployer using Dynamo and Grove. The setup uses theQwen3 0.6B model and demonstrates the ability of Grove to manage distributed inference workloads with separate prefill and decode workers.
Note: This is a foundational example designed to help you understand the core concepts. For more complicated deployments, refer to theai-dynamo/grove GitHub repo.
First, ensure that you have the following components ready in your Kubernetes cluster:
kubectl configured to access your clusterhf-token-secret), which can be created with the following command:kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN=<insert_huggingface_token>
Note: In the code, replace<insert_huggingface_token> with your actual Hugging Face token. Keep this token secure and never commit it to source control.
kubectl create namespace vllm-v1-disagg-router
# 1. Set environmentexport NAMESPACE=vllm-v1-disagg-routerexport RELEASE_VERSION=0.5.1# 2. Install CRDshelm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgzhelm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default# 3. Install Dynamo Operator + Grovehelm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgzhelm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace --set "grove.enabled=true"kubectl get crd | grep grove
Expected output:
podcliques.grove.io podcliquescalinggroups.grove.io podcliquesets.grove.io podgangs.scheduler.grove.io podgangsets.grove.io
Create aDynamoGraphDeployment manifest that defines a disaggregated serving architecture with one frontend, two decode workers, and one prefill worker:
apiVersion: nvidia.com/v1alpha1kind: DynamoGraphDeploymentmetadata: name: dynamo-grovespec: services: Frontend: dynamoNamespace: vllm-v1-disagg-router componentType: frontend replicas: 1 extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1 envs: - name: DYN_ROUTER_MODE value: kv VllmDecodeWorker: dynamoNamespace: vllm-v1-disagg-router envFromSecret: hf-token-secret componentType: worker replicas: 2 resources: limits: gpu: "1" extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1 workingDir: /workspace/components/backends/vllm command: - python3 - -m - dynamo.vllm args: - --model - Qwen/Qwen3-0.6B VllmPrefillWorker: dynamoNamespace: vllm-v1-disagg-router envFromSecret: hf-token-secret componentType: worker replicas: 1 resources: limits: gpu: "1" extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1 workingDir: /workspace/components/backends/vllm command: - python3 - -m - dynamo.vllm args: - --model - Qwen/Qwen3-0.6B - --is-prefill-worker
kubectl apply -f dynamo-grove.yaml -n ${NAMESPACE}Verify that operator and Grove pods were created:
kubectl get pods -n ${NAMESPACE}Expected output:
NAME READY STATUS RESTARTS AGEdynamo-grove-0-frontend-w2xxl 1/1 Running 0 10mdynamo-grove-0-vllmdecodeworker-57ghl 1/1 Running 0 10mdynamo-grove-0-vllmdecodeworker-drgv4 1/1 Running 0 10mdynamo-grove-0-vllmprefillworker-27hhn 1/1 Running 0 10mdynamo-platform-dynamo-operator-controller-manager-7774744kckrr 2/2 Running 0 10mdynamo-platform-etcd-0 1/1 Running 0 10mdynamo-platform-nats-0 2/2 Running 0 10m
First, port-forward the frontend:
kubectl port-forward svc/dynamo-grove-frontend 8000:8000 -n ${NAMESPACE}Then test the endpoint:
curl http://localhost:8000/v1/models
Optionally, you can inspect the PodClique resource to see how Grove groups pods together including replica counts:
kubectl get podclique dynamo-grove-0-vllmdecodeworker -n vllm-v1-disagg-router -o yaml
NVIDIA Grove is fully open source and available on theai-dynamo/grove GitHub repo. We invite you to try Grove in your own Kubernetes environments—withDynamo, as a standalone component, or along high-performance AI inference engines.
Explore theGrove Deployment Guide and ask questions onGitHub orDiscord. To see Grove in action, visit theNVIDIA Booth #753 at KubeCon 2025 in Atlanta. We welcome contributions, pull requests, and feedback from the community.
To learn more, check out these additional resources:
The NVIDIA Grove project acknowledges the valuable contributions of all open source developers, testers, and community members who have participated in its evolution, with special thanks to SAP (Madhav Bhargava, Saketh Kalaga, Frank Heine) for their exceptional contributions and support. Open source thrives on collaboration—thank you for being part of Grove.









