Use vLLM on GKE to run inference with Qwen3 Stay organized with collections Save and categorize content based on your preferences.
This tutorial shows you how to deploy and serve aQwen3 large language model (LLM) with thevLLM serving framework. You deploy the model on asingleA4 virtual machine (VM) instance on Google Kubernetes Engine (GKE).
Important: To complete this tutorial, you must have reserved the capacity tocreate an A4 VM. Tolearn more about your options for reserving capacity inAI Hypercomputer for a future date and time, seeChoose a consumption option.This tutorial is intended for machine learning (ML) engineers, platformadministrators and operators, and for data and AI specialists who are interestedin using Kubernetes container orchestration capabilities to handle inferenceworkloads.
Objectives
Access Qwen3 by using Hugging Face.
Prepare your environment.
Create a GKE cluster in Autopilot mode.
Create a Kubernetes secret for Hugging Face credentials.
Deploy a vLLM container to your GKE cluster.
Interact with Qwen3 by using curl.
Clean up.
Costs
This tutorial uses billable components of Google Cloud, including:
To generate a cost estimate based on your projected usage, use thePricing Calculator.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
Install the Google Cloud CLI.
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
Toinitialize the gcloud CLI, run the following command:
gcloudinit
Create or select a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Create a Google Cloud project:
gcloud projects createPROJECT_ID
Replace
PROJECT_IDwith a name for the Google Cloud project you are creating.Select the Google Cloud project that you created:
gcloud config set projectPROJECT_ID
Replace
PROJECT_IDwith your Google Cloud project name.
Verify that billing is enabled for your Google Cloud project.
Enable the required API:
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.gcloudservicesenablecontainer.googleapis.comInstall the Google Cloud CLI.
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
Toinitialize the gcloud CLI, run the following command:
gcloudinit
Create or select a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Create a Google Cloud project:
gcloud projects createPROJECT_ID
Replace
PROJECT_IDwith a name for the Google Cloud project you are creating.Select the Google Cloud project that you created:
gcloud config set projectPROJECT_ID
Replace
PROJECT_IDwith your Google Cloud project name.
Verify that billing is enabled for your Google Cloud project.
Enable the required API:
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.gcloudservicesenablecontainer.googleapis.comGrant roles to your user account. Run the following command once for each of the following IAM roles:
roles/container.admingcloudprojectsadd-iam-policy-bindingPROJECT_ID--member="user:USER_IDENTIFIER"--role=ROLE
Replace the following:
PROJECT_ID: Your project ID.USER_IDENTIFIER: The identifier for your user account. For example,myemail@example.com.ROLE: The IAM role that you grant to your user account.
- Sign in to or create a Hugging Face account.
Access Qwen3 by using Hugging Face
To use Hugging Face to access Qwen3, follow these steps:
- Sign in to Hugging Face
- Create a Hugging Face
readaccess token.ClickYour Profile > Settings > Access Tokens > +Create new token. - Specify a name of your choice for the token and then select a role. Theminimum role permission level that you can select for this tutorial isRead.
- SelectCreate token.
- Copy and save the generated token to your clipboard. You use it later inthis tutorial.
Prepare your environment
To prepare your environment, set the default environment variables:
gcloud config set projectPROJECT_IDgcloud config set billing/quota_projectPROJECT_IDexport PROJECT_ID=$(gcloud config get project)export RESERVATION_URL=RESERVATION_URLexport REGION=REGIONexport CLUSTER_NAME=CLUSTER_NAMEexport HUGGING_FACE_TOKEN=HUGGING_FACE_TOKENexport NETWORK=NETWORK_NAMEexport SUBNETWORK=SUBNETWORK_NAMEReplace the following:
PROJECT_ID: the ID of the Google Cloud projectwhere you want to create the GKE cluster.RESERVATION_URL: the URL of the reservation that you wantto use to create your GKE cluster. Based on the project inwhich the reservation exists, specify one of the following values:The reservation exists in your project:
RESERVATION_NAMEThe reservation exists in a different project, and your project canuse the reservation:
projects/RESERVATION_PROJECT_ID/reservations/RESERVATION_NAME
REGION: the region where you want to create yourGKE cluster. You can only create the cluster in the regionwhere your reservation exists.CLUSTER_NAME: the name of the GKE clusterto create.HUGGING_FACE_TOKEN: the Hugging Face access token thatyou created in the previous section.NETWORK_NAME: the network that the GKEcluster uses. Specify one of the following values:If you created a custom network, then specify the name of your network.
Otherwise, specify
default.
SUBNETWORK_NAME: the subnetwork that theGKE cluster uses. Specify one of the following values:If you created a custom subnetwork, then specify the name of yoursubnetwork. You can only specify a subnetwork that exists in the sameregion as the reservation.
Otherwise, specify
default.
Create a GKE cluster in Autopilot mode
To create a GKE cluster in Autopilot mode, run thefollowing command:
gcloudcontainerclusterscreate-auto$CLUSTER_NAME\--project=$PROJECT_ID\--region=$REGION\--release-channel=rapid\--network=$NETWORK\--subnetwork=$SUBNETWORKCreating the GKE cluster might take some time to complete. Toverify that Google Cloud has finished creating your cluster, go toKubernetes clusterson the Google Cloud console.
Create a Kubernetes secret for Hugging Face credentials
To create a Kubernetes secret for Hugging Face credentials, follow these steps:
Configure
kubectlto communicate with your GKE cluster:gcloudcontainerclustersget-credentials$CLUSTER_NAME\--location=$REGIONCreate a Kubernetes secret to store your Hugging Face token:
kubectlcreatesecretgenerichf-secret\--from-literal=hf_token=${HUGGING_FACE_TOKEN}\--dry-run=client-oyaml|kubectlapply-f-
Deploy a vLLM container to your GKE cluster
To deploy the vLLM container to serve the Qwen3 model by using Kubernetes Deployments,do the following:
Create a
qwen3-235b-deploy.yamlfile with your chosen vLLM deployment. :apiVersion:apps/v1kind:Deploymentmetadata:name:vllm-qwen3-deploymentspec:replicas:1selector:matchLabels:app:qwen3-servertemplate:metadata:labels:app:qwen3-serverai.gke.io/model:Qwen3-235B-A22B-Instruct-2507ai.gke.io/inference-server:vllmspec:containers:-name:qwen-inference-serverimage:us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250801_0916_RC01resources:requests:cpu:"10"memory:"1000Gi"ephemeral-storage:"500Gi"nvidia.com/gpu:"8"limits:cpu:"10"memory:"1000Gi"ephemeral-storage:"500Gi"nvidia.com/gpu:"8"command:["python3","-m","vllm.entrypoints.openai.api_server"]args:---model=$(MODEL_ID)---tensor-parallel-size=8---host=0.0.0.0---port=8000---max-model-len=8192---max-num-seqs=4---dtype=bfloat16env:-name:MODEL_IDvalue:"Qwen/Qwen3-235B-A22B-Instruct-2507"-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_tokenvolumeMounts:-mountPath:/dev/shmname:dshmlivenessProbe:httpGet:path:/healthport:8000initialDelaySeconds:1320periodSeconds:10readinessProbe:httpGet:path:/healthport:8000initialDelaySeconds:1320periodSeconds:5volumes:-name:dshmemptyDir:medium:MemorynodeSelector:cloud.google.com/gke-accelerator:nvidia-b200cloud.google.com/reservation-name:RESERVATION_URLcloud.google.com/reservation-affinity:"specific"cloud.google.com/gke-gpu-driver-version:latest---apiVersion:v1kind:Servicemetadata:name:qwen3-servicespec:selector:app:qwen3-servertype:ClusterIPports:-protocol:TCPport:8000targetPort:8000---apiVersion:monitoring.googleapis.com/v1kind:PodMonitoringmetadata:name:vllm-qwen3-monitoringspec:selector:matchLabels:app:qwen3-serverendpoints:-port:8000path:/metricsinterval:30sApply the
qwen3-235b-deploy.yamlfile to your GKE cluster:kubectl apply -f qwen3-235b-deploy.yamlDuring the deployment process, the container must download the
Qwen3-235B-A22B-Instruct-2507model from Hugging Face. For this reason,deployment of the container might take up to 30 minutes to complete.To see the completion status, run the following command:
kubectl wait \ --for=condition=Available \ --timeout=1500s deployment/vllm-qwen3-deploymentThe
--timeout=1500sflag allows the command to monitor the deployment forup to 25 minutes.
Interact with Qwen3 by using curl
To verify the Qwen3 model that you deployed, do the following:
Set up port forwarding to Qwen3:
kubectl port-forward service/qwen3-service 8000:8000Open a new terminal window. You can then chat with your model by using
curl:curlhttp://127.0.0.1:8000/v1/chat/completions\-XPOST\-H"Content-Type: application/json"\-d'{ "model": "Qwen/Qwen3-235B-A22B-Instruct-2507", "messages": [ { "role": "user", "content": "Describe a GPU in one short sentence?" } ]}'The output is similar to the following:
{"id":"chatcmpl-a926ddf7ef2745ca832bda096e867764","object":"chat.completion","created":1755023619,"model":"Qwen/Qwen3-235B-A22B-Instruct-2507","choices":[{"index":0,"message":{"role":"assistant","content":"A GPU is a specialized electronic circuit designed to rapidly process and render graphics and perform parallel computations.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":16,"total_tokens":36,"completion_tokens":20,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}
Observe model performance
If you want to observe your model's performance, then you can use the vLLMdashboard integration inCloud Monitoring. This dashboard helps you viewcritical performance metrics for your model like token throughput, networklatency, and error rates. For information, seevLLM in theMonitoring documentation.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete your project
Caution: Deleting a project has the following effects:- Everything in the project is deleted. If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.
- Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as an
appspot.comURL, delete selected resources inside the project instead of deleting the whole project.
If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects can help you avoid exceeding project quota limits.
Delete a Google Cloud project:
gcloud projects deletePROJECT_ID
Delete your GKE cluster
To delete your GKE cluster, run the following command:
gcloudcontainerclustersdelete$CLUSTER_NAME\--region=$REGIONDelete the resources
To delete theqwen3-235b-deploy.yaml file and the Kubernetes secret from theGKE cluster, run the following commands:
kubectldelete-fqwen3-235b-deploy.yamlkubectldeletesecrethf-secretWhat's next
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.