Serve Qwen3-8B-Base with vLLM on TPUs Stay organized with collections Save and categorize content based on your preferences.
This tutorial shows you how to serve theQwen/Qwen3-8B-Base modelusing thevLLM TPU serving frameworkon av6e TPU VM.
Objectives
- Set up your environment.
- Run vLLM with Qwen3-8B-Base.
- Send an inference request.
- Run a benchmark workload.
- Clean up.
Costs
This tutorial uses billable components of Google Cloud, including:
To generate a cost estimate based on your projected usage, use thepricing calculator.
Before you begin
Before going through this tutorial, follow the instructions in theSet up the Cloud TPU environment page. Theinstructions guide you through the steps needed to create a Google Cloudproject and configure it to use Cloud TPU. You may also use an existingGoogle Cloud project. If you choose to do so, you can skip the create aGoogle Cloud project step and start withSet up your environment to use Cloud TPU.
You need a Hugging Face access token to use this tutorial. You can sign upfor a free account atHugging Face. Once you havean account, generate an access token:
- On theWelcome to Hugging Face page,click your account avatar and selectAccess tokens.
- On theAccess Tokens page, clickCreate new token.
- Select theRead token type and enter a name for your token.
- Your access token is displayed. Save the token in a safe place.
Set up your environment
Create a Cloud TPU v6e VM using the queued resources API. For Qwen3-8B-Base,we recommend using a v6e-1 TPU.
exportPROJECT_ID=YOUR_PROJECT_IDexportTPU_NAME=Qwen3-8B-Base-tutorialexportZONE=us-east5-aexportQR_ID=Qwen3-8B-Base-qrgcloudalphacomputetpusqueued-resourcescreate$QR_ID\--node-id$TPU_NAME\--project$PROJECT_ID\--zone$ZONE\--accelerator-typev6e-1\--runtime-versionv2-alpha-tpuv6eCheck to make sure your TPU VM is ready.
gcloudcomputetpusqueued-resourcesdescribe$QR_ID\--project$PROJECT_ID\--zone$ZONEWhen your TPU VM has been created the status of the queued resource requestwill be set to
ACTIVE. For example:name:projects/your-project-id/locations/your-zone/queuedResources/your-queued-resource-idstate:state:ACTIVEtpu:nodeSpec:-node:acceleratorType:v6e-1bootDisk:{}networkConfig:enableExternalIps:truequeuedResource:projects/your-project-number/locations/your-zone/queuedResources/your-queued-resource-idruntimeVersion:v2-alpha-tpuv6eschedulingConfig:{}serviceAccount:{}shieldedInstanceConfig:{}useTpuVm:truenodeId:your-node-idparent:projects/your-project-number/locations/your-zoneConnect to the TPU VM.
gcloudcomputetpustpu-vmssh$TPU_NAME\--project$PROJECT_ID\--zone$ZONE
Run vLLM with Qwen3-8B-Base
Inside the TPU VM, run the vLLM Docker container. This command uses ashared memory size of 10 GB.
exportDOCKER_URI=vllm/vllm-tpu:latestsudodockerrun-it--rm--name$USER-vllm--privileged--net=host\-v/dev/shm:/dev/shm\--shm-size10gb\-p8000:8000\--entrypoint/bin/bash${DOCKER_URI}Inside the container, set your Hugging Face token. Replace
YOUR_HF_TOKENwith your Hugging Face token.exportHF_HOME=/dev/shmexportHF_TOKEN=YOUR_HF_TOKENStart the vLLM server using the
vllm servecommand.exportMAX_MODEL_LEN=4096exportTP=1# number of chipsvllmserveQwen/Qwen3-8B-Base\--seed42\--disable-log-requests\--gpu-memory-utilization0.98\--max-num-batched-tokens1024\--max-num-seqs128\--tensor-parallel-size$TP\--max-model-len$MAX_MODEL_LENWhen the vLLM server is running you will see output like the following:
(APIServerpid=7)INFO:Startedserverprocess[7](APIServerpid=7)INFO:Waitingforapplicationstartup.(APIServerpid=7)INFO:Applicationstartupcomplete.
Send an inference request
Once the vLLM server is running, you can send requests to it from a new shell.
Open a new shell and connect to your TPU VM.
exportPROJECT_ID=YOUR_PROJECT_IDexportTPU_NAME=Qwen3-8B-Base-tutorialexportZONE=us-east5-agcloudcomputetpustpu-vmssh$TPU_NAME\--project$PROJECT_ID\--zone=$ZONEOpen a shell into the running Docker container.
sudodockerexec-it$USER-vllm/bin/bashSend a test request to the server using
curl.curlhttp://localhost:8000/v1/completions\-H"Content-Type: application/json"\-d'{ "model": "Qwen/Qwen3-8B-Base", "prompt": "The future of AI is", "max_tokens": 200, "temperature": 0 }'
The response is returned in JSON format.
Run a benchmark workload
You can run benchmarks against the running server from your second terminal.
Inside the container, install the
datasetslibrary.pipinstalldatasetsRun the
vllm bench servecommand.exportHF_HOME=/dev/shmcd/workspace/vllmvllmbenchserve\--backendvllm\--model"Qwen/Qwen3-8B-Base"\--dataset-namerandom\--num-prompts1000\--seed100
The benchmark results appear as follows:
============ServingBenchmarkResult============Successfulrequests:1000Failedrequests:0Benchmarkduration(s):73.97Totalinputtokens:1024000Totalgeneratedtokens:128000Requestthroughput(req/s):13.52Outputtokenthroughput(tok/s):1730.38Peakoutputtokenthroughput(tok/s):2522.00Peakconcurrentrequests:1000.00TotalTokenthroughput(tok/s):15573.42---------------TimetoFirstToken----------------MeanTTFT(ms):34834.97MedianTTFT(ms):34486.19P99TTFT(ms):70234.40-----TimeperOutputToken(excl.1sttoken)------MeanTPOT(ms):47.30MedianTPOT(ms):48.57P99TPOT(ms):48.60---------------Inter-tokenLatency----------------MeanITL(ms):47.31MedianITL(ms):53.49P99ITL(ms):54.58==================================================Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- In the second shell, typeexit to exit from the vLLM container.
- In the second shell, typeexit command to close the terminal.
- In the first shell, typeCtrl+C to stop the vLLM server.
- In the first shell, typeexit to exit from the vLLM container.
- In the first shell, typeexit to disconnect from the TPU VM.
Delete your resources
You can delete the project which will delete all resources or you can keep theproject and delete the resources.
Delete your project
To delete your Google Cloud project and all associated resources run:
gcloudprojectsdelete$PROJECT_IDDelete TPU resources
Delete your Cloud TPU resources. The following command deletes both thequeued resource request and the TPU VM using the--force parameter.
gcloudalphacomputetpusqueued-resourcesdelete$QR_ID\--project=$PROJECT_ID\--zone=$ZONE\--forceWhat's next
- Learn more aboutvLLM on Cloud TPU.
- Learn more aboutCloud TPU.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-17 UTC.