Serve Qwen3-8B-Base with vLLM on TPUs

This tutorial shows you how to serve theQwen/Qwen3-8B-Base modelusing thevLLM TPU serving frameworkon av6e TPU VM.

Objectives

Set up your environment.
Run vLLM with Qwen3-8B-Base.
Send an inference request.
Run a benchmark workload.
Clean up.

Costs

This tutorial uses billable components of Google Cloud, including:

To generate a cost estimate based on your projected usage, use thepricing calculator.

Before you begin

Before going through this tutorial, follow the instructions in theSet up the Cloud TPU environment page. Theinstructions guide you through the steps needed to create a Google Cloudproject and configure it to use Cloud TPU. You may also use an existingGoogle Cloud project. If you choose to do so, you can skip the create aGoogle Cloud project step and start withSet up your environment to use Cloud TPU.

You need a Hugging Face access token to use this tutorial. You can sign upfor a free account atHugging Face. Once you havean account, generate an access token:

On theWelcome to Hugging Face page,click your account avatar and selectAccess tokens.
On theAccess Tokens page, clickCreate new token.
Select theRead token type and enter a name for your token.
Your access token is displayed. Save the token in a safe place.

Set up your environment

Create a Cloud TPU v6e VM using the queued resources API. For Qwen3-8B-Base,we recommend using a v6e-1 TPU.

exportPROJECT_ID=YOUR_PROJECT_IDexportTPU_NAME=Qwen3-8B-Base-tutorialexportZONE=us-east5-aexportQR_ID=Qwen3-8B-Base-qrgcloudalphacomputetpusqueued-resourcescreate$QR_ID\--node-id$TPU_NAME\--project$PROJECT_ID\--zone$ZONE\--accelerator-typev6e-1\--runtime-versionv2-alpha-tpuv6e

Check to make sure your TPU VM is ready.

gcloudcomputetpusqueued-resourcesdescribe$QR_ID\--project$PROJECT_ID\--zone$ZONE

When your TPU VM has been created the status of the queued resource requestwill be set toACTIVE. For example:

name:projects/your-project-id/locations/your-zone/queuedResources/your-queued-resource-idstate:state:ACTIVEtpu:nodeSpec:-node:acceleratorType:v6e-1bootDisk:{}networkConfig:enableExternalIps:truequeuedResource:projects/your-project-number/locations/your-zone/queuedResources/your-queued-resource-idruntimeVersion:v2-alpha-tpuv6eschedulingConfig:{}serviceAccount:{}shieldedInstanceConfig:{}useTpuVm:truenodeId:your-node-idparent:projects/your-project-number/locations/your-zone

Connect to the TPU VM.

gcloudcomputetpustpu-vmssh$TPU_NAME\--project$PROJECT_ID\--zone$ZONE

Run vLLM with Qwen3-8B-Base

Inside the TPU VM, run the vLLM Docker container. This command uses ashared memory size of 10 GB.

exportDOCKER_URI=vllm/vllm-tpu:latestsudodockerrun-it--rm--name$USER-vllm--privileged--net=host\-v/dev/shm:/dev/shm\--shm-size10gb\-p8000:8000\--entrypoint/bin/bash${DOCKER_URI}

Inside the container, set your Hugging Face token. ReplaceYOUR_HF_TOKENwith your Hugging Face token.
```
exportHF_HOME=/dev/shmexportHF_TOKEN=YOUR_HF_TOKEN
```

Start the vLLM server using thevllm serve command.

exportMAX_MODEL_LEN=4096exportTP=1# number of chipsvllmserveQwen/Qwen3-8B-Base\--seed42\--disable-log-requests\--gpu-memory-utilization0.98\--max-num-batched-tokens1024\--max-num-seqs128\--tensor-parallel-size$TP\--max-model-len$MAX_MODEL_LEN

When the vLLM server is running you will see output like the following:

(APIServerpid=7)INFO:Startedserverprocess[7](APIServerpid=7)INFO:Waitingforapplicationstartup.(APIServerpid=7)INFO:Applicationstartupcomplete.

Send an inference request

Once the vLLM server is running, you can send requests to it from a new shell.

Open a new shell and connect to your TPU VM.

exportPROJECT_ID=YOUR_PROJECT_IDexportTPU_NAME=Qwen3-8B-Base-tutorialexportZONE=us-east5-agcloudcomputetpustpu-vmssh$TPU_NAME\--project$PROJECT_ID\--zone=$ZONE

Open a shell into the running Docker container.
```
sudodockerexec-it$USER-vllm/bin/bash
```

Send a test request to the server usingcurl.

curlhttp://localhost:8000/v1/completions\-H"Content-Type: application/json"\-d'{        "model": "Qwen/Qwen3-8B-Base",        "prompt": "The future of AI is",        "max_tokens": 200,        "temperature": 0      }'

The response is returned in JSON format.

Run a benchmark workload

You can run benchmarks against the running server from your second terminal.

Inside the container, install thedatasets library.
```
pipinstalldatasets
```

Run thevllm bench serve command.

exportHF_HOME=/dev/shmcd/workspace/vllmvllmbenchserve\--backendvllm\--model"Qwen/Qwen3-8B-Base"\--dataset-namerandom\--num-prompts1000\--seed100

The benchmark results appear as follows:

============ServingBenchmarkResult============Successfulrequests:1000Failedrequests:0Benchmarkduration(s):73.97Totalinputtokens:1024000Totalgeneratedtokens:128000Requestthroughput(req/s):13.52Outputtokenthroughput(tok/s):1730.38Peakoutputtokenthroughput(tok/s):2522.00Peakconcurrentrequests:1000.00TotalTokenthroughput(tok/s):15573.42---------------TimetoFirstToken----------------MeanTTFT(ms):34834.97MedianTTFT(ms):34486.19P99TTFT(ms):70234.40-----TimeperOutputToken(excl.1sttoken)------MeanTPOT(ms):47.30MedianTPOT(ms):48.57P99TPOT(ms):48.60---------------Inter-tokenLatency----------------MeanITL(ms):47.31MedianITL(ms):53.49P99ITL(ms):54.58==================================================

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

In the second shell, typeexit to exit from the vLLM container.
In the second shell, typeexit command to close the terminal.
In the first shell, typeCtrl+C to stop the vLLM server.
In the first shell, typeexit to exit from the vLLM container.
In the first shell, typeexit to disconnect from the TPU VM.

Delete your resources

You can delete the project which will delete all resources or you can keep theproject and delete the resources.

Delete your project

To delete your Google Cloud project and all associated resources run:

gcloudprojectsdelete$PROJECT_ID

Delete TPU resources

Delete your Cloud TPU resources. The following command deletes both thequeued resource request and the TPU VM using the--force parameter.

gcloudalphacomputetpusqueued-resourcesdelete$QR_ID\--project=$PROJECT_ID\--zone=$ZONE\--force

What's next

Learn more aboutvLLM on Cloud TPU.
Learn more aboutCloud TPU.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-17 UTC.

Movatterモバイル変換

Serve Qwen3-8B-Base with vLLM on TPUs Stay organized with collections Save and categorize content based on your preferences.