Deploy open models with a custom vLLM container Stay organized with collections Save and categorize content based on your preferences.
To see examples of deploying Llama 3.2 3B with custom vLLM containers, run the following notebooks in the environment of your choice:
"Deploy Llama 3.2 3B on CPU using vLLM":
Open in Colab |Open in Colab Enterprise |Openin Vertex AI Workbench |View on GitHub
"Deploy Llama 3.2 3B on GPU using vLLM":
Open in Colab |Open in Colab Enterprise |Openin Vertex AI Workbench |View on GitHub
"Deploy Llama 3.2 3B on TPU using vLLM with GCS weights":
Open in Colab |Open in Colab Enterprise |Openin Vertex AI Workbench |View on GitHub
"Deploy Llama 3.2 3B on TPU using vLLM":
Open in Colab |Open in Colab Enterprise |Openin Vertex AI Workbench |View on GitHub
Although the various Vertex AI model serving options aresufficient for many use cases, you might need to use your own container imagesto serve models on Vertex AI. This document describes how to use a vLLM customcontainer image to serve models on Vertex AI on CPUs, GPUs, or TPUs.For more information about vLLM supported models, see thevLLMdocumentation.
The vLLM API server implements the OpenAI API protocol, but it does not support theVertex AI request and response requirements. Therefore, you must use aVertex AIRaw InferenceRequestto get inferences from models deployed toVertex AI using aPredictionEndpoint. For moreinformation about the Raw Prediction method in the Vertex AI Python SDK, see thePython SDKdocumentation.
You can source models from both Hugging Face and Cloud Storage. Thisapproach offers flexibility, which lets you take advantage of thecommunity-driven model hub (Hugging Face) and theoptimized data transfer and security capabilities of Cloud Storage for internalmodel management or fine-tuned versions.
vLLM downloads the models from Hugging Face if a Hugging Face access token isprovided. Otherwise, vLLM assumes the model is available on local disk. Thecustom container image lets Vertex AI download the model fromGoogle Cloud in addition to Hugging Face.
Before you begin
In your Google Cloud project, enable the Vertex AI and Artifact RegistryAPIs.
gcloudservicesenableaiplatform.googleapis.com\artifactregistry.googleapis.comConfigure Google Cloud CLI with your project ID and initializeVertex AI SDK.
PROJECT_ID="PROJECT_ID"LOCATION="LOCATION"importvertexaivertexai.init(project=PROJECT_ID,location=LOCATION)gcloudconfigsetproject{PROJECT_ID}Create a Docker repository in Artifact Registry.
gcloudartifactsrepositoriescreateDOCKER_REPOSITORY\--repository-format=docker\--location=LOCATION\--description="Vertex AI Docker repository"Optional: If downloading models from Hugging Face, obtain a Hugging Facetoken.
- Create aHugging Face account if you don't haveone.
- Forgated models like Llama 3.2, request and receive access on HuggingFace before proceeding.
- Generate an access token: Go toYour Profile > Settings > Access Tokens.
- SelectNew Token.
- Specify a name and a role of at leastRead.
- SelectGenerate a token.
- Save this token for the deployment steps.
Prepare container build files
The followingDockerfile builds the vLLM custom container image for GPUs,TPUs, and CPUs. This custom container downloads models from Hugging Face orCloud Storage.
ARGBASE_IMAGEFROM${BASE_IMAGE}ENVDEBIAN_FRONTEND=noninteractive# Install gcloud SDKRUNapt-getupdate &&\apt-getinstall-yapt-utilsgitapt-transport-httpsgnupgca-certificatescurl\ &&echo"deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main"|tee-a/etc/apt/sources.list.d/google-cloud-sdk.list\ &&curlhttps://packages.cloud.google.com/apt/doc/apt-key.gpg|gpg--dearmor-o/usr/share/keyrings/cloud.google.gpg\ &&apt-getupdate-y &&apt-getinstallgoogle-cloud-cli-y\ &&rm-rf/var/lib/apt/lists/*WORKDIR/workspace/vllm# Copy entrypoint.sh to the containerCOPY./entrypoint.sh/workspace/vllm/vertexai/entrypoint.shRUNchmod+x/workspace/vllm/vertexai/entrypoint.shENTRYPOINT["/workspace/vllm/vertexai/entrypoint.sh"]Build the custom container image using Cloud Build. The followingcloudbuild.yaml configuration file shows how to build the image for multipleplatforms using the same Dockerfile.
steps:-name:'gcr.io/cloud-builders/docker'automapSubstitutions:truescript:|#!/usr/bin/env bashset -euo pipefaildevice_type_param=${_DEVICE_TYPE}device_type=${device_type_param,,}base_image=${_BASE_IMAGE}image_name="vllm-${_DEVICE_TYPE}"if [[ $device_type == "cpu" ]]; thenecho "Quietly building open source vLLM CPU container image"git clone https://github.com/vllm-project/vllm.gitcd vllm && DOCKER_BUILDKIT=1 docker build -t $base_image -f docker/Dockerfile.cpu . -qcd ..fiecho "Quietly building container image for: $device_type"docker build -t $LOCATION-docker.pkg.dev/$PROJECT_ID/${_REPOSITORY}/$image_name --build-arg BASE_IMAGE=$base_image . -qdocker push $LOCATION-docker.pkg.dev/$PROJECT_ID/${_REPOSITORY}/$image_namesubstitutions:_DEVICE_TYPE:gpu_BASE_IMAGE:vllm/vllm-openai_REPOSITORY:my-docker-repoThe files are available in thegooglecloudplatform/vertex-ai-samples GitHubrepository. Clone the repository to use them:
gitclonehttps://github.com/GoogleCloudPlatform/vertex-ai-samples.gitBuild and push the container image
Build the custom container image using Cloud Build by submitting thecloudbuild.yaml file. Use substitutions to specify the target device type,which can be GPU, TPU, or CPU, and the corresponding base image.
GPU
DEVICE_TYPE="gpu"BASE_IMAGE="vllm/vllm-openai"cdvertex-ai-samples/notebooks/official/prediction/vertexai_serving_vllm/cloud-build &&\gcloudbuildssubmit\--config=cloudbuild.yaml\--region=LOCATION\--timeout="2h"\--machine-type=e2-highcpu-32\--substitutions=_REPOSITORY=DOCKER_REPOSITORY,_DEVICE_TYPE=$DEVICE_TYPE,_BASE_IMAGE=$BASE_IMAGETPU
DEVICE_TYPE="tpu"BASE_IMAGE="vllm/vllm-tpu:nightly"cdvertex-ai-samples/notebooks/official/prediction/vertexai_serving_vllm/cloud-build &&\gcloudbuildssubmit\--config=cloudbuild.yaml\--region=LOCATION\--timeout="2h"\--machine-type=e2-highcpu-32\--substitutions=_REPOSITORY=DOCKER_REPOSITORY,_DEVICE_TYPE=$DEVICE_TYPE,_BASE_IMAGE=$BASE_IMAGECPU
DEVICE_TYPE="cpu"BASE_IMAGE="vllm-cpu-base"cdvertex-ai-samples/notebooks/official/prediction/vertexai_serving_vllm/cloud-build &&\gcloudbuildssubmit\--config=cloudbuild.yaml\--region=LOCATION\--timeout="2h"\--machine-type=e2-highcpu-32\--substitutions=_REPOSITORY=DOCKER_REPOSITORY,_DEVICE_TYPE=$DEVICE_TYPE,_BASE_IMAGE=$BASE_IMAGEAfter the build finishes, configure Docker to authenticate with Artifact Registry:
gcloudauthconfigure-dockerLOCATION-docker.pkg.dev--quietUpload model to Model Registry and deploy
Upload your model to Vertex AI Model Registry,create an endpoint, and deploy the model by completing these steps. This exampleuses Llama 3.2 3B, but you can adapt it for other models.
Define model and deployment variables. Set the
DOCKER_URIvariable to theimage you built in the previous step (for example, for GPU):DOCKER_URI=f"LOCATION-docker.pkg.dev/PROJECT_ID/DOCKER_REPOSITORY/vllm-gpu"Define variables for the Hugging Face token and model properties. For example,for GPU deployment:
hf_token="your-hugging-face-auth-token"model_name="gpu-llama3_2_3B-serve-vllm"model_id="meta-llama/Llama-3.2-3B"machine_type="g2-standard-8"accelerator_type="NVIDIA_L4"accelerator_count=1Upload model to Model Registry. The
upload_modelfunction varies slightly depending on the device type because of differentvLLM arguments and environment variables.fromgoogle.cloudimportaiplatformdefupload_model_gpu(model_name,model_id,hf_token,accelerator_count,docker_uri):vllm_args=["python3","-m","vllm.entrypoints.openai.api_server","--host=0.0.0.0","--port=8080",f"--model={model_id}","--max-model-len=2048","--gpu-memory-utilization=0.9","--enable-prefix-caching",f"--tensor-parallel-size={accelerator_count}",]env_vars={"HF_TOKEN":hf_token,"LD_LIBRARY_PATH":"$LD_LIBRARY_PATH:/usr/local/nvidia/lib64",}model=aiplatform.Model.upload(display_name=model_name,serving_container_image_uri=docker_uri,serving_container_args=vllm_args,serving_container_ports=[8080],serving_container_predict_route="/v1/completions",serving_container_health_route="/health",serving_container_environment_variables=env_vars,serving_container_shared_memory_size_mb=(16*1024),# 16 GBserving_container_deployment_timeout=1800,)returnmodeldefupload_model_tpu(model_name,model_id,hf_token,tpu_count,docker_uri):vllm_args=["python3","-m","vllm.entrypoints.openai.api_server","--host=0.0.0.0","--port=8080",f"--model={model_id}","--max-model-len=2048","--enable-prefix-caching",f"--tensor-parallel-size={tpu_count}",]env_vars={"HF_TOKEN":hf_token}model=aiplatform.Model.upload(display_name=model_name,serving_container_image_uri=docker_uri,serving_container_args=vllm_args,serving_container_ports=[8080],serving_container_predict_route="/v1/completions",serving_container_health_route="/health",serving_container_environment_variables=env_vars,serving_container_shared_memory_size_mb=(16*1024),# 16 GBserving_container_deployment_timeout=1800,)returnmodeldefupload_model_cpu(model_name,model_id,hf_token,docker_uri):vllm_args=["python3","-m","vllm.entrypoints.openai.api_server","--host=0.0.0.0","--port=8080",f"--model={model_id}","--max-model-len=2048",]env_vars={"HF_TOKEN":hf_token}model=aiplatform.Model.upload(display_name=model_name,serving_container_image_uri=docker_uri,serving_container_args=vllm_args,serving_container_ports=[8080],serving_container_predict_route="/v1/completions",serving_container_health_route="/health",serving_container_environment_variables=env_vars,serving_container_shared_memory_size_mb=(16*1024),# 16 GBserving_container_deployment_timeout=1800,)returnmodel# Example for GPU:vertexai_model=upload_model_gpu(model_name,model_id,hf_token,accelerator_count,DOCKER_URI)Create an endpoint.
endpoint=aiplatform.Endpoint.create(display_name=f"model_name-endpoint")Deploy model to endpoint. Model deployment might take 20 to 30 minutes.
# Example for GPU:vertexai_model.deploy(endpoint=endpoint,deployed_model_display_name=model_name,machine_type=machine_type,accelerator_type=accelerator_type,accelerator_count=accelerator_count,traffic_percentage=100,deploy_request_timeout=1800,min_replica_count=1,max_replica_count=4,autoscaling_target_accelerator_duty_cycle=60,)For TPUs, omit the
accelerator_typeandaccelerator_countparameters, and useautoscaling_target_request_count_per_minute=60. For CPUs, omit theaccelerator_typeandaccelerator_countparameters, and useautoscaling_target_cpu_utilization=60.
Load models from Cloud Storage
The custom container downloads the model from a Cloud Storage locationinstead of downloading it from Hugging Face. When you use Cloud Storage:
- Set the
model_idparameter in theupload_modelfunction to a Cloud Storage URI, for example,gs://<var>my-bucket</var>/<var>my-models</var>/<var>llama_3_2_3B</var>. - Omit the
HF_TOKENvariable fromenv_varswhen you callupload_model. - When you call
model.deploy, specify aservice_accountthat haspermissions to read from the Cloud Storage bucket.
Create an IAM Service Account for Cloud Storage access
If your model is in Cloud Storage, create a service account that VertexPrediction endpoints can use to access the model artifacts.
SERVICE_ACCOUNT_NAME="vertexai-endpoint-sa"SERVICE_ACCOUNT_EMAIL=f"SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserviceaccount.com"gcloudiamservice-accountscreateSERVICE_ACCOUNT_NAME\--display-name="Vertex AI Endpoint Service Account"# Grant storage read permissiongcloudprojectsadd-iam-policy-bindingPROJECT_ID\--member="serviceAccount:SERVICE_ACCOUNT_EMAIL"\--role="roles/storage.objectViewer"When you deploy, pass the service account email to thedeploy method:service_account=<var>SERVICE_ACCOUNT_EMAIL</var>.
Get predictions using endpoint
After you successfully deploy the model to the endpoint, verify the modelresponse usingraw_predict.
importjsonPROMPT="Distance of moon from earth is"request_body=json.dumps({"prompt":PROMPT,"temperature":0.0,},)raw_response=endpoint.raw_predict(body=request_body,headers={"Content-Type":"application/json"})assertraw_response.status_code==200result=json.loads(raw_response.text)forchoiceinresult["choices"]:print(choice)Example output:
{"index":0,"text":"384,400 km. The moon is 1/4 of the earth's","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}What's next
- Choose an open model serving option
- Use open models using Model as a Service (MaaS)
- Deploy open models from Model Garden
- Deploy open models with prebuilt containers
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-11-25 UTC.
Open in Colab
Open in Colab Enterprise
Openin Vertex AI Workbench
View on GitHub