vLLM serving for text-only and multimodal language models on Cloud GPUs

Summary

This tutorial walks you through the process of deploying and serving Llama 3.1and 3.2 models usingvLLM inVertex AI. It is designed to be used in conjunction with two separatenotebooks:Serve Llama 3.1 withvLLMfor deploying text-only Llama 3.1 models, andServe Multimodal Llama 3.2 withvLLMfor deploying multimodel Llama 3.2 models that handle both text and imageinputs. The steps outlined on this page show you how to efficiently handle modelinference on GPUs and customize models for diverse applications, equipping youwith the tools to integrate advanced language models into your projects.

By the end of this guide, you will understand how to:

  • Download prebuilt Llama models fromHugging Facewith vLLM container.
  • Use vLLM to deploy these models on GPU instances within Google CloudVertex AI Model Garden.
  • Serve models efficiently to handle inference requests at scale.
  • Run inference on text-only requests and text + image requests.
  • Cleanup.
  • Debug deployment.

vLLM Key Features

FeatureDescription
PagedAttentionAn optimized attention mechanism that efficiently manages memory during inference. Supports high-throughput text generation by dynamically allocating memory resources, enabling scalability for multiple concurrent requests.
Continuous batchingConsolidates multiple input requests into a single batch for parallel processing, maximizing GPU utilization and throughput.
Token streamingEnables real-time token-by-token output during text generation. Ideal for applications that require low latency, such as chatbots or interactive AI systems.
Model compatibilitySupports a wide range of pre-trained models across popular frameworks like Hugging Face Transformers. Makes it easier to integrate and experiment with different LLMs.
Multi-GPU & multi-hostEnables efficient model serving by distributing the workload across multiple GPUs within a single machine and across multiple machines in a cluster, significantly increasing throughput and scalability.
Efficient deploymentOffers seamless integration with APIs, such as OpenAI chat completions, making deployment straightforward for production use cases.
Seamless integration with Hugging Face modelsvLLM is compatible with Hugging Face model artifacts format and supports loading from HF, making it straightforward to deploy Llama models alongside other popular models like Gemma, Phi, and Qwen in an optimized setting.
Community-driven open-source projectvLLM is open-source and encourages community contributions, promoting continuous improvement in LLM serving efficiency.

Table 1: Summary of vLLM features

Google Vertex AI vLLM Customizations: Enhance performance and integration

The vLLM implementation within Google Vertex AI Model Gardenis not a direct integration of the open-source library. Vertex AImaintains a customized and optimized version of vLLM that is specificallytailored to enhance performance, reliability, and seamless integration withinthe Google Cloud.

  • Performance optimizations:
    • Parallel downloading from Cloud Storage: Significantlyaccelerates model loading and deployment times by enabling parallel dataretrieval from Cloud Storage, reducing latency and improving startupspeed.
  • Feature enhancements:
    • Dynamic LoRA with enhanced caching and Cloud Storage support:Extends dynamic LoRA capabilities with local disk caching mechanisms androbust error handling, alongside support for loading LoRA weightsdirectly from Cloud Storage paths and signed URLs. This simplifiesmanagement and deployment of customized models.
    • Llama 3.1/3.2 function calling parsing: Implements specializedparsing for Llama 3.1/3.2 function calling, improving the robustness inparsing.
    • Host memory prefix caching: The external vLLM only supports GPUmemory prefix caching.
    • Speculative decoding: This is an existing vLLM feature, butVertex AI ran experiments to find high-performing model setups.

These Vertex AI-specific customizations, while often transparent to theend-user, enable you to maximize the performance and efficiency of your Llama3.1 deployments on Vertex AI Model Garden.

  • Vertex AI ecosystem integration:
    • Vertex AI prediction input/output format support: Ensuresseamless compatibility with Vertex AI prediction input andoutput formats, simplifying data handling and integration with otherVertex AI services.
    • Vertex Environment variable awareness: Respects and leveragesVertex AI environment variables (AIP_*) for configuration andresource management, streamlining deployment and ensuring consistentbehavior within the Vertex AI environment.
    • Enhanced error handling and robustness: Implements comprehensiveerror handling, input/output validation, and server terminationmechanisms to ensure stability, reliability, and seamless operationwithin the managed Vertex AI environment.
    • Nginx server for capability: Integrates an Nginx server on top ofthe vLLM server, facilitating the deployment of multiple replicas andenhancing scalability and high availability of the servinginfrastructure.

Additional benefits of vLLM

  • Benchmark performance: vLLM offers competitive performance when comparedto other serving systems like Hugging Face text-generation-inference andNVIDIA's FasterTransformer in terms of throughput and latency.
  • Ease of use: The library provides a straightforward API for integrationwith existing workflows, allowing you to deploy both Llama 3.1 and 3.2models with minimal setup.
  • Advanced features: vLLM supports streaming outputs (generating responsestoken-by-token) and efficiently handles variable-length prompts, enhancinginteractivity and responsiveness in applications.

For an overview of the vLLM system, see thepaper.

Supported Models

vLLM provides support for a broad selection of state-of-the-art models, allowingyou to choose a model that best fits your needs. The following table offers aselection of these models. However, to access a comprehensive list of supportedmodels, including those for both text-only and multimodal inference, you canconsult the official vLLMwebsite.

CategoryModels
Meta AILlama 3.3, Llama 3.2, Llama 3.1, Llama 3, Llama 2, Code Llama
Mistral AIMistral 7B, Mixtral 8x7B, Mixtral 8x22B, and their variants (Instruct, Chat), Mistral-tiny, Mistral-small, Mistral-medium
DeepSeek AIDeepSeek-V3, DeepSeek-R1, DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-14B, DeepSeek-R1-Distill-Qwen-32B, DeepSeek-R1-Distill-Llama-70B, Deepseek-vl2-tiny, Deepseek-vl2-small, Deepseek-vl2
MosaicMLMPT (7B, 30B) and variants (Instruct, Chat), MPT-7B-StoryWriter-65k
OpenAIGPT-2, GPT-3, GPT-4, GPT-NeoX
Together AIRedPajama, Pythia
Stability AIStableLM (3B, 7B), StableLM-Alpha-3B, StableLM-Base-Alpha-7B, StableLM-Instruct-Alpha-7B
TII (Technology Innovation Institute)Falcon 7B, Falcon 40B and variants (Instruct, Chat), Falcon-RW-1B, Falcon-RW-7B
BigScienceBLOOM, BLOOMZ
GoogleFLAN-T5, UL2, Gemma (2B, 7B), PaLM 2,
SalesforceCodeT5, CodeT5+
LightOnPersimmon-8B-base, Persimmon-8B-chat
EleutherAIGPT-Neo, Pythia
AI21 LabsJamba
CerebrasCerebras-GPT
IntelIntel-NeuralChat-7B
Other Prominent ModelsStarCoder, OPT, Baichuan, Aquila, Qwen, InternLM, XGen, OpenLLaMA, Phi-2, Yi, OpenCodeInterpreter, Nous-Hermes, Gemma-it, Mistral-Instruct-v0.2-7B-Zeus,

Table 2: Some models supported by vLLM

Get started in Model Garden

The vLLM Cloud GPUs serving container is integrated intoModel Garden the playground, one-click deployment, andColab Enterprise notebook examples. This tutorial focuses on the Llamamodel family from Meta AI as an example.

Use the Colab Enterprise notebook

Playgroundandone-clickdeployments are also available but are not outlined in this tutorial.

  1. Navigate to themodel card pageand clickOpen notebook.
  2. Select the Vertex Serving notebook. The notebook is opened inColab Enterprise.
  3. Run through the notebook to deploy a model by using vLLM and send predictionrequests to the endpoint.

Setup and requirements

This section outlines the necessary steps for setting up your Google Cloudproject and ensuring you have the required resources for deploying and servingvLLM models.

1. Billing

2. GPU availability and quotas

Machine TypeAccelerator TypeRecommended Regions
a2-ultragpu-1g1 NVIDIA_A100_80GBus-central1, us-east4, europe-west4, asia-southeast1
a3-highgpu-8g8 NVIDIA_H100_80GBus-central1, us-west1, europe-west4, asia-southeast1

3. Set up a Google Cloud Project

Run the following code sample to make sure that your Google Cloud emvironment iscorrectly set up. This step installs necessary Python libraries and sets upaccess to Google Cloud resources. Actions include:

  • Installation: Upgrade thegoogle-cloud-aiplatform library and clonerepository containing utility functions.
  • Environment Setup: Defining variables for the Google Cloud Project ID,region, and a unique Cloud Storage bucket for storing model artifacts.
  • API activation: Enable the Vertex AI amd Compute Engine APIs,which are essential for deploying and managing AI models.
  • Bucket configuration: Create a new Cloud Storage bucket or check anexisting bucket to ensure it's in the correct region.
  • Vertex AI initialization: Initialize the Vertex AI clientlibrary with the project, location, and staging bucket settings.
  • Service account setup: Identify the default service account for runningVertex AI jobs and granting it the necessary permissions.
BUCKET_URI="gs://"REGION=""!pip3install--upgrade--quiet'google-cloud-aiplatform>=1.64.0'!gitclonehttps://github.com/GoogleCloudPlatform/vertex-ai-samples.gitimportdatetimeimportimportlibimportosimportuuidfromtypingimportTupleimportrequestsfromgoogle.cloudimportaiplatformcommon_util=importlib.import_module("vertex-ai-samples.community-content.vertex_model_garden.model_oss.notebook_util.common_util")models,endpoints={},{}PROJECT_ID=os.environ["GOOGLE_CLOUD_PROJECT"]ifnotREGION:REGION=os.environ["GOOGLE_CLOUD_REGION"]print("Enabling Vertex AI API and Compute Engine API.")!gcloudservicesenableaiplatform.googleapis.comcompute.googleapis.comnow=datetime.datetime.now().strftime("%Y%m%d%H%M%S")BUCKET_NAME="/".join(BUCKET_URI.split("/")[:3])ifBUCKET_URIisNoneorBUCKET_URI.strip()==""orBUCKET_URI=="gs://":BUCKET_URI=f"gs://{PROJECT_ID}-tmp-{now}-{str(uuid.uuid4())[:4]}"BUCKET_NAME="/".join(BUCKET_URI.split("/")[:3])!gsutilmb-l{REGION}{BUCKET_URI}else:assertBUCKET_URI.startswith("gs://"),"BUCKET_URI must start with `gs://`."shell_output=!gsutills-Lb{BUCKET_NAME}|grep"Location constraint:"|sed"s/Location constraint://"bucket_region=shell_output[0].strip().lower()ifbucket_region!=REGION:raiseValueError("Bucket region%s is different from notebook region%s"%(bucket_region,REGION))print(f"Using this Bucket:{BUCKET_URI}")STAGING_BUCKET=os.path.join(BUCKET_URI,"temporal")MODEL_BUCKET=os.path.join(BUCKET_URI,"llama3_1")print("Initializing Vertex AI API.")aiplatform.init(project=PROJECT_ID,location=REGION,staging_bucket=STAGING_BUCKET)shell_output=!gcloudprojectsdescribe$PROJECT_IDproject_number=shell_output[-1].split(":")[1].strip().replace("'","")SERVICE_ACCOUNT="your service account email"print("Using this default Service Account:",SERVICE_ACCOUNT)!gsutiliamchserviceAccount:{SERVICE_ACCOUNT}:roles/storage.admin$BUCKET_NAME!gcloudconfigsetproject$PROJECT_ID!gcloudprojectsadd-iam-policy-binding--no-user-output-enabled{PROJECT_ID}--member=serviceAccount:{SERVICE_ACCOUNT}--role="roles/storage.admin"!gcloudprojectsadd-iam-policy-binding--no-user-output-enabled{PROJECT_ID}--member=serviceAccount:{SERVICE_ACCOUNT}--role="roles/aiplatform.user"

Using Hugging Face with Meta Llama 3.1, 3.2, and vLLM

Note: Access to these models requires sharing your contact information andaccepting the terms of use as outlined in the Meta Privacy Policy. Your requestwill then be reviewed by the repo's authors.

Meta's Llama 3.1 and 3.2 collections provide a range of multilingual largelanguage models (LLMs) designed for high-quality text generation across varioususe cases. These models are pre-trained and instruction-tuned, excelling intasks like multilingual dialogue, summarization, and agentic retrieval. Beforeusing Llama 3.1 and 3.2 models, you must agree to their terms of use, as shownin the screenshot. The vLLM library offers an open-source streamlined servingenvironment with optimizations for latency, memory efficiency, and scalability.

Meta LLama 3 Community License AgreementFigure 1: Meta LLama 3 Community License Agreement

Overview of Meta Llama 3.1 and 3.2 Collections

The Llama 3.1 and 3.2 collections each cater to different deployment scales andmodel sizes, providing you with flexible options for multilingual dialogue tasksand beyond. Refer to theLlama overviewpage for more information.

  • Text-only: The Llama 3.2 collection of multilingual large languagemodels (LLMs) is a collection of pretrained and instruction-tuned generativemodels in 1B and 3B sizes (text in, text out).
  • Vision and Vision Instruct: The Llama 3.2-Vision collection ofmultimodal large language models (LLMs) is a collection of pretrained andinstruction-tuned image reasoning generative models in 11B and 90B sizes(text + images in, text out). Optimization: Like Llama 3.1, the 3.2 modelsare tailored for multilingual dialogue and perform well in retrieval andsummarization tasks, achieving top results on standard benchmarks.
  • Model Architecture: Llama 3.2 also features an auto-regressivetransformer framework, with SFT and RLHF applied to align the models forhelpfulness and safety.

Hugging Face user access tokens

This tutorial requires a read access token from the Hugging Face Hub to accessthe necessary resources. Follow these steps to set up your authentication:

Hugging Face Access Token SettingsFigure 2: Hugging Face Access Token Settings
  1. Generate a read access token:

  2. Use the token:

    • Use the generated token to authenticate and access public or privaterepositories as needed for the tutorial.
Manage Hugging Face Access TokenFigure 3: Manage Hugging Face Access Token

This setup ensures you have the appropriate level of access without unnecessarypermissions. These practices enhance security and prevent accidental tokenexposure. For more information on setting up access tokens, visit Hugging FaceAccess Tokenspage.

Avoid sharing or exposing your token publicly or online. When you set your tokenas an environment variable during deployment, it remains private to yourproject. Vertex AI ensures its security by preventing other users fromaccessing your models and endpoints.

For more information on protecting your access token, refer to theHugging FaceAccess Tokens - BestPractices.

Deploying text-only Llama 3.1 Models with vLLM

For production-level deployment of large language models, vLLM provides anefficient serving solution that optimizes memory usage, lowers latency, andincreases throughput. This makes it particularly well-suited for handling thelarger Llama 3.1 models as well as the multimodal Llama 3.2 models.

Note: Recommended serving configurations: This example recommends using A100 80Gor H100 GPUs for optimal serving efficiency and performance. These GPUs are nowreadily available and are the preferred options for deploying these models.

Step 1: Choose a model to deploy

Choose the Llama 3.1 model variant to deploy. Available options include varioussizes and instruction-tuned versions:

base_model_name="Meta-Llama-3.1-8B"# @param ["Meta-Llama-3.1-8B", "Meta-Llama-3.1-8B-Instruct", "Meta-Llama-3.1-70B", "Meta-Llama-3.1-70B-Instruct", "Meta-Llama-3.1-405B-FP8", "Meta-Llama-3.1-405B-Instruct-FP8"]hf_model_id="meta-Llama/"+base_model_name

Step 2: Check deployment hardware and quota

The deploy function sets the appropriate GPU and machine type based on the modelsize and check the quota in that region for a particular project:

if"8b"inbase_model_name.lower():accelerator_type="NVIDIA_L4"machine_type="g2-standard-12"accelerator_count=1elif"70b"inbase_model_name.lower():accelerator_type="NVIDIA_L4"machine_type="g2-standard-96"accelerator_count=8elif"405b"inbase_model_name.lower():accelerator_type="NVIDIA_H100_80GB"machine_type="a3-highgpu-8g"accelerator_count=8else:raiseValueError(f"Recommended GPU setting not found for:{accelerator_type} and{base_model_name}.")

Verify GPU quota availability in your specified region:

common_util.check_quota(project_id=PROJECT_ID,region=REGION,accelerator_type=accelerator_type,accelerator_count=accelerator_count,is_for_training=False,)

Step 3: Inspect the model using vLLM

The following function uploads the model to Vertex AI, configuresdeployment settings, and deploys it to an endpoint using vLLM.

  1. Docker Image: The deployment uses a prebuilt vLLM Docker image forefficient serving.
  2. Configuration: Configure memory utilization, model length, and othervLLM settings. For more information on the arguments supported by theserver, visit the official vLLM documentation page.
  3. Environment Variables: Set environment variables for authentication anddeployment source.
defdeploy_model_vllm(model_name:str,model_id:str,service_account:str,base_model_id:str=None,machine_type:str="g2-standard-8",accelerator_type:str="NVIDIA_L4",accelerator_count:int=1,gpu_memory_utilization:float=0.9,max_model_len:int=4096,dtype:str="auto",enable_trust_remote_code:bool=False,enforce_eager:bool=False,enable_lora:bool=False,max_loras:int=1,max_cpu_loras:int=8,use_dedicated_endpoint:bool=False,max_num_seqs:int=256,)->Tuple[aiplatform.Model,aiplatform.Endpoint]:"""Deploys trained models with vLLM into Vertex AI."""endpoint=aiplatform.Endpoint.create(display_name=f"{model_name}-endpoint",dedicated_endpoint_enabled=use_dedicated_endpoint,)if"8b"inbase_model_name.lower():accelerator_type="NVIDIA_L4"machine_type="g2-standard-12"accelerator_count=1elif"70b"inbase_model_name.lower():accelerator_type="NVIDIA_L4"machine_type="g2-standard-96"accelerator_count=8elif"405b"inbase_model_name.lower():accelerator_type="NVIDIA_H100_80GB"machine_type="a3-highgpu-8g"accelerator_count=8else:raiseValueError(f"Recommended GPU setting not found for:{accelerator_type} and{base_model_name}.")common_util.check_quota(project_id=PROJECT_ID,region=REGION,accelerator_type=accelerator_type,accelerator_count=accelerator_count,is_for_training=False,)vllm_args=["python","-m","vllm.entrypoints.api_server","--host=0.0.0.0","--port=8080",f"--model={model_id}",f"--tensor-parallel-size={accelerator_count}","--swap-space=16",f"--gpu-memory-utilization={gpu_memory_utilization}",f"--max-model-len={max_model_len}",f"--dtype={dtype}",f"--max-loras={max_loras}",f"--max-cpu-loras={max_cpu_loras}",f"--max-num-seqs={max_num_seqs}","--disable-log-stats"]ifenable_trust_remote_code:vllm_args.append("--trust-remote-code")ifenforce_eager:vllm_args.append("--enforce-eager")ifenable_lora:vllm_args.append("--enable-lora")ifmodel_type:vllm_args.append(f"--model-type={model_type}")env_vars={"MODEL_ID":model_id,"DEPLOY_SOURCE":"notebook","HF_TOKEN":HF_TOKEN}model=aiplatform.Model.upload(display_name=model_name,serving_container_image_uri=VLLM_DOCKER_URI,serving_container_args=vllm_args,serving_container_ports=[8080],serving_container_predict_route="/generate",serving_container_health_route="/ping",serving_container_environment_variables=env_vars,serving_container_shared_memory_size_mb=(16*1024),serving_container_deployment_timeout=7200,)print(f"Deploying{model_name} on{machine_type} with{accelerator_count}{accelerator_type} GPU(s).")model.deploy(endpoint=endpoint,machine_type=machine_type,accelerator_type=accelerator_type,accelerator_count=accelerator_count,deploy_request_timeout=1800,service_account=service_account,)print("endpoint_name:",endpoint.name)returnmodel,endpoint

Step 4: Execute deployment

Run the deployment function with the selected model and configuration. This stepdeploys the model and returns the model and endpoint instances:

HF_TOKEN=""VLLM_DOCKER_URI="us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20241001_0916_RC00"model_name=common_util.get_job_name_with_datetime(prefix=f"{base_model_name}-serve-vllm")gpu_memory_utilization=0.9max_model_len=4096max_loras=1models["vllm_gpu"],endpoints["vllm_gpu"]=deploy_model_vllm(model_name=common_util.get_job_name_with_datetime(prefix=f"{base_model_name}-serve"),model_id=hf_model_id,service_account=SERVICE_ACCOUNT,machine_type=machine_type,accelerator_type=accelerator_type,accelerator_count=accelerator_count,gpu_memory_utilization=gpu_memory_utilization,max_model_len=max_model_len,max_loras=max_loras,enforce_eager=True,enable_lora=True,use_dedicated_endpoint=use_dedicated_endpoint,)

After running this code sample, your Llama 3.1 model will be deployed onVertex AI and accessible through the specified endpoint. You caninteract with it for inference tasks such as text generation, summarization, anddialogue. Depending on model size, new model deployment can take up to an hour.You can check the progress at online prediction.

Llama 3.1 Deployment Endpoint in Vertex DashboardFigure 4: Llama 3.1 Deployment Endpoint in Vertex Dashboard

Making predictions with Llama 3.1 on Vertex AI

After successfully deploying the Llama 3.1 model to Vertex AI, you canstart making predictions by sending text prompts to the endpoint. This sectionprovides an example of generating responses with various customizable parametersfor controlling the output.

Step 1: Define your prompt and parameters

Start by setting up your text prompt and sampling parameters to guide themodel's response. Here are the key parameters:

  • prompt: The input text for which you want the model to generate aresponse. For example, prompt = "What is a car?".
  • max_tokens: The maximum number of tokens in the generated output.Reducing this value can help prevent timeout issues.
  • temperature: Controls the randomness of predictions. Higher values(for example, 1.0) increase diversity, while lower values (for example, 0.5)make the output more focused.
  • top_p: Limits the sampling pool to the top cumulative probability. Forexample, setting top_p = 0.9 will only consider tokens within the top 90%probability mass.
  • top_k: Limits sampling to the top k most likely tokens. For example,setting top_k = 50 will only sample from the top 50 tokens.
  • raw_response: If True, returns the raw model output. If False, applyadditional formatting with the structure"Prompt:\n{prompt}\nOutput:\n{output}".
  • lora_id (optional): Path to LoRA weight files to apply Low-RankAdaptation (LoRA) weights. This can be a Cloud Storage bucket or aHugging Face repository URL. Note that this only works if--enable-lora isset in the deployment arguments. Dynamic LoRA is not supported formultimodal models.
prompt="What is a car?"max_tokens=50temperature=1.0top_p=1.0top_k=1raw_response=Falselora_id=""

Step 2: Send the prediction request

Now that the instance is configured, you can send the prediction request to thedeployed Vertex AI endpoint. This example shows how to make aprediction and print the result:

response=endpoints["vllm_gpu"].predict(instances=instances,use_dedicated_endpoint=use_dedicated_endpoint)forpredictioninresponse.predictions:print(prediction)

Example output

Here's an example of how the model might respond to the prompt "What is a car?":

Human: What is a car?Assistant: A car, or a motor car, is a road-connected human-transportation systemused to move people or goods from one place to another.

Additional notes

  • Moderation: To ensure safe content, you can moderate the generated textwith Vertex AI's text moderation capabilities.
  • Handling timeouts: If you encounter issues likeServiceUnavailable:503, try reducing themax_tokens parameter.

This approach provides a flexible way to interact with the Llama 3.1 model usingdifferent sampling techniques and LoRA adaptors, making it suitable for avariety of use cases from general-purpose text generation to task-specificresponses.

Deploying multimodal Llama 3.2 models with vLLM

This section walks you through the process of uploading prebuilt Llama 3.2models to the Model Registry and deploying them to aVertex AI endpoint. The deployment time can take up to an hour,depending on the size of the model. Llama 3.2 models are available in multimodalversions that support both text and image inputs. vLLM supports:

  • Text-only format
  • Single image + text format

These formats make Llama 3.2 suitable for applications requiring both visual andtext processing.

Step 1: Choose a model to deploy

Specify the Llama 3.2 model variant you want to deploy. The following exampleusesLlama-3.2-11B-Vision as the selected model, but you can choose from otheravailable options based on your requirements.

base_model_name="Llama-3.2-11B-Vision"# @param ["Llama-3.2-1B", "Llama-3.2-1B-Instruct", "Llama-3.2-3B", "Llama-3.2-3B-Instruct", "Llama-3.2-11B-Vision", "Llama-3.2-11B-Vision-Instruct", "Llama-3.2-90B-Vision", "Llama-3.2-90B-Vision-Instruct"]hf_model_id="meta-Llama/"+base_model_name

Step 2: Configure hardware and resources

Select appropriate hardware for the model size. vLLM can use different GPUsdepending on the computational needs of the model:

  • 1B and 3B models: Use NVIDIA L4 GPUs.
  • 11B models: Use NVIDIA A100 GPUs.
  • 90B models: Use NVIDIA H100 GPUs.

This example configures the deployment based on the model selection:

if"3.2-1B"inbase_model_nameor"3.2-3B"inbase_model_name:accelerator_type="NVIDIA_L4"machine_type="g2-standard-8"accelerator_count=1elif"3.2-11B"inbase_model_name:accelerator_type="NVIDIA_TESLA_A100"machine_type="a2-highgpu-1g"accelerator_count=1elif"3.2-90B"inbase_model_name:accelerator_type="NVIDIA_H100_80GB"machine_type="a3-highgpu-8g"accelerator_count=8else:raiseValueError(f"Recommended GPU setting not found for:{base_model_name}.")

Ensure that you have the required GPU quota:

common_util.check_quota(project_id=PROJECT_ID,region=REGION,accelerator_type=accelerator_type,accelerator_count=accelerator_count,is_for_training=False,)

Step 3: Deploy the model using vLLM

The following function handles the deployment of the Llama 3.2 model onVertex AI. It configures the model's environment, memory utilization,and vLLM settings for efficient serving.

defdeploy_model_vllm(model_name:str,model_id:str,service_account:str,base_model_id:str=None,machine_type:str="g2-standard-8",accelerator_type:str="NVIDIA_L4",accelerator_count:int=1,gpu_memory_utilization:float=0.9,max_model_len:int=4096,dtype:str="auto",enable_trust_remote_code:bool=False,enforce_eager:bool=False,enable_lora:bool=False,max_loras:int=1,max_cpu_loras:int=8,use_dedicated_endpoint:bool=False,max_num_seqs:int=12,model_type:str=None,)->Tuple[aiplatform.Model,aiplatform.Endpoint]:"""Deploys trained models with vLLM into Vertex AI."""endpoint=aiplatform.Endpoint.create(display_name=f"{model_name}-endpoint",dedicated_endpoint_enabled=use_dedicated_endpoint,)ifnotbase_model_id:base_model_id=model_idvllm_args=["python","-m","vllm.entrypoints.api_server","--host=0.0.0.0","--port=8080",f"--model={model_id}",f"--tensor-parallel-size={accelerator_count}","--swap-space=16",f"--gpu-memory-utilization={gpu_memory_utilization}",f"--max-model-len={max_model_len}",f"--dtype={dtype}",f"--max-loras={max_loras}",f"--max-cpu-loras={max_cpu_loras}",f"--max-num-seqs={max_num_seqs}","--disable-log-stats",]ifenable_trust_remote_code:vllm_args.append("--trust-remote-code")ifenforce_eager:vllm_args.append("--enforce-eager")ifenable_lora:vllm_args.append("--enable-lora")ifmodel_type:vllm_args.append(f"--model-type={model_type}")env_vars={"MODEL_ID":base_model_id,"DEPLOY_SOURCE":"notebook",}# HF_TOKEN is not a compulsory field and may not be defined.try:ifHF_TOKEN:env_vars["HF_TOKEN"]=HF_TOKENexceptNameError:passmodel=aiplatform.Model.upload(display_name=model_name,serving_container_image_uri=VLLM_DOCKER_URI,serving_container_args=vllm_args,serving_container_ports=[8080],serving_container_predict_route="/generate",serving_container_health_route="/ping",serving_container_environment_variables=env_vars,serving_container_shared_memory_size_mb=(16*1024),serving_container_deployment_timeout=7200,)print(f"Deploying{model_name} on{machine_type} with{accelerator_count}{accelerator_type} GPU(s).")model.deploy(endpoint=endpoint,machine_type=machine_type,accelerator_type=accelerator_type,accelerator_count=accelerator_count,deploy_request_timeout=1800,service_account=service_account,)print("endpoint_name:",endpoint.name)returnmodel,endpoint

Step 4: Execute deployment

Run the deployment function with the configured model and settings. The functionwill return both the model and endpoint instances, which you can use forinference.

model_name=common_util.get_job_name_with_datetime(prefix=f"{base_model_name}-serve-vllm")models["vllm_gpu"],endpoints["vllm_gpu"]=deploy_model_vllm(model_name=model_namemodel_id=hf_model_id,base_model_id=hf_model_id,service_account=SERVICE_ACCOUNT,machine_type=machine_type,accelerator_type=accelerator_type,accelerator_count=accelerator_count,gpu_memory_utilization=gpu_memory_utilization,max_model_len=max_model_len,enforce_eager=True,use_dedicated_endpoint=use_dedicated_endpoint,max_num_seqs=max_num_seqs,)
Llama 3.2 Deployment Endpoint in Vertex DashboardFigure 5: Llama 3.2 Deployment Endpoint in Vertex Dashboard

Depending on model size, new model deployment can take up to an hour tocomplete. You can check its progress at online prediction.

Inference with vLLM on Vertex AI using default prediction route

This section guides you through setting up inference for the Llama 3.2 Visionmodel on Vertex AI using the default prediction route. You'll use thevLLM library for efficient serving and interact with the model by sending avisual prompt in combination with text.

To get started, ensure your model endpoint is deployed and ready forpredictions.

Step 1: Define your prompt and parameters

This example provides an image URL and a text prompt, which the model willprocess to generate a response.

Sample Image Input for prompting Llama 3.2Figure 6: Sample Image Input for prompting Llama 3.2
image_url="https://images.pexels.com/photos/1254140/pexels-photo-1254140.jpeg"raw_prompt="This is a picture of"# Reference prompt formatting guidelines here: https://www.Llama.com/docs/model-cards-and-prompt-formats/Llama3_2/#-base-model-promptprompt=f"<|begin_of_text|><|image|>{raw_prompt}"

Step 2: Configure prediction parameters

Adjust the following parameters to control the model's response:

max_tokens=64temperature=0.5top_p=0.95

Step 3: Prepare the prediction request

Set up the prediction request with the image URL, prompt, and other parameters.

instances=[{"prompt":prompt,"multi_modal_data":{"image":image_url},"max_tokens":max_tokens,"temperature":temperature,"top_p":top_p,},]

Step 4: Make the prediction

Send the request to your Vertex AI endpoint and process the response:

response=endpoints["vllm_gpu"].predict(instances=instances)forraw_predictioninresponse.predictions:prediction=raw_prediction.split("Output:")print(prediction[1])

If you encounter a timeout issue (for example,ServiceUnavailable: 503 Took toolong to respond when processing), try reducing themax_tokens value to alower number, such as 20, to mitigate the response time.

Inference with vLLM on Vertex AI using OpenAI Chat Completion

This section covers how to perform inference on Llama 3.2 Vision models usingthe OpenAI Chat Completions API on Vertex AI. This approach lets youuse multimodal capabilities by sending both images and text prompts to the modelfor more interactive responses.

Step 1: Execute deployment of Llama 3.2 Vision Instruct model

Run the deployment function with the configured model and settings. The functionwill return both the model and endpoint instances, which you can use forinference.

base_model_name="Llama-3.2-11B-Vision-Instruct"hf_model_id=f"meta-llama/{base_model_name}"model_name=common_util.get_job_name_with_datetime(prefix=f"{base_model_name}-serve-vllm")model,endpoint=deploy_model_vllm(model_name=model_namemodel_id=hf_model_id,base_model_id=hf_model_id,service_account=SERVICE_ACCOUNT,machine_type="a2-highgpu-1g",accelerator_type="NVIDIA_TESLA_A100",accelerator_count=1,gpu_memory_utilization=0.9,max_model_len=4096,enforce_eager=True,max_num_seqs=12,)

Step 2: Configure endpoint resource

Begin by setting up the endpoint resource name for your Vertex AIdeployment.

ENDPOINT_RESOURCE_NAME="projects/{}/locations/{}/endpoints/{}".format(PROJECT_ID,REGION,endpoint.name)

Step 3: Install OpenAI SDK and authentication libraries

To send requests using OpenAI's SDK, ensure the necessary libraries areinstalled:

!pipinstall-qUopenaigoogle-authrequests

Step 4: Define input parameters for chat completion

Set up your image URL and text prompt that will be sent to the model. Adjustmax_tokens andtemperature to control the response length and randomness,respectively.

user_image="https://images.freeimages.com/images/large-previews/ab3/puppy-2-1404644.jpg"user_message="Describe this image?"max_tokens=50temperature=1.0

Step 5: Set up authentication and base URL

Retrieve your credentials and set the base URL for API requests.

importgoogle.authimportopenaicreds,project=google.auth.default()auth_req=google.auth.transport.requests.Request()creds.refresh(auth_req)BASE_URL=(f"https://{REGION}-aiplatform.googleapis.com/v1beta1/{ENDPOINT_RESOURCE_NAME}")try:ifuse_dedicated_endpoint:BASE_URL=f"https://{DEDICATED_ENDPOINT_DNS}/v1beta1/{ENDPOINT_RESOURCE_NAME}"exceptNameError:pass

Step 6: Send Chat Completion request

Using OpenAI's Chat Completions API, send the image and text prompt to yourVertex AI endpoint:

client=openai.OpenAI(base_url=BASE_URL,api_key=creds.token)model_response=client.chat.completions.create(model="",messages=[{"role":"user","content":[{"type":"image_url","image_url":{"url":user_image}},{"type":"text","text":user_message},],}],temperature=temperature,max_tokens=max_tokens,)print(model_response)

(Optional ) Step 7: Reconnect to an existing endpoint

To reconnect to a previously created endpoint, use the endpoint ID. This step isuseful if you want to reuse an endpoint instead of creating a new one.

endpoint_name=""aip_endpoint_name=(f"projects/{PROJECT_ID}/locations/{REGION}/endpoints/{endpoint_name}")endpoint=aiplatform.Endpoint(aip_endpoint_name)

This setup provides flexibility to switch between newly created and existingendpoints as needed, allowing for streamlined testing and deployment.

Cleanup

To avoid ongoing charges and free up resources, make sure to delete the deployedmodels, endpoints, and optionally the storage bucket used for this experiment.

Step 1: Delete Endpoints and Models

The following code will undeploy each model and delete the associated endpoints:

# Undeploy model and delete endpointforendpointinendpoints.values():endpoint.delete(force=True)# Delete modelsformodelinmodels.values():model.delete()

Step 2: (Optional) Delete Cloud Storage Bucket

If you created a Cloud Storage bucket specifically for this experiment, youcan delete it by setting delete_bucket to True. This step is optional butrecommended if the bucket is no longer needed.

delete_bucket=Falseifdelete_bucket:!gsutil-mrm-r$BUCKET_NAME

By following these steps, you ensure that all resources used in this tutorialare cleaned up, reducing any unnecessary costs associated with the experiment.

Debugging common issues

This section provides guidance on identifying and resolving common issuesencountered during vLLM model deployment and inference on Vertex AI.

Check the logs

Check the logs to identify the root cause of deployment failures or unexpectedbehavior:

  1. Navigate to Vertex AI Prediction Console: Go to theVertex AI PredictionConsolein the Google Cloud console.
  2. Select the Endpoint: Click the endpoint experiencing issues. The statusshould indicate if the deployment has failed.
  3. View Logs: Click the endpoint and then navigate to theLogs tab orclickView logs. This directs you to Cloud Logging, filtered to showlogs specific to that endpoint and model deployment. You can also accesslogs through the Cloud Logging service directly.
  4. Analyze the Logs: Review the log entries for error messages, warnings,and other relevant information. View timestamps to correlate log entrieswith specific actions. Look for issues around resource constraints (memoryand CPU), authentication problems, or configuration errors.

Common Issue 1: CUDA Out of Memory (OOM) during deployment

CUDA Out of Memory (OOM) errors occur when the model's memory usage exceeds theavailable GPU capacity.

In the case of the text only model, we used the following engine arguments:

base_model_name="Meta-Llama-3.1-8B"hf_model_id=f"meta-llama/{base_model_name}"accelerator_type="NVIDIA_L4"accelerator_count=1machine_type="g2-standard-12"accelerator_count:int=1gpu_memory_utilization=0.9max_model_len=4096dtype="auto"max_num_seqs=256

In the case of the multimodal model, we used the following engine arguments:

base_model_name="Llama-3.2-11B-Vision-Instruct"hf_model_id=f"meta-llama/{base_model_name}"accelerator_type="NVIDIA_L4"accelerator_count=1machine_type="g2-standard-12"accelerator_count:int=1gpu_memory_utilization=0.9max_model_len=4096dtype="auto"max_num_seqs=12

Deploying the multimodal model with max_num_seqs = 256, like we did in the caseof text only model could cause the following error:

[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.91 GiB. GPU 0 has a total capacity of 39.38 GiB of which 3.76 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 34.94 GiB is allocated by PyTorch, and 175.15 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Out of Memory (OOM) GPU Error LogFigure 7: Out of Memory (OOM) GPU Error Log

Understandmax_num_seqs and GPU Memory:

  • Themax_num_seqs parameter defines the maximum number of concurrentrequests the model can handle.
  • Each sequence processed by the model consumes GPU memory. The total memoryusage is proportional tomax_num_seqs times the memory per sequence.
  • Text-only models (like Meta-Llama-3.1-8B) generally consume less memory persequence than multimodal models (like Llama-3.2-11B-Vision-Instruct), whichprocess both text and images.

Review the Error Log (figure 8):

  • The log shows atorch.OutOfMemoryError when trying to allocate memory onthe GPU.
  • The error occurs because the model's memory usage exceeds the available GPUcapacity. The NVIDIA L4 GPU has 24 GB, and setting themax_num_seqsparameter too high for the multimodal model causes an overflow.
  • The log suggests settingPYTORCH_CUDA_ALLOC_CONF=expandable_segments:Trueto improve memory management, though the primary issue here is high memoryusage.
Error message indicating that the Llama 3.2 deployment failedFigure 8: Failed Llama 3.2 Deployment
Model Version Details PanelFigure 9: Model Version Details Panel

To resolve this issue, navigate to theVertex AI PredictionConsole,click the endpoint. The status should indicate that the deployment has failed.Click to view the logs. Verify that max-num-seqs = 256. This value is too highfor Llama-3.2-11B-Vision-Instruct. A more adequate value should be 12.

Common Issue 2: Hugging Face token needed

Hugging Face token errors occur when the model is gated and requires properauthentication credentials to be accessed.

The following screenshot displays a log entry in Google Cloud's Log Explorershowing an error message related to accessing the Meta LLaMA-3.2-11B-Visionmodel hosted on Hugging Face. The error indicates that access to the model isrestricted, requiring authentication to proceed. The message specificallystates, "Cannot access gated repository for URL," highlighting that the model isgated and requires proper authentication credentials to be accessed. This logentry can help troubleshoot authentication issues when working with restrictedresources in external repositories.

Error message indicating that a HuggingFace token is needed to access the modelFigure 10: Hugging Face Token Error

To resolve this issue, verify the permissions of your Hugging Face access token.Copy the latest token and deploy a new endpoint.

Common Issue 3: Chat template needed

Chat template errors occur when the default chat template is no longer allowed,and a custom chat template must be provided if the tokenizer does not defineone.

This screenshot shows a log entry in Google Cloud's Log Explorer, where aValueError occurs due to a missing chat template in the transformers libraryversion 4.44. The error message indicates that the default chat template is nolonger allowed, and a custom chat template must be provided if the tokenizerdoes not define one. This error highlights a recent change in the libraryrequiring explicit definition of a chat template, useful for debugging issueswhen deploying chat-based applications.

Error message indicating that a chat template is needed to access the modelFigure 11: Chat Template Needed

To bypass this, make sure to provide a chat template during deployment using the--chat-template input argument. Sample templates can be found in thevLLMexamples repository.

Common Issue 4: Model Max Seq Len

Model max sequence length errors occur when the model's max seq len (4096) islarger than the maximum number of tokens that can be stored in KV cache (2256).

Model Max Seq LenFigure 12: Max Seq Length too Large

ValueError: The model's max seq len (4096) is larger than the maximum number oftokens that can be stored in KV cache (2256). Try increasinggpu_memory_utilization or decreasingmax_model_len when initializing theengine.

To resolve this problem, set max_model_len 2048, which is less than 2256.Another resolution for this issue is to use more or larger GPUs.tensor-parallel-size will need to be set appropriately if opting to use moreGPUs.

Model Garden vLLM container release notes

Main releases

Standard vLLM


Release date

Architecture

vLLM version

Container URI
Jul 17, 2025
ARM

v0.9.2

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250717_0916_arm_RC01
Jul 10, 2025
x86

v0.9.2

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250710_0916_RC01
Jun 20, 2025
x86

Past v0.9.1,commit

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250620_0916_RC01
Jun 11, 2025
x86

v0.9.1

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250611_0916_RC01
Jun 2, 2025
x86

v0.9.0

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250601_0916_RC01
May 6, 2025
x86

v0.8.5.post1

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250506_0916_RC01
Apr 29, 2025
x86

v0.8.4

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250429_0916_RC01, 20250430_0916_RC00_maas
Apr 17, 2025
x86

v0.8.4

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250417_0916_RC01
Apr 10, 2025
x86

Past v0.8.3,commit

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250410_0917_RC01
Apr 7, 2025
x86

v0.8.3

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250407_0917_RC01, 20250407_0917_RC0120250429_0916_RC00_maas
Apr 7, 2025
x86

v0.8.1

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250404_0916_RC01
Apr 5, 2025
x86

Past v0.8.2,commit

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250405_1205_RC01
Mar 31, 2025
x86

v0.8.1

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250401_0916_RC01
Mar 26, 2025
x86

v0.8.1

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250327_0916_RC01
Mar 23, 2025
x86

v0.8.1

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250324_0916_RC01
Mar 21, 2025
x86

v0.8.1

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250321_0916_RC01
Mar 11, 2025
x86

Past v0.7.3,commit

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250312_0916_RC01
Mar 3, 2025
x86

v0.7.2

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250304_0916_RC01
Jan 14, 2025
x86

v0.6.4.post1

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250114_0916_RC00_maas
Dec 2, 2024
x86

v0.6.4.post1

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20241202_0916_RC00_maas
Nov 12, 2024
x86

v0.6.2

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20241112_0916_RC00_maas
Oct 16, 2024
x86

v0.6.2

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20241016_0916_RC00_maas

Optimized vLLM

Deprecated: Development on the Optimized vLLM container is discontinued.

Release date

Architecture

Container URI
Jan 21, 2025
x86

us-docker.pkg.dev/vertex-ai-restricted/vertex-vision-model-garden-dockers/pytorch-vllm-optimized-serve:20250121_0835_RC00
Oct 29, 2024
x86

us-docker.pkg.dev/vertex-ai-restricted/vertex-vision-model-garden-dockers/pytorch-vllm-optimized-serve:20241029_0835_RC00

Additional releases

The full list of VMG standard vLLM container releases can be found on theArtifact Registry page.

Releases for vLLM-TPU in experimental status are tagged with<yyyymmdd_hhmm_tpu_experimental_RC00>.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.