Run a computation-based evaluation pipeline Stay organized with collections Save and categorize content based on your preferences.
You can evaluate the performance of foundation models and your tuned generativeAI models on Vertex AI. The models are evaluated using a set of metricsagainst an evaluation dataset that you provide. This page explains howcomputation-based model evaluation through the evaluation pipeline serviceworks, how to create and format the evaluation dataset, and how to perform theevaluation using the Google Cloud console, Vertex AI API, or theVertex AI SDK for Python.
How computation-based model evaluation works
To evaluate the performance of a model, you first create an evaluation datasetthat contains prompt and ground truth pairs. For each pair, the prompt is theinput that you want to evaluate, and the ground truth is the ideal response forthat prompt. During evaluation, the prompt in each pair of the evaluationdataset is passed to the model to produce an output. The output generated by themodel and the ground truth from the evaluation dataset are used to compute theevaluation metrics.
The type of metrics used for evaluation depends on the task that you areevaluating. The following table shows the supported tasks and the metrics usedto evaluate each task:
| Task | Metric |
|---|---|
| Classification | Micro-F1, Macro-F1, Per class F1 |
| Summarization | ROUGE-L |
| Question answering | Exact Match |
| Text generation | BLEU, ROUGE-L |
Supported models
Model evaluation is supported for the following models:
text-bison: Base and tuned versions.Gemini: All tasks except classification.
Prepare evaluation dataset
The evaluation dataset that's used for model evaluation includes prompt andground truth pairs that align with the task that you want to evaluate. Yourdataset must include a minimum of 1 prompt and ground truth pair and at least10 pairs for meaningful metrics. The more examples yougive, the more meaningful the results.
Dataset format
Your evaluation dataset must be inJSON Lines (JSONL)format where each line contains a single prompt and ground truth pair specifiedin theinput_text andoutput_text fields, respectively. Theinput_textfield contains the prompt that you want to evaluate, and theoutput_text fieldcontains the ideal response for the prompt.
The maximum token length forinput_text is 8,192, and the maximum token lengthforoutput_text is 1,024.
Upload evaluation dataset to Cloud Storage
You can eithercreate a new Cloud Storage bucketor use an existing one to store your dataset file. The bucket must be in thesame region as the model.
After your bucket is ready,uploadyour dataset file to the bucket.
Perform model evaluation
You can evaluate models by using the REST API or the Google Cloud console.
Permissions required for this task
To perform this task, you must grantIdentity and Access Management (IAM) roles to each of the following service accounts:
| Service account | Default principal | Description | Roles |
|---|---|---|---|
| Vertex AI Service Agent | service-PROJECT_NUMBER@gcp-sa-aiplatform.iam.gserviceaccount.com | The Vertex AI Service Agent is automatically provisioned for your project and granted a predefined role. However, if an org policy modifies the default permissions of the Vertex AI Service Agent, you must manually grant the role to the service agent. | Vertex AI Service Agent (roles/aiplatform.serviceAgent) |
| Vertex AI Pipelines Service Account | PROJECT_NUMBER-compute@developer.gserviceaccount.com | The service account that runs the pipeline. The default service account used is theCompute Engine default service account. Optionally, you can use a custom service account instead of the default service account. |
Depending on your input and output data sources, you may also need to grant the Vertex AI Pipelines Service Account additional roles:
| Data source | Role | Where to grant the role |
|---|---|---|
| Standard BigQuery table | BigQuery Data Editor | Project that runs the pipeline |
| BigQuery Data Viewer | Project that the table belongs to | |
| BigQuery view of astandard BigQuery table | BigQuery Data Editor | Project that runs the pipeline |
| BigQuery Data Viewer | Project that the view belongs to | |
| BigQuery Data Viewer | Project that the table belongs to | |
| BigQuery external table that has a source Cloud Storage file | BigQuery Data Editor | Project that runs the pipeline |
| BigQuery Data Viewer | Project that the external table belongs to | |
| Storage Object Viewer | Project that the source file belongs to | |
| BigQuery view of aBigQuery external table that has a source Cloud Storage file | BigQuery Data Editor | Project that runs the pipeline |
| BigQuery Data Viewer | Project that the view belongs to | |
| BigQuery Data Viewer | Project that the external table belongs to | |
| Storage Object Viewer | Project that the source file belongs to | |
| Cloud Storage file | BigQuery Data Viewer | Project that runs the pipeline |
REST
To create a model evaluation job, send aPOST request by using thepipelineJobs method.
Before using any of the request data, make the following replacements:
- PROJECT_ID: The Google Cloud project that runs the pipeline components.
- PIPELINEJOB_DISPLAYNAME: A display name for the pipelineJob.
- LOCATION: The region to run the pipeline components. Currently, only
us-central1is supported. - DATASET_URI: The Cloud Storage URI of your reference dataset. You can specify one or multiple URIs. This parameter supportswildcards. To learn more about this parameter, seeInputConfig.
- OUTPUT_DIR: The Cloud Storage URI to store evaluation output.
- MODEL_NAME: Specify a publisher model or a tuned model resource as follows:
- Publisher model:
publishers/google/models/MODEL@MODEL_VERSIONExample:
publishers/google/models/text-bison@002 - Tuned model:
projects/PROJECT_NUMBER/locations/LOCATION/models/ENDPOINT_IDExample:
projects/123456789012/locations/us-central1/models/1234567890123456789
The evaluation job doesn't impact any existing deployments of the model or their resources.
- Publisher model:
- EVALUATION_TASK: The task that you want to evaluate the model on. The evaluation job computes a set of metrics relevant to that specific task. Acceptable values include the following:
summarizationquestion-answeringtext-generationclassification
- INSTANCES_FORMAT: The format of your dataset. Currently, only
jsonlis supported. To learn more about this parameter, seeInputConfig. - PREDICTIONS_FORMAT: The format of the evaluation output. Currently, only
jsonlis supported. To learn more about this parameter, seeInputConfig. - MACHINE_TYPE: (Optional) The machine type for running the evaluation job. The default value is
e2-highmem-16. For a list of supported machine types, seeMachine types. - SERVICE_ACCOUNT: (Optional) The service account to use for running the evaluation job. To learn how to create a custom service account, seeConfigure a service account with granular permissions. If unspecified, theVertex AI Custom Code Service Agent is used.
- NETWORK: (Optional) The fully qualified name of the Compute Engine network to peer the evaluatiuon job to. The format of the network name is
projects/PROJECT_NUMBER/global/networks/NETWORK_NAME. If you specify this field, you need to have aVPC Network Peering for Vertex AI. If left unspecified, the evaluation job is not peered with any network. - KEY_NAME: (Optional) The name of the customer-managed encryption key (CMEK). If configured, resources created by the evaluation job is encrypted using the provided encryption key. The format of the key name is
projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING/cryptoKeys/KEY. The key needs to be in the same region as the evaluation job.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/pipelineJobs
Request JSON body:
{ "displayName": "PIPELINEJOB_DISPLAYNAME", "runtimeConfig": { "gcsOutputDirectory": "gs://OUTPUT_DIR", "parameterValues": { "project": "PROJECT_ID", "location": "LOCATION", "batch_predict_gcs_source_uris": ["gs://DATASET_URI"], "batch_predict_gcs_destination_output_uri": "gs://OUTPUT_DIR", "model_name": "MODEL_NAME", "evaluation_task": "EVALUATION_TASK", "batch_predict_instances_format": "INSTANCES_FORMAT", "batch_predict_predictions_format: "PREDICTIONS_FORMAT", "machine_type": "MACHINE_TYPE", "service_account": "SERVICE_ACCOUNT", "network": "NETWORK", "encryption_spec_key_name": "KEY_NAME" } }, "templateUri": "https://us-kfp.pkg.dev/vertex-evaluation/pipeline-templates/evaluation-llm-text-generation-pipeline/1.0.1"}To send your request, choose one of these options:
curl
Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login , or by usingCloud Shell, which automatically logs you into thegcloud CLI . You can check the currently active account by runninggcloud auth list. Save the request body in a file namedrequest.json, and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/pipelineJobs"
PowerShell
Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login . You can check the currently active account by runninggcloud auth list. Save the request body in a file namedrequest.json, and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/pipelineJobs" | Select-Object -Expand Content
You should receive a JSON response similar to the following. Note thatpipelineSpec has been truncated to save space.
Response
........... "state": "PIPELINE_STATE_PENDING", "labels": { "vertex-ai-pipelines-run-billing-id": "1234567890123456789" }, "runtimeConfig": { "gcsOutputDirectory": "gs://my-evaluation-bucket/output", "parameterValues": { "project": "my-project", "location": "us-central1", "batch_predict_gcs_source_uris": [ "gs://my-evaluation-bucket/reference-datasets/eval_data.jsonl" ], "batch_predict_gcs_destination_output_uri": "gs://my-evaluation-bucket/output", "model_name": "publishers/google/models/text-bison@002" } }, "serviceAccount": "123456789012-compute@developer.gserviceaccount.com", "templateUri": "https://us-kfp.pkg.dev/vertex-evaluation/pipeline-templates/evaluation-llm-text-generation-pipeline/1.0.1", "templateMetadata": { "version": "sha256:d4c0d665533f6b360eb474111aa5e00f000fb8eac298d367e831f3520b21cb1a" }}Example curl command
PROJECT_ID=myprojectREGION=us-central1MODEL_NAME=publishers/google/models/text-bison@002TEST_DATASET_URI=gs://my-gcs-bucket-uri/dataset.jsonlOUTPUT_DIR=gs://my-gcs-bucket-uri/outputcurl\-XPOST\-H"Authorization: Bearer $(gcloud auth print-access-token)"\-H"Content-Type: application/json; charset=utf-8"\"https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/pipelineJobs"-d\$'{"displayName":"evaluation-llm-text-generation-pipeline","runtimeConfig":{"gcsOutputDirectory":"'${OUTPUT_DIR}'","parameterValues":{"project":"'${PROJECT_ID}'","location":"'${REGION}'","batch_predict_gcs_source_uris":["'${TEST_DATASET_URI}'"],"batch_predict_gcs_destination_output_uri":"'${OUTPUT_DIR}'","model_name":"'${MODEL_NAME}'",}},"templateUri":"https://us-kfp.pkg.dev/vertex-evaluation/pipeline-templates/evaluation-llm-text-generation-pipeline/1.0.1"}'Python
To learn how to install or update the Vertex AI SDK for Python, seeInstall the Vertex AI SDK for Python. For more information, see thePython API reference documentation.
importosfromgoogle.authimportdefaultimportvertexaifromvertexai.preview.language_modelsimport(EvaluationTextClassificationSpec,TextGenerationModel,)PROJECT_ID=os.getenv("GOOGLE_CLOUD_PROJECT")defevaluate_model()->object:"""Evaluate the performance of a generative AI model."""# Set credentials for the pipeline components used in the evaluation taskcredentials,_=default(scopes=["https://www.googleapis.com/auth/cloud-platform"])vertexai.init(project=PROJECT_ID,location="us-central1",credentials=credentials)# Create a reference to a generative AI modelmodel=TextGenerationModel.from_pretrained("text-bison@002")# Define the evaluation specification for a text classification tasktask_spec=EvaluationTextClassificationSpec(ground_truth_data=["gs://cloud-samples-data/ai-platform/generative_ai/llm_classification_bp_input_prompts_with_ground_truth.jsonl"],class_names=["nature","news","sports","health","startups"],target_column_name="ground_truth",)# Evaluate the modeleval_metrics=model.evaluate(task_spec=task_spec)print(eval_metrics)# Example response:# ...# PipelineJob run completed.# Resource name: projects/123456789/locations/us-central1/pipelineJobs/evaluation-llm-classification-...# EvaluationClassificationMetric(label_name=None, auPrc=0.53833705, auRoc=0.8...returneval_metricsConsole
To create a model evaluation job by using the Google Cloud console, performthe following steps:
- In the Google Cloud console, go to theVertex AI Model Registry page.
- Click the name of the model that you want to evaluate.
- In theEvaluate tab, clickCreate evaluation and configure as follows:
- Objective: Select the task that you want to evaluate.
- Target column or field: (Classification only) Enter the target column for prediction. Example:
ground_truth. - Source path: Enter or select the URI of your evaluation dataset.
- Output format: Enter the format of the evaluation output. Currently, only
jsonlis supported. - Cloud Storage path: Enter or select the URI to store evaluation output.
- Class names: (Classification only) Enter the list of possible class names.
- Number of compute nodes: Enter the number of compute nodes to run the evaluation job.
- Machine type: Select a machine type to use for running the evaluation job.
- ClickStart evaluation
View evaluation results
You can find the evaluation results in the Cloud Storage output directorythat you specified when creating the evaluation job. The file is namedevaluation_metrics.json.
For tuned models, you can also view evaluation results in the Google Cloud console:
In the Vertex AI section of the Google Cloud console, go totheVertex AI Model Registry page.
Click the name of the model to view its evaluation metrics.
In theEvaluate tab, click the name of the evaluation run that you wantto view.
What's next
- Learn aboutgenerative AI evaluation.
- Learn about online evaluation withGen AI Evaluation Service.
- Learn how totune a foundation model.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.