Run an evaluation

You can use theGen AI Evaluation module of theVertex AI SDK for Python to programmatically evaluate your generative language models and applications with the Gen AI evaluation service API. This page shows you how to run evaluations with the Vertex AI SDK. Note that evaluations at scale are only available using the REST API.

Before you begin

Install the Vertex AI SDK

To install the Gen AI Evaluation module from the Vertex AI SDK for Python, run the following command:

!pipinstall-qgoogle-cloud-aiplatform[evaluation]

For more information, seeInstall the Vertex AI SDK for Python.

Authenticate the Vertex AI SDK

After you install the Vertex AI SDK for Python, you need to authenticate. The followingtopics explain how to authenticate with the Vertex AI SDK if you'reworking locally and if you're working in Colaboratory:

If you're developing locally, set upApplication Default Credentials (ADC)in your local environment:
1. Install the Google Cloud CLI, theninitialize it by running the following command:
```
gcloudinit
```
2. Create local authentication credentials for your Google Account:
```
gcloudauthapplication-defaultlogin
```
  A login screen is displayed. After you sign in, your credentials are storedin the local credential file used by ADC. For more information, seeSet up ADC for a local development environment.
If you're working in Colaboratory, run the following command in aColab cell to authenticate:
```
fromgoogle.colabimportauthauth.authenticate_user()
```
This command opens a window where you can complete the authentication.

Understanding service accounts

Theservice account is used by the Gen AI evaluation service to getpredictions from theGemini API in Vertex AI for model-based evaluationmetrics. This service account is automatically provisioned on the first requestto the Gen AI evaluation service.

Name	Description	Email address	Role
Vertex AI Rapid Eval Service Agent	The service account used to get predictions for model based evaluation.	`service-PROJECT_NUMBER@gcp-sa-vertex-eval.iam.gserviceaccount.com`	`roles/aiplatform.rapidevalServiceAgent`

The permissions associated to the rapid evaluation service agent are:

Role	Permissions
Vertex AI Rapid Eval Service Agent (roles/aiplatform.rapidevalServiceAgent)	`aiplatform.endpoints.predict`

Run your evaluation

Use theEvalTask class to run evaluations for the following use cases:

`EvalTask` class

TheEvalTask class helps you evaluate models and applications based on specific tasks. To make fair comparisons between generative models, you typically need to repeatedly evaluate various models and prompt templates against a fixed evaluation dataset using specific metrics. It's also important to evaluate multiple metrics simultaneously within a single evaluation run.

EvalTask also integrates withVertex AI Experiments to help you track configurations and results for each evaluation run. Vertex AI Experiments aids in managing and interpreting evaluation results, empowering you to make informed decisions.

The following example demonstrates how to instantiate theEvalTask class and run an evaluation:

fromvertexai.evaluationimport(EvalTask,PairwiseMetric,PairwiseMetricPromptTemplate,PointwiseMetric,PointwiseMetricPromptTemplate,MetricPromptTemplateExamples，)eval_task=EvalTask(dataset=DATASET,metrics=[METRIC_1,METRIC_2,METRIC_3],experiment=EXPERIMENT_NAME,)eval_result=eval_task.evaluate(model=MODEL,prompt_template=PROMPT_TEMPLATE,experiment_run=EXPERIMENT_RUN,)

Run evaluation with model-based metrics

Formodel-based metrics, use thePointwiseMetric andPairwiseMetric classes to define metrics tailored to your specific criteria. Run evaluations using the following options:

Use model-based metric examples

You can directly use the built-in constantMetric Prompt Template Examples within Vertex AI SDK. Alternatively, modify and incorporate them in the free-form metric definition interface.

For the full list of Metric Prompt Template Examples covering most key use cases, seeMetric prompt templates.

Console

When you're running evaluations in a Colab Enterprise notebook, you can access metric prompt templates from directly within the Google Cloud console.

Click the link for your preferredGen AI evaluation service notebook.
The notebook opens in GitHub. ClickOpen in Colab Enterprise. If a dialog asks you to enable APIs, clickEnable.
Click theGen AI Evaluation icon in the sidebar. APre-built metric templates panel opens.
SelectPointwise orPairwise metrics.
Click the metric you want to use, such asFluency. The code sample for the metric appears.
ClickCopy to copy the code sample. Optionally, clickCustomize to change pre-set fields for the metric.
Paste the code sample into your notebook.

Vertex AI SDK

The following Vertex AI SDK example shows how to useMetricPromptTemplateExamples class to define your metrics:

# View all the available examples of model-based metricsMetricPromptTemplateExamples.list_example_metric_names()# Display the metric prompt template of a specific example metricprint(MetricPromptTemplateExamples.get_prompt_template('fluency'))# Use the pre-defined model-based metrics directlyeval_task=EvalTask(dataset=EVAL_DATASET,metrics=[MetricPromptTemplateExamples.Pointwise.FLUENCY],)eval_result=eval_task.evaluate(model=MODEL,)

Use a model-based metric templated interface

Customize your metrics by populating fields likeCriteria andRating Rubrics using thePointwiseMetricPromptTemplate andPairwiseMetricPromptTemplate classes within Vertex AI SDK. Certain fields, such asInstruction, are assigned a default value if you don't provide input.

Optionally, you can specifyinput_variables, which is a list of input fields used by the metric prompt template to generate model-based evaluation results. By default, the model'sresponse column is included for pointwise metrics, and both the candidate model'sresponse andbaseline_model_response columns are included for pairwise metrics.

For additional information, refer to the "Structure a metric prompt template" section inMetric prompt templates.

# Define a pointwise metric with two custom criteriacustom_text_quality=PointwiseMetric(metric="custom_text_quality",metric_prompt_template=PointwiseMetricPromptTemplate(criteria={"fluency":"Sentences flow smoothly and are easy to read, avoiding awkward phrasing or run-on sentences. Ideas and sentences connect logically, using transitions effectively where needed.","entertaining":"Short, amusing text that incorporates emojis, exclamations and questions to convey quick and spontaneous communication and diversion.",},rating_rubric={"1":"The response performs well on both criteria.","0":"The response is somewhat aligned with both criteria","-1":"The response falls short on both criteria",},input_variables=["prompt"],),)# Display the serialized metric prompt templateprint(custom_text_quality.metric_prompt_template)# Run evaluation using the custom_text_quality metriceval_task=EvalTask(dataset=EVAL_DATASET,metrics=[custom_text_quality],)eval_result=eval_task.evaluate(model=MODEL,)

Use the model-based metric free-form SDK interface

For more flexibility in customizing the metric prompt template, you can define a metric directly using the free-form interface, which accepts a direct string input.

# Define a pointwise multi-turn chat quality metricpointwise_chat_quality_metric_prompt="""Evaluate the AI's contribution to a meaningful conversation, considering coherence, fluency, groundedness, and conciseness. Review the chat history for context. Rate the response on a 1-5 scale, with explanations for each criterion and its overall impact.# Conversation History{history}# Current User Prompt{prompt}# AI-generated Response{response}"""freeform_multi_turn_chat_quality_metric=PointwiseMetric(metric="multi_turn_chat_quality_metric",metric_prompt_template=pointwise_chat_quality_metric_prompt,)# Run evaluation using the freeform_multi_turn_chat_quality_metric metriceval_task=EvalTask(dataset=EVAL_DATASET,metrics=[freeform_multi_turn_chat_quality_metric],)eval_result=eval_task.evaluate(model=MODEL,)

Evaluate a translation model

Preview

This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms. Pre-GA products and features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

To evaluate your translation model, you can specifyBLEU,MetricX, orCOMET as evaluation metrics when using the Vertex AI SDK.

#Prepare the dataset for evaluation.sources=["Dem Feuer konnte Einhalt geboten werden","Schulen und Kindergärten wurden eröffnet.",]responses=["The fire could be stopped","Schools and kindergartens were open",]references=["They were able to control the fire.","Schools and kindergartens opened",]eval_dataset=pd.DataFrame({"source":sources,"response":responses,"reference":references,})# Set the metrics.metrics=["bleu",pointwise_metric.Comet(),pointwise_metric.MetricX(),]eval_task=evaluation.EvalTask(dataset=eval_dataset,metrics=metrics,)eval_result=eval_task.evaluate()

Run evaluation with computation-based metrics

You can usecomputation-based metrics standalone, or together with model-based metrics.

# Combine computation-based metrics "ROUGE" and "BLEU" with model-based metricseval_task=EvalTask(dataset=EVAL_DATASET,metrics=["rouge_l_sum","bleu",custom_text_quality],)eval_result=eval_task.evaluate(model=MODEL,)

Run evaluations at scale

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

If you have large evaluation datasets or periodically run evaluations in a production environment, you can use theEvaluateDataset API in the Gen AI evaluation service to run evaluations at scale.

Before using any of the request data, make the following replacements:

PROJECT_NUMBER: Yourproject number.
DATASET_URI: The Cloud Storage path to a JSONL file that contains evaluation instances. Each line in the file should represent a single instance, with keys corresponding to user-defined input fields in themetric_prompt_template (for model-based metrics) or required input parameters (for computation-based metrics). You can only specify one JSONL file. The following example is a line for a pointwise evaluation instance:
```
{"response": "The Roman Senate was filled with exuberance due to Pompey's defeat in Asia."}
```
METRIC_SPEC: One or moremetric specs you are using for evaluation. You can use the following metric specs when running evaluations at scale:"pointwise_metric_spec","pairwise_metric_spec","exact_match_spec","bleu_spec", and"rouge_spec".
METRIC_SPEC_FIELD_NAME: The required fields for your chosen metric spec. For example,"metric_prompt_template"
METRIC_SPEC_FIELD_CONTENT: The field content for your chosen metric spec. For example, you can use the following field content for a pointwise evaluation:"Evaluate the fluency of this sentence: {response}. Give score from 0 to 1. 0 - not fluent at all. 1 - very fluent."
OUTPUT_BUCKET: The name of the Cloud Storage bucket where you want to store evaluation results.

HTTP method and URL:

POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/PROJECT_NUMBER/locations/us-central1/evaluateDataset

Request JSON body:

{  "dataset": {    "gcs_source": {      "uris": "DATASET_URI"    }  },  "metrics": [    {METRIC_SPEC: {METRIC_SPEC_FIELD_NAME:METRIC_SPEC_FIELD_CONTENT      }    }  ],  "output_config": {    "gcs_destination": {      "output_uri_prefix": "OUTPUT_BUCKET"    }  }}

To send your request, choose one of these options:

curl

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by running gcloud init orgcloud auth login , or by usingCloud Shell, which automatically logs you into thegcloud CLI . You can check the currently active account by runninggcloud auth list.

Save the request body in a file namedrequest.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://us-central1-aiplatform.googleapis.com/v1beta1/projects/PROJECT_NUMBER/locations/us-central1/evaluateDataset"

PowerShell

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by running gcloud init orgcloud auth login . You can check the currently active account by runninggcloud auth list.

Save the request body in a file namedrequest.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://us-central1-aiplatform.googleapis.com/v1beta1/projects/PROJECT_NUMBER/locations/us-central1/evaluateDataset" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Response

{  "name": "projects/PROJECT_NUMBER/locations/us-central1/operations/OPERATION_ID",  "metadata": {    "@type": "type.googleapis.com/google.cloud.aiplatform.v1beta1.EvaluateDatasetOperationMetadata",    "genericMetadata": {      "createTime":CREATE_TIME,      "updateTime":UPDATE_TIME    }  },  "done": true,  "response": {    "@type": "type.googleapis.com/google.cloud.aiplatform.v1beta1.EvaluateDatasetResponse",    "outputInfo": {      "gcsOutputDirectory": "gs://OUTPUT_BUCKET/evaluation_GENERATION_TIME"    }  }}

You can use theOPERATION_ID you receive in the response to request the status of the evaluation:

curl -X GET \  -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \  -H "Content-Type: application/json; charset=utf-8" \  "https://us-central1-aiplatform.googleapis.com/v1beta1/projects/PROJECT_NUMBER/locations/us-central1/operations/OPERATION_ID"

Additional metric customization

If you need to further customize your metrics, like choosing a different judge model for model-based metrics, or define a new computation-based metric, you can use theCustomMetric class in the Vertex AI SDK. For more details, see the following notebooks:

To see an example of Bring your own judge model using Custom Metric, run the "Bring your own judge model using Custom Metric" notebook in one of the following environments:

Open in Colab |Open in Colab Enterprise |Openin Vertex AI Workbench |View on GitHub

To see an example of Bring your own computation-based Custom Metric, run the "Bring your own computation-based Custom Metric" notebook in one of the following environments:

Open in Colab |Open in Colab Enterprise |Openin Vertex AI Workbench |View on GitHub

Run model-based evaluation with increased rate limits and quota

A single evaluation request for a model-based metric results in multiple underlying requests to the Gemini API in Vertex AI and consumes quota for the judge model. You should set a higher evaluation service rate limit in the following use cases:

Increased data volume: If you're processing significantly more data using the model-based metrics, you might hit the default requests per minute (RPM) quota. Increasing the quota lets you handle the larger volume without performance degradation or interruptions.
Faster evaluation: If your application requires quicker turnaround time for evaluations, you might need a higher RPM quota. This is especially important for time-sensitive applications or those with real-time interactions where delays in evaluation can impact the user experience.
Complex evaluation tasks: A higher RPM quota ensures you have enough capacity to handle resource-intensive evaluations for complex tasks or large amounts of text.
High user concurrency: If you anticipate a large number of users simultaneously requesting model-based evaluations and model inference within your project, a higher model RPM limit is crucial to prevent bottlenecks and maintain responsiveness.

If you're using the default judge model ofgemini-2.0-flash or newer models, we recommend that you useProvisioned Throughput to manage your quota.

For models older thangemini-2.0-flash, use the following instructions to increase thejudge model RPM quota:

In the Google Cloud console, go to the IAM & AdminQuotas page.
View Quotas in Console
In theFilter field, specify theDimension (model identifier) and theMetric (quota identifier for Gemini models):base_model:gemini-2.0-flash andMetric:aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model.
For the quota that you want to increase, click themore actions menu button.
In the drop-down menu, clickEdit quota. TheQuota changes panel opens.
UnderEdit quota, enter a new quota value.
ClickSubmit request.
A Quota Increase Request (QIR) is confirmed by email and typically takes two business days to process.

To run an evaluation using a new quota, set theevaluation_service_qps parameter as follows:

fromvertexai.evaluationimportEvalTask# GEMINI_RPM is the requests per minute (RPM) quota for gemini-2.0-flash-001 in your region# Evaluation Service QPS limit is equal to (gemini-2.0-flash-001 RPM / 60 sec / default number of samples)CUSTOM_EVAL_SERVICE_QPS_LIMIT=GEMINI_RPM/60/4eval_task=EvalTask(dataset=DATASET,metrics=[METRIC_1,METRIC_2,METRIC_3],)eval_result=eval_task.evaluate(evaluation_service_qps=CUSTOM_EVAL_SERVICE_QPS_LIMIT,# Specify a retry_timeout limit for a more responsive evaluation run# the default value is 600 (in seconds, or 10 minutes)retry_timeout=RETRY_TIMEOUT,)

For more information about quotas and limits, seeGen AI evaluation service quotas, andGen AI evaluation service API.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Run an evaluation Stay organized with collections Save and categorize content based on your preferences.