Define your evaluation metrics Stay organized with collections Save and categorize content based on your preferences.
The first step to evaluate your generative models or applications is to identify your evaluation goal and define your evaluation metrics. This page provides an overview of concepts related to defining evaluation metrics for your use case.
Overview
Generative AI models can be used to create applications for a wide range of tasks, such as summarizing news articles, responding to customer inquiries, or assisting with code writing. The Gen AI evaluation service in Vertex AI lets you evaluate any model with explainable metrics.
For example, you might be developing an application to summarize articles. To evaluate your application's performance on that specific task, consider the criteria you would like to measure and the metrics that you would use to score them:
Criteria: Single or multiple dimensions you would like to evaluate upon, such as
conciseness,relevance,correctness, orappropriate choice of words.Metrics: A single score that measures the model output against criteria.
The Gen AI evaluation service provides two major types of metrics:
Model-based metrics: Our model-based metrics assess your candidate model against a judge model. The judge model for most use cases is Gemini, but you can also use models such asMetricX orCOMET for translation use cases.
You can measure model-based metricspairwise or pointwise:
Pointwise metrics: Let the judge model assess the candidate model's output based on the evaluation criteria. For example, the score could be 0~5, where 0 means the response does not fit the criteria, while 5 means the response fits the criteria well.
Pairwise metrics: Let the judge model compare the responses of two models and pick the better one. This is often used when comparing a candidate model with the baseline model. Pairwise metrics are only supported with Gemini as a judge model.
Computation-based metrics: These metrics are computed using mathematical formulas to compare the model's output against a ground truth or reference. Commonly used computation-based metrics include ROUGE and BLEU.
You can use computation-based metrics standalone, or together with model-based metrics. Use the following table to decide when to use model-based or computation-based metrics:
| Evaluation approach | Data | Cost and speed | |
|---|---|---|---|
| Model-based metrics | Use a judge model to assess performance based on descriptive evaluation criteria | Ground truth is optional | Slightly more expensive and slower |
| Computation-based metrics | Use mathematical formulas to assess performance | Ground truth is usually required | Low cost and fast |
To get started, seePrepare your dataset andRun evaluation.
Define your model-based metrics
Model-based evaluation involves using a machine learning model as a judge model to evaluate the outputs of the candidate model.
Proprietary Google judge models, such as Gemini, are calibrated with human raters to ensure their quality. They are managed and available out of the box. The process of model-based evaluation varies based on the evaluation metrics you provide.
Model-based evaluation follows this process:
Data preparation: You provide evaluation data in the form of input prompts. The candidate models receive the prompts and generate corresponding responses.
Evaluation: The evaluation metrics and generated responses are sent to the judge model. The judge model evaluates each response individually, providing a row-based assessment.
Aggregation and explanation: Gen AI evaluation service aggregates these individual assessments into an overall score. The output also includeschain-of-thought explanations for each judgment, outlining the rationale behind the selection.
Gen AI evaluation service offers the following options to set up your model-based metrics with the Vertex AI SDK:
| Option | Description | Best for |
|---|---|---|
| Use an existing example | Use a prebuilt metric prompt template to get started. | Common use cases, time-saving |
| Define metrics with our templated interface | Get guided assistance in defining your metrics. Our templated interface provides structure and suggestions. | Customization with support |
| Define metrics from scratch | Have complete control over your metric definitions. | Ideal for highly specific use cases. Requires more technical expertise and time investment. |
As an example, you might want to develop a generative AI application that returns fluent and entertaining responses. For this application, you can define two criteria for evaluation using the templated interface:
Fluency: Sentences flow smoothly, avoiding awkward phrasing or run-on sentences. Ideas and sentences connect logically, using transitions effectively where needed.
Entertainment: Short, amusing text that incorporates emoji, exclamations, and questions to convey quick and spontaneous communication and diversion.
To turn those two criteria into a metric, you want an overall score ranging from -1 ~ 1 calledcustom_text_quality. You can define a metric like this:
# Define a pointwise metric with two criteria: Fluency and Entertaining.custom_text_quality=PointwiseMetric(metric="custom_text_quality",metric_prompt_template=PointwiseMetricPromptTemplate(criteria={"fluency":("Sentences flow smoothly and are easy to read, avoiding awkward"" phrasing or run-on sentences. Ideas and sentences connect"" logically, using transitions effectively where needed."),"entertaining":("Short, amusing text that incorporates emojis, exclamations and"" questions to convey quick and spontaneous communication and"" diversion."),},rating_rubric={"1":"The response performs well on both criteria.","0":"The response is somewhat aligned with both criteria","-1":"The response falls short on both criteria",},),)For a complete list of metric prompt templates, seeMetric prompt templates for evaluation.
Evaluate translation models
Preview
This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA products and features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.
TheGen AI evaluation service offers the following translation task evaluation metrics:
MetricX and COMET are pointwise model-based metrics that have been trained for translation tasks. You can evaluate the quality and accuracy of translation model results for your content, whether they are outputs of NMT, TranslationLLM, or Gemini models.
You can also use Gemini as a judge model to evaluate your model for fluency, coherence, verbosity and text quality in combination with MetricX, COMET or BLEU.
MetricX is an error-based metric developed by Google that predicts a floating point score between 0 and 25 representing the quality of a translation. MetricX is available both as a referenced-based and reference-free (QE) method. When you use this metric, a lower score is a better score, because it means there are fewer errors.
COMET employs a reference-based regression approach that provides scores ranging from 0 to 1, where 1 signifies a perfect translation.
BLEU (Bilingual Evaluation Understudy) is acomputation-based metric. The BLEU score indicates how similar the candidate text is to the reference text. A BLEU score value that is closer to one indicates that a translation is closer to the reference text.
Note that BLEU scores are not recommended for comparing across different corpora and languages. For example, an English to German BLEU score of 50 is not comparable to a Japanese to English BLEU score of 50. Many translation experts have shifted to model-based metric approaches, which have higher correlation with human ratings and are more granular in identifying error scenarios.
To learn how to run evaluations for translation models, seeEvaluate a translation model.
Choose between pointwise or pairwise evaluation
Use the following table to decide when you want to use pointwise or pairwise evaluation:
| Definition | When to use | Example use cases | |
|---|---|---|---|
| Pointwise evaluation | Evaluate one model and generate scores based on the criteria |
|
|
| Pairwise evaluation | Compare two models against each other, generating a preference based on the criteria |
|
|
Computation-based metrics
Computation-based metrics compare whether the LLM-generated results areconsistent with a ground-truth dataset of input and output pairs. The commonlyused metrics can be categorized into the following groups:
- Lexicon-based metrics: Use math to calculate the stringsimilarities between LLM-generated results and groundtruth, such as
Exact MatchandROUGE. - Count-based metrics: Aggregate the number of rows that hit or miss certainground-truth labels, such as
F1-score,Accuracy, andTool Name Match. - Embedding-based metrics: Calculate the distance between the LLM-generatedresults and ground truth in the embedding space, reflecting their level ofsimilarity.
General text generation
The following metrics help you to evaluate the model's ability to ensure theresponses are useful, safe, and effective for your users.
Exact match
Theexact_match metric computes whether a model responsematches a reference exactly.
- Token limit: None
Evaluation criteria
Not applicable.
Metric input parameters
| Input parameter | Description |
|---|---|
response | The LLM response. |
reference | The golden LLM response for reference. |
Output scores
| Value | Description |
|---|---|
| 0 | Not matched |
| 1 | Matched |
BLEU
Thebleu (BiLingual Evaluation Understudy) metric holds theresult of an algorithm for evaluating the quality of the response, which hasbeen translated from one natural language to another natural language. Thequality of the response is considered to be the correspondence between aresponse parameter and itsreference parameter.
- Token limit: None
Evaluation criteria
Not applicable.
Metric input parameters
| Input parameter | Description |
|---|---|
response | The LLM response. |
reference | The golden LLM response for the reference. |
Output scores
| Value | Description |
|---|---|
| A float in the range of [0,1] | Higher scores indicate better translations. A score of1 represents a perfect match to thereference. |
ROUGE
TheROUGE metric is used to compare the providedresponse parameter against areference parameter.Allrouge metrics return the F1 score.rouge-l-sum is calculated by default,but you canspecify therougevariantthat you want to use.
- Token limit: None
Evaluation criteria
Not applicable
Metric input parameters
| Input parameter | Description |
|---|---|
response | The LLM response. |
reference | The golden LLM response for the reference. |
Output scores
| Value | Description |
|---|---|
| A float in the range of [0,1] | A score closer to0 means poor similarity betweenresponse andreference. A score closer to1 means strong similarity betweenresponse andreference. |
Tool use and function calling
The following metrics help you to evaluate the model's ability to predict avalidtool (function) call.
Call valid
Thetool_call_valid metric describes the model's ability topredict a valid tool call. Only the first tool call isinspected.
- Token limit: None
Evaluation criteria
| Evaluation criterion | Description |
|---|---|
| Validity | The model's output contains a valid tool call. |
| Formatting | A JSON dictionary contains thename andarguments fields. |
Metric input parameters
| Input parameter | Description |
|---|---|
prediction | The candidate model output, which is a JSONserialized string that containscontent andtool_calls keys. Thecontent value is the textoutput from the model. Thetool_calls value is a JSONserialized string of a list of tool calls. Here is an example:{"content": "", "tool_calls": [{"name":"book_tickets", "arguments": {"movie": "Mission Impossible Dead ReckoningPart 1", "theater":"Regal Edwards 14", "location": "Mountain View CA","showtime": "7:30", "date": "2024-03-30","num_tix": "2"}}]} |
reference | The ground-truth reference prediction, which follows the same format asprediction. |
Output scores
| Value | Description |
|---|---|
| 0 | Invalid tool call |
| 1 | Valid tool call |
Name match
Thetool_name_match metric describes the model's ability to predicta tool call with the correct tool name. Only the first tool call is inspected.
- Token limit: None
Evaluation criteria
| Evaluation criterion | Description |
|---|---|
| Name matching | The model-predicted tool call matches the reference tool call's name. |
Metric input parameters
| Input parameter | Description |
|---|---|
prediction | The candidate model output, which is a JSONserialized string that containscontent andtool_calls keys. Thecontent value is the textoutput from the model. Thetool_call value is a JSONserialized string of a list of tool calls. Here is an example:{"content": "","tool_calls": [{"name": "book_tickets", "arguments":{"movie": "Mission Impossible Dead Reckoning Part 1", "theater":"RegalEdwards 14", "location": "Mountain View CA", "showtime": "7:30", "date":"2024-03-30","num_tix": "2"}}]} |
reference | The ground-truth reference prediction, which follows the same format as theprediction. |
Output scores
| Value | Description |
|---|---|
| 0 | Tool call name doesn't match the reference. |
| 1 | Tool call name matches the reference. |
Parameter key match
Thetool_parameter_key_match metric describes the model's ability topredict a tool call with the correct parameter names.
- Token limit: None
Evaluation criteria
| Evaluation criterion | Description |
|---|---|
| Parameter matching ratio | The ratio between the number of predicted parameters that match the parameter names of the reference tool call and the total number of parameters. |
Metric input parameters
| Input parameter | Description |
|---|---|
prediction | The candidate model output, which is a JSONserialized string that contains thecontent andtool_calls keys. Thecontent value is the textoutput from the model. Thetool_call value is a JSONserialized string of a list of tool calls. Here is an example:{"content": "", "tool_calls": [{"name": "book_tickets", "arguments":{"movie": "Mission Impossible Dead Reckoning Part 1", "theater":"RegalEdwards 14", "location": "Mountain View CA", "showtime": "7:30", "date":"2024-03-30","num_tix": "2"}}]} |
reference | The ground-truth reference model prediction, which follows the same format asprediction. |
Output scores
| Value | Description |
|---|---|
| A float in the range of [0,1] | The higher score of1 means more parameters match thereference parameters' names. |
Parameter KV match
Thetool_parameter_kv_match metric describes the model's ability topredict a tool call with the correct parameter names and key values.
- Token limit: None
Evaluation criteria
| Evaluation criterion | Description |
|---|---|
| Parameter matching ratio | The ratio between the number of the predicted parameters that match both the parameter names and values of the reference tool call and the total number of parameters. |
Metric input parameters
| Input parameter | Description |
|---|---|
prediction | The candidate model output, which is a JSONserialized string that containscontent andtool_calls keys. Thecontent value is the textoutput from the model. Thetool_call value is a JSONserialized string of a list of tool calls. Here is an example:{"content": "", "tool_calls": [{"name": "book_tickets", "arguments":{"movie": "Mission Impossible Dead Reckoning Part 1", "theater":"RegalEdwards 14", "location": "Mountain View CA", "showtime": "7:30", "date":"2024-03-30","num_tix": "2"}}]} |
reference | The ground-truth reference prediction, which follows the same format asprediction. |
Output scores
| Value | Description |
|---|---|
| A float in the range of [0,1] | The higher score of1 means more parameters match thereference parameters' names and values. |
In the generative AI evaluation service, you canuse computation-based metrics through the Vertex AI SDK for Python.
Baseline evaluation quality for generative tasks
When evaluating the output of generative AI models, note that the evaluation process is inherently subjective, and the quality of evaluation can vary depending on the specific task and evaluation criteria. This subjectivity also applies to human evaluators. For more information about the challenges of achieving consistent evaluation for generative AI models, seeJudging LLM-as-a-Judgewith MT-Bench and Chatbot Arena andLearning to summarize from human feedback.
What's next
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-11-24 UTC.