Package evaluation (1.55.0)

API documentation forevaluation package.

Classes

CustomMetric

The custom evaluation metric.

The evaluation function. Must use the dataset row/instance as the metric_function input. Returns per-instance metric result as a dictionary. The metric score must mapped to the CustomMetric.name as key.

EvalResult

Evaluation result.

EvalTask

A class representing an EvalTask.

An Evaluation Tasks is defined to measure the model's ability to perform acertain task in response to specific prompts or inputs. Evaluation tasks mustcontain an evaluation dataset, and a list of metrics to evaluate. Evaluationtasks help developers compare propmpt templates, track experiments, comparemodels and their settings, and assess the quality of the model's generatedtext.

Dataset Details:

Default dataset column names:    * content_column_name: "content"    * reference_column_name: "reference"    * response_column_name: "response"Requirement for different use cases:  * Bring your own prediction: A `response` column is required. Response      column name can be customized by providing `response_column_name`      parameter.  * Without prompt template: A column representing the input prompt to the      model is required. If `content_column_name` is not specified, the      eval dataset requires `content` column by default. The response      column is not used if present and new responses from the model are      generated with the content column and used for evaluation.  * With prompt template: Dataset must contain column names corresponding to      the placeholder names in the prompt template. For example, if prompt      template is "Instruction: {instruction}, context: {context}", the      dataset must contain `instruction` and `context` column.

Metrics Details:

The supported metrics, metric bundle descriptions, grading rubrics, andthe required input fields can be found on the Vertex AI publicdocumentation page [Evaluation methods and metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval).

Usage:

1. To perform bring-your-own-prediction(BYOP) evaluation, provide the modelresponses in the response column in the dataset. The response column nameis "response" by default, or specify `response_column_name` parameter tocustomize.  ```  eval_dataset = pd.DataFrame({          "reference": [...],          "response" : [...],  })  eval_task = EvalTask(    dataset=eval_dataset,    metrics=["bleu", "rouge_l_sum", "coherence", "fluency"],    experiment="my-experiment",  )  eval_result = eval_task.evaluate(        experiment_run_name="eval-experiment-run"  )  ```2. To perform evaluation with built-in Gemini model inference, specify the`model` parameter with a GenerativeModel instance.  The default querycolumn name to the model is `content`.  ```  eval_dataset = pd.DataFrame({        "reference": [...],        "content"  : [...],  })  result = EvalTask(      dataset=eval_dataset,      metrics=["exact_match", "bleu", "rouge_1", "rouge_2",      "rouge_l_sum"],      experiment="my-experiment",  ).evaluate(      model=GenerativeModel("gemini-pro"),      experiment_run_name="gemini-pro-eval-run"  )  ```3. If a `prompt_template` is specified, the `content` column is not required.Prompts can be assembled from the evaluation dataset, and all placeholdernames must be present in the dataset columns.  ```  eval_dataset = pd.DataFrame({      "context"    : [...],      "instruction": [...],      "reference"  : [...],  })  result = EvalTask(      dataset=eval_dataset,      metrics=["summarization_quality"],  ).evaluate(      model=model,      prompt_template="{instruction}. Article: {context}. Summary:",  )  ```4. To perform evaluation with custom model inference, specify the `model`parameter with a custom prediction function. The `content` column in thedataset is used to generate predictions with the custom model function forevaluation.  ```  def custom_model_fn(input: str) -> str:    response = client.chat.completions.create(      model="gpt-3.5-turbo",      messages=[        {"role": "user", "content": input}      ]    )    return response.choices[0].message.content  eval_dataset = pd.DataFrame({        "content"  : [...],        "reference": [...],  })  result = EvalTask(      dataset=eval_dataset,      metrics=["text_generation_similarity","text_generation_quality"],      experiment="my-experiment",  ).evaluate(      model=custom_model_fn,      experiment_run_name="gpt-eval-run"  )  ```

PairwiseMetric

The Side-by-side(SxS) Pairwise Metric.

A model-based evaluation metric that compares two generative modelsside-by-side, and allows users to A/B test their generative models todetermine which model is performing better on the given evaluation task.

For more details on when to use pairwise metrics, seeEvaluation methods and metrics.

Result Details:

* In `EvalResult.summary_metrics`, win rates for both the baseline andcandidate model are computed, showing the rate of each model performsbetter on the given task. The win rate is computed as the number of timesthe candidate model performs better than the baseline model divided by thetotal number of examples. The win rate is a number between 0 and 1.* In `EvalResult.metrics_table`, a pairwise metric produces threeevaluation results for each row in the dataset:    * `pairwise_choice`: the `pairwise_choice` in the evaluation result is      an enumeration that indicates whether the candidate or baseline      model perform better.    * `explanation`: The model AutoRater's rationale behind each verdict      using chain-of-thought reasoning. These explanations help users      scrutinize the AutoRater's judgment and build appropriate trust in      its decisions.    * `confidence`: A score between 0 and 1, which signifies how confident      the AutoRater was with its verdict. A score closer to 1 means higher      confidence.See [documentation page](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval#understand-results)for more details on understanding the metric results.

Usages:

```from <xref uid="vertexai.generative_models">vertexai.generative_models</xref> import GenerativeModelfrom vertexai.preview.evaluation import EvalTask, PairwiseMetricbaseline_model = GenerativeModel("gemini-1.0-pro")candidate_model = GenerativeModel("gemini-1.5-pro")pairwise_summarization_quality = PairwiseMetric(  metric = "summarization_quality",  baseline_model=baseline_model,)eval_task =  EvalTask(  dataset = pd.DataFrame({      "instruction": [...],      "context": [...],  }),  metrics=[pairwise_summarization_quality],)pairwise_results = eval_task.evaluate(    prompt_template="instruction: {instruction}. context: {context}",    model=candidate_model,)```

PromptTemplate

A prompt template for creating prompts with placeholders.

ThePromptTemplate class allows users to define a template string withplaceholders represented in curly braces{placeholder}. The placeholdernames cannot contain spaces. These placeholders can be replaced with specificvalues using theassemble method, providing flexibility in generatingdynamic prompts.

Usage:

```template_str = "Hello, {name}! Today is {day}. How are you?"prompt_template = PromptTemplate(template_str)completed_prompt = prompt_template.assemble(name="John", day="Monday")print(completed_prompt)```

Packages Functions

make_metric

make_metric(name:str,metric_function:typing.Callable[[typing.Dict[str,typing.Any]],typing.Dict[str,typing.Any]],)->vertexai.preview.evaluation.metrics._base.CustomMetric

Makes a custom metric.

Parameters
NameDescription
name

The name of the metric

metric_function

The evaluation function. Must use the dataset row/instance as the metric_function input. Returns per-instance metric result as a dictionary. The metric score must mapped to the CustomMetric.name as key.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-10-30 UTC.