View and interpret evaluation results

This page describes how to view and interpret your model evaluation results after running your model evaluation.

View evaluation results

After you define your evaluation task, run the task to getevaluation results, as follows:

fromvertexai.evaluationimportEvalTaskeval_result=EvalTask(dataset=DATASET,metrics=[METRIC_1,METRIC_2,METRIC_3],experiment=EXPERIMENT_NAME,).evaluate(model=MODEL,experiment_run=EXPERIMENT_RUN_NAME,)

TheEvalResult class represents the result of an evaluation run with the following attributes:

summary_metrics: A dictionary of aggregated evaluation metrics for an evaluation run.
metrics_table: Apandas.DataFrame table containing evaluation dataset inputs, responses, explanations, and metric results per row.
metadata: the experiment name and experiment run name for the evaluation run.

TheEvalResult class is defined as follows:

@dataclasses.dataclassclassEvalResult:"""Evaluation result.    Attributes:      summary_metrics: A dictionary of aggregated evaluation metrics for an evaluation run.      metrics_table: A pandas.DataFrame table containing evaluation dataset inputs,        responses, explanations, and metric results per row.      metadata: the experiment name and experiment run name for the evaluation run.    """summary_metrics:Dict[str,float]metrics_table:Optional["pd.DataFrame"]=Nonemetadata:Optional[Dict[str,str]]=None

With the use of helper functions, the evaluation results can be displayed in theColab notebook as follows:

Tables for summary metrics and row-based metrics

Visualize evaluation results

You can plot summary metrics in a radar or bar chart for visualization andcomparison between results from different evaluation runs. This visualizationcan be helpful for evaluating different models and different prompt templates.

In the following example, we visualize four metrics (coherence, fluency, instruction following and overall text quality) for responses generated using four different prompt templates. From the radar and bar plot, we can infer that prompt template #2 consistently outperforms the other templates across all four metrics. This is particularly evident in its significantly higher scores for instruction following and text quality. Based on this analysis, prompt template #2 appears to be the most effective choice among the four options.

Radar chart showing the coherence, instruction_following, text_quality, and fluency scores for all prompt templates

Bar chart showing the mean for coherence, instruction_following, text_quality, and fluency for all prompt templates

Understand metric results

The following tables list various components of instance-level and aggregate results included inmetrics_table andsummary_metrics respectively forPointwiseMetric,PairwiseMetric and computation-based metrics:

`PointwiseMetric`

Instance-level results

Column	Description
response	The response generated for the prompt by the model.
score	The rating given to the response as per the criteria and rating rubric. The score can be binary (0 and 1), Likert scale (1 to 5, or -2 to 2), or float (0.0 to 1.0).
explanation	The judge model's reason for the score. We use chain-of-thought reasoning to guide the judge model to explain its rationale behind each verdict. Forcing the judge model to reason is shown to improve evaluation accuracy.

Note: Results for translation metrics only includescore.

Aggregate results

Column	Description
mean score	Average score for all instances.
standard deviation	Standard deviation for all the scores.

`PairwiseMetric`

Instance-level results

Column	Description
response	The response generated for the prompt by candidate model.
baseline_model_response	The response generated for the prompt by baseline model.
pairwise_choice	The model with the better response. Possible values are CANDIDATE, BASELINE or TIE.
explanation	The judge model's reason for the choice.

Aggregate results

Column	Description
candidate_model_win_rate	Ratio of time the judge model decided the candidate model had the better response to total responses. Ranges between 0 to 1.
baseline_model_win_rate	Ratio of time the judge model decided the baseline model had the better response to total responses. Ranges between 0 to 1.

Computation-based metrics

Instance-level results

Column	Description
response	The model's response being evaluated.
reference	The reference response.
score	The score is calculated for each pair of responses and references.

Aggregate results

Column	Description
mean score	Average score for all instances.
standard deviation	Standard deviation for all the scores.

Examples

The examples in this section demonstrate how to read and understand the evaluation results.

What's next

Try an evaluation example notebook.
Learn aboutgenerative AI evaluation.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

View and interpret evaluation results Stay organized with collections Save and categorize content based on your preferences.

View evaluation results

Visualize evaluation results

Understand metric results

PointwiseMetric

Instance-level results

Aggregate results

View and interpret evaluation results

`PointwiseMetric`