View and interpret evaluation results

This page describes how to view and interpret your model evaluation results after running your model evaluation.

View evaluation results

After you define your evaluation task, run the task to getevaluation results, as follows:

fromvertexai.evaluationimportEvalTaskeval_result=EvalTask(dataset=DATASET,metrics=[METRIC_1,METRIC_2,METRIC_3],experiment=EXPERIMENT_NAME,).evaluate(model=MODEL,experiment_run=EXPERIMENT_RUN_NAME,)

TheEvalResult class represents the result of an evaluation run with the following attributes:

  • summary_metrics: A dictionary of aggregated evaluation metrics for an evaluation run.
  • metrics_table: Apandas.DataFrame table containing evaluation dataset inputs, responses, explanations, and metric results per row.
  • metadata: the experiment name and experiment run name for the evaluation run.

TheEvalResult class is defined as follows:

@dataclasses.dataclassclassEvalResult:"""Evaluation result.    Attributes:      summary_metrics: A dictionary of aggregated evaluation metrics for an evaluation run.      metrics_table: A pandas.DataFrame table containing evaluation dataset inputs,        responses, explanations, and metric results per row.      metadata: the experiment name and experiment run name for the evaluation run.    """summary_metrics:Dict[str,float]metrics_table:Optional["pd.DataFrame"]=Nonemetadata:Optional[Dict[str,str]]=None

With the use of helper functions, the evaluation results can be displayed in theColab notebook as follows:

Tables for summary metrics and row-based metrics

Visualize evaluation results

You can plot summary metrics in a radar or bar chart for visualization andcomparison between results from different evaluation runs. This visualizationcan be helpful for evaluating different models and different prompt templates.

In the following example, we visualize four metrics (coherence, fluency, instruction following and overall text quality) for responses generated using four different prompt templates. From the radar and bar plot, we can infer that prompt template #2 consistently outperforms the other templates across all four metrics. This is particularly evident in its significantly higher scores for instruction following and text quality. Based on this analysis, prompt template #2 appears to be the most effective choice among the four options.

Radar chart showing the coherence, instruction_following, text_quality, and fluency scores for all prompt templates

Bar chart showing the mean for coherence, instruction_following, text_quality, and fluency for all prompt templates

Understand metric results

The following tables list various components of instance-level and aggregate results included inmetrics_table andsummary_metrics respectively forPointwiseMetric,PairwiseMetric and computation-based metrics:

PointwiseMetric

Instance-level results

ColumnDescription
responseThe response generated for the prompt by the model.
scoreThe rating given to the response as per the criteria and rating rubric. The score can be binary (0 and 1), Likert scale (1 to 5, or -2 to 2), or float (0.0 to 1.0).
explanationThe judge model's reason for the score. We use chain-of-thought reasoning to guide the judge model to explain its rationale behind each verdict. Forcing the judge model to reason is shown to improve evaluation accuracy.
Note: Results for translation metrics only includescore.

Aggregate results

ColumnDescription
mean scoreAverage score for all instances.
standard deviationStandard deviation for all the scores.

PairwiseMetric

Instance-level results

ColumnDescription
responseThe response generated for the prompt by candidate model.
baseline_model_responseThe response generated for the prompt by baseline model.
pairwise_choiceThe model with the better response. Possible values are CANDIDATE, BASELINE or TIE.
explanationThe judge model's reason for the choice.

Aggregate results

ColumnDescription
candidate_model_win_rateRatio of time the judge model decided the candidate model had the better response to total responses. Ranges between 0 to 1.
baseline_model_win_rateRatio of time the judge model decided the baseline model had the better response to total responses. Ranges between 0 to 1.

Computation-based metrics

Instance-level results

ColumnDescription
responseThe model's response being evaluated.
referenceThe reference response.
scoreThe score is calculated for each pair of responses and references.

Aggregate results

ColumnDescription
mean scoreAverage score for all instances.
standard deviationStandard deviation for all the scores.

Examples

The examples in this section demonstrate how to read and understand the evaluation results.

Example 1: Pointwise evaluation

In the first example, consider evaluating a pointwise evaluation instance forTEXT_QUALITY. The score from the pointwise evaluation ofTEXT_QUALITY metric is 4 (from a scale 1 to 5), which means the response is good. Furthermore, the explanation in the evaluation result shows why the judge model thinks the prediction deserves the score 4, and not a score that's higher or lower.

Dataset

Result

Example 2: Pairwise evaluation

The second example is a pairwise comparison evaluation onPAIRWISE_QUESTION_ANSWERING_QUALITY. Thepairwise_choice result shows the candidate response "France is a country located in Western Europe." is preferred by the judge model compared to the baseline response "France is a country." to answer the question in the prompt. Similar to pointwise results, an explanation is also provided to explain why the candidate response is better than the baseline response (candidate response is more helpful in this case).

Dataset

Result

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.