Evaluate a judge model Stay organized with collections Save and categorize content based on your preferences.
Preview
This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA products and features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.
For model-based metrics, the Gen AI evaluation service evaluates your models with a foundational model, such as Gemini, that has been configured and prompted as a judge model. If you want to learn more about the judge model, theAdvanced judge model customization series describes additional tools you can use to evaluate and configure the judge model.
For the basic evaluation workflow, see theGen AI evaluation service quickstart. TheAdvanced judge model customization series includes the following pages:
- Evaluate a judge model (current page)
- Prompting for judge model customization
- Configure a judge model
Overview
Using human judges to evaluate large language models (LLMs) can be expensive andtime consuming. Using a judge model is a more scalable way to evaluate LLMs. TheGen AI evaluation service uses a configured Gemini 2.0 Flashmodel by default as the judge model, with customizable prompts to evaluate yourmodel for various use cases.
The following sections show how to evaluate a customized judge model for your ideal use case.
Prepare the dataset
To evaluate model-based metrics, prepare an evaluation dataset with human ratings as the ground truth. The goal is to compare the scores from model-based metrics with human ratings and see if model-based metrics have the ideal quality for your use case.
For
PointwiseMetric, prepare the{metric_name}/human_ratingcolumn in the dataset as the ground truth for the{metric_name}/scoreresult generated by model-based metrics.For
PairwiseMetric, prepare the{metric_name}/human_pairwise_choicecolumn in the dataset as the ground truth for the{metric_name}/pairwise_choiceresult generated by model-based metrics.
Use the following dataset schema:
| Model-based metric | Human rating column |
|---|---|
PointwiseMetric | {metric_name}/human_rating |
PairwiseMetric | {metric_name}/human_pairwise_choice |
Available metrics
For aPointwiseMetric that returns only 2 scores (such as 0 and 1), and aPairwiseMetric that only has 2 preference types (Model A or Model B), the following metrics are available:
| Metric | Calculation |
|---|---|
| 2-class balanced accuracy | \( (1/2)*(True Positive Rate + True Negative Rate) \) |
| 2-class balanced f1 score | \( ∑_{i=0,1} (cnt_i/sum) * f1(class_i) \) |
| Confusion matrix | Use theconfusion_matrix andconfusion_matrix_labels fields to calculate metrics such as True positive rate (TPR), True negative rate (TNR), False positive rate (FPR), and False negative rate (FNR).For example, the following result: confusion_matrix = [[20, 31, 15], [10, 11, 3], [ 3, 2, 2]]confusion_matrix_labels = ['BASELINE', 'CANDIDATE', 'TIE'] BASELINE | CANDIDATE | TIE BASELINE. 20 31 15 CANDIDATE. 10 11 3 TIE. 3 2 2 | |
For aPointwiseMetric that returns more than 2 scores (such as 1 through 5), and aPairwiseMetric that has more than 2 preference types (Model A, Model B, or Tie), the following metrics are available:
| Metric | Calculation |
|---|---|
| Multiple-class balanced accuracy | \( (1/n) *∑_{i=1...n}(recall(class_i)) \) |
| Multiple-class balanced f1 score | \( ∑_{i=1...n} (cnt_i/sum) * f1(class_i) \) |
Where:
\( f1 = 2 * precision * recall / (precision + recall) \)
\( precision = True Positives / (True Positives + False Positives) \)
\( recall = True Positives / (True Positives + False Negatives) \)
\( n \) : number of classes
\( cnt_i \) : number of \( class_i \) in ground truth data
\( sum \): number of elements in ground truth data
To calculate other metrics, you can use open-source libraries.
Evaluate the model-based metric
The following example updates the model-based metric with a custom definition of fluency, then evaluates the quality of the metric.
fromvertexai.preview.evaluationimport{AutoraterConfig,PairwiseMetric,}fromvertexai.preview.evaluation.autorater_utilsimportevaluate_autorater# Step 1: Prepare the evaluation dataset with the human rating data column.human_rated_dataset=pd.DataFrame({"prompt":[PROMPT_1,PROMPT_2],"response":[RESPONSE_1,RESPONSE_2],"baseline_model_response":[BASELINE_MODEL_RESPONSE_1,BASELINE_MODEL_RESPONSE_2],"pairwise_fluency/human_pairwise_choice":["model_A","model_B"]})# Step 2: Get the results from model-based metricpairwise_fluency=PairwiseMetric(metric="pairwise_fluency",metric_prompt_template="please evaluate pairwise fluency...")eval_result=EvalTask(dataset=human_rated_dataset,metrics=[pairwise_fluency],).evaluate()# Step 3: Calibrate model-based metric result and human preferences.# eval_result contains human evaluation result from human_rated_dataset.evaluate_autorater_result=evaluate_autorater(evaluate_autorater_input=eval_result.metrics_table,eval_metrics=[pairwise_fluency])What's next
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.