Model evaluation in Vertex AI

Note: Vertex AI provides model evaluation metrics for both predictive AI and generative AI models. This page provides an overview of the evaluation service for predictive AI models. To evaluate a generative AI model, seeGenerative AI evaluation serviceoverview.

The inference AI evaluation service lets you evaluate model performance across specific use cases. You might also refer to evaluation as observability into a model's performance. The model evaluation provided by Vertex AI can fit in the typical machine learning workflow in several ways:

  • After you train your model, review model evaluation metrics before you deploy your model. Compare evaluation metrics across multiple models to help you decide which model to deploy.

  • After your model is deployed to production, periodically evaluate your model with new incoming data. If the evaluation metrics show that your model performance is degrading, consider re-training your model. This process is calledcontinuous evaluation.

How you interpret and use those metrics depends on your business need and the problem your model is trained to solve. For example, you might have a lower tolerance for false positives than for false negatives, or the other way around. These kinds of questions affect which metrics you would focus on as you iterateon your model.

Some key metrics provided by the predictive AI model evaluation service include the following:

Note: The model evaluation service described on this page is separate fromtheevaluation metrics that are automatically generated duringthe AutoML training process.

Features

To evaluate a model with Vertex AI, you need atrained model, a batch inference output, and a ground truth dataset. Thefollowing is a typical model evaluation workflow using Vertex AI:

  1. Train a model. You can do this in Vertex AI usingAutoML or custom training.

  2. Run a batch inference job on the model to generate inference results.

  3. Prepare theground truth data, which is the "correctly labeled" data asdetermined by humans. The ground truth is usually the testdataset you used during the model training process.

  4. Run an evaluation job on the model, which evaluates the accuracy of the batchinference results compared to the ground truth data.

  5. Analyze the metrics that result from the evaluation job.

  6. Iterate on your model to see if you can improve your model's accuracy.You can run multiple evaluation jobs, and compare the results of multiple jobsacross models or model versions.

You can run model evaluation in Vertex AI in several ways:

  • Create evaluations through the Vertex AI Model Registry in theGoogle Cloud console.

  • Use model evaluations from Vertex AI as apipeline component with Vertex AI Pipelines. Youcan create pipeline runs and templates that include model evaluations as apart of your automated MLOps workflow.

    You can run themodel evaluation component by itself, or withother pipeline components such as thebatch inferencecomponent.

Vertex AI supports evaluation of the following model types:

Image

Classification

You can view and download schema files from the following Cloud Storagelocation:
gs://google-cloud-aiplatform/schema/modelevaluation/

  • AuPRC: Theareaunder the precision-recall (PR) curve, also referred to as averageprecision. This value ranges from zero to one, where a higher value indicatesa higher-quality model.
  • Log loss: The cross-entropy between the model inferences and the targetvalues. This ranges from zero to infinity, where a lower value indicates ahigher-quality model.
  • Confidence threshold: A confidence score that determines whichinferences to return. A model returns inferences that are at this value orhigher. A higher confidence threshold increases precision but lowers recall.Vertex AI returns confidence metrics at different threshold valuesto show how the threshold affectsprecisionandrecall.
  • Recall: The fraction of inferences with this class that the modelcorrectly predicted. Also calledtrue positive rate.
  • Precision: The fraction of classification inferences produced by themodel that were correct.
  • Confusion matrix: Aconfusionmatrix shows how often a model correctly predicted a result. For incorrectlypredicted results, the matrix shows what the model predicted instead. Theconfusion matrix helps you understand where your model is "confusing" tworesults.

Tabular

Classification

You can view and download schema files from the following Cloud Storagelocation:
gs://google-cloud-aiplatform/schema/modelevaluation/

  • AuPRC: Theareaunder the precision-recall (PR) curve, also referred to as averageprecision. This value ranges from zero to one, where a higher value indicatesa higher-quality model.
  • AuROC: Theareaunder receiver operating characteristic curve. This ranges from zero to one,where a higher value indicates a higher-quality model.
  • Log loss: The cross-entropy between the model inferences and the targetvalues. This ranges from zero to infinity, where a lower value indicates ahigher-quality model.
  • Confidence threshold: A confidence score that determines whichinferences to return. A model returns inferences that are at this value orhigher. A higher confidence threshold increases precision but lowers recall.Vertex AI returns confidence metrics at different threshold valuesto show how the threshold affectsprecisionandrecall.
  • Recall: The fraction of inferences with this class that the modelcorrectly predicted. Also calledtrue positive rate.
  • Recall at 1: The recall (true positive rate) when only considering thelabel that has the highest inference score and not below the confidencethreshold for each example.
  • Precision: The fraction of classification inferences produced by themodel that were correct.
  • Precision at 1: The precision when only considering the label that hasthe highest inference score and not below the confidence threshold for eachexample.
  • F1 score: The harmonic mean of precision and recall. F1 is a usefulmetric if you're looking for a balance between precision and recall and there'san uneven class distribution.
  • F1 score at 1: The harmonic mean of recall at 1 and precision at 1.
  • Confusion matrix: Aconfusionmatrix shows how often a model correctly predicted a result. For incorrectlypredicted results, the matrix shows what the model predicted instead. Theconfusion matrix helps you understand where your model is "confusing" tworesults.
  • True negative count: The number of times a model correctly predicteda negative class.
  • True positive count: The number of times a model correctly predicted apositive class.
  • False negative count: The number of times a model mistakenly predicteda negative class.
  • False positive count: The number of times a model mistakenly predicteda positive class.
  • False positive rate: The fraction of incorrectly predicted results out ofall predicted results.
  • False positive rate at 1: The false positive rate when only consideringthe label that has the highest inference score and not below the confidencethreshold for each example.
  • Model feature attributions: Vertex AI shows you how much each feature impacts a model. The values are provided as apercentage for each feature: the higher the percentage, the more impact the feature had onmodel training. Review this information to ensure that all of the most importantfeatures make sense for your data and business problem.

Regression

You can view and download schema files from the following Cloud Storagelocation:
gs://google-cloud-aiplatform/schema/modelevaluation/

  • MAE: The mean absolute error (MAE) is the average absolute differencebetween the target values and the predicted values. This metric ranges from zeroto infinity; a lower value indicates a higher quality model.
  • RMSE: The root-mean-squared error is the square root of the averagesquared difference between the target and predicted values. RMSE is moresensitive to outliers than MAE,so if you're concerned about large errors, thenRMSE can be a more useful metric to evaluate. Similar to MAE, a smaller valueindicates a higher quality model (0 represents a perfect predictor).
  • RMSLE: The root-mean-squared logarithmic error metric is similar to RMSE,except that it uses the natural logarithm of the predicted and actual valuesplus 1. RMSLE penalizes under-inference more heavily than over-inference. Itcan also be a good metric when you don't want to penalize differences for largeinference values more heavily than for small inference values. This metricranges from zero to infinity; a lower value indicates a higher quality model.The RMSLE evaluation metric is returned only if all label and predicted valuesare non-negative.
  • r^2: r squared (r^2) is the square of the Pearson correlationcoefficient between the labels and predicted values. This metric ranges betweenzero and one. A higher value indicates a closer fit to the regression line.
  • MAPE: Mean absolute percentage error (MAPE) is the average absolutepercentage difference between the labels and the predicted values. This metricranges between zero and infinity; a lower value indicates a higher qualitymodel.
    MAPE is not shown if the target column contains any 0 values. In this case,MAPE is undefined.
  • Model feature attributions: Vertex AI shows you how much each feature impacts a model. The values are provided as apercentage for each feature: the higher the percentage, the more impact the feature had onmodel training. Review this information to ensure that all of the most importantfeatures make sense for your data and business problem.

Forecasting

You can view and download schema files from the following Cloud Storagelocation:
gs://google-cloud-aiplatform/schema/modelevaluation/

  • MAE: The mean absolute error (MAE) is the average absolute differencebetween the target values and the predicted values. This metric ranges from zeroto infinity; a lower value indicates a higher quality model.
  • RMSE: The root-mean-squared error is the square root of the averagesquared difference between the target and predicted values. RMSE is moresensitive to outliers than MAE,so if you're concerned about large errors, thenRMSE can be a more useful metric to evaluate. Similar to MAE, a smaller valueindicates a higher quality model (0 represents a perfect predictor).
  • RMSLE: The root-mean-squared logarithmic error metric is similar to RMSE,except that it uses the natural logarithm of the predicted and actual valuesplus 1. RMSLE penalizes under-inference more heavily than over-inference. Itcan also be a good metric when you don't want to penalize differences for largeinference values more heavily than for small inference values. This metricranges from zero to infinity; a lower value indicates a higher quality model.The RMSLE evaluation metric is returned only if all label and predicted valuesare non-negative.
  • r^2: r squared (r^2) is the square of the Pearson correlationcoefficient between the labels and predicted values. This metric ranges betweenzero and one. A higher value indicates a closer fit to the regression line.
  • MAPE: Mean absolute percentage error (MAPE) is the average absolutepercentage difference between the labels and the predicted values. This metricranges between zero and infinity; a lower value indicates a higher qualitymodel.
    MAPE is not shown if the target column contains any 0 values. In this case,MAPE is undefined.
  • WAPE: Weighted absolute percentage error (WAPE) is the overall differencebetween the value predicted by a model and the values observed over the valuesobserved. Compared to RMSE, WAPE is weighted towards the overall differencesrather than individual differences, which can be highly influenced by lowor intermittent values. A lower value indicates a higher quality model.
  • RMSPE: Root mean squared percentage error (RMPSE) shows RMSE as apercentage of the actual values instead of an absolute number. A lower valueindicates a higher quality model.
  • Quantile: The percent quantile, which indicates the probability that anobserved value will be below the predicted value. For example, at the 0.5quantile, the observed values are expected to be lower than the predicted values50% of the time.
  • Observed quantile: Shows the percentage of true values that were lessthan the predicted value for a given quantile.
  • Scaled pinball loss: The scaled pinball loss at a particular quantile.A lower value indicates a higher quality model at the given quantile.

Notebook tutorials

AutoML: Tabular

Custom training: Tabular

Vertex AI Model Registry

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.