The ML.EVALUATE function
This document describes theML.EVALUATE function, which lets youevaluate model metrics.
Supported models
You can use theML.EVALUATE function with all model types except for thefollowing:
Syntax
TheML.EVALUATE function syntax differs depending on the type of model thatyou use the function with. Choose the option appropriate for your use case.
Times series
ML.EVALUATE( MODEL `PROJECT_ID.DATASET.MODEL` [, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }], STRUCT( [PERFORM_AGGREGATION AS perform_aggregation] [,HORIZON AS horizon] [,CONFIDENCE_LEVEL AS confidence_level]))Arguments
ML.EVALUATE takes the following arguments:
PROJECT_ID: the project that contains the resource.DATASET: the project that contains the resource.MODEL: the name of the model.TABLE: the name of the input table that contains the evaluation data.If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during training arereturned.
If you specify a
TABLEvalue, the input column names in the table mustmatch the column names in the model, and their types must be compatibleaccording to BigQueryimplicit coercion rules.QUERY_STATEMENT: a GoogleSQL query thatis used to generate the evaluation data. For the supported SQL syntax oftheQUERY_STATEMENTclause inGoogleSQL, seeQuery syntax.If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during trainingare returned.
If you used the
TRANSFORMclausein theCREATE MODELstatement that created the model, then you canonly specify the input columns present in theTRANSFORMclausein the query.
PERFORM_AGGREGATION: aBOOLvalue thatindicates the level of evaluation for forecasting accuracy. If you specifyTRUE, then the forecasting accuracy is on the time series level. If youspecifyFALSE, the forecasting accuracy is on the timestamp level.The default value isTRUE.HORIZON: anINT64value that specifies thenumber of forecasted time points against which the evaluation metrics arecomputed. The default value is the horizon value specified in theCREATE MODELstatement for the time series model, or1000if unspecified. Whenevaluating multiple time series at the same time, this parameter appliesto each time series.You can only use the
HORIZONargumentwhen the following conditions are met:- The model type is
ARIMA_PLUS. - You have specified a value for either the
TABLEorQUERY_STATEMENTargument.
- The model type is
CONFIDENCE_LEVEL: aFLOAT64value thatspecifies the percentage of the future values that fall in theprediction interval. The default value is0.95. The valid inputrange is[0, 1).You can only use the
CONFIDENCE_LEVELargumentwhen the following conditions are met:- The model type is
ARIMA_PLUS. - You have specified a value for either the
TABLEorQUERY_STATEMENTargument. The
PERFORM_AGGREGATIONargument value isFALSE.The value of the
CONFIDENCE_LEVELargumentaffects theupper_boundandlower_boundvalues in the output.
- The model type is
Classification & regression
ML.EVALUATE( MODEL `PROJECT_ID.DATASET.MODEL` [, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }], STRUCT( [THRESHOLD AS threshold] [,TRIAL_ID AS trial_id]))Arguments
ML.EVALUATE takes the following arguments:
PROJECT_ID: the project that contains the resource.DATASET: the project that contains the resource.MODEL: the name of the model.TABLE: the name of the input table that contains the evaluation data.If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during training arereturned.
If you specify a
TABLEvalue, the input column names in the table mustmatch the column names in the model, and their types must be compatibleaccording to BigQueryimplicit coercion rules.The table must have acolumn that matches the label column name that is provided duringmodel training. You can provide this value by using the
input_label_colsoption during model training. Ifinput_label_colsis unspecified, the column namedlabelin the training data is used.QUERY_STATEMENT: a GoogleSQL query thatis used to generate the evaluation data. For the supported SQL syntax oftheQUERY_STATEMENTclause inGoogleSQL, seeQuery syntax.If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during trainingare returned.
If you used the
TRANSFORMclausein theCREATE MODELstatement that created the model, then you canonly specify the input columns present in theTRANSFORMclausein the query.The query must have acolumn that matches the label column name that is provided duringmodel training. You can provide this value by using the
input_label_colsoption during model training. Ifinput_label_colsis unspecified, the column namedlabelin the training data is used.
THRESHOLD: aFLOAT64value that specifiesa custom threshold for the evaluation. You can only use theTHRESHOLDargument with binary-classclassification models. The default value is0.5.A
0value for precision or recall means that the selected thresholdproduced no true positive labels. ANaNvalue for precision meansthat the selected threshold produced no positive labels, neither truepositives nor false positives.You must specify a value for either the
TABLEorQUERY_STATEMENTargument in order to specify a threshold.
TRIAL_ID: anINT64value that identifies thehyperparameter tuning trial that you want the function to evaluate. TheML.EVALUATEfunction uses the optimal trial by default. Only specifythis argument if you ran hyperparameter tuning when creating the model.
Remote over Gemini
ML.EVALUATE( MODEL `PROJECT_ID.DATASET.MODEL` [, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }], STRUCT( [TASK_TYPE AS task_type] [,MAX_OUTPUT_TOKENS AS max_output_tokens] [,TEMPERATURE AS temperature] [,TOP_P AS top_k]))Arguments
ML.EVALUATE takes the following arguments:
PROJECT_ID: the project that contains the resource.DATASET: the project that contains the resource.MODEL: the name of the model.TABLE: the name of the input table that contains the evaluation data.If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during training arereturned.
If the remote model isn't configured to usesupervised tuning,the following column naming requirements apply:
- The table must have a column named
input_textthat contains theprompt text to use when evaluating the model. - The table must have a column named
output_textthat contains thegenerated text that you would expect to be returned by the model.
If the remote model is configured to use supervised tuning,the following column naming requirements apply:
- The table must have a column whose name matches the prompt columnname that is provided during model training. You can provide thisvalue by using the
prompt_coloption during model training. Ifprompt_colis unspecified, the column namedpromptin thetraining data is used. An error is returned if there is no columnnamedprompt. The table must have a column whose name matches the label columnname that is provided during model training. You can provide thisvalue by using the
input_label_colsoption during model training.Ifinput_label_colsis unspecified, the column namedlabelinthe training data is used. An error is returned if there is nocolumn namedlabel.You can find information about the label and prompt columns by looking at the model schema information in the Google Cloud console.
For more information, see
AS SELECT.- The table must have a column named
QUERY_STATEMENT: a GoogleSQL query thatis used to generate the evaluation data. For the supported SQL syntax oftheQUERY_STATEMENTclause inGoogleSQL, seeQuery syntax.If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during trainingare returned.
If the remote model isn't configured to usesupervised tuning,the following column naming requirements apply:
- The query must have a column named
input_textthat contains theprompt text to use when evaluating the model. - The query must have a column named
output_textthat contains thegenerated text that you would expect to be returned by the model.
If the remote model is configured to use supervised tuning,the following column naming requirements apply:
- The query must have a column whose name matches the prompt columnname that is provided during model training. You can provide thisvalue by using the
prompt_coloption during model training. Ifprompt_colis unspecified, the column namedpromptin thetraining data is used. An error is returned if there is no columnnamedprompt. The query must have a column whose name matches the label columnname that is provided during model training. You can provide thisvalue by using the
input_label_colsoption during model training.Ifinput_label_colsis unspecified, the column namedlabelinthe training data is used. An error is returned if there is nocolumn namedlabel.You can find information about the label and prompt columns by looking at the model schema information in the Google Cloud console.
For more information, see
AS SELECT.- The query must have a column named
TASK_TYPE: aSTRINGvalue that specifies thetype of task for which you want to evaluate the model's performance. Thevalid options are the following:TEXT_GENERATIONCLASSIFICATIONSUMMARIZATIONQUESTION_ANSWERING
The default value is
TEXT_GENERATION.
MAX_OUTPUT_TOKENS: anINT64value that setsthe maximum number of tokens output by the model. Specify a lower valuefor shorter responses and a higher value for longer responses.A token might be smaller than a word and is approximately four characters.100 tokens correspond to approximately 60-80 words.The default value is
1024.The
MAX_OUTPUT_TOKENSvalue must be in the range[1,8192].
TEMPERATURE: aFLOAT64value that is used forsampling during the response generation. It controls thedegree of randomness in token selection. LowerTEMPERATUREvalues are good for prompts that require a more deterministicand less open-ended or creative response, while higherTEMPERATUREvaluescan lead to more diverse or creative results. ATEMPERATUREvalue of0isdeterministic, meaning that the highest probability response is alwaysselected.The
TEMPERATUREvalue must be in the range[0.0,1.0].The default value is
1.0.
TOP_P: aFLOAT64value in the range[0.0,1.0]that changes how the model selects tokens for output. Tokens are selectedfrom the most to least probable until the sum of their probabilitiesequals theTOP_Pvalue. For example, if tokens A,B, and C have a probability of0.3,0.2, and0.1and theTOP_Pvalue is0.5, then the model selectseither A or B as the next token by using theTEMPERATUREvalue and doesn't consider C. Specifya lower value for less random responses and a higher value for more randomresponses.The default value is
0.95.
Remote over Claude
ML.EVALUATE(MODEL`PROJECT_ID.DATASET.MODEL`[,{TABLE`PROJECT_ID.DATASET.TABLE`|(QUERY_STATEMENT)}],STRUCT([TASK_TYPEAStask_type][,MAX_OUTPUT_TOKENSASmax_output_tokens][,TOP_KAStop_k][,TOP_PAStop_k]))
Arguments
ML.EVALUATE takes the following arguments:
PROJECT_ID: the project that contains the resource.DATASET: the project that contains the resource.MODEL: the name of the model.TABLE: the name of the input table that contains the evaluation data.If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during training arereturned.
The following column naming requirements apply:
- The table must have a column named
input_textthat contains theprompt text to use when evaluating the model. - The table must have a column named
output_textthat contains thegenerated text that you would expect to be returned by the model.
- The table must have a column named
QUERY_STATEMENT: a GoogleSQL query thatis used to generate the evaluation data. For the supported SQL syntax oftheQUERY_STATEMENTclause inGoogleSQL, seeQuery syntax.If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during trainingare returned.
The following column naming requirements apply:
- The query must have a column named
input_textthat contains theprompt text to use when evaluating the model. - The query must have a column named
output_textthat contains thegenerated text that you would expect to be returned by the model.
- The query must have a column named
TASK_TYPE: aSTRINGvalue that specifies thetype of task for which you want to evaluate the model's performance. Thevalid options are the following:TEXT_GENERATIONCLASSIFICATIONSUMMARIZATIONQUESTION_ANSWERING
The default value is
TEXT_GENERATION.
MAX_OUTPUT_TOKENS: anINT64value that setsthe maximum number of tokens output by the model. Specify a lower valuefor shorter responses and a higher value for longer responses.A token might be smaller than a word and is approximately four characters.100 tokens correspond to approximately 60-80 words.The default value is
1024.The
MAX_OUTPUT_TOKENSvalue must be in the range[1,4096].
TOP_K: anINT64value in the range[1,40]that changes how the model selects tokens for output. Specify a lowervalue for less random responses and a higher value for more randomresponses. The model determines an appropriate value if you don'tspecify one.A
TOP_Kvalue of1means the next selected tokenis the most probable among all tokens in the model's vocabulary, while aTOP_Kvalue of3means that the next token isselected from among the three most probable tokens by using theTEMPERATUREvalue.For each token selection step, the
TOP_Ktokenswith the highest probabilities are sampled. Then tokens are furtherfiltered based on theTOP_Pvalue, with the finaltoken selected using temperature sampling.
TOP_P: aFLOAT64value in the range[0.0,1.0]that changes how the model selects tokens for output. Tokens are selectedfrom the most to least probable until the sum of their probabilitiesequals theTOP_Pvalue. For example, if tokens A,B, and C have a probability of0.3,0.2, and0.1and theTOP_Pvalue is0.5, then the model selectseither A or B as the next token by using theTEMPERATUREvalue and doesn't consider C. Specifya lower value for less random responses and a higher value for more randomresponses.The model determines an appropriate value if you don't specify one.
Remote over Llama or Mistral AI
ML.EVALUATE( MODEL `PROJECT_ID.DATASET.MODEL` [, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }], STRUCT( [TASK_TYPE AS task_type] [,MAX_OUTPUT_TOKENS AS max_output_tokens] [,TEMPERATURE AS temperature] [,TOP_P AS top_k]))Arguments
ML.EVALUATE takes the following arguments:
PROJECT_ID: the project that contains the resource.DATASET: the project that contains the resource.MODEL: the name of the model.TABLE: the name of the input table that contains the evaluation data.If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during training arereturned.
The following column naming requirements apply:
- The table must have a column named
input_textthat contains theprompt text to use when evaluating the model. - The table must have a column named
output_textthat contains thegenerated text that you would expect to be returned by the model.
- The table must have a column named
QUERY_STATEMENT: a GoogleSQL query thatis used to generate the evaluation data. For the supported SQL syntax oftheQUERY_STATEMENTclause inGoogleSQL, seeQuery syntax.If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during trainingare returned.
The following column naming requirements apply:
- The query must have a column named
input_textthat contains theprompt text to use when evaluating the model. - The query must have a column named
output_textthat contains thegenerated text that you would expect to be returned by the model.
- The query must have a column named
TASK_TYPE: aSTRINGvalue that specifies thetype of task for which you want to evaluate the model's performance. Thevalid options are the following:TEXT_GENERATIONCLASSIFICATIONSUMMARIZATIONQUESTION_ANSWERING
The default value is
TEXT_GENERATION.
MAX_OUTPUT_TOKENS: anINT64value that setsthe maximum number of tokens output by the model. Specify a lower valuefor shorter responses and a higher value for longer responses.A token might be smaller than a word and is approximately four characters.100 tokens correspond to approximately 60-80 words.The default value is
1024.The
MAX_OUTPUT_TOKENSvalue must be in the range[1,4096].
TEMPERATURE: aFLOAT64value that is used forsampling during the response generation. It controls thedegree of randomness in token selection. LowerTEMPERATUREvalues are good for prompts that require a more deterministicand less open-ended or creative response, while higherTEMPERATUREvaluescan lead to more diverse or creative results. ATEMPERATUREvalue of0isdeterministic, meaning that the highest probability response is alwaysselected.The
TEMPERATUREvalue must be in the range[0.0,1.0].The default value is
1.0.
TOP_P: aFLOAT64value in the range[0.0,1.0]that changes how the model selects tokens for output. Tokens are selectedfrom the most to least probable until the sum of their probabilitiesequals theTOP_Pvalue. For example, if tokens A,B, and C have a probability of0.3,0.2, and0.1and theTOP_Pvalue is0.5, then the model selectseither A or B as the next token by using theTEMPERATUREvalue and doesn't consider C. Specifya lower value for less random responses and a higher value for more randomresponses.The model determines an appropriate value if you don't specify one.
Remote over open
ML.EVALUATE( MODEL `PROJECT_ID.DATASET.MODEL` [, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }], STRUCT( [TASK_TYPE AS task_type] [,MAX_OUTPUT_TOKENS AS max_output_tokens] [,TEMPERATURE AS temperature] [,TOP_K AS top_k] [,TOP_P AS top_p]))Arguments
ML.EVALUATE takes the following arguments:
PROJECT_ID: the project that contains the resource.DATASET: the project that contains the resource.MODEL: the name of the model.TABLE: the name of the input table that contains the evaluation data.If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during training arereturned.
The following column naming requirements apply:
- The table must have a column named
input_textthat contains theprompt text to use when evaluating the model. - The table must have a column named
output_textthat contains thegenerated text that you would expect to be returned by the model.
- The table must have a column named
QUERY_STATEMENT: a GoogleSQL query thatis used to generate the evaluation data. For the supported SQL syntax oftheQUERY_STATEMENTclause inGoogleSQL, seeQuery syntax.If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during trainingare returned.
The following column naming requirements apply:
- The query must have a column named
input_textthat contains theprompt text to use when evaluating the model. - The query must have a column named
output_textthat contains thegenerated text that you would expect to be returned by the model.
- The query must have a column named
TASK_TYPE: aSTRINGvalue that specifies thetype of task for which you want to evaluate the model's performance. Thevalid options are the following:TEXT_GENERATIONCLASSIFICATIONSUMMARIZATIONQUESTION_ANSWERING
The default value is
TEXT_GENERATION.
MAX_OUTPUT_TOKENS: anINT64value that setsthe maximum number of tokens output by the model. Specify a lower valuefor shorter responses and a higher value for longer responses.A token might be smaller than a word and is approximately four characters.100 tokens correspond to approximately 60-80 words.The model determines an appropriate value if you don't specify one.
The
MAX_OUTPUT_TOKENSvalue must be in the range[1,4096].
TEMPERATURE: aFLOAT64value that is used forsampling during the response generation. It controls thedegree of randomness in token selection. LowerTEMPERATUREvalues are good for prompts that require a more deterministicand less open-ended or creative response, while higherTEMPERATUREvaluescan lead to more diverse or creative results. ATEMPERATUREvalue of0isdeterministic, meaning that the highest probability response is alwaysselected.The
TEMPERATUREvalue must be in the range[0.0,1.0].The model determines an appropriate value if you don't specify one.
TOP_K: anINT64value in the range[1,40]that changes how the model selects tokens for output. Specify a lowervalue for less random responses and a higher value for more randomresponses. The model determines an appropriate value if you don'tspecify one.A
TOP_Kvalue of1means the next selected tokenis the most probable among all tokens in the model's vocabulary, while aTOP_Kvalue of3means that the next token isselected from among the three most probable tokens by using theTEMPERATUREvalue.For each token selection step, the
TOP_Ktokenswith the highest probabilities are sampled. Then tokens are furtherfiltered based on theTOP_Pvalue, with the finaltoken selected using temperature sampling.
TOP_P: aFLOAT64value in the range[0.0,1.0]that changes how the model selects tokens for output. Tokens are selectedfrom the most to least probable until the sum of their probabilitiesequals theTOP_Pvalue. For example, if tokens A,B, and C have a probability of0.3,0.2, and0.1and theTOP_Pvalue is0.5, then the model selectseither A or B as the next token by using theTEMPERATUREvalue and doesn't consider C. Specifya lower value for less random responses and a higher value for more randomresponses.The model determines an appropriate value if you don't specify one.
All other models
ML.EVALUATE( MODEL `PROJECT_ID.DATASET.MODEL` [, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }], STRUCT( [THRESHOLD AS threshold] [,TRIAL_ID AS trial_id]))Arguments
ML.EVALUATE takes the following arguments:
PROJECT_ID: the project that contains the resource.DATASET: the project that contains the resource.MODEL: the name of the model.TABLE: the name of the input table that contains the evaluation data.If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during training arereturned.
If you specify a
TABLEvalue, the input column names in the table mustmatch the column names in the model, and their types must be compatibleaccording to BigQueryimplicit coercion rules.QUERY_STATEMENT: a GoogleSQL query thatis used to generate the evaluation data. For the supported SQL syntax oftheQUERY_STATEMENTclause inGoogleSQL, seeQuery syntax.If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during trainingare returned.
If you used the
TRANSFORMclausein theCREATE MODELstatement that created the model, then you canonly specify the input columns present in theTRANSFORMclausein the query.
THRESHOLD: aFLOAT64value that specifiesa custom threshold for the evaluation. You can only use theTHRESHOLDargument with binary-classclassification models. The default value is0.5.A
0value for precision or recall means that the selected thresholdproduced no true positive labels. ANaNvalue for precision meansthat the selected threshold produced no positive labels, neither truepositives nor false positives.You must specify a value for either the
TABLEorQUERY_STATEMENTargument in order to specify a threshold.
TRIAL_ID: anINT64value that identifies thehyperparameter tuning trial that you want the function to evaluate. TheML.EVALUATEfunction uses the optimal trial by default. Only specifythis argument if you ran hyperparameter tuning when creating the model.You can't use the
TRIAL_IDargument with PCAmodels.
Output
ML.EVALUATE returns a single row of metrics applicable to thetype of model specified.
For models that return them, theprecision,recall,f1_score,log_loss,androc_auc metrics are macro-averaged for all of the class labels. For amacro-average, metrics are calculated for each label and then an unweightedaverage is taken of those values.
Time series
ML.EVALUATE returns the following columns forARIMA_PLUS orARIMA_PLUS_XREG models when input data is provided andperform_aggregation isFALSE:
time_series_id_colortime_series_id_cols: a value that containsthe identifiers of a time series.time_series_id_colcan be anINT64orSTRINGvalue.time_series_id_colscan be anARRAY<INT64>orARRAY<STRING>value. Only present when forecasting multiple time seriesat once. The column names and types are inherited from theTIME_SERIES_ID_COLoption as specified in theCREATE MODELstatement.ARIMA_PLUS_XREGmodels don't support this column.time_series_timestamp_col: aSTRINGvalue that containsthe timestamp column for a time series. The column name and type areinherited from theTIME_SERIES_TIMESTAMP_COLoption as specified in theCREATE MODELstatement.time_series_data_col: aSTRINGvalue that containsthe data column for a time series. The column name and type are inheritedfrom theTIME_SERIES_DATA_COLoption as specified in theCREATE MODELstatement.forecasted_time_series_data_col: aSTRINGvalue that contains the samedata astime_series_data_colbut withforecasted_prefixed to thecolumn name.lower_bound: aFLOAT64value that contains the lower bound of theprediction interval.upper_bound: aFLOAT64value that contains the upper bound of theprediction interval.absolute_error: aFLOAT64value that contains the absolute value ofthe difference between the forecasted value and the actual data value.absolute_percentage_error: aFLOAT64value that contains the absolutevalue of the absolute error divided by the actual value.
Notes:
The following things are true for time series models when input data isprovided andperform_aggregation isFALSE:
ML.EVALUATEevaluates the forecasting accuracy of each forecasted timestamp.- For history timestamps, the following columns are
NULL: forecasted_time_series_data_collower_boundupper_boundabsolute_errorabsolute_percentage_error
ML.EVALUATE returns the following columns forARIMA_PLUS orARIMA_PLUS_XREG models when input data is provided andperform_aggregation isTRUE:
time_series_id_colortime_series_id_cols: the identifiers of a timeseries. Only present when forecasting multiple time series at once. Thecolumn names and types are inherited from theTIME_SERIES_ID_COLoptionas specified in theCREATE MODELstatement.ARIMA_PLUS_XREGmodelsdon't support this column.mean_absolute_error: aFLOAT64value that contains themean absolute errorfor the model.mean_squared_error: aFLOAT64value that contains themean squared error forthe model.root_mean_squared_error: aFLOAT64value that contains theroot mean squared errorfor the model.mean_absolute_percentage_error: aFLOAT64value that contains themean absolute percentage errorfor the model.symmetric_mean_absolute_percentage_error: aFLOAT64value thatcontains thesymmetric mean absolute percentage errorfor the model.
Notes:
The following things are true for time series models when input data isprovided andperform_aggregation isTRUE:
ML.EVALUATEevaluates the forecasting accuracy of each forecasted timestamp.- The error metrics are aggregated over the forecasting error on each forecasted timestamp.
ML.EVALUATE returns the following columns for anARIMA_PLUS model wheninput data isn't provided:
time_series_id_colortime_series_id_cols: the identifiers of a timeseries. Only present when forecasting multiple time series at once. Thecolumn names and types are inherited from theTIME_SERIES_ID_COLoptionas specified in theCREATE MODELstatement.non_seasonal_p: anINT64value that contains the order for theautoregressive model. For more information, seeAutoregressive integrated moving average.non_seasonal_d: anINT64that contains the degree of differencing forthe non-seasonal model. For more information, seeAutoregressive integrated moving average.non_seasonal_q: anINT64that contains the order for the movingaverage model. For more information, seeAutoregressive integrated moving average.has_drift: aBOOLvalue that indicates whether the model includes alinear drift term.log_likelihood: aFLOAT64value that contains thelog likelihoodfor the model.aic: aFLOAT64value that contains theAkaike information criterionfor the model.variance: aFLOAT64value that measures how far the observed valuediffers from the predicted value mean.seasonal_periods: aSTRINGvalue that contains the seasonal period forthe model.has_holiday_effect: aBOOLvalue that indicates whether the modelincludes any holiday effects.has_spikes_and_dips: aBOOLvalue that indicates whether the modelperforms automatic spikes and dips detection and cleanup.has_step_changes: aBOOLvalue that indicates whether the model hasstep changes.
ML.EVALUATE without input data is deprecated. UseML.ARIMA_EVALUATE instead.Classification
The following types of models are classification models:
- Logistic regressor
- Boosted tree classifier
- Random forest classifier
- DNN classifier
- Wide & Deep classifier
- AutoML Tables classifier
ML.EVALUATE returns the following columns for classification models:
trial_id: anINT64value that identifies the hyperparameter tuningtrial. This column is only returned if you ran hyperparameter tuning whencreating the model. This column doesn't apply forAutoML Tables models.precision: aFLOAT64value that contains theprecision for themodel.recall: aFLOAT64value that contains therecall for themodel.accuracy: aFLOAT64value that contains theaccuracy for themodel.accuracyis computed as a global total or micro-average. For amicro-average, the metric is calculated globally by counting the totalnumber of correctly predicted rows.f1_score: aFLOAT64value that contains theF1 score for the model.log_loss: aFLOAT64value that contains thelogistic lossfor the model.roc_auc: aFLOAT64value that contains thearea under the receiver operating characteristic curvefor the model.
Regression
The following types of models are regression models:
- Linear regression
- Boosted tree regressor
- Random forest regressor
- Deep neural network (DNN) regressor
- Wide & Deep regressor
- AutoML Tables regressor
ML.EVALUATE returns the following columns for regression models:
trial_id: anINT64value that identifies the hyperparameter tuning trial.This column is only returned if you ran hyperparameter tuning when creating themodel. This column doesn't apply for AutoML Tables models.mean_absolute_error: aFLOAT64value that contains themean absolute error forthe model.mean_squared_error: aFLOAT64value that contains themean squared error forthe model.mean_squared_log_error: aFLOAT64value that contains the mean squaredlogarithmic error for the model. The mean squared logarithmic errormeasures the distance between the actual and predicted values.median_absolute_error: aFLOAT64value that contains themedian absolute errorfor the model.r2_score: aFLOAT64value that contains theR2 score for the model.explained_variance: aFLOAT64value that contains theexplained variancefor the model.
K-means
ML.EVALUATE returns the following columns for k-means models:
trial_id: anINT64value that identifies the hyperparameter tuning trial.This column is only returned if you ran hyperparameter tuning when creating themodel.davies_bouldin_index: aFLOAT64value that contains theDavies-Bouldin Indexfor the model.mean_squared_distance: aFLOAT64value that contains the mean squareddistance for the model, which is the average of the distances betweentraining data points to their closest centroid.
Matrix factorization
ML.EVALUATE returns the following columns for matrix factorization modelswith implicit feedback:
trial_id: anINT64value that identifies the hyperparameter tuning trial.This column is only returned if you ran hyperparameter tuning when creating themodel.recall: aFLOAT64value that contains therecall for the model.mean_squared_error: aFLOAT64value that contains themean squared error forthe model.normalized_discounted_cumulative_gain: aFLOAT64value that contains thenormalized discounted cumulative gain for the model.average_rank: aFLOAT64value that contains theaverage rank (PDF download) for the model.
ML.EVALUATE returns the following columns for matrix factorization modelswith explicit feedback:
trial_id: anINT64value that identifies the hyperparameter tuning trial.This column is only returned if you ran hyperparameter tuning when creating themodel.mean_absolute_error: aFLOAT64value that contains themean absolute error forthe model.mean_squared_error: aFLOAT64value that contains themean squared error forthe model.mean_squared_log_error: aFLOAT64value that contains the mean squaredlogarithmic error for the model. The mean squared logarithmic errormeasures the distance between the actual and predicted values.mean_absolute_error: aFLOAT64value that contains themean absolute error forthe model.r2_score: aFLOAT64value that contains theR2 scorefor the model.explained_variance: aFLOAT64value that contains theexplained variancefor the model.
Remote over pre-trained models
This section describes the output for the following types of models:
- Gemini
- Anthropic Claude
- Mistral AI
- Llama
- Open models
ML.EVALUATE returns different columns depending on thetask_type valuethat you specify.
When you specify theTEXT_GENERATION task type, the following columns arereturned:
bleu4_score: aFLOAT64column that contains thebilingual evaluation understudy (BLEU4) scorefor the model.rouge-l_precision: aFLOAT64column that contains theRecall-oriented understudy for gisting evaluation (ROUGE-L)precisionfor the model .rouge-l_recall: aFLOAT64column that contains the ROUGE-Lrecall for the model.rouge-l_f1: aFLOAT64column that contains the ROUGE-LF1 score for the model.evaluation_status: aSTRINGcolumn in JSON format that contains thefollowing elements:num_successful_rows: the number of successful inference rows returnedfrom Vertex AI.num_total_rows: the number of total input rows.
When you specify theCLASSIFICATION task type, the following columns arereturned:
precision: aFLOAT64column that contains theprecisionfor the model .recall: aFLOAT64column that contains therecall for the model.f1: aFLOAT64column that contains theF1 score for the model.label: aSTRINGcolumn that contains the label generated for theinput data.evaluation_status: aSTRINGcolumn in JSON format that contains thefollowing elements:num_successful_rows: the number of successful inference rows returnedfrom Vertex AI.num_total_rows: the number of total input rows.
When you specify theSUMMARIZATION task type, the following columns arereturned:
rouge-l_precision: aFLOAT64column that contains theRecall-oriented understudy for gisting evaluation (ROUGE-L)precisionfor the model.rouge-l_recall: aFLOAT64column that contains the ROUGE-Lrecall for the model.rouge-l_f1: aFLOAT64column that contains the ROUGE-LF1 score for the model.evaluation_status: aSTRINGcolumn in JSON format that contains thefollowing elements:num_successful_rows: the number of successful inference rows returnedfrom Vertex AI.num_total_rows: the number of total input rows.
When you specify theQUESTION_ANSWERING task type, the following columns arereturned:
exact_match: aFLOAT64column that indicates if the generated text exactlymatches theground truth.This value is1if the generated text equals the ground truth, otherwise itis0. This metric is an average across all of the input rows.evaluation_status: aSTRINGcolumn in JSON format that contains thefollowing elements:num_successful_rows: the number of successful inference rows returnedfrom Vertex AI.num_total_rows: the number of total input rows.
Remote over custom models
ML.EVALUATE returns the following column for remote models overcustom models deployed to Vertex AI:
remote_eval_metrics: aJSONcolumn containing appropriate metrics for themodel type.
PCA
ML.EVALUATE returns the following column for PCA models:
total_explained_variance_ratio: aFLOAT64value that contains thepercentage of the cumulative variance explained by all the returnedprincipal components. For more information, seetheML.PRINCIPAL_COMPONENT_INFOfunction.
Autoencoder
ML.EVALUATE returns the following columns for autoencoder models:
mean_absolute_error: aFLOAT64value that contains themean absolute error forthe model.mean_squared_error: aFLOAT64value that contains themean squared error forthe model.mean_squared_log_error: aFLOAT64value that contains the mean squaredlogarithmic error for the model. The mean squared logarithmic errormeasures the distance between the actual and predicted values.
Limitations
ML.EVALUATE is subject to the following limitations:
ML.EVALUATEdoesn't supportimported TensorFlow modelsorremote models over Cloud AI services.- For remote models over Vertex AI endpoints,
ML.EVALUATEfetches evaluation result from the Vertex AI endpoint anddoesn't take any input data.
Costs
When used with remote models over Vertex AI LLMs,ML.EVALUATE costs are calculated based on the following:
- The bytes processed from the input table. These charges are billed fromBigQuery to your project. For more information, seeBigQuery pricing.
- The input to and output from the LLM. These charges are billed fromVertex AI to your project. For more information, seeVertex AI pricing.
Examples
The following examples show how to useML.EVALUATE.
ML.EVALUATE with no input data specified
The following query evaluates a model with no input data specified:
SELECT*FROMML.EVALUATE(MODEL`mydataset.mymodel`)
ML.EVALUATE with a custom threshold and input data
The following query evaluates a model with input data and a customthreshold of0.55:
SELECT*FROMML.EVALUATE(MODEL`mydataset.mymodel`,(SELECTcustom_label,column1,column2FROM`mydataset.mytable`),STRUCT(0.55ASthreshold))
ML.EVALUATE to calculate forecasting accuracy of a time series
The following query evaluates the 30-point forecasting accuracy for atime series model:
SELECT*FROMML.EVALUATE(MODEL`mydataset.my_arima_model`,(SELECTtimeseries_date,timeseries_metricFROM`mydataset.mytable`),STRUCT(TRUEASperform_aggregation,30AShorizon))
ML.EVALUATE to calculate ARIMA_PLUS forecasting accuracy for each forecasted timestamp
The following query evaluates the forecasting accuracy for each of the 30forecasted points of a time series model. It also computes the predictioninterval based on a confidence level of0.9.
SELECT*FROMML.EVALUATE(MODEL`mydataset.my_arima_model`,(SELECTtimeseries_date,timeseries_metricFROM`mydataset.mytable`),STRUCT(FALSEASperform_aggregation,0.9ASconfidence_level,30AShorizon))
ML.EVALUATE to calculate ARIMA_PLUS_XREG forecasting accuracy for each forecasted timestamp
The following query evaluates the forecasting accuracy for each of the 30forecasted points of a time series model. It also computes the predictioninterval based on a confidence level of0.9. Note that you need to include theside features for the evaluation data.
SELECT*FROMML.EVALUATE(MODEL`mydataset.my_arima_xreg_model`,(SELECTtimeseries_date,timeseries_metric,feature1,feature2FROM`mydataset.mytable`),STRUCT(FALSEASperform_aggregation,0.9ASconfidence_level,30AShorizon))
ML.EVALUATE to calculate LLM text generation accuracy
The following query evaluates the LLM text generation accuracy forthe classification task type for each label from the evaluation table.
SELECT*FROMML.EVALUATE(MODEL`mydataset.my_llm`,(SELECTprompt,labelFROM`mydataset.mytable`),STRUCT('classification'AStask_type))
What's next
- For more information about model evaluation, seeBigQuery ML model evaluation overview.
- For more information about supported SQL statements and functions for MLmodels, seeEnd-to-end user journeys for ML models.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-11-24 UTC.