The ML.EVALUATE function

This document describes theML.EVALUATE function, which lets youevaluate model metrics.

Supported models

You can use theML.EVALUATE function with all model types except for thefollowing:

Syntax

TheML.EVALUATE function syntax differs depending on the type of model thatyou use the function with. Choose the option appropriate for your use case.

Times series

ML.EVALUATE(  MODEL `PROJECT_ID.DATASET.MODEL`  [, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }],    STRUCT(      [PERFORM_AGGREGATION AS perform_aggregation]      [,HORIZON AS horizon]      [,CONFIDENCE_LEVEL AS confidence_level]))

Arguments

ML.EVALUATE takes the following arguments:

  • PROJECT_ID: the project that contains the resource.
  • DATASET: the project that contains the resource.
  • MODEL: the name of the model.
  • TABLE: the name of the input table that contains the evaluation data.

    If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during training arereturned.

    If you specify aTABLE value, the input column names in the table mustmatch the column names in the model, and their types must be compatibleaccording to BigQueryimplicit coercion rules.

  • QUERY_STATEMENT: a GoogleSQL query thatis used to generate the evaluation data. For the supported SQL syntax oftheQUERY_STATEMENT clause inGoogleSQL, seeQuery syntax.

    If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during trainingare returned.

    If you used theTRANSFORM clausein theCREATE MODEL statement that created the model, then you canonly specify the input columns present in theTRANSFORM clausein the query.

  • PERFORM_AGGREGATION: aBOOL value thatindicates the level of evaluation for forecasting accuracy. If you specifyTRUE, then the forecasting accuracy is on the time series level. If youspecifyFALSE, the forecasting accuracy is on the timestamp level.The default value isTRUE.

  • HORIZON: anINT64 value that specifies thenumber of forecasted time points against which the evaluation metrics arecomputed. The default value is the horizon value specified in theCREATE MODELstatement for the time series model, or1000 if unspecified. Whenevaluating multiple time series at the same time, this parameter appliesto each time series.

    You can only use theHORIZON argumentwhen the following conditions are met:

    • The model type isARIMA_PLUS.
    • You have specified a value for either theTABLE orQUERY_STATEMENT argument.
  • CONFIDENCE_LEVEL: aFLOAT64 value thatspecifies the percentage of the future values that fall in theprediction interval. The default value is0.95. The valid inputrange is[0, 1).

    You can only use theCONFIDENCE_LEVEL argumentwhen the following conditions are met:

    • The model type isARIMA_PLUS.
    • You have specified a value for either theTABLE orQUERY_STATEMENT argument.
    • ThePERFORM_AGGREGATION argument value isFALSE.

      The value of theCONFIDENCE_LEVEL argumentaffects theupper_bound andlower_bound values in the output.

Note:

ForARIMA_PLUS andARIMA_PLUS_XREG models, the output columns differ depending on whether the input data is provided or not. If no input data is provided, useML.ARIMA_EVALUATE instead. The support ofML.EVALUATE without input data is deprecated.

Classification & regression

ML.EVALUATE(  MODEL `PROJECT_ID.DATASET.MODEL`  [, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }],    STRUCT(      [THRESHOLD AS threshold]      [,TRIAL_ID AS trial_id]))

Arguments

ML.EVALUATE takes the following arguments:

  • THRESHOLD: aFLOAT64 value that specifiesa custom threshold for the evaluation. You can only use theTHRESHOLD argument with binary-classclassification models. The default value is0.5.

    A0 value for precision or recall means that the selected thresholdproduced no true positive labels. ANaN value for precision meansthat the selected threshold produced no positive labels, neither truepositives nor false positives.

    You must specify a value for either theTABLE orQUERY_STATEMENT argument in order to specify a threshold.

  • TRIAL_ID: anINT64 value that identifies thehyperparameter tuning trial that you want the function to evaluate. TheML.EVALUATE function uses the optimal trial by default. Only specifythis argument if you ran hyperparameter tuning when creating the model.

Remote over Gemini

ML.EVALUATE(  MODEL `PROJECT_ID.DATASET.MODEL`  [, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }],    STRUCT(      [TASK_TYPE AS task_type]      [,MAX_OUTPUT_TOKENS AS max_output_tokens]      [,TEMPERATURE AS temperature]      [,TOP_P AS top_k]))

Arguments

ML.EVALUATE takes the following arguments:

  • PROJECT_ID: the project that contains the resource.
  • DATASET: the project that contains the resource.
  • MODEL: the name of the model.
  • TABLE: the name of the input table that contains the evaluation data.

    If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during training arereturned.

    If the remote model isn't configured to usesupervised tuning,the following column naming requirements apply:

    • The table must have a column namedinput_text that contains theprompt text to use when evaluating the model.
    • The table must have a column namedoutput_text that contains thegenerated text that you would expect to be returned by the model.

    If the remote model is configured to use supervised tuning,the following column naming requirements apply:

    • The table must have a column whose name matches the prompt columnname that is provided during model training. You can provide thisvalue by using theprompt_col option during model training. Ifprompt_col is unspecified, the column namedprompt in thetraining data is used. An error is returned if there is no columnnamedprompt.
    • The table must have a column whose name matches the label columnname that is provided during model training. You can provide thisvalue by using theinput_label_cols option during model training.Ifinput_label_cols is unspecified, the column namedlabel inthe training data is used. An error is returned if there is nocolumn namedlabel.

      You can find information about the label and prompt columns by looking at the model schema information in the Google Cloud console.

    For more information, seeAS SELECT.

  • QUERY_STATEMENT: a GoogleSQL query thatis used to generate the evaluation data. For the supported SQL syntax oftheQUERY_STATEMENT clause inGoogleSQL, seeQuery syntax.

    If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during trainingare returned.

    If the remote model isn't configured to usesupervised tuning,the following column naming requirements apply:

    • The query must have a column namedinput_text that contains theprompt text to use when evaluating the model.
    • The query must have a column namedoutput_text that contains thegenerated text that you would expect to be returned by the model.

    If the remote model is configured to use supervised tuning,the following column naming requirements apply:

    • The query must have a column whose name matches the prompt columnname that is provided during model training. You can provide thisvalue by using theprompt_col option during model training. Ifprompt_col is unspecified, the column namedprompt in thetraining data is used. An error is returned if there is no columnnamedprompt.
    • The query must have a column whose name matches the label columnname that is provided during model training. You can provide thisvalue by using theinput_label_cols option during model training.Ifinput_label_cols is unspecified, the column namedlabel inthe training data is used. An error is returned if there is nocolumn namedlabel.

      You can find information about the label and prompt columns by looking at the model schema information in the Google Cloud console.

    For more information, seeAS SELECT.

  • TASK_TYPE: aSTRING value that specifies thetype of task for which you want to evaluate the model's performance. Thevalid options are the following:

    • TEXT_GENERATION
    • CLASSIFICATION
    • SUMMARIZATION
    • QUESTION_ANSWERING

    The default value isTEXT_GENERATION.

  • MAX_OUTPUT_TOKENS: anINT64 value that setsthe maximum number of tokens output by the model. Specify a lower valuefor shorter responses and a higher value for longer responses.A token might be smaller than a word and is approximately four characters.100 tokens correspond to approximately 60-80 words.

    The default value is1024.

    TheMAX_OUTPUT_TOKENS value must be in the range[1,8192].

  • TEMPERATURE: aFLOAT64 value that is used forsampling during the response generation. It controls thedegree of randomness in token selection. LowerTEMPERATURE values are good for prompts that require a more deterministicand less open-ended or creative response, while higherTEMPERATURE valuescan lead to more diverse or creative results. ATEMPERATURE value of0 isdeterministic, meaning that the highest probability response is alwaysselected.

    TheTEMPERATURE value must be in the range[0.0,1.0].

    The default value is1.0.

  • TOP_P: aFLOAT64 value in the range[0.0,1.0]that changes how the model selects tokens for output. Tokens are selectedfrom the most to least probable until the sum of their probabilitiesequals theTOP_P value. For example, if tokens A,B, and C have a probability of0.3,0.2, and0.1 and theTOP_P value is0.5, then the model selectseither A or B as the next token by using theTEMPERATURE value and doesn't consider C. Specifya lower value for less random responses and a higher value for more randomresponses.

    The default value is0.95.

Remote over Claude

ML.EVALUATE(MODEL`PROJECT_ID.DATASET.MODEL`[,{TABLE`PROJECT_ID.DATASET.TABLE`|(QUERY_STATEMENT)}],STRUCT([TASK_TYPEAStask_type][,MAX_OUTPUT_TOKENSASmax_output_tokens][,TOP_KAStop_k][,TOP_PAStop_k]))

Arguments

ML.EVALUATE takes the following arguments:

  • PROJECT_ID: the project that contains the resource.
  • DATASET: the project that contains the resource.
  • MODEL: the name of the model.
  • TABLE: the name of the input table that contains the evaluation data.

    If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during training arereturned.

    The following column naming requirements apply:

    • The table must have a column namedinput_text that contains theprompt text to use when evaluating the model.
    • The table must have a column namedoutput_text that contains thegenerated text that you would expect to be returned by the model.
  • QUERY_STATEMENT: a GoogleSQL query thatis used to generate the evaluation data. For the supported SQL syntax oftheQUERY_STATEMENT clause inGoogleSQL, seeQuery syntax.

    If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during trainingare returned.

    The following column naming requirements apply:

    • The query must have a column namedinput_text that contains theprompt text to use when evaluating the model.
    • The query must have a column namedoutput_text that contains thegenerated text that you would expect to be returned by the model.

  • TASK_TYPE: aSTRING value that specifies thetype of task for which you want to evaluate the model's performance. Thevalid options are the following:

    • TEXT_GENERATION
    • CLASSIFICATION
    • SUMMARIZATION
    • QUESTION_ANSWERING

    The default value isTEXT_GENERATION.

  • MAX_OUTPUT_TOKENS: anINT64 value that setsthe maximum number of tokens output by the model. Specify a lower valuefor shorter responses and a higher value for longer responses.A token might be smaller than a word and is approximately four characters.100 tokens correspond to approximately 60-80 words.

    The default value is1024.

    TheMAX_OUTPUT_TOKENS value must be in the range[1,4096].

  • TOP_K: anINT64 value in the range[1,40]that changes how the model selects tokens for output. Specify a lowervalue for less random responses and a higher value for more randomresponses. The model determines an appropriate value if you don'tspecify one.

    ATOP_K value of1 means the next selected tokenis the most probable among all tokens in the model's vocabulary, while aTOP_K value of3 means that the next token isselected from among the three most probable tokens by using theTEMPERATURE value.

    For each token selection step, theTOP_K tokenswith the highest probabilities are sampled. Then tokens are furtherfiltered based on theTOP_P value, with the finaltoken selected using temperature sampling.

  • TOP_P: aFLOAT64 value in the range[0.0,1.0]that changes how the model selects tokens for output. Tokens are selectedfrom the most to least probable until the sum of their probabilitiesequals theTOP_P value. For example, if tokens A,B, and C have a probability of0.3,0.2, and0.1 and theTOP_P value is0.5, then the model selectseither A or B as the next token by using theTEMPERATURE value and doesn't consider C. Specifya lower value for less random responses and a higher value for more randomresponses.

    The model determines an appropriate value if you don't specify one.

Remote over Llama or Mistral AI

ML.EVALUATE(  MODEL `PROJECT_ID.DATASET.MODEL`  [, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }],    STRUCT(      [TASK_TYPE AS task_type]      [,MAX_OUTPUT_TOKENS AS max_output_tokens]      [,TEMPERATURE AS temperature]      [,TOP_P AS top_k]))

Arguments

ML.EVALUATE takes the following arguments:

  • PROJECT_ID: the project that contains the resource.
  • DATASET: the project that contains the resource.
  • MODEL: the name of the model.
  • TABLE: the name of the input table that contains the evaluation data.

    If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during training arereturned.

    The following column naming requirements apply:

    • The table must have a column namedinput_text that contains theprompt text to use when evaluating the model.
    • The table must have a column namedoutput_text that contains thegenerated text that you would expect to be returned by the model.
  • QUERY_STATEMENT: a GoogleSQL query thatis used to generate the evaluation data. For the supported SQL syntax oftheQUERY_STATEMENT clause inGoogleSQL, seeQuery syntax.

    If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during trainingare returned.

    The following column naming requirements apply:

    • The query must have a column namedinput_text that contains theprompt text to use when evaluating the model.
    • The query must have a column namedoutput_text that contains thegenerated text that you would expect to be returned by the model.

  • TASK_TYPE: aSTRING value that specifies thetype of task for which you want to evaluate the model's performance. Thevalid options are the following:

    • TEXT_GENERATION
    • CLASSIFICATION
    • SUMMARIZATION
    • QUESTION_ANSWERING

    The default value isTEXT_GENERATION.

  • MAX_OUTPUT_TOKENS: anINT64 value that setsthe maximum number of tokens output by the model. Specify a lower valuefor shorter responses and a higher value for longer responses.A token might be smaller than a word and is approximately four characters.100 tokens correspond to approximately 60-80 words.

    The default value is1024.

    TheMAX_OUTPUT_TOKENS value must be in the range[1,4096].

  • TEMPERATURE: aFLOAT64 value that is used forsampling during the response generation. It controls thedegree of randomness in token selection. LowerTEMPERATURE values are good for prompts that require a more deterministicand less open-ended or creative response, while higherTEMPERATURE valuescan lead to more diverse or creative results. ATEMPERATURE value of0 isdeterministic, meaning that the highest probability response is alwaysselected.

    TheTEMPERATURE value must be in the range[0.0,1.0].

    The default value is1.0.

  • TOP_P: aFLOAT64 value in the range[0.0,1.0]that changes how the model selects tokens for output. Tokens are selectedfrom the most to least probable until the sum of their probabilitiesequals theTOP_P value. For example, if tokens A,B, and C have a probability of0.3,0.2, and0.1 and theTOP_P value is0.5, then the model selectseither A or B as the next token by using theTEMPERATURE value and doesn't consider C. Specifya lower value for less random responses and a higher value for more randomresponses.

    The model determines an appropriate value if you don't specify one.

Remote over open

ML.EVALUATE(  MODEL `PROJECT_ID.DATASET.MODEL`  [, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }],    STRUCT(      [TASK_TYPE AS task_type]      [,MAX_OUTPUT_TOKENS AS max_output_tokens]      [,TEMPERATURE AS temperature]      [,TOP_K AS top_k]      [,TOP_P AS top_p]))

Arguments

ML.EVALUATE takes the following arguments:

  • PROJECT_ID: the project that contains the resource.
  • DATASET: the project that contains the resource.
  • MODEL: the name of the model.
  • TABLE: the name of the input table that contains the evaluation data.

    If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during training arereturned.

    The following column naming requirements apply:

    • The table must have a column namedinput_text that contains theprompt text to use when evaluating the model.
    • The table must have a column namedoutput_text that contains thegenerated text that you would expect to be returned by the model.
  • QUERY_STATEMENT: a GoogleSQL query thatis used to generate the evaluation data. For the supported SQL syntax oftheQUERY_STATEMENT clause inGoogleSQL, seeQuery syntax.

    If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during trainingare returned.

    The following column naming requirements apply:

    • The query must have a column namedinput_text that contains theprompt text to use when evaluating the model.
    • The query must have a column namedoutput_text that contains thegenerated text that you would expect to be returned by the model.

  • TASK_TYPE: aSTRING value that specifies thetype of task for which you want to evaluate the model's performance. Thevalid options are the following:

    • TEXT_GENERATION
    • CLASSIFICATION
    • SUMMARIZATION
    • QUESTION_ANSWERING

    The default value isTEXT_GENERATION.

  • MAX_OUTPUT_TOKENS: anINT64 value that setsthe maximum number of tokens output by the model. Specify a lower valuefor shorter responses and a higher value for longer responses.A token might be smaller than a word and is approximately four characters.100 tokens correspond to approximately 60-80 words.

    The model determines an appropriate value if you don't specify one.

    TheMAX_OUTPUT_TOKENS value must be in the range[1,4096].

  • TEMPERATURE: aFLOAT64 value that is used forsampling during the response generation. It controls thedegree of randomness in token selection. LowerTEMPERATURE values are good for prompts that require a more deterministicand less open-ended or creative response, while higherTEMPERATURE valuescan lead to more diverse or creative results. ATEMPERATURE value of0 isdeterministic, meaning that the highest probability response is alwaysselected.

    TheTEMPERATURE value must be in the range[0.0,1.0].

    The model determines an appropriate value if you don't specify one.

  • TOP_K: anINT64 value in the range[1,40]that changes how the model selects tokens for output. Specify a lowervalue for less random responses and a higher value for more randomresponses. The model determines an appropriate value if you don'tspecify one.

    ATOP_K value of1 means the next selected tokenis the most probable among all tokens in the model's vocabulary, while aTOP_K value of3 means that the next token isselected from among the three most probable tokens by using theTEMPERATURE value.

    For each token selection step, theTOP_K tokenswith the highest probabilities are sampled. Then tokens are furtherfiltered based on theTOP_P value, with the finaltoken selected using temperature sampling.

  • TOP_P: aFLOAT64 value in the range[0.0,1.0]that changes how the model selects tokens for output. Tokens are selectedfrom the most to least probable until the sum of their probabilitiesequals theTOP_P value. For example, if tokens A,B, and C have a probability of0.3,0.2, and0.1 and theTOP_P value is0.5, then the model selectseither A or B as the next token by using theTEMPERATURE value and doesn't consider C. Specifya lower value for less random responses and a higher value for more randomresponses.

    The model determines an appropriate value if you don't specify one.

All other models

ML.EVALUATE(  MODEL `PROJECT_ID.DATASET.MODEL`  [, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }],    STRUCT(      [THRESHOLD AS threshold]      [,TRIAL_ID AS trial_id]))

Arguments

ML.EVALUATE takes the following arguments:

  • PROJECT_ID: the project that contains the resource.
  • DATASET: the project that contains the resource.
  • MODEL: the name of the model.
  • TABLE: the name of the input table that contains the evaluation data.

    If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during training arereturned.

    If you specify aTABLE value, the input column names in the table mustmatch the column names in the model, and their types must be compatibleaccording to BigQueryimplicit coercion rules.

  • QUERY_STATEMENT: a GoogleSQL query thatis used to generate the evaluation data. For the supported SQL syntax oftheQUERY_STATEMENT clause inGoogleSQL, seeQuery syntax.

    If you don't specify a table or query to provide input data, theevaluation metrics that are generated for the model during trainingare returned.

    If you used theTRANSFORM clausein theCREATE MODEL statement that created the model, then you canonly specify the input columns present in theTRANSFORM clausein the query.

  • THRESHOLD: aFLOAT64 value that specifiesa custom threshold for the evaluation. You can only use theTHRESHOLD argument with binary-classclassification models. The default value is0.5.

    A0 value for precision or recall means that the selected thresholdproduced no true positive labels. ANaN value for precision meansthat the selected threshold produced no positive labels, neither truepositives nor false positives.

    You must specify a value for either theTABLE orQUERY_STATEMENT argument in order to specify a threshold.

  • TRIAL_ID: anINT64 value that identifies thehyperparameter tuning trial that you want the function to evaluate. TheML.EVALUATE function uses the optimal trial by default. Only specifythis argument if you ran hyperparameter tuning when creating the model.

    You can't use theTRIAL_ID argument with PCAmodels.

Output

ML.EVALUATE returns a single row of metrics applicable to thetype of model specified.

For models that return them, theprecision,recall,f1_score,log_loss,androc_auc metrics are macro-averaged for all of the class labels. For amacro-average, metrics are calculated for each label and then an unweightedaverage is taken of those values.

Time series

ML.EVALUATE returns the following columns forARIMA_PLUS orARIMA_PLUS_XREG models when input data is provided andperform_aggregation isFALSE:

  • time_series_id_col ortime_series_id_cols: a value that containsthe identifiers of a time series.time_series_id_col can be anINT64orSTRING value.time_series_id_cols can be anARRAY<INT64> orARRAY<STRING> value. Only present when forecasting multiple time seriesat once. The column names and types are inherited from theTIME_SERIES_ID_COL option as specified in theCREATE MODEL statement.ARIMA_PLUS_XREG models don't support this column.
  • time_series_timestamp_col: aSTRING value that containsthe timestamp column for a time series. The column name and type areinherited from theTIME_SERIES_TIMESTAMP_COL option as specified in theCREATE MODEL statement.
  • time_series_data_col: aSTRING value that containsthe data column for a time series. The column name and type are inheritedfrom theTIME_SERIES_DATA_COL option as specified in theCREATE MODELstatement.
  • forecasted_time_series_data_col: aSTRING value that contains the samedata astime_series_data_col but withforecasted_ prefixed to thecolumn name.
  • lower_bound: aFLOAT64 value that contains the lower bound of theprediction interval.
  • upper_bound: aFLOAT64 value that contains the upper bound of theprediction interval.
  • absolute_error: aFLOAT64 value that contains the absolute value ofthe difference between the forecasted value and the actual data value.
  • absolute_percentage_error: aFLOAT64 value that contains the absolutevalue of the absolute error divided by the actual value.

Notes:

The following things are true for time series models when input data isprovided andperform_aggregation isFALSE:

  • ML.EVALUATE evaluates the forecasting accuracy of each forecasted timestamp.
  • For history timestamps, the following columns areNULL:
    • forecasted_time_series_data_col
    • lower_bound
    • upper_bound
    • absolute_error
    • absolute_percentage_error

ML.EVALUATE returns the following columns forARIMA_PLUS orARIMA_PLUS_XREG models when input data is provided andperform_aggregation isTRUE:

Notes:

The following things are true for time series models when input data isprovided andperform_aggregation isTRUE:

  • ML.EVALUATE evaluates the forecasting accuracy of each forecasted timestamp.
  • The error metrics are aggregated over the forecasting error on each forecasted timestamp.

ML.EVALUATE returns the following columns for anARIMA_PLUS model wheninput data isn't provided:

Note: The support ofML.EVALUATE without input data is deprecated. UseML.ARIMA_EVALUATE instead.

Classification

The following types of models are classification models:

ML.EVALUATE returns the following columns for classification models:

Regression

The following types of models are regression models:

  • Linear regression
  • Boosted tree regressor
  • Random forest regressor
  • Deep neural network (DNN) regressor
  • Wide & Deep regressor
  • AutoML Tables regressor

ML.EVALUATE returns the following columns for regression models:

  • trial_id: anINT64 value that identifies the hyperparameter tuning trial.This column is only returned if you ran hyperparameter tuning when creating themodel. This column doesn't apply for AutoML Tables models.
  • mean_absolute_error: aFLOAT64 value that contains themean absolute error forthe model.
  • mean_squared_error: aFLOAT64 value that contains themean squared error forthe model.
  • mean_squared_log_error: aFLOAT64 value that contains the mean squaredlogarithmic error for the model. The mean squared logarithmic errormeasures the distance between the actual and predicted values.
  • median_absolute_error: aFLOAT64 value that contains themedian absolute errorfor the model.
  • r2_score: aFLOAT64 value that contains theR2 score for the model.
  • explained_variance: aFLOAT64 value that contains theexplained variancefor the model.

K-means

ML.EVALUATE returns the following columns for k-means models:

  • trial_id: anINT64 value that identifies the hyperparameter tuning trial.This column is only returned if you ran hyperparameter tuning when creating themodel.
  • davies_bouldin_index: aFLOAT64 value that contains theDavies-Bouldin Indexfor the model.
  • mean_squared_distance: aFLOAT64 value that contains the mean squareddistance for the model, which is the average of the distances betweentraining data points to their closest centroid.

Matrix factorization

ML.EVALUATE returns the following columns for matrix factorization modelswith implicit feedback:

  • trial_id: anINT64 value that identifies the hyperparameter tuning trial.This column is only returned if you ran hyperparameter tuning when creating themodel.
  • recall: aFLOAT64 value that contains therecall for the model.
  • mean_squared_error: aFLOAT64 value that contains themean squared error forthe model.
  • normalized_discounted_cumulative_gain: aFLOAT64 value that contains thenormalized discounted cumulative gain for the model.
  • average_rank: aFLOAT64 value that contains theaverage rank (PDF download) for the model.

ML.EVALUATE returns the following columns for matrix factorization modelswith explicit feedback:

  • trial_id: anINT64 value that identifies the hyperparameter tuning trial.This column is only returned if you ran hyperparameter tuning when creating themodel.
  • mean_absolute_error: aFLOAT64 value that contains themean absolute error forthe model.
  • mean_squared_error: aFLOAT64 value that contains themean squared error forthe model.
  • mean_squared_log_error: aFLOAT64 value that contains the mean squaredlogarithmic error for the model. The mean squared logarithmic errormeasures the distance between the actual and predicted values.
  • mean_absolute_error: aFLOAT64 value that contains themean absolute error forthe model.
  • r2_score: aFLOAT64 value that contains theR2 scorefor the model.
  • explained_variance: aFLOAT64 value that contains theexplained variancefor the model.

Remote over pre-trained models

This section describes the output for the following types of models:

  • Gemini
  • Anthropic Claude
  • Mistral AI
  • Llama
  • Open models

ML.EVALUATE returns different columns depending on thetask_type valuethat you specify.

When you specify theTEXT_GENERATION task type, the following columns arereturned:

When you specify theCLASSIFICATION task type, the following columns arereturned:

  • precision: aFLOAT64 column that contains theprecisionfor the model .
  • recall: aFLOAT64 column that contains therecall for the model.
  • f1: aFLOAT64 column that contains theF1 score for the model.
  • label: aSTRING column that contains the label generated for theinput data.
  • evaluation_status: aSTRING column in JSON format that contains thefollowing elements:

    • num_successful_rows: the number of successful inference rows returnedfrom Vertex AI.
    • num_total_rows: the number of total input rows.

When you specify theSUMMARIZATION task type, the following columns arereturned:

  • rouge-l_precision: aFLOAT64 column that contains theRecall-oriented understudy for gisting evaluation (ROUGE-L)precisionfor the model.
  • rouge-l_recall: aFLOAT64 column that contains the ROUGE-Lrecall for the model.
  • rouge-l_f1: aFLOAT64 column that contains the ROUGE-LF1 score for the model.
  • evaluation_status: aSTRING column in JSON format that contains thefollowing elements:

    • num_successful_rows: the number of successful inference rows returnedfrom Vertex AI.
    • num_total_rows: the number of total input rows.

When you specify theQUESTION_ANSWERING task type, the following columns arereturned:

  • exact_match: aFLOAT64 column that indicates if the generated text exactlymatches theground truth.This value is1 if the generated text equals the ground truth, otherwise itis0. This metric is an average across all of the input rows.
  • evaluation_status: aSTRING column in JSON format that contains thefollowing elements:

    • num_successful_rows: the number of successful inference rows returnedfrom Vertex AI.
    • num_total_rows: the number of total input rows.

Remote over custom models

ML.EVALUATE returns the following column for remote models overcustom models deployed to Vertex AI:

  • remote_eval_metrics: aJSON column containing appropriate metrics for themodel type.

PCA

ML.EVALUATE returns the following column for PCA models:

  • total_explained_variance_ratio: aFLOAT64 value that contains thepercentage of the cumulative variance explained by all the returnedprincipal components. For more information, seetheML.PRINCIPAL_COMPONENT_INFO function.

Autoencoder

ML.EVALUATE returns the following columns for autoencoder models:

  • mean_absolute_error: aFLOAT64 value that contains themean absolute error forthe model.
  • mean_squared_error: aFLOAT64 value that contains themean squared error forthe model.
  • mean_squared_log_error: aFLOAT64 value that contains the mean squaredlogarithmic error for the model. The mean squared logarithmic errormeasures the distance between the actual and predicted values.

Limitations

ML.EVALUATE is subject to the following limitations:

Costs

When used with remote models over Vertex AI LLMs,ML.EVALUATE costs are calculated based on the following:

  • The bytes processed from the input table. These charges are billed fromBigQuery to your project. For more information, seeBigQuery pricing.
  • The input to and output from the LLM. These charges are billed fromVertex AI to your project. For more information, seeVertex AI pricing.

Examples

The following examples show how to useML.EVALUATE.

ML.EVALUATE with no input data specified

The following query evaluates a model with no input data specified:

SELECT*FROMML.EVALUATE(MODEL`mydataset.mymodel`)

ML.EVALUATE with a custom threshold and input data

The following query evaluates a model with input data and a customthreshold of0.55:

SELECT*FROMML.EVALUATE(MODEL`mydataset.mymodel`,(SELECTcustom_label,column1,column2FROM`mydataset.mytable`),STRUCT(0.55ASthreshold))

ML.EVALUATE to calculate forecasting accuracy of a time series

The following query evaluates the 30-point forecasting accuracy for atime series model:

SELECT*FROMML.EVALUATE(MODEL`mydataset.my_arima_model`,(SELECTtimeseries_date,timeseries_metricFROM`mydataset.mytable`),STRUCT(TRUEASperform_aggregation,30AShorizon))

ML.EVALUATE to calculate ARIMA_PLUS forecasting accuracy for each forecasted timestamp

The following query evaluates the forecasting accuracy for each of the 30forecasted points of a time series model. It also computes the predictioninterval based on a confidence level of0.9.

SELECT*FROMML.EVALUATE(MODEL`mydataset.my_arima_model`,(SELECTtimeseries_date,timeseries_metricFROM`mydataset.mytable`),STRUCT(FALSEASperform_aggregation,0.9ASconfidence_level,30AShorizon))

ML.EVALUATE to calculate ARIMA_PLUS_XREG forecasting accuracy for each forecasted timestamp

The following query evaluates the forecasting accuracy for each of the 30forecasted points of a time series model. It also computes the predictioninterval based on a confidence level of0.9. Note that you need to include theside features for the evaluation data.

SELECT*FROMML.EVALUATE(MODEL`mydataset.my_arima_xreg_model`,(SELECTtimeseries_date,timeseries_metric,feature1,feature2FROM`mydataset.mytable`),STRUCT(FALSEASperform_aggregation,0.9ASconfidence_level,30AShorizon))

ML.EVALUATE to calculate LLM text generation accuracy

The following query evaluates the LLM text generation accuracy forthe classification task type for each label from the evaluation table.

SELECT*FROMML.EVALUATE(MODEL`mydataset.my_llm`,(SELECTprompt,labelFROM`mydataset.mytable`),STRUCT('classification'AStask_type))

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-11-24 UTC.