The ML.PREDICT function

This document describes theML.PREDICT function, which you can use topredict outcomes by using a model.ML.PREDICT works with the following models:

For PCA and autoencoder models, you can use theAI.GENERATE_EMBEDDING function as an alternative to theML.PREDICTfunction.AI.GENERATE_EMBEDDING generates the same embedding data asML.PREDICTas an array in a single column, rather than in a series of columns. Having allof the embeddings in a single column lets you directly use theVECTOR_SEARCH functionon theAI.GENERATE_EMBEDDING output.

You can run prediction during model creation, after model creation, or after afailure (as long as at least one iteration is finished).ML.PREDICT always usesthe model weights from the last successful iteration.

Syntax

ML.PREDICT(  MODEL `PROJECT_ID.DATASET.MODEL_NAME`,  { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }  STRUCT(    [THRESHOLD AS threshold]    [,KEEP_ORIGINAL_COLUMNS AS keep_original_columns]    [,TRIAL_ID AS trial_id]))

Arguments

ML.PREDICT takes the following arguments:

  • PROJECT_ID: the project that contains theresource.
  • DATASET: the dataset that contains theresource.
  • MODEL: the name of the model.
  • TABLE: The name of the input table that contains theevaluation data.

    IfTABLE is specified, the input column names in the table must match thecolumn names in the model, and their types should be compatible according toBigQueryimplicit coercion rules.

    For TensorFlow Lite, Open Neural Network Exchange (ONNX), andXGBoost models, the input must beconvertible tothe type expected by the model.

    For remote models, the input columns must contain all Vertex AIendpoint input fields.

    If there are unused columns from the table, they are passed through asoutput columns.

  • QUERY_STATEMENT: The GoogleSQL query thatis used to generate the evaluation data. See theGoogleSQL query syntaxpage for the supported SQL syntax of theQUERY_STATEMENT clause.

    IfQUERY_STATEMENT is specified, the input column names from the querymust match the column names in the model, and their types should becompatible according to BigQueryimplicit coercion rules.

    For TensorFlow Lite, ONNX, and XGBoost models, the input must beconvertible tothe type expected by the model.

    For remote models, the input columns must contain all Vertex AIendpoint input fields.

    If there are unused columns from the query, they are passed through asoutput columns.

    If you used theTRANSFORM clausein theCREATE MODEL statement that created the model, then only the inputcolumns present in theTRANSFORM clause must appear inQUERY_STATEMENT.

    If you are running inference on image data from anobject table, you must use theML.DECODE_IMAGE functionto convert image bytes to a multi-dimensionalARRAY representation. Youcan useML.DECODE_IMAGE output directly in anML.PREDICT statement,or you can write the results fromML.DECODE_IMAGE to a table column andreference that column when you callML.PREDICT. For more information, seePredict an outcome from image data with an imported TensorFlow model.

  • THRESHOLD: aFLOAT64 value that specifies a custom threshold fora binary classification model. It is used as the cutoff between the twolabels. Predictions above the threshold are positive predictions.Predictions below the threshold are negative predictions. The default valueis0.5.

  • KEEP_ORIGINAL_COLUMNS: aBOOL value that specifieswhether to output the input table columns. IfTRUE, the columns from theinput table are output. The default value isFALSE.

    KEEP_ORIGINAL_COLUMNS only applies to principal component analysis (PCA)models.

  • TRIAL_ID: anINT64 value that identifies thehyperparameter tuning trial that you want the function to evaluate. Thefunction uses the optimal trial by default. Only specify this argument if youran hyperparameter tuning when creating the model.

Output

The output of theML.PREDICT function has as many rows as the input table, andit includes all columns from the input table and all output columns from themodel. The output column names for the model arepredicted_<label_column_name>and, for classification models,predicted_<label_column_name>_probs. Inboth columns,label_column_name is the name of the input label column that'sused during training.

Regression models

For the following types of regression models:

  • Linear regression
  • Boosted tree regressor
  • Random forest regressor
  • DNN regressor
  • Wide-and-deep regressor

The following column is returned:

  • predicted_<label_column_name>: aSTRING value that contains the predictedvalue of the label.

Classification models

For the following types of binary-class classification models:

  • Logistic regression
  • Boosted tree classifier
  • Random forest classifier
  • DNN classifier
  • Wide-and-deep classifier

The following columns are returned:

  • Thepredicted_<label_column_name>: aSTRING value that contains one of thetwo input labels, depending on which label has the higher predictedprobability.
  • Thepredicted_<label_column_name>_probs: anARRAY<STRUCT> value in the form[<label, probability>] that contains the predicted probability of eachlabel.

For the following types of multiclass classification models:

  • Logistic regression
  • Boosted tree classifier
  • Random forest classifier
  • DNN classifier
  • Wide-and-deep classifier

The following columns are returned:

  • Thepredicted_<label_column_name>: aSTRING value that contains the labelwith the highest predicted probability score.
  • Thepredicted_<label_column_name>_probs: aFLOAT64 value that containsthe probability for each class label, calculated using asoftmax function.

K-means models

For k-means models, the following columns are returned:

  • centroid_id: anINT64 value that identifies the centroid.
  • nearest_centroids_distance: anARRAY<STRUCT> value that contains thedistances to the nearestk clusters, wherek is equal to the lesserofnum_clusters or5. If the model was created with thestandardize_features optionset toTRUE, then the model computes these distances using standardizedfeatures; otherwise, it computes these distances using non-standardizedfeatures.

PCA models

For PCA models, the following columns are returned:

  • principal_component_<index>: anINT64 value that represents the projectionof the input data onto each principal component. These values can also beconsidered as embedded low-dimensional features in the space that isspanned by the principal components.

The original input columns are appended if thekeep_original_columnsargument is set toTRUE.

Autoencoder models

For autoencoder models, the following columns are returned:

  • latent_col_<index>: anINT64 value that represents the dimensions of thelatent space.

The original input columns are appended after the latent space columns.

Imported models

For TensorFlow Lite models, the output is the output of theTensorFlow Lite model's predict method.

For ONNX models, the output is the output of theONNX model's predict method.

For XGBoost models, the output is the output of the XGBoost model's predictmethod.

Remote models

For remote models, the output columns contain all Vertex AI endpointoutput fields, and also aremote_model_status field that contains statusmessages from Vertex AI endpoint.

Missing data imputation

In statistics, imputation is used to replace missing data with substitutedvalues. When you train a model in BigQuery ML,NULL values aretreated as missing data. When you predict outcomes in BigQuery ML,missing values can occur when BigQuery ML encounters aNULLvalue or a previously unseen value. BigQuery ML handles missingdata differently, based on the type of data in the column.

Column typeImputation method
NumericIn both training and prediction,NULL values in numeric columns are replaced with the mean value of the given column, as calculated by the feature column in the original input data.
One-hot/Multi-hot encodedIn both training and prediction,NULL values in the encoded columns are mapped to an additional category that is added to the data. Previously unseen data is assigned a weight of 0 during prediction.
TIMESTAMPTIMESTAMP columns use a mixture of imputation methods from both standardized and one-hot encoded columns. For the generated Unix time column, BigQuery ML replaces values with the mean Unix time across the original columns. For other generated values, BigQuery ML assigns them to the respectiveNULL category for each extracted feature.
STRUCTIn both training and prediction, each field of theSTRUCT is imputed according to its type.

Permissions

You must have thebigquery.models.getDataIdentity and Access Management (IAM) permissionin order to runML.PREDICT.

Examples

The following examples assume your model and input table are in your defaultproject.

Predict an outcome

The following example predicts an outcome and returns the following columns:

  • predicted_label
  • label
  • column1
  • column2
SELECT*FROMML.PREDICT(MODEL`mydataset.mymodel`,(SELECTlabel,column1,column2FROM`mydataset.mytable`))

Compare predictions from two different models

The following example creates two models and then compares their output:

  1. Create the first model:

    CREATEMODEL`mydataset.mymodel1`OPTIONS(model_type='linear_reg',input_label_cols=['label'],)ASSELECTlabel,input_column1FROM`mydataset.mytable`
  2. Create the second model:

    CREATEMODEL`mydataset.mymodel2`OPTIONS(model_type='linear_reg',input_label_cols=['label'],)ASSELECTlabel,input_column2FROM`mydataset.mytable`
  3. Compare the output of the two models:

    SELECTlabel,predicted_label1,predicted_labelASpredicted_label2FROMML.PREDICT(MODEL`mydataset.mymodel2`,(SELECT*EXCEPT(predicted_label),predicted_labelASpredicted_label1FROMML.PREDICT(MODEL`mydataset.mymodel1`,TABLE`mydataset.mytable`)))

Specify a custom threshold

The following example runs prediction with input data anda custom threshold of0.55:

SELECT*FROMML.PREDICT(MODEL`mydataset.mymodel`,(SELECTcustom_label,column1,column2FROM`mydataset.mytable`),STRUCT(0.55ASthreshold))

Predict an outcome from structured data with an imported TensorFlow model

The following query predicts outcomes using an importedTensorFlow model. Theinput_data table contains inputs in theschema expected bymy_model. SeetheCREATE MODEL statement for TensorFlow modelsfor more information.

SELECT*FROMML.PREDICT(MODEL`my_project.my_dataset.my_model`,(SELECT*FROMinput_data))

Predict an outcome from image data with an imported TensorFlow model

If you are running inference on image data from anobject table, you must use theML.DECODE_IMAGE functionto convert image bytes to a multi-dimensionalARRAY representation. You canuseML.DECODE_IMAGE output directly in anML.PREDICT function,or you can write the results fromML.DECODE_IMAGE to a table column andreference that column when you callML.PREDICT. You can also passML.DECODE_IMAGE output to another image processing function foradditional preprocessing during either of these procedures.

You can join the object table to standard BigQuery tables tolimit the data used in inference, or to provide additional input to the model.

The following examples show different ways you can use theML.PREDICTfunction with image data.

Example 1

The following example uses theML.DECODE_IMAGE function directly in theML.PREDICT function. It returns the inference results for all images in theobject table, for a model with an input field ofinput and an outputfield offeature:

SELECT*FROMML.PREDICT(MODEL`my_dataset.vision_model`,(SELECTuri,ML.RESIZE_IMAGE(ML.DECODE_IMAGE(data),480,480,FALSE)ASinputFROM`my_dataset.object_table`));

Example 2

The following example uses theML.DECODE_IMAGE function directly in theML.PREDICT function, and uses theML.CONVERT_COLOR_SPACE function in theML.PREDICT function to convertthe image color space fromRBG toYIQ. It also shows how touse object table fields to filter the objects included in inference.It returns the inference results for all JPG images in theobject table, for a model with an input field ofinput and an outputfield offeature:

SELECT*FROMML.PREDICT(MODEL`my_dataset.vision_model`,(SELECTuri,ML.CONVERT_COLOR_SPACE(ML.RESIZE_IMAGE(ML.DECODE_IMAGE(data),224,280,TRUE),'YIQ')ASinputFROM`my_dataset.object_table`WHEREcontent_type='image/jpeg'));

Example 3

The following example uses results fromML.DECODE_IMAGE that have beenwritten to a table column but not processed any further. It usesML.RESIZE_IMAGE andML.CONVERT_IMAGE_TYPE in theML.PREDICT function toprocess the image data. It returns the inference results for all images in thedecoded images table, for a model with an input field ofinput and an outputfield offeature.

Create the decoded images table:

CREATEORREPLACETABLE`my_dataset.decoded_images`AS(SELECTML.DECODE_IMAGE(data)ASdecoded_imageFROM`my_dataset.object_table`);

Run inference on the decoded images table:

SELECT*FROMML.PREDICT(MODEL`my_dataset.vision_model`,(SELECTuri,ML.CONVERT_IMAGE_TYPE(ML.RESIZE_IMAGE(decoded_image,480,480,FALSE))ASinputFROM`my_dataset.decoded_images`));

Example 4

The following example uses results fromML.DECODE_IMAGE that have beenwritten to a table column and preprocessed usingML.RESIZE_IMAGE. It returns the inference results for all images in thedecoded images table, for a model with an input field ofinput and an outputfield offeature.

Create the table:

CREATEORREPLACETABLE`my_dataset.decoded_images`AS(SELECTML.RESIZE_IMAGE(ML.DECODE_IMAGE(data)480,480,FALSE)ASdecoded_imageFROM`my_dataset.object_table`);

Run inference on the decoded images table:

SELECT*FROMML.PREDICT(MODEL`my_dataset.vision_model`,(SELECTuri,decoded_imageASinputFROM`my_dataset.decoded_images`));

Example 5

The following example uses theML.DECODE_IMAGE function directly in theML.PREDICT function. In this example, the model has an output field ofembeddings and two input fields: one that expects animage,f_img, and one that expects a string,f_txt. The imageinput comes from the object table and the string input comes from astandard BigQuery table that is joined with the object tableby using theuri column.

SELECT*FROMML.PREDICT(MODEL`my_dataset.mixed_model`,(SELECTuri,ML.RESIZE_IMAGE(ML.DECODE_IMAGE(my_dataset.my_object_table.data),224,224,FALSE)ASf_img,my_dataset.image_description.descriptionASf_txtFROM`my_dataset.object_table`JOIN`my_dataset.image_description`ONobject_table.uri=image_description.uri));

Predict an outcome with a model trained with theTRANSFORM clause

The following example trains a model using theTRANSFORM clause:

CREATEMODEL`mydataset.mymodel`TRANSFORM(f1+f2asc,label)OPTIONS(...)ASSELECTf1,f2,f3,labelFROMt;

Because thef3 column doesn't appear in theTRANSFORM clause,the following prediction query omits that column in theQUERY_STATEMENT:

SELECT*FROMML.PREDICT(MODEL`mydataset.mymodel`,(SELECTf1,f2FROMt1));

Iff3 is provided in theSELECT statement, it isn't used for calculatingpredictions but is instead passed through for use in the rest of theSQL statement.

Predict dimensionality reduction results (latent space) with an autoencoder model

The following example runs prediction against a previously builtautoencoder model, where the input was 4 dimensional (4 input columns) andthe dimensionality reduction had 2 dimensions (2 output columns):

SELECT*FROMML.PREDICT(MODEL`mydataset.mymodel`,(SELECTf1,f2,f3,f4FROMt1));

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-11-25 UTC.