The CREATE MODEL statement for AutoML models

This document describes theCREATE MODEL statement for creatingAutoML classification and regression modelsin BigQuery by using SQL. Alternatively, you can use theGoogle Cloud console user interface tocreate a model by using a UI(Preview) instead of constructing the SQLstatement yourself. AutoML lets you quickly build large-scalemachine learning models on tabular data.

You can use AutoML regressor models with theML.PREDICT functionto performregression, and you can useAutoML classifier models with theML.PREDICT function toperformclassification. You can useboth types of AutoML models with theML.PREDICT functionto performanomaly detection.

BigQuery ML uses the default values forAutoML training options,includingdata splitting andoptimization functions.

For more information about supported SQL statements and functions for thismodel, seeEnd-to-end user journeys for ML models.

CREATE MODEL syntax

{CREATE MODEL |CREATE MODEL IF NOT EXISTS |CREATE OR REPLACE MODEL}model_nameOPTIONS(model_option_list)ASquery_statementmodel_option_list:MODEL_TYPE = { 'AUTOML_REGRESSOR' | 'AUTOML_CLASSIFIER' }    [,BUDGET_HOURS =float64_value ]    [,OPTIMIZATION_OBJECTIVE = {string_value |struct_value } ]    [,INPUT_LABEL_COLS =string_array ]    [,DATA_SPLIT_COL =string_value ]    [,KMS_KEY_NAME =string_value ]

CREATE MODEL

Creates and trains a new model in the specified dataset. If the model nameexists,CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates and trains a new model only if the model doesn't exist in thespecified dataset.

CREATE OR REPLACE MODEL

Creates and trains a model and replaces an existing model with the same name inthe specified dataset.

model_name

The name of the model you're creating or replacing. The modelname must be unique in the dataset: no other model or table can have the samename. The model name must follow the same naming rules as aBigQuery table. A model name can:

  • Contain up to 1,024 characters
  • Contain letters (upper or lower case), numbers, and underscores

model_name is case-sensitive.

If you don't have a default project configured, then you must prepend theproject ID to the model name in the following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example, `myproject.mydataset.mymodel`.

MODEL_TYPE

Syntax

MODEL_TYPE={'AUTOML_REGRESSOR'|'AUTOML_CLASSIFIER'}

Description

Specifies the model type. This option is required.

Arguments

This option accepts the following values:

  • AUTOML_REGRESSOR: This creates a regression model that uses a label columnwith a numeric data type.
  • AUTOML_CLASSIFIER: This creates a classification model that uses a labelcolumn with either a string or a numeric data type.

BUDGET_HOURS

Syntax

BUDGET_HOURS =float64_value

Description

Sets the training budget in hours for AutoML training.

After training an AutoML model, BigQuery MLcompresses the model to ensure it is small enough to import, which can take upto 50% of the training time. The time to compress the model is not includedin the training budget time.

Arguments

AFLOAT64 value between1.0 and72.0. The default value is1.0.

OPTIMIZATION_OBJECTIVE

Syntax

OPTIMIZATION_OBJECTIVE = {string_value |struct_value }

Description

Sets the optimization objective function to use for AutoMLtraining.

For more details on the optimization objective functions, see theAutoML documentation.

Arguments

This option can be specified as aSTRING orSTRUCT value.

This option accepts the following string values for optimization objectivefunctions:

  • For regression:
    • MINIMIZE_RMSE(default)
    • MINIMIZE_MAE
    • MINIMIZE_RMSLE
  • For binary classification:
    • MAXIMIZE_AU_ROC(default)
    • MINIMIZE_LOG_LOSS
    • MAXIMIZE_AU_PRC
    • MAXIMIZE_PRECISION_AT_RECALL
    • MAXIMIZE_RECALL_AT_PRECISION
  • For multiclass classification:
    • MINIMIZE_LOG_LOSS

For example:

OPTIMIZATION_OBJECTIVE='MAXIMIZE_AU_ROC'

For binary classification models, you can alternatively specify a struct valuefor this option. The struct must contain aSTRING value and aFLOAT64 valuein one of the following combinations:

  • The string value isMAXIMIZE_PRECISION_AT_RECALL and the float valuespecifies the fixed recall value, which must be in the range of[0,1].

  • The string value isMAXIMIZE_RECALL_AT_PRECISION and the float valuespecifies the fixed precision value, which must be in the range of[0,1].

For example:

OPTIMIZATION_OBJECTIVE=STRUCT('MAXIMIZE_PRECISION_AT_RECALL',0.3)

INPUT_LABEL_COLS

Syntax

INPUT_LABEL_COLS =string_array

Description

The name of the label column in the training data. The label column contains theexpected machine learning result for the given record. For example, in a spamdetection dataset, the label column value would probably be eitherspam ornot spam. In a rainfall dataset, the label column value might be the amountof rain that fell during a certain period.

Arguments

A one-elementARRAY of string values. Defaults tolabel.

Supported data types forinput_label_cols include the following:

Model typeSupported label types
automl_regressorINT64
NUMERIC
BIGNUMERIC
FLOAT64
automl_classifierAnygroupable data type

DATA_SPLIT_COL

Syntax

DATA_SPLIT_COL =string_value

Description

The name of the column to use to sort input data into the training, validation,or test set. Defaults torandom splitting.

Arguments

The string value must be the name of one of the columns in the training data.This column must have either a timestamp or string data type. This column ispassed directly to AutoML.

If you use a string column, rows are assigned to the appropriate dataset basedon the column's value, which must be one of the following options:

  • TRAIN
  • VALIDATE
  • TEST
  • UNASSIGNED

For more information about how to use these values, seeManual split.

Timestamp columns are used to perform achronological split.

KMS_KEY_NAME

Syntax

KMS_KEY_NAME =string_value

Description

The Cloud Key Management Servicecustomer-managed encryption key (CMEK) touse to encrypt the model.

Arguments

ASTRING value containing the fully-qualified name of the CMEK. For example,

'projects/my_project/locations/my_location/keyRings/my_ring/cryptoKeys/my_key'

Supported data types for input columns

For columns other than the label column, anygroupabledata type is supported. The BigQuery column type is used todetermine the feature column type in AutoML.

BigQuery typeAutoML type
INT64
NUMERIC
BIGNUMERIC
FLOAT64
NUMERIC orTIMESTAMP if AutoML determines that it is a UNIX timestamp
BOOLCATEGORICAL
STRING
BYTES
EitherCATEGORICAL orTEXT, auto-selected by AutoML.
TIMESTAMP
DATETIME
TIME
DATE
EitherTIMESTAMP,CATEGORICAL, orTEXT, auto-selected by AutoML.

To force a numeric column to be treated as categorical, use theCAST functionto cast it to a BigQuery string. Arrays of supported types areallowed and remain arrays during AutoML training.

Locations

For information about supported locations, seeLocations for non-remote models.

Limitations

AutoML models have the following limitations:

  • The input data to AutoML must be between 1,000 and200,000,000 rows, and must be less than 100 GB.
  • Global region customer-managed encryption keys (CMEKs) and multi-regionCMEKs, for exampleeu orus, are not supported.
  • BigQuery ML AutoML models aren't visiblein the AutoML user interface, and aren't available forbatch or online predictions in AutoML.
  • Thedefault maximum number of concurrent training jobsis 5. Raising the Vertex AI quota does not modify this quota. If youreceive the errorToo many AutoML training queries have been issued withina short period of time, you can submit a request to raise the maximum numberof concurrent training jobs. To request an increase, emailbqml-feedback@google.com with your project ID and the details of your request.
  • Column names for feature columns must be 125 characters or fewer.
  • ForAUTOML_CLASSIFIER models, thelabel column can contain up to 1,000 unique values; that is, the number of classes is less than or equal to 1,000. If you need to classify into more than 1,000 labels, contactbqml-feedback@google.com.

CREATE MODEL example

The following example creates a model namedmymodel inmydataset in yourdefault project. It uses the publicnyc-tlc.yellow.trips taxi trip dataavailable in BigQuery. The job takes approximately 3 hours tocomplete, including training, model compression, temporary data movement (toAutoML), and setup tasks.

Create the model:

CREATEORREPLACEMODEL`project_id.mydataset.mymodel`OPTIONS(model_type='AUTOML_REGRESSOR',input_label_cols=['fare_amount'],budget_hours=1.0)ASSELECT(tolls_amount+fare_amount)ASfare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_countFROM`nyc-tlc.yellow.trips`WHEREABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetimeASSTRING)),100000))=1ANDtrip_distance>0ANDfare_amount>=2.5ANDfare_amount<=100.0ANDpickup_longitude>-78ANDpickup_longitude<-70ANDdropoff_longitude>-78ANDdropoff_longitude<-70ANDpickup_latitude>37ANDpickup_latitude<45ANDdropoff_latitude>37ANDdropoff_latitude<45ANDpassenger_count>0

Run predictions:

SELECT*FROMML.PREDICT(MODEL`project_id.mydataset.mymodel`,(SELECT*FROM`nyc-tlc.yellow.trips`LIMIT100))

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-11-24 UTC.