The CREATE MODEL statement

To create a model in BigQuery, use the BigQuery MLCREATEMODEL statement. This statement is similar to theCREATE TABLEDDL statement. When you run a query that contains aCREATE MODEL statement, aquery job is generated for you that processesthe query. You can also use the Google Cloud console user interface tocreate a model by using a UI(Preview).

For more information about supported SQL statements and functions for eachmodel type, see the following documents:

Required permissions

  • To create a dataset to store the model, you need thebigquery.datasets.create IAM permission.

  • To create a model, you need the following permissions:

    • bigquery.jobs.create
    • bigquery.models.create
    • bigquery.models.getData
    • bigquery.models.updateData
    • bigquery.connections.delegate (for remote models)

The followingpredefined IAM rolesgrant these permissions:

For more information about IAM roles and permissions inBigQuery, seeIntroduction to IAM.

CREATE MODEL syntax

Note: This syntax statement provides a comprehensive list of model types withtheir model options. When you create a model, use that model specificCREATEMODEL statement for convenience. You can view specificCREATE MODELstatements by clicking theMODEL_TYPE name in the following list,in the table of contents in the left panel, or in thecreate model link in theEnd-to-end user journey for each model.
{CREATE MODEL |CREATE MODEL IF NOT EXISTS |CREATE OR REPLACE MODEL}model_name[TRANSFORM (select_list)][INPUT (field_namefield_type) OUTPUT (field_namefield_type)][REMOTE WITH CONNECTION {`connection_name` | DEFAULT}][OPTIONS(model_option_list)][AS {query_statement |  (    training_data AS (query_statement),    custom_holiday AS (holiday_statement)  )}]model_option_list:MODEL_TYPE = {'LINEAR_REG' |'LOGISTIC_REG' |'KMEANS' |'MATRIX_FACTORIZATION' |'PCA' |'AUTOENCODER' |'AUTOML_CLASSIFIER' |'AUTOML_REGRESSOR' |'BOOSTED_TREE_CLASSIFIER' |'BOOSTED_TREE_REGRESSOR' |'RANDOM_FOREST_CLASSIFIER' |'RANDOM_FOREST_REGRESSOR' |'DNN_CLASSIFIER' |'DNN_REGRESSOR' |'DNN_LINEAR_COMBINED_CLASSIFIER' |'DNN_LINEAR_COMBINED_REGRESSOR' |'ARIMA_PLUS' |'ARIMA_PLUS_XREG' |'TENSORFLOW' |'TENSORFLOW_LITE' |'ONNX' |'XGBOOST' |'CONTRIBUTION_ANALYSIS'}    [,MODEL_REGISTRY = { 'VERTEX_AI' } ]    [,VERTEX_AI_MODEL_ID =string_value ]    [,VERTEX_AI_MODEL_VERSION_ALIASES =string_array ]    [,INPUT_LABEL_COLS =string_array ]    [,MAX_ITERATIONS =int64_value ]    [,EARLY_STOP = { TRUE | FALSE } ]    [,MIN_REL_PROGRESS =float64_value ]    [,DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' } ]    [,DATA_SPLIT_EVAL_FRACTION =float64_value ]    [,DATA_SPLIT_TEST_FRACTION =float64_value ]    [,DATA_SPLIT_COL =string_value ]    [,OPTIMIZE_STRATEGY = { 'AUTO_STRATEGY' | 'BATCH_GRADIENT_DESCENT' | 'NORMAL_EQUATION' } ]    [,L1_REG =float64_value ]    [,L2_REG =float64_value ]    [,LEARN_RATE_STRATEGY = { 'LINE_SEARCH' | 'CONSTANT' } ]    [,LEARN_RATE =float64_value ]    [,LS_INIT_LEARN_RATE =float64_value ]    [,WARM_START = { TRUE | FALSE } ]    [,AUTO_CLASS_WEIGHTS = { TRUE | FALSE } ]    [,CLASS_WEIGHTS =struct_array ]    [,INSTANCE_WEIGHT_COL =string_value ]    [,NUM_CLUSTERS =int64_value ]    [,KMEANS_INIT_METHOD = { 'RANDOM' | 'KMEANS++' | 'CUSTOM' } ]    [,KMEANS_INIT_COL =string_value ]    [,DISTANCE_TYPE = { 'EUCLIDEAN' | 'COSINE' } ]    [,STANDARDIZE_FEATURES = { TRUE | FALSE } ]    [,MODEL_PATH =string_value ]    [,BUDGET_HOURS =float64_value ]    [,OPTIMIZATION_OBJECTIVE = {string_value |struct_value } ]    [,FEEDBACK_TYPE = {'EXPLICIT' | 'IMPLICIT'} ]    [,NUM_FACTORS =int64_value ]    [,USER_COL =string_value ]    [,ITEM_COL =string_value ]    [,RATING_COL =string_value ]    [,WALS_ALPHA =float64_value ]    [,BOOSTER_TYPE = { 'gbtree' | 'dart'} ]    [,NUM_PARALLEL_TREE =int64_value ]    [,DART_NORMALIZE_TYPE = { 'tree' | 'forest'} ]    [,TREE_METHOD = { 'auto' | 'exact' | 'approx' | 'hist'} ]    [,MIN_TREE_CHILD_WEIGHT =float64_value ]    [,COLSAMPLE_BYTREE =float64_value ]    [,COLSAMPLE_BYLEVEL =float64_value ]    [,COLSAMPLE_BYNODE =float64_value ]    [,MIN_SPLIT_LOSS =float64_value ]    [,MAX_TREE_DEPTH =int64_value ]    [,SUBSAMPLE =float64_value ]    [,ACTIVATION_FN = { 'RELU' | 'RELU6' | 'CRELU' | 'ELU' | 'SELU' | 'SIGMOID' | 'TANH' } ]    [,BATCH_SIZE =int64_value ]    [,DROPOUT =float64_value ]    [,HIDDEN_UNITS =int_array ]    [,OPTIMIZER = { 'ADAGRAD' | 'ADAM' | 'FTRL' | 'RMSPROP' | 'SGD' } ]    [,TIME_SERIES_TIMESTAMP_COL =string_value ]    [,TIME_SERIES_DATA_COL =string_value ]    [,TIME_SERIES_ID_COL = {string_value |string_array } ]    [,HORIZON =int64_value ]    [,AUTO_ARIMA = { TRUE | FALSE } ]    [,AUTO_ARIMA_MAX_ORDER =int64_value ]    [,AUTO_ARIMA_MIN_ORDER =int64_value ]    [,NON_SEASONAL_ORDER =(int64_value, int64_value, int64_value) ]    [,DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | ... } ]    [,FORECAST_LIMIT_LOWER_BOUND =float64_value  ]    [,FORECAST_LIMIT_UPPER_BOUND =float64_value  ]    [,INCLUDE_DRIFT = { TRUE | FALSE } ]    [,HOLIDAY_REGION = { 'GLOBAL' | 'NA' | 'JAPAC' | 'EMEA' | 'LAC' | 'AE' | ... } ]    [,CLEAN_SPIKES_AND_DIPS = { TRUE | FALSE } ]    [,ADJUST_STEP_CHANGES = { TRUE | FALSE } ]    [,DECOMPOSE_TIME_SERIES = { TRUE | FALSE } ]    [,HIERARCHICAL_TIME_SERIES_COLS = {string_array } ]    [,ENABLE_GLOBAL_EXPLAIN = { TRUE | FALSE } ]    [,APPROX_GLOBAL_FEATURE_CONTRIB = { TRUE | FALSE }]    [,INTEGRATED_GRADIENTS_NUM_STEPS =int64_value ]    [,CALCULATE_P_VALUES = { TRUE | FALSE } ]    [,FIT_INTERCEPT = { TRUE | FALSE } ]    [,CATEGORY_ENCODING_METHOD = { 'ONE_HOT_ENCODING' | 'DUMMY_ENCODING' |      'LABEL_ENCODING' | 'TARGET_ENCODING' } ]    [,{ ENDPOINT =string_value |HUGGING_FACE_MODEL_ID =string_value |MODEL_GARDEN_MODEL_NAME =string_value} ]    [,HUGGING_FACE_TOKEN =string_value ]    [,MACHINE_TYPE =string_value ]    [,MIN_REPLICA_COUNT =int64_value ]    [,MAX_REPLICA_COUNT =int64_value ]    [,RESERVATION_AFFINITY_TYPE = { 'NO_RESERVATION' | 'ANY_RESERVATION' | 'SPECIFIC_RESERVATION' } ]    [,RESERVATION_AFFINITY_KEY =string_value ]    [,RESERVATION_AFFINITY_VALUES =string_array ]    [,ENDPOINT_IDLE_TTL =interval_value ]    [,REMOTE_SERVICE_TYPE = { 'CLOUD_AI_VISION_V1' | 'CLOUD_AI_NATURAL_LANGUAGE_V1' |      'CLOUD_AI_TRANSLATE_V3' } ]    [,XGBOOST_VERSION = { '0.9' | '1.1' } ]    [,TF_VERSION = { '1.15' | '2.8.0' } ]    [,NUM_TRIALS =int64_value, ]    [,MAX_PARALLEL_TRIALS =int64_value ]    [,HPARAM_TUNING_ALGORITHM = { 'VIZIER_DEFAULT' | 'RANDOM_SEARCH' | 'GRID_SEARCH' } ]    [,HPARAM_TUNING_OBJECTIVES = { 'R2_SCORE' | 'ROC_AUC' | ... } ]    [,NUM_PRINCIPAL_COMPONENTS =int64_value ]    [,PCA_EXPLAINED_VARIANCE_RATIO =float64_value ]    [,SCALE_FEATURES = { TRUE | FALSE } ]    [,PCA_SOLVER = { 'FULL' | 'RANDOMIZED' | 'AUTO' } ]    [,TIME_SERIES_LENGTH_FRACTION =float64_value ]    [,MIN_TIME_SERIES_LENGTH =int64_value ]    [,MAX_TIME_SERIES_LENGTH =int64_value ]    [,TREND_SMOOTHING_WINDOW_SIZE =int64_value ]    [,SEASONALITIES =string_array ]    [,PROMPT_COL =string_value ]    [,LEARNING_RATE_MULTIPLIER =float64_value ]    [,ACCELERATOR_TYPE = { 'GPU' | 'TPU' } ]    [,EVALUATION_TASK = { 'TEXT_GENERATION' | 'CLASSIFICATION' | 'SUMMARIZATION' |      'QUESTION_ANSWERING' | 'UNSPECIFIED' } ]    [,DOCUMENT_PROCESSOR =string_value ]    [,SPEECH_RECOGNIZER =string_value ]    [,KMS_KEY_NAME =string_value ]    [,CONTRIBUTION_METRIC =string_value ]    [,DIMENSION_ID_COLS =string_array ]    [,IS_TEST_COL =string_value ]    [,MIN_APRIORI_SUPPORT =float64_value ]    [,PRUNING_METHOD ={'NO_PRUNING', 'PRUNE_REDUNDANT_INSIGHTS'} ]    [,TOP_K_INSIGHTS_BY_APRIORI_SUPPORT =int64_value ]

CREATE MODEL

Creates and trains a new model in the specified dataset. If the model nameexists,CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates and trains a new model only if the model does not exist in thespecified dataset.

CREATE OR REPLACE MODEL

Creates and trains a model and replaces an existing model with the same name inthe specified dataset.

model_name

model_name is the name of the model you're creating or replacing. The modelname must be unique per dataset: no other model or table can have the same name.The model name must follow the same naming rules as a BigQuery table. Amodel name can:

  • Contain up to 1,024 characters
  • Contain letters (upper or lower case), numbers, and underscores

model_name is case-sensitive.

If you don't have a default project configured, prepend the project ID to themodel name in following format, including backticks:`[PROJECT_ID].[DATASET].[MODEL]`; for example,`myproject.mydataset.mymodel`.

TRANSFORM

TRANSFORM lets you specify all preprocessing during model creation andhave it automatically applied during prediction and evaluation.

For example, you can create the following model:

CREATEORREPLACEMODEL`myproject.mydataset.mymodel`TRANSFORM(ML.FEATURE_CROSS(STRUCT(f1,f2))ascross_f,ML.QUANTILE_BUCKETIZE(f3)OVER()asbuckets,label_col)OPTIONS(model_type='linear_reg',input_label_cols=['label_col'])ASSELECT*FROMt

During prediction, you don't need to preprocess the input again, and the sametransformations are automatically restored:

SELECT*FROMML.PREDICT(MODEL`myproject.mydataset.mymodel`,(SELECTf1,f2,f3FROMtable))

When theTRANSFORM clause is present, only output columns from theTRANSFORM clause are used in training. Any results fromquery_statement that don't appear in theTRANSFORM clause are ignored.

The input columns of theTRANSFORM clause are the result ofquery_statement.So, the final input used in training is the set of columns generated by thefollowing query:

SELECT(select_list)FROM(query_statement);

Input columns of theTRANSFORM clause can be of any SIMPLE type or ARRAY ofSIMPLE type. SIMPLE types are non-STRUCT and non-ARRAY data types.

In prediction (ML.PREDICT), users only need to pass in the originalcolumns from thequery_statement that are used inside theTRANSFORM clause.The columns dropped inTRANSFORM don't need to be provided during prediction.TRANSFORM is automatically applied to the input data during prediction,including the statistics used in ML analytic functions (for example,ML.QUANTILE_BUCKETIZE).

To learn more about feature preprocessing, seeFeature preprocessing overview, or try theFeature Engineering Functions notebook.

To try using theTRANSFORM clause, try theUse the BigQuery MLTRANSFORM clause for feature engineering tutorial or theCreate Model With Inline Transpose notebook.

select_list

You can pass columns fromquery_statement through to model training withouttransformation by either using** EXCEPT(), or by listingthe column names directly.

Not all columns fromquery_statement are required to appear in theTRANSFORMclause, so you can drop columns appearing inquery_statement by omittingthem from theTRANSFORM clause.

You can transform inputs fromquery_statement by using expressions inselect_list.select_list is similar to a normalSELECT statement.select_list supports the following syntax:

  • *
  • * EXCEPT()
  • * REPLACE()
  • expression
  • expression.*

The following cannot appear insideselect_list:

  • Aggregation functions.
  • Non-BigQuery ML analytic functions. For more information aboutsupported functions, seeManual feature preprocessing.
  • UDFs.
  • Subqueries.
  • Anonymous columns. For example,a + b as c is allowed, whilea + b isn't.

The output columns ofselect_list can be of any BigQuerysupported data type.

If present, the following columns must appear inselect_list withouttransformation:

  • label
  • data_split_col
  • kmeans_init_col
  • instance_weight_col

If these columns are returned byquery_statement, you must reference them inselect_list by column name outside of any expression, or by using*. Youcan't use aliases with these columns.

INPUT andOUTPUT

INPUT andOUTPUT clauses are used to specify input and output format forremote models orXGBoost models.

field_name

For remote models,INPUT andOUTPUT field names must be identical as thefield names of the Vertex AI endpoint request and response. See examples inremote modelINPUT andOUTPUT clause.

For XGBoost models,INPUT field names must be identical to the names in thefeature_names field iffeature_names field is populated in the XGBoost model file. SeeXGBoost INPUT OUTPUT clause for more details.

field_type

Remote models support thefollowing BigQuery data types forINPUT andOUTPUT clauses:

XGBoost models only supportnumeric typesfor theINPUT field type andFLOAT64for theOUTPUT field type.

connection_name

BigQuery uses aCLOUD_RESOURCEconnectionto interact with your Vertex AI endpoint. You need to grantVertex AI User role to connection's service account on your Vertex AI endpoint project.

See examples inremote modelCONNECTION statement.

To use adefault connection, specifyspecifyDEFAULT instead of the connection name.

model_option_list

CREATE MODEL supports the following options:

MODEL_TYPE

Syntax

MODEL_TYPE={'LINEAR_REG'|'LOGISTIC_REG'|'KMEANS'|'PCA'|'MATRIX_FACTORIZATION'|'AUTOENCODER'|'AUTOML_REGRESSOR'|'AUTOML_CLASSIFIER'|'BOOSTED_TREE_CLASSIFIER'|'BOOSTED_TREE_REGRESSOR'|'RANDOM_FOREST_CLASSIFIER'|'RANDOM_FOREST_REGRESSOR'|'DNN_CLASSIFIER'|'DNN_REGRESSOR'|'DNN_LINEAR_COMBINED_CLASSIFIER'|'DNN_LINEAR_COMBINED_REGRESSOR'|'ARIMA_PLUS'|'ARIMA_PLUS_XREG'|'TENSORFLOW'|'TENSORFLOW_LITE'|'ONNX'|'XGBOOST'|'CONTRIBUTION_ANALYSIS'}

Description

Specify the model type. This argument is required.

Arguments

The argument is in the model type column.

Model categoryModel typeDescriptionModel specific CREATE MODEL statement
Regression'LINEAR_REG'Linear regression for real-valued label prediction; for example, the sales of anitem on a given day.CREATE MODEL statement for generalized linear models
'BOOSTED_TREE_REGRESSOR'Create a boosted tree regressor model using the XGBoost library.CREATE MODEL statement for boosted tree models
'RANDOM_FOREST_REGRESSOR'Create a random forest regressor model using the XGBoost library.CREATE MODEL statement for random forest models
'DNN_REGRESSOR'Create a Deep Neural Network Regressor model.CREATE MODEL statement for DNN models
'DNN_LINEAR_COMBINED_REGRESSOR'Create a Wide-and-Deep Regressor model.CREATE MODEL statement for Wide-and-Deep models
'AUTOML_REGRESSOR'Create a regression model using AutoML.CREATE MODEL statement for AutoML models
Classification'LOGISTIC_REG'Logistic regression for binary-class or multi-class classification; for example, determining whether a customer will make a purchase.CREATE MODEL statement for generalized linear models
'BOOSTED_TREE_CLASSIFIER'Create a boosted tree classifier model using the XGBoost library.CREATE MODEL statement for boosted tree models
'RANDOM_FOREST_CLASSIFIER'Create a random forest classifier model using the XGBoost library.CREATE MODEL statement for random forest models
'DNN_CLASSIFIER'Create a Deep Neural Network Classifier model.CREATE MODEL statement for DNN models
'DNN_LINEAR_COMBINED_CLASSIFIER'Create a Wide-and-Deep Classifier model.CREATE MODEL statement for Wide-and-Deep models
'AUTOML_CLASSIFIER'Create a classification model using AutoML.CREATE MODEL statement for AutoML models
Clustering'KMEANS'K-means clustering for data segmentation; for example,identifying customer segments.CREATE MODEL statement for K-means models
Collaborative Filtering'MATRIX_FACTORIZATION'Matrix factorization for recommendation systems.For example, given a set of users, items, and some ratings for a subset of the items, creates a model to predict a user's rating for items they have not rated.CREATE MODEL statement for matrix factorization models
Dimensionality Reduction'PCA'Principal component analysis for dimensionality reduction.CREATE MODEL statement for PCA models
'AUTOENCODER'Create an Autoencoder model for anomaly detection, dimensionality reduction, and embedding purposes.CREATE MODEL statement for Autoencoder model
Time series forecasting'ARIMA_PLUS' (previously'ARIMA') Univariate time-series forecasting with many modeling components under the hood such asARIMA model for the trend, STL and ETS for seasonality, and holiday effects.CREATE MODEL statement for time series models
'ARIMA_PLUS_XREG' Multivariate time-series forecasting using linear regression and ARIMA_PLUS as the underlying techniques.CREATE MODEL statement for time series models
Augmented analytics'CONTRIBUTION_ANALYSIS'Create a contribution analysis model to find key drivers of a change.CREATE MODEL statement for Contribution Analysis
Importing models'TENSORFLOW'Create a model by importing a TensorFlow model into BigQuery.CREATE MODEL statement for TensorFlow models
'TENSORFLOW_LITE'Create a model by importing a TensorFlow Lite model into BigQuery.CREATE MODEL statement for TensorFlow Lite models
'ONNX'Create a model by importing an ONNX model into BigQuery.CREATE MODEL statement for ONNX models
'XGBOOST'Create a model by importing a XGBoost model into BigQuery.CREATE MODEL statement for XGBoost models
Remote modelsN/ACreate a model by specifying a Cloud AI service, or the endpoint for a Vertex AI model.CREATE MODEL statement for remote models over Google models in Vertex AI

CREATE MODEL statement for remote models over hosted models in Vertex AI

CREATE MODEL statement for remote models over Cloud AI services

Note: We are deprecatingARIMA as the model type. While the model trainingpipelines ofARIMA andARIMA_PLUS are the same,ARIMA_PLUS supports morecapabilities, including support for a new training option,DECOMPOSE_TIME_SERIES,and table-valued functions includingML.ARIMA_EVALUATEandML.EXPLAIN_FORECAST.

Other model options

The following table provides a comprehensive list of model options, with a briefdescriptions and their applicable model types. You can find detailed descriptionin the model specificCREATE MODEL statement by clicking the model type in the"Applied model types" column.

When the applied model types are supervised learning models, unless "regressor"or "classifier" is explicitly listed, it means that model options apply to boththe regressor and the classifier. For example, the "boosted tree" means thatmodel option applies to both boosted tree regressor and boosted tree classifier,while the "boosted tree classifier" only applies to the classifier.

NameDescriptionApplied model types
MODEL_REGISTRYThe MODEL_REGISTRY option specifies the Model Registry destination.All model types are supported.
VERTEX_AI_MODEL_IDThe Vertex AI model ID to register the model with.All model types are supported.
VERTEX_AI_MODEL_VERSION_ALIASESThe Vertex AI model alias to register the model with.All model types are supported.
INPUT_LABEL_COLSThe label column names in the training data.Linear & logistic regression,
Boosted trees,
Random forest,
DNN,
Wide & Deep,
AutoML
MAX_ITERATIONSThe maximum number of training iterations or steps.Linear & logistic regression,
Boosted trees,
DNN,
Wide & Deep,
Kmeans,
Matrix factorization,
Autoencoder
EARLY_STOPWhether training should stop after the first iteration in which the relativeloss improvement is less than the value specified for `MIN_REL_PROGRESS`.Linear & logistic regression,
Boosted trees,
Random forest,
DNN,
Wide & Deep,
Kmeans,
Matrix factorization,
Autoencoder
MIN_REL_PROGRESSThe minimum relative loss improvement that is necessary to continue trainingwhen `EARLY_STOP` is set to true.Linear & logistic regression,
Boosted trees,
Random forest,
DNN,
Wide & Deep,
Kmeans,
Matrix factorization,
Autoencoder
DATA_SPLIT_METHODThe method to split input data into training and evaluation sets when not running hyperparameter tuning, or into training, evaluation, and test sets when running hyperparameter tuning.Linear & logistic regression,
Boosted trees,
Random forest,
DNN,
Wide & Deep
Matrix factorization
DATA_SPLIT_EVAL_FRACTIONSpecifies the fraction of the data used for evaluation. Accurate to two decimal places.Linear & logistic regression,
Boosted trees,
Random forest,
DNN,
Wide & Deep
Matrix factorization
DATA_SPLIT_TEST_FRACTIONSpecifies the fraction of the data used for testing when you are running hyperparameter tuning. Accurate to two decimal places.Linear & logistic regression,
Boosted trees,
Random forest,
DNN,
Wide & Deep
Matrix factorization
DATA_SPLIT_COLIdentifies the column used to split the data.Linear & logistic regression,
Boosted trees,
Random forest,
DNN,
Wide & Deep
Matrix factorization
OPTIMIZE_STRATEGYThe strategy to train linear regression models.Linear regression
L1_REGThe amount ofL1 regularization applied.Linear & logistic regression,
Boosted trees
Random forest
L2_REGThe amount ofL2 regularization applied.Linear & logistic regression,
Boosted trees,
Random forest,
Matrix factorization,
ARIMA_PLUS_XREG
LEARN_RATE_STRATEGYThe strategy for specifying thelearning rate during training.Linear & logistic regression
LEARN_RATEThe learn rate forgradient descent when LEARN_RATE_STRATEGY is set to CONSTANT.Linear & logistic regression
LS_INIT_LEARN_RATESets the initial learning rate that LEARN_RATE_STRATEGY=LINE_SEARCH uses.Linear & logistic regression
WARM_STARTRetrain a model with new training data, new model options, or both.Linear & logistic regression,
DNN,
Wide & Deep,
Kmeans,
Autoencoder
AUTO_CLASS_WEIGHTSWhether to balance class labels using weights for each class in inverseproportion to the frequency of that class.Logistic regression,
Boosted tree classifier,
Random forest classifier,
DNN classifier,
Wide & Deep classifier
CLASS_WEIGHTSThe weights to use for each class label. This option cannot be specifiedif AUTO_CLASS_WEIGHTS is specified.

It takes an ARRAY of STRUCTs; each STRUCT is a (STRING, FLOAT64) pairrepresenting a class label and the corresponding weight.

A weight must be present for every class label. The weights are not required toadd up to one. For example: CLASS_WEIGHTS = [STRUCT('example_label', .2)].
Logistic regression,
Boosted tree classifier,
Random forest classifier,
DNN classifier,
Wide & Deep classifier
INSTANCE_WEIGHT_COLIdentifies the column used to specify the weights for each data point in the training dataset.Boosted trees,
Random forest
NUM_CLUSTERSThe number of clusters to identify in the input data.Kmeans
KMEANS_INIT_METHODThe method of initializing the clusters.Kmeans
KMEANS_INIT_COLIdentifies the column used to initialize the centroids.Kmeans
DISTANCE_TYPEThe type of metric to compute the distance between two points.K-means
STANDARDIZE_FEATURESWhether tostandardize numericalfeatures.Kmeans
BUDGET_HOURSSets the training budget hours.AutoML
OPTIMIZATION_OBJECTIVESets the optimization objective function to use for AutoML.AutoML
MODEL_PATHSpecifies the location of the imported model to import.Imported TensorFlow model,
Imported TensorFlow lite model,
Imported ONNX model,
Imported XGBoost model
FEEDBACK_TYPESpecifies feedback type for matrix factorization models which changes thealgorithm that is used during training.Matrix factorization
NUM_FACTORSSpecifies the number of latent factors.Matrix factorization
USER_COLThe user column name.Matrix factorization
ITEM_COLThe item column name.Matrix factorization
RATING_COLThe rating column name.Matrix factorization
WALS_ALPHAA hyperparameter for matrix factorization models with IMPLICIT feedback.Matrix factorization
BOOSTER_TYPEFor boosted tree models, specify the booster type to use, with default valueGBTREE.Boosted trees
NUM_PARALLEL_TREENumber of parallel trees constructed during each iteration.Boosted trees,
Random forest
DART_NORMALIZE_TYPEType of normalization algorithm for DART booster.Boosted trees
TREE_METHODType of tree construction algorithm.Boosted trees,
Random forest
MIN_TREE_CHILD_WEIGHTMinimum sum of instance weight needed in a child for further partitioning.Boosted trees,
Random forest
COLSAMPLE_BYTREESubsample ratio of columns when constructing each tree. Subsampling occurs oncefor every tree constructed.Boosted trees,
Random forest
COLSAMPLE_BYLEVELSubsample ratio of columns for each level. Subsampling occurs once for everynew depth level reached in a tree.Boosted trees,
Random forest
COLSAMPLE_BYNODESubsample ratio of columns for each node (split). Subsampling occurs once everytime a new split is evaluated.Boosted trees,
Random forest
MIN_SPLIT_LOSSMinimum loss reduction required to make a further partition on a leaf node ofthe tree.Boosted trees,
Random forest
MAX_TREE_DEPTHMaximum depth of a tree.Boosted trees,
Random forest
SUBSAMPLESubsample ratio of the training instances.Boosted trees,
Random forest
ACTIVATION_FNSpecifies the activation function of the neural network.DNN,
Wide & Deep,
Autoencoder
BATCH_SIZESpecifies the mini batch size of samples that are fed tothe neural network.DNN,
Wide & Deep,
Autoencoder
DROPOUTSpecifies the dropout rate of units in the neural network.DNN,
Wide & Deep,
Autoencoder
HIDDEN_UNITSSpecifies the hidden layers of the neural network.DNN,
Wide & Deep,
Autoencoder
OPTIMIZERSpecifies the optimizer for training the model.DNN,
Wide & Deep,
Autoencoder
TIME_SERIES_TIMESTAMP_COLThe timestamp column name for time series models.ARIMA_PLUS,
ARIMA_PLUS_XREG
TIME_SERIES_DATA_COLThe data column name for time series models.ARIMA_PLUS,
ARIMA_PLUS_XREG
TIME_SERIES_ID_COLThe ID column names for time-series models.ARIMA_PLUS,
ARIMA_PLUS_XREG
HORIZONThe number of time points to forecast. When forecasting multiple time series atonce, this parameter applies to each time series.ARIMA_PLUS,
ARIMA_PLUS_XREG
AUTO_ARIMAWhether the training process should use auto.ARIMA or not.ARIMA_PLUS,
ARIMA_PLUS_XREG
AUTO_ARIMA_MAX_ORDERThe maximum value for the sum of non-sesonal p and q. It controls the parametersearch space in the auto.ARIMA algorithm.ARIMA_PLUS,
ARIMA_PLUS_XREG
AUTO_ARIMA_MIN_ORDERThe minimum value for the sum of non-sesonal p and q. It controls the parametersearch space in the auto.ARIMA algorithm.ARIMA_PLUS,
ARIMA_PLUS_XREG
NON_SEASONAL_ORDERThe tuple of non-seasonal p, d, and q for the ARIMA_PLUS model.ARIMA_PLUS,
ARIMA_PLUS_XREG
DATA_FREQUENCYThe data frequency of the input time series.ARIMA_PLUS,
ARIMA_PLUS_XREG
FORECAST_LIMIT_LOWER_BOUNDThe lower bound of the time series forecasting values.ARIMA_PLUS
FORECAST_LIMIT_UPPER_BOUNDThe upper bound of the time series forecasting values.ARIMA_PLUS
INCLUDE_DRIFTShould the ARIMA_PLUS model include a linear drift term or not.ARIMA_PLUS,
ARIMA_PLUS_XREG
HOLIDAY_REGIONThe geographical region based on which the holiday effect is applied inmodeling.ARIMA_PLUS,
ARIMA_PLUS_XREG
CLEAN_SPIKES_AND_DIPSWhether the spikes and dips should be cleaned.ARIMA_PLUS,
ARIMA_PLUS_XREG
ADJUST_STEP_CHANGESWhether the step changes should be adjusted.ARIMA_PLUS,
ARIMA_PLUS_XREG
DECOMPOSE_TIME_SERIESWhether the separate components of both the history and the forecast parts of thetime series (such as seasonal components) should be saved.ARIMA_PLUS
HIERARCHICAL_TIME_SERIES_COLSThe column names used to generate hierarchical time series forecasts. The column order represents the hierarchy structure.ARIMA_PLUS
ENABLE_GLOBAL_EXPLAINSpecifies whether to compute global explanations usingexplainable AI to evaluate global feature importance to the model.Linear & logistic regression,
Boosted trees,
Random forest,
DNN,
Wide & Deep
APPROX_GLOBAL_FEATURE_CONTRIBSpecifies whether to use fast approximation for feature contribution computation.Boosted trees,
Random forest
INTEGRATED_GRADIENTS_NUM_STEPSSpecifies the number of steps to sample between the example being explained andits baseline for approximating the integral in integrated gradients attribution methods.DNN,
Wide & Deep
CALCULATE_P_VALUESSpecifies whether to compute p-values for the model during training.Linear & logistic regression
FIT_INTERCEPTSpecifies whether to fit an intercept for the model during training.Linear & logistic regression
CATEGORY_ENCODING_METHODSpecifies the default encoding method for categorical features.Linear & logistic regression,
Boosted trees
ENDPOINTSpecifies the Vertex AI endpoint to use for a remote model. This can be the name of a Google model in Vertex AI or the HTTPS endpoint of a model deployed to Vertex AI.Remote models over Google models in Vertex AI
Remote models over hosted models in Vertex AI
Remote models over open models
HUGGING_FACE_MODEL_IDSpecifies the model ID for a supported Hugging Face model.Remote models over open models
MODEL_GARDEN_MODEL_NAMESpecifies the model ID and model version of a supported Vertex AI Model Garden model.Remote models over open models
HUGGING_FACE_TOKENSpecifies the Hugging Face User Access Token to use.Remote models over open models
MACHINE_TYPESpecifies the machine type to use when deploying the model to Vertex AI.Remote models over open models
MIN_REPLICA_COUNTSpecifies the minimum number of machine replicas used when deploying the model to Vertex AI.Remote models over open models
MAX_REPLICA_COUNTSpecifies the maximum number of machine replicas used when deploying the model to Vertex AI.Remote models over open models
RESERVATION_AFFINITY_TYPEDetermines whether the deployed model uses Compute Engine reservations to provide assured virtual machine (VM) availability when serving predictions, and specifies whether the model uses VMs from all available reservations or just one specific reservation.Remote models over open models
RESERVATION_AFFINITY_KEYThe key for a Compute Engine reservation.Remote models over open models
RESERVATION_AFFINITY_VALUESSpecifies the full resource name of the Compute Engine reservation.Remote models over open models
ENDPOINT_IDLE_TTLSpecifies the duration of inactivity after which a BigQuery-managed Vertex AI model is automatically undeployed from a Vertex AI endpoint.Remote models over open models
REMOTE_SERVICE_TYPESpecifies the Cloud AI service to use for a remote model.Remote models over Cloud AI services
XGBOOST_VERSIONSpecifies the Xgboost version for model training.Boosted trees,
Random forest
TF_VERSIONSpecifies the TensorFlow (TF) version for model training.DNN,
Wide & Deep,
Autoencoder
NUM_TRIALSSpecifies the maximum number of submodels to train when you are running hyperparameter tuning.Linear & logistic regression,
Boosted trees,
Random forest,
DNN,
Wide & Deep,
Kmeans,
Matrix factorization,
Autoencoder
MAX_PARALLEL_TRIALSSpecifies the maximum number of trials to run at the same time when you are running hyperparameter tuning.Linear & logistic regression,
Boosted trees,
Random forest,
DNN,
Wide & Deep,
Kmeans,
Matrix factorization,
Autoencoder
HPARAM_TUNING_ALGORITHMSpecifies the algorithm used to tune the hyperparameters when you are running hyperparameter tuning.Linear & logistic regression,
Boosted trees,
Random forest,
DNN,
Wide & Deep,
Kmeans,
Matrix factorization,
Autoencoder
HPARAM_TUNING_OBJECTIVESSpecifies the hyperparameter tuning objective for the model.Linear & logistic regression,
Boosted trees,
Random forest,
DNN,
Wide & Deep,
Kmeans,
Matrix factorization,
Autoencoder
NUM_PRINCIPAL_COMPONENTSThe number of principal components to keep.PCA
PCA_EXPLAINED_VARIANCE_RATIOThe ratio for the explained variance.PCA
SCALE_FEATURESDetermines whether or not to scale the numerical features to unit variance.PCA
PCA_SOLVERThe solver to use to calculate the principal components.PCA
TIME_SERIES_LENGTH_FRACTIONThe fraction of the interpolated length of the time series that's used to model the time series trend component.ARIMA_PLUS,
ARIMA_PLUS_XREG
MIN_TIME_SERIES_LENGTHThe minimum number of time points that are used in modeling the trend component of the time series.ARIMA_PLUS,
ARIMA_PLUS_XREG
MAX_TIME_SERIES_LENGTHThe maximum number of time points that are used in modeling the trend component of the time series.ARIMA_PLUS,
ARIMA_PLUS_XREG
TREND_SMOOTHING_WINDOW_SIZEThe smoothing window size for the trend component.ARIMA_PLUS,
ARIMA_PLUS_XREG
SEASONALITIESThe seasonality of the time series data refers to the presence of variations that occur at certain regular intervals such as weekly, monthly or quarterly.ARIMA_PLUS
PROMPT_COLThe name of the prompt column in the training data table to use when performing supervised tuning.Remote models over Google models in Vertex AI
LEARNING_RATE_MULTIPLIERA multiplier to apply to the recommended learning rate when performing supervised tuning.Remote models over Google models in Vertex AI
EVALUATION_TASKWhen performing supervised tuning, the type of task that you want to tune the model to perform.Remote models over Google models in Vertex AI
DOCUMENT_PROCESSORIdentifies the document processor to use when the REMOTE_SERVICE_TYPE option value isCLOUD_AI_DOCUMENT_V1.Remote models over Cloud AI services
SPEECH_RECOGNIZERIdentifies the speech recognizer to use when the REMOTE_SERVICE_TYPE option value isCLOUD_AI_SPEECH_TO_TEXT_V2Remote models over Cloud AI services
KMS_KEY_NAMESpecifies the Cloud Key Management Servicecustomer-managed encryption key (CMEK) to use to encrypt the model.Linear & logistic regression,
Boosted trees,
Random forest,
DNN,
Wide & Deep,
AutoML,
K-means,
PCA,
Autoencoder,
Matrix factorization,
ARIMA_PLUS,
ARIMA_PLUS_XREG,
ONNX,
TensorFlow,
TensorFlow Lite,
XGBoost
CONTRIBUTION_METRICThe expression to use when performing contribution analysis.Contribution analysis
DIMENSION_ID_COLSThe names of the columns to use as dimensions when summarizing the contribution analysis metric.Contribution analysis
IS_TEST_COLThe name of the column to use to determine whether a given row is test data or control data.Contribution analysis
MIN_APRIORI_SUPPORTThe minimum apriori support threshold for including segments in the model output.Contribution analysis
TOP_K_INSIGHTS_BY_APRIORI_SUPPORTThe number of top insights by apriori support to include in the model output.Contribution analysis
PRUNING_METHODThe pruning method to use for the contribution analysis model.Contribution analysis

AS

All model types support the followingAS clause syntax for specifying thetraining data:
ASquery_statement

For time series forecasting models that have aDATA_FREQUENCY valueof eitherDAILY orAUTO_FREQUENCY, you can optionally use thefollowingAS clause syntax to performcustom holiday modelingin addition to specifying the training data:

AS (  training_data AS (query_statement),  custom_holiday AS (holiday_statement))

query_statement

Thequery_statement argument specifies the query that is used togenerate the training data. For information about the supported SQL syntax ofthequery_statement clause, seeGoogleSQL query syntax.

holiday_statement

Theholiday_statement argument specifies the query that provides customholiday modeling information for time series forecast models. This query mustreturn 50,000 rows or less and must contain the following columns:

  • region: Required. ASTRING value that identifies the region to target forholiday modeling. Use one of the following options:

    • An upper-caseholiday region code.Use this option tooverwrite or supplement the holidays for the specified region. Youcan see the holidays for a region by runningSELECT * FROMbigquery-public-data.ml_datasets.holidays_and_events_for_forecasting WHEREregion =region.
    • An arbitrary string. Use this option to specify a custom region thatyou want to model holidays for. For example, you could specifyLondonif you are only modeling holidays for that city.

    Be sure not to use an existing holiday region code when you aretrying to model for a custom region. For example, if you want to model aholiday in California, and specifyCA as theregion value, theservice recognizes that as the holiday region code for Canada andtargets that region. Because the argument is case-sensitive, you couldspecifyca,California, or some other value that isn't a holidayregion code.

  • holiday_name: Required. ASTRING value that identifies the holidayto target for holiday modeling. Use one of the following options:

    • The holiday name as it is represented in thebigquery-public-data.ml_datasets.holidays_and_events_for_forecastingpublic table, including case.Use this option tooverwriteorsupplementthe specified holiday.
    • A string that represents a custom holiday. The string must be a validcolumn name so that it can be used inML.EXPLAIN_FORECAST output. Forexample, it cannot contain space. For more information on column naming,seeColumn names.
  • primary_date: Required. ADATE value that specifies the date the holidayfalls on.

  • preholiday_days: Optional. AnINT64 value that specifies the start of theholiday window around the holiday that is taken into account whenmodeling. Must be greater than or equal to1. Defaults to1.

  • postholiday_days: Optional. AnINT64 value that specifies the end of theholiday window around the holiday that is taken into account whenmodeling. Must be greater than or equal to1. Defaults to1.

Thepreholiday_days andpostholiday_days arguments together describethe holiday window around the holiday that is taken into accountwhen modeling. The holiday window is defined as[primary_date - preholiday_days, primary_date + postholiday_days] and isinclusive of the pre- and post-holiday days. The value for each holiday windowmust be less than or equal to30 and must be the same across the givenholiday. For example, if you are modeling Arbor Day for several different years,you must specify the same holiday window for all of those years.

To achieve the best holiday modeling result, provide as much historical andforecast information about the occurrences of each included holiday as possible.For example, if you have time series data from 2018 to 2022 and would like toforecast for 2023, you get the best result by providing the custom holidayinformation for all of those years, similar to the following:

CREATEORREPLACEMODEL`mydataset.arima_model`OPTIONS(model_type='ARIMA_PLUS',holiday_region='US',...)AS(training_dataAS(SELECT*FROM`mydataset.timeseries_data`),custom_holidayAS(SELECT'US'ASregion,'Halloween'ASholiday_name,primary_date,5ASpreholiday_days,1ASpostholiday_daysFROMUNNEST([DATE('2018-10-31'),DATE('2019-10-31'),DATE('2020-10-31'),DATE('2021-10-31'),DATE('2022-10-31'),DATE('2023-10-31')])ASprimary_date))

Supported inputs

TheCREATE MODEL statement supports the following data types for input label,data split columns and input feature columns.

Supported input feature types

SeeSupported input feature types for BigQuery ML supported input feature types.

Supported data types for input label columns

BigQuery ML supports different GoogleSQL data types depending on themodel type. Supported data types forinput_label_cols include:

Model typeSupported label types
regression modelsINT64
NUMERIC
BIGNUMERIC
FLOAT64
classification modelsAnygroupable data type

Supported data types for data split columns

BigQuery ML supports different GoogleSQL data types depending onthe data split method. Supported data types fordata_split_col include:

Data split methodSupported column types
CUSTOMBOOL
SEQINT64
NUMERIC
BIGNUMERIC
FLOAT64
TIMESTAMP

Limitations

CREATE MODEL statements must comply with the following rules:

  • Only oneCREATE statement is allowed.
  • When you use aCREATE MODEL statement, the size of the model must be 90MB or less or the query fails. Generally, if all categorical variables areshort strings, a total feature cardinality (model dimension) of 5-10 million issupported. The dimensionality is dependent on the cardinality and length ofthe string variables.
  • The label column cannot containNULL values. If the label column containsNULL values, then the query fails.
  • TheCREATE MODEL IF NOT EXISTS clause always updates the lastmodified timestamp of a model.
  • Query statements used in theCREATE MODEL statement cannot containEXTERNAL_QUERY. If you want to useEXTERNAL_QUERY, thenmaterialize thequery resultand then use theCREATE MODEL statement with the newly created table.

Quotas

Troubleshoot quota forCREATE MODEL statements.

Maximum number ofCREATE MODEL statements

This error means that you have exceeded the quota forCREATE MODEL statements.

Error message

Quota exceeded: Your project exceeded quota for CREATE MODEL queries per project.

Resolution

If you exceed thequotaforCREATE MODEL statements, send an email tobqml-feedback@google.comand request a quota increase.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-16 UTC.