The CREATE MODEL statement for boosted trees models using XGBoost

This document describes theCREATE MODEL statement for creatingboosted treesmodels in BigQuery by using SQL. Alternatively, you can use theGoogle Cloud console user interface tocreate a model by using a UI(Preview) instead of constructing the SQLstatement yourself. Boosted trees models are trained using theXGBoost library.

You can use boosted trees regressor models with theML.PREDICT functionto performregression, and you can useboosted trees classifier models with theML.PREDICT function toperformclassification. You can useboth types of boosted trees models with theML.PREDICT functionto performanomaly detection.

For more information about supported SQL statements and functions for thismodel, seeEnd-to-end user journeys for ML models.

`CREATE MODEL` syntax

{CREATE MODEL |CREATE MODEL IF NOT EXISTS |CREATE OR REPLACE MODEL}model_nameOPTIONS(model_option_list)ASquery_statementmodel_option_list:MODEL_TYPE = { 'BOOSTED_TREE_CLASSIFIER' | 'BOOSTED_TREE_REGRESSOR' }    [,APPROX_GLOBAL_FEATURE_CONTRIB = { TRUE | FALSE } ]    [,CATEGORY_ENCODING_METHOD = { 'ONE_HOT_ENCODING` | 'TARGET_ENCODING' | 'LABEL_ENCODING' } ]    [,BOOSTER_TYPE = { {'GBTREE' | 'DART'} | HPARAM_CANDIDATES([candidates]) } ]    [,DART_NORMALIZE_TYPE = { {'TREE' | 'FOREST'} | HPARAM_CANDIDATES([candidates]) } ]    [,NUM_PARALLEL_TREE = {int64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]    [,MAX_TREE_DEPTH = {int64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]    [,DROPOUT = {float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]    [,L1_REG = {float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]    [,L2_REG = {float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]    [,LEARN_RATE = {float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]    [,TREE_METHOD = { {'AUTO' | 'EXACT' | 'APPROX' | 'HIST'} | HPARAM_CANDIDATES([candidates]) } ]    [,MIN_TREE_CHILD_WEIGHT = {int64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]    [,COLSAMPLE_BYTREE = {float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]    [,COLSAMPLE_BYLEVEL = {float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]    [,COLSAMPLE_BYNODE = {float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]    [,MIN_SPLIT_LOSS = {float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]    [,SUBSAMPLE = {float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]    [,INSTANCE_WEIGHT_COL =string_value ]    [,XGBOOST_VERSION = { '0.9' | '1.1' } ]    [,AUTO_CLASS_WEIGHTS = { TRUE | FALSE } ]    [,CLASS_WEIGHTS =struct_array ]    [,ENABLE_GLOBAL_EXPLAIN = { TRUE | FALSE } ]    [,EARLY_STOP = { TRUE | FALSE } ]    [,MIN_REL_PROGRESS =float64_value ]    [,INPUT_LABEL_COLS =string_array ]    [,MAX_ITERATIONS =int64_value ]    [,DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' } ]    [,DATA_SPLIT_EVAL_FRACTION =float64_value ]    [,DATA_SPLIT_TEST_FRACTION =float64_value ]    [,DATA_SPLIT_COL =string_value ]    [,NUM_TRIALS =int64_value ]    [,MAX_PARALLEL_TRIALS =int64_value ]    [,HPARAM_TUNING_ALGORITHM = { 'VIZIER_DEFAULT' | 'RANDOM_SEARCH' | 'GRID_SEARCH' } ]    [,HPARAM_TUNING_OBJECTIVES = { 'ROC_AUC' | 'R2_SCORE' | ... } ]    [,KMS_KEY_NAME =string_value ]

`CREATE MODEL`

Creates and trains a new model in the specified dataset. If the model nameexists,CREATE MODEL returns an error.

`CREATE MODEL IF NOT EXISTS`

Creates and trains a new model only if the model doesn't exist in thespecified dataset.

`CREATE OR REPLACE MODEL`

Creates and trains a model and replaces an existing model with the same name inthe specified dataset.

`model_name`

The name of the model you're creating or replacing. The modelname must be unique in the dataset: no other model or table can have the samename. The model name must follow the same naming rules as aBigQuery table. A model name can:

Contain up to 1,024 characters
Contain letters (upper or lower case), numbers, and underscores

model_name is case-sensitive.

If you don't have a default project configured, then you must prepend theproject ID to the model name in the following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example, `myproject.mydataset.mymodel`.

`MODEL_TYPE`

Syntax

MODEL_TYPE={'BOOSTED_TREE_CLASSIFIER'|'BOOSTED_TREE_REGRESSOR'}

Description

Specifies the model type. This option is required.

`APPROX_GLOBAL_FEATURE_CONTRIB`

Syntax

APPROX_GLOBAL_FEATURE_CONTRIB={TRUE|FALSE}

Description

Enables fast approximation for feature contributions. This capability isprovided by the XGBoost library; BigQuery ML only passes thisoption through to it. For more information, seePackage 'xgboost'and search forapproxcontrib.

In order to use the fast approximation for feature contributioncomputations, you need to set bothENABLE_GLOBAL_EXPLAIN andAPPROX_GLOBAL_FEATURE_CONTRIB toTRUE.

Arguments

ABOOL value. The default value isTRUE whenENABLE_GLOBAL_EXPLAIN isTRUE andNUM_PARALLEL_TREE >= 10, otherwise it isFALSE.

`CATEGORY_ENCODING_METHOD`

Syntax

CATEGORY_ENCODING_METHOD={'ONE_HOT_ENCODING'|'TARGET_ENCODING'|'LABEL_ENCODING'}

Description

Specifies which encoding method to use on non-numeric features. For moreinformation about supported encoding methods, seeBigQuery ML auto preprocessing.

Arguments

This option accepts the following values:

LABEL_ENCODING. This is the default.
TARGET_ENCODING
ONE_HOT_ENCODING

`BOOSTER_TYPE`

Syntax

BOOSTER_TYPE = { { 'GBTREE' | 'DART'} | HPARAM_CANDIDATES([candidates]) }

Description

Specifies the booster type to use.

Arguments

This option accepts the following values:

GBTREE:tree-booster. This is the default.
DART:dart-booster

If you are running hyperparameter training, you can provide more than onevalue for this option by usingHPARAM_CANDIDATES and specifying an array.For example,BOOSTER_TYPE = HPARAM_CANDIDATES(['GBTREE', 'DART']).

`DART_NORMALIZE_TYPE`

Syntax

DART_NORMALIZE_TYPE = { { 'TREE' | 'FOREST'} | HPARAM_CANDIDATES([candidates]) }

Description

The type of normalization algorithm to use if you are using theDARTbooster.

Arguments

This option accepts the following values:

TREE: New trees have the same weight of each of the dropped trees. This isthe default value.
FOREST: New trees have the same weight of the sum of the droppedtrees (forest).

If you are running hyperparameter training, you can provide more than onevalue for this option by usingHPARAM_CANDIDATES and specifying an array.For example,DART_NORMALIZE_TYPE = HPARAM_CANDIDATES(['TREE', 'FOREST']).

`NUM_PARALLEL_TREE`

Syntax

NUM_PARALLEL_TREE = { int64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

The number of parallel trees constructed during each iteration. To train aboosted random forest model, set this value to larger than1.

Arguments

If you aren't running hyperparameter tuning, then you can specify anINT64value. The default value is1.

If you are running hyperparameter tuning, use one of the following options:

TheHPARAM_RANGE keyword and twoINT64 values that definethe range of the hyperparameter. For example,NUM_PARALLEL_TREE = HPARAM_RANGE(1, 5).
TheHPARAM_CANDIDATES keyword and an array ofINT64 valuesthat provide discrete values to use for the hyperparameter. For example,NUM_PARALLEL_TREE = HPARAM_CANDIDATES([0, 1, 3, 5]).

When running hyperparameter tuning, the valid range is(1, ∞], thereis no default range, and the scale type isLINEAR.

`MAX_TREE_DEPTH`

Syntax

MAX_TREE_DEPTH = { int64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

The maximum depth of a tree.

Arguments

If you aren't running hyperparameter tuning, then you can specify anINT64value. The default value is6.

If you are running hyperparameter tuning, use one of the following options:

TheHPARAM_RANGE keyword and twoINT64 values that definethe range of the hyperparameter. For example,MAX_TREE_DEPTH = HPARAM_RANGE(0, 4).
TheHPARAM_CANDIDATES keyword and an array ofINT64 valuesthat provide discrete values to use for the hyperparameter. For example,MAX_TREE_DEPTH = HPARAM_CANDIDATES([1, 5, 10, 15]).

When running hyperparameter tuning, the valid range is(1, 20], the defaultrange is(1, 10], and the scale type isLINEAR.

`DROPOUT`

Syntax

DROPOUT = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

Specifies the dropout rate, which is the fraction of previous trees to dropduring the dropout.

Arguments

If you aren't running hyperparameter tuning, then you can specify aFLOAT64value between0 and1.0. The default value is0.

If you are running hyperparameter tuning, use one of the following options:

TheHPARAM_RANGE keyword and twoFLOAT64 values that definethe range of the hyperparameter. For example,DROPOUT = HPARAM_RANGE(0, 0.6).
TheHPARAM_CANDIDATES keyword and an array ofFLOAT64 valuesthat provide discrete values to use for the hyperparameter. For example,DROPOUT = HPARAM_CANDIDATES([0, 0.1, 0.2, 0.6]).

When running hyperparameter tuning, the valid range is(0, 1.0], there is nodefault range, and the scale type isLINEAR.

`L1_REG`

Syntax

L1_REG = {float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

The amount ofL1 regularizationapplied.

Arguments

If you aren't running hyperparameter tuning, then you can specify aFLOAT64 value. The default value is0.

If you are running hyperparameter tuning, then you can use one of thefollowing options:

TheHPARAM_RANGE keyword and twoFLOAT64 values that definethe range to use for the hyperparameter. For example,L1_REG = HPARAM_RANGE(0, 5.0).
TheHPARAM_CANDIDATES keyword and an array ofFLOAT64 valuesthat provide discrete values to use for the hyperparameter. For example,L1_REG = HPARAM_CANDIDATES([0, 1.0, 3.0, 5.0]).

When running hyperparameter tuning, the valid range is(0, ∞), the default range is(0, 10.0], and the scaletype isLOG.

`L2_REG`

Syntax

L2_REG = {float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

The amount ofL2 regularizationapplied.

Arguments

If you aren't running hyperparameter tuning, then you can specify aFLOAT64 value. The default value is1.0.

If you are running hyperparameter tuning, then you can use one of thefollowing options:

TheHPARAM_RANGE keyword and twoFLOAT64 values that definethe range to use for the hyperparameter. For example,L2_REG = HPARAM_RANGE(1.5, 5.0).
TheHPARAM_CANDIDATES keyword and an array ofFLOAT64 valuesthat provide discrete values to use for the hyperparameter. For example,L2_REG = HPARAM_CANDIDATES([0, 1.0, 3.0, 5.0]).

When running hyperparameter tuning, the valid range is(0, ∞), thedefault range is(0, 10.0], and the scale type isLOG.

`LEARN_RATE`

Syntax

LEARN_RATE = {float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

LEARN_RATE is the step size shrinkage used in updates to prevents overfitting.After each boosting step,LEARN_RATE shrinks the feature weights to make theboosting process more conservative.

Arguments

If you aren't running hyperparameter tuning, then you can specify aFLOAT64value. The default value is0.3.

If you are running hyperparameter tuning, use one of the following options:

TheHPARAM_RANGE keyword and twoFLOAT64 values that definethe range of the hyperparameter. For example,LEARN_RATE = HPARAM_RANGE(0, 0.5).
TheHPARAM_CANDIDATES keyword and an array ofFLOAT64 valuesthat provide discrete values to use for the hyperparameter. For example,LEARN_RATE = HPARAM_CANDIDATES([0, 0.1, 0.3, 0.5]).

When running hyperparameter tuning, the valid range is(0, ∞], thedefault range is(0, 1.0], and the scale type isLINEAR.

`TREE_METHOD`

Syntax

TREE_METHOD = { { 'AUTO' | 'EXACT' | 'APPROX' | 'HIST'} | HPARAM_CANDIDATES([candidates]) }

Description

The type of tree construction algorithm.

HIST is recommended for large datasets in order to increase trainingspeed and reduce resource consumption. For more information, seetree booster.

Arguments

This option accepts the following values:

AUTO: Faster histogram optimized approximate greedy algorithm.This is the default.
EXACT: Exact greedy algorithm. Enumerates all split candidates.
APPROX: Approximate greedy algorithm using quantile sketch and gradienthistogram.
HIST: Same as theAUTO tree method..

If you are running hyperparameter training, you can provide more than onevalue for this option by usingHPARAM_CANDIDATES and specifying an array.For example,TREE_METHOD = HPARAM_CANDIDATES(['APPROX', 'HIST']).

`MIN_TREE_CHILD_WEIGHT`

Syntax

MIN_TREE_CHILD_WEIGHT = { int64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

The minimum sum of instance weight needed in a child for further partitioning.If the tree partition step results in a leaf node whose sum of instanceweight is less thanMIN_TREE_CHILD_WEIGHT, then the building process stopspartitioning. The larger theMIN_TREE_CHILD_WEIGHT value is,the more conservative the algorithm is.

Arguments

If you aren't running hyperparameter tuning, then you can specify anINT64value greater than or equal to0. The default value is1.

If you are running hyperparameter tuning, use one of the following options:

TheHPARAM_RANGE keyword and twoINT64 values that definethe range of the hyperparameter. For example,MIN_TREE_CHILD_WEIGHT = HPARAM_RANGE(0, 5).
TheHPARAM_CANDIDATES keyword and an array ofINT64 valuesthat provide discrete values to use for the hyperparameter. For example,MIN_TREE_CHILD_WEIGHT = HPARAM_CANDIDATES([0, 1, 3, 5]).

When running hyperparameter tuning, the valid range is[0, ∞),there is no default range, and the scale type isLINEAR.

`COLSAMPLE_BYTREE`

Syntax

COLSAMPLE_BYTREE = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

The subsample ratio of columns when constructing each tree. Subsampling occursonce for every tree constructed.

Arguments

If you aren't running hyperparameter tuning, then you can specify aFLOAT64value between0 and1.0. The default value is1.0.

If you are running hyperparameter tuning, use one of the following options:

TheHPARAM_RANGE keyword and twoFLOAT64 values that definethe range of the hyperparameter. For example,COLSAMPLE_BYTREE = HPARAM_RANGE(0, 0.3)).
TheHPARAM_CANDIDATES keyword and an array ofFLOAT64 valuesthat provide discrete values to use for the hyperparameter. For example,COLSAMPLE_BYTREE = HPARAM_CANDIDATES([0, 0.1, 0.3, 0.5]).

When running hyperparameter tuning, the valid range is[0, 1.0], there isno default range, and the scale type isLINEAR.

`COLSAMPLE_BYLEVEL`

Syntax

COLSAMPLE_BYLEVEL = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

The subsample ratio of columns for each level. Subsampling occurs once for everynew depth level reached in a tree. Columns are subsampled from the set ofcolumns chosen for the current tree.

Arguments

If you aren't running hyperparameter tuning, then you can specify aFLOAT64value between0 and1.0. The default value is1.0.

If you are running hyperparameter tuning, use one of the following options:

TheHPARAM_RANGE keyword and twoFLOAT64 values that definethe range of the hyperparameter. For example,COLSAMPLE_BYLEVEL = HPARAM_RANGE(0, 0.3)).
TheHPARAM_CANDIDATES keyword and an array ofFLOAT64 valuesthat provide discrete values to use for the hyperparameter. For example,COLSAMPLE_BYLEVEL = HPARAM_CANDIDATES([0, 0.1, 0.3, 0.5]).

When running hyperparameter tuning, the valid range is[0, 1.0], there isno default range, and the scale type isLINEAR.

`COLSAMPLE_BYNODE`

Syntax

COLSAMPLE_BYNODE = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

The subsample ratio of columns for each node (split). Subsampling occurs oncefor every time a new split is evaluated. Columns are subsampled from the set ofcolumns chosen for the current level.

Arguments

If you aren't running hyperparameter tuning, then you can specify aFLOAT64value between0 and1.0. The default value is1.0.

If you are running hyperparameter tuning, use one of the following options:

TheHPARAM_RANGE keyword and twoFLOAT64 values that definethe range of the hyperparameter. For example,COLSAMPLE_BYNODE = HPARAM_RANGE(0, 0.3)).
TheHPARAM_CANDIDATES keyword and an array ofFLOAT64 valuesthat provide discrete values to use for the hyperparameter. For example,COLSAMPLE_BYNODE = HPARAM_CANDIDATES([0, 0.1, 0.3, 0.5]).

When running hyperparameter tuning, the valid range is[0, 1.0], there isno default range, and the scale type isLINEAR.

`MIN_SPLIT_LOSS`

Syntax

MIN_SPLIT_LOSS = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

The minimum loss reduction required to make a further partition on a leaf node ofthe tree. The larger theMIN_SPLIT_LOSS value is, the more conservative thealgorithm is.

Arguments

If you aren't running hyperparameter tuning, then you can specify aFLOAT64value. The default value is0.

If you are running hyperparameter tuning, use one of the following options:

TheHPARAM_RANGE keyword and twoFLOAT64 values that definethe range of the hyperparameter. For example,MIN_SPLIT_LOSS = HPARAM_RANGE(0, 5.5).
TheHPARAM_CANDIDATES keyword and an array ofFLOAT64 valuesthat provide discrete values to use for the hyperparameter. For example,MIN_SPLIT_LOSS = HPARAM_CANDIDATES([0, 0.5, 1.5, 2.5]).

When running hyperparameter tuning, the valid range is[0, ∞), thereis no default range, and the scale type isLINEAR.

`SUBSAMPLE`

Syntax

SUBSAMPLE = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

The subsample ratio of the training instances. Setting this value to0.5 meansthat training randomly samples half of the training data prior to growing trees,which prevents overfitting. Subsampling occurs once in every boosting iteration.This is independent of the training-test data split used in the trainingoptions. The test data is not used in any iteration irrespective of theSUBSAMPLE value; subsampling is only applied to the training data.

Arguments

If you aren't running hyperparameter tuning, then you can specify aFLOAT64value between0 and1.0. The default value is1.0, whichspecifies that subsampling uses all of the training data in each iteration.

If you are running hyperparameter tuning, use one of the following options:

TheHPARAM_RANGE keyword and twoFLOAT64 values that definethe range of the hyperparameter. For example,SUBSAMPLE = HPARAM_RANGE(0, 0.6).
TheHPARAM_CANDIDATES keyword and an array ofFLOAT64 valuesthat provide discrete values to use for the hyperparameter. For example,SUBSAMPLE = HPARAM_CANDIDATES([0, 0.1, 0.2, 0.6]).

When running hyperparameter tuning, the valid range is(0, 1.0], the defaultrange is(0, 1.0], and the scale type isLINEAR.

`INSTANCE_WEIGHT_COL`

Syntax

INSTANCE_WEIGHT_COL =string_value

Description

The column used to specify the weights for each data point in thetraining dataset. The column you specify must be a numerical column. You can'tuse this column as a feature or label, and it is excluded from featuresautomatically. You can't specify this option ifAUTO_CLASS_WEIGHTS isTRUEor ifCLASS_WEIGHTS is set.

TheINSTANCE_WEIGHT_COL option is only supported for non-arrayfeatures.

Arguments

ASTRING value.

`XGBOOST_VERSION`

Syntax

XGBOOST_VERSION={'0.9'|'1.1'}

Description

The XGBoost version for model training.

Arguments

This option accepts the following values:

0.9. This is the default.
1.1

`AUTO_CLASS_WEIGHTS`

Syntax

AUTO_CLASS_WEIGHTS={TRUE|FALSE}

Description

Determines whether to balance class labels by using weights for each class ininverse proportion to the frequency of that class.

Only use this option with classifier models.

By default, the training data used to create the modelis unweighted. If the labels in the training data are imbalanced, the modelmight learn to predict the most popular class of labels more heavily, whichyou might not want.

To balance every class, set this option toTRUE. Balance isaccomplished using the following formula:

total_input_rows/(input_rows_for_class_n*number_of_unique_classes)

Arguments

ABOOL value. The default value isFALSE.

`CLASS_WEIGHTS`

Syntax

CLASS_WEIGHTS =struct_array

Description

The weights to use for each class label. You can't specify this option ifAUTO_CLASS_WEIGHTS isTRUE.

Arguments

AnARRAY ofSTRUCT values. EachSTRUCT contains aSTRING value that specifies the class label and aFLOAT64 value thatspecifies the weight for that class label. A weight must be present for everyclass label. The weights are not required to add up to 1.

ACLASS_WEIGHTS value might look like the following example:

CLASS_WEIGHTS=[STRUCT('example_label',.2)]

`ENABLE_GLOBAL_EXPLAIN`

Syntax

ENABLE_GLOBAL_EXPLAIN={TRUE|FALSE}

Description

Determines whether to compute global explanations by usingexplainable AIto evaluate the importance of global features to the model.

Global explanations are computed when you create the model. This optionmust beTRUE if you want to use theML.GLOBAL_EXPLAIN functionto retrieve the global explanations after the model is created.

Arguments

ABOOL value. The default value isFALSE.

`EARLY_STOP`

Syntax

EARLY_STOP={TRUE|FALSE}

Description

Determines whether training should stop after the first iteration in which therelative loss improvement is less than the value specified forMIN_REL_PROGRESS.

Arguments

ABOOL value. The default value isTRUE.

`MIN_REL_PROGRESS`

Syntax

MIN_REL_PROGRESS =float64_value

Description

The minimum relative loss improvement that is necessary to continue trainingwhenEARLY_STOP is set toTRUE. For example, a value of0.01 specifies that each iteration must reduce the loss by 1% for trainingto continue.

Arguments

AFLOAT64 value. The default value is0.01.

`INPUT_LABEL_COLS`

Syntax

INPUT_LABEL_COLS =string_array

Description

The name of the label column in the training data. The label column contains theexpected machine learning result for the given record. For example, in a spamdetection dataset, the label column value would probably be eitherspam ornot spam. In a rainfall dataset, the label column value might be the amountof rain that fell during a certain period.

Arguments

A one-elementARRAY of string values. Defaults tolabel.

`MAX_ITERATIONS`

Syntax

MAX_ITERATIONS =int64_value

Description

The maximum number of rounds for boosting.

Arguments

AnINT64 value. The default value is20.

`DATA_SPLIT_METHOD`

Syntax

DATA_SPLIT_METHOD={'AUTO_SPLIT'|'RANDOM'|'CUSTOM'|'SEQ'|'NO_SPLIT'}

Description

The method used to split input data into training, evaluation, and, if you arerunning hyperparameter tuning, test data sets. Training data is used to trainthe model. Evaluation data is used to avoidoverfittingby using early stopping. Test data is used to test the hyperparameter tuningtrial and record its metrics in the model.

The percentage sizes of the data sets produced by the various arguments forthis option are approximate. Larger input data sets come closer to thepercentages described than smaller input data sets do.

You can see the model's data split information in the following ways:

The data split method and percentage are shown in theTraining Optionssection of the model'sDetails page on the BigQuerypage of the Google Cloud console.
Links to temporary tables that contain the split data are available in theModel Details section of the model'sDetails page on theBigQuery of the Google Cloud console. You can also returnthis information from theDataSplitResult fieldin the BigQuery API. These tables are saved for 48 hours. If youneed this information for more than 48 hours, then you should exportthis data or copy it to permanent tables.

Arguments

This option accepts the following values:

*AUTO_SPLIT: This is the default value. This option splits the data as follows:

If there are fewer than 500 rows in the input data, then all rows are usedas training data.
If you aren't running hyperparameter tuning, then data is randomized andsplit as follows:
- If there are between 500 and 50,000 rows in the input data, then 20% ofthe data is used as evaluation data and 80% is used as training data.
- If there are more than 50,000 rows, then 10,000 rows are used asevaluation data and the remaining rows are used as training data.
If you are running hyperparameter tuning and there are more than 500 rows inthe input data, then the data is randomized and split as follows:
- 10% of the data is used as evaluation data
- 10% is used as test data
- 80% is used as training data
  For more information, seeData split.
RANDOM: Data is randomized before being split into sets. You can use thisoption with theDATA_SPLIT_EVAL_FRACTION andDATA_SPLIT_TEST_FRACTIONoptions to customize the data split. If you don't specify either of thoseoptions, data is split in the same way as for theAUTO_SPLIT option.
A random split is deterministic: different training runs produce the samesplit results if the same underlying training data is used.
Note: A random split is based on the FARM_FINGERPRINTof the data (including the column name and schema), sotables with the same content but different column names and schemas may getdifferent splitting and different evaluation metrics.
CUSTOM: Split data using the value in a specified column:
- If you aren't running hyperparameter tuning, then you must provide thename of a column of typeBOOL. Rows with a value ofTRUE orNULL areused as evaluation data, rows with a value ofFALSE are used astraining data.
- If you are running hyperparameter tuning, then you must provide the nameof a column of typeSTRING. Rows with a value ofTRAIN are used astraining data, rows with a value ofEVAL are used as evaluation data,and rows with a value ofTEST are used as test data.
Use theDATA_SPLIT_COL option to identify the columnthat contains the data split information.
SEQ: Split data sequentially by using the value in a specified column of oneof the following types:
- NUMERIC
- BIGNUMERIC
- STRING
- TIMESTAMP
The data is sorted smallest to largest based on the specified column.
When you aren't running hyperparameter tuning, the firstn rowsare used as evaluation data, wheren is the value specified forDATA_SPLIT_EVAL_FRACTION. The remaining rows are used as training data.
When you are running hyperparameter tuning, the firstn rowsare used as evaluation data, wheren is the value specified forDATA_SPLIT_EVAL_FRACTION. The nextm rows are used as testdata, wherem is the value specified forDATA_SPLIT_TEST_FRACTION. The remaining rows are used as training data.
All rows with split values smaller than the threshold are used astraining data. The remaining rows, includingNULLs, are used as evaluationdata.
Use theDATA_SPLIT_COL option to identify the column that contains thedata split information.
NO_SPLIT: No data split; all input data is used as training data.

`DATA_SPLIT_EVAL_FRACTION`

Syntax

DATA_SPLIT_EVAL_FRACTION =float64_value

Description

The fraction of the data to use as evaluation data. Use when you arespecifyingRANDOM orSEQ as the value for theDATA_SPLIT_METHOD option.

If you are running hyperparameter tuning and you specify a value for thisoption, you must also specify a value forDATA_SPLIT_TEST_FRACTION. In thiscase, the training dataset is1 -eval_fraction -test_fraction. Forexample, if you specify20.00 forDATA_SPLIT_EVAL_FRACTION and8.0 forDATA_SPLIT_TEST_FRACTION, your training dataset is 72% of the input data.

Arguments

AFLOAT64 value. The default is0.2. The service maintains the accuracy ofthe input value to two decimal places.

`DATA_SPLIT_TEST_FRACTION`

Syntax

DATA_SPLIT_TEST_FRACTION =float64_value

Description

The fraction of the data to use as test data. Use this option when you arerunning hyperparameter tuning and specifying eitherRANDOM orSEQ asvalue for theDATA_SPLIT_METHOD option.

If you specify a value for this option, you must also specify a value forDATA_SPLIT_EVAL_FRACTION. In this case,the training dataset is1 -eval_fraction -test_fraction.For example, if you specify20.00 forDATA_SPLIT_EVAL_FRACTION and8.0 forDATA_SPLIT_TEST_FRACTION, your training dataset is 72% of the input data.

Arguments

AFLOAT64 value. The default is0. The service maintains the accuracy ofthe input value to two decimal places.

`DATA_SPLIT_COL`

Syntax

DATA_SPLIT_COL =string_value

Description

The name of the column to use to sort input data into the training,evaluation, or test set. Use when you are specifyingCUSTOM orSEQ as thevalue for theDATA_SPLIT_METHOD option:

If you aren't running hyperparameter tuning and you are specifyingSEQ asthe value forDATA_SPLIT_METHOD, then the data is first sorted smallest tolargest based on the specified column. The lastn rows are usedas evaluation data, wheren is the value specified forDATA_SPLIT_EVAL_FRACTION. The remaining rowsare used as training data.
If you aren't running hyperparameter tuning and you are specifyingCUSTOMas the value forDATA_SPLIT_METHOD, then you must provide the nameof a column of typeBOOL. Rows with a value ofTRUE orNULLare used asevaluation data, rows with a value ofFALSE are used as trainingdata.
If you are running hyperparameter tuning and you are specifyingSEQ as thevalue forDATA_SPLIT_METHOD, then the data is first sorted smallest tolargest based on the specified column. The lastn rows are usedas evaluation data, wheren is the value specified forDATA_SPLIT_EVAL_FRACTION. The nextm rows are used as test data, wherem is the valuespecified forDATA_SPLIT_TEST_FRACTION.The remaining rows are used as training data.
If you are running hyperparameter tuning and you are specifyingCUSTOM asthe value forDATA_SPLIT_METHOD, then you must provide the name of a columnof typeSTRING. Rows with a value ofTRAIN are used as training data, rowswith a value ofEVAL are used as evaluation data, and rows with a value ofTEST are used as test data.

The column you specify forDATA_SPLIT_COL can't be used as a feature orlabel, and is excluded from features automatically.

Arguments

ASTRING value.

`NUM_TRIALS`

Syntax

NUM_TRIALS =int64_value

Description

The maximum number of submodels to train. The tuning stops whenNUM_TRIALSsubmodels are trained, or when the hyperparameter search space is exhausted.You must specify this option in order to use hyperparameter tuning.

Arguments

AnINT64 value between1 and100, inclusive.

Note: We recommend using at least(number_of_hyperparameters * 10) trials to tune a model.

`MAX_PARALLEL_TRIALS`

Syntax

MAX_PARALLEL_TRIALS =int64_value

Description

The maximum number of trials to run at the same time. If you specify a valuefor this option, you must also specify a value for NUM_TRIALS.

Arguments

AnINT64 value between1 and5, inclusive. The default value is1.

Note: Although specifying largerMAX_PARALLEL_TRIALS values can accelerate thehyperparameter tuning process, acceleration can undermine the final modelquality when you specifyVIZIER_DEFAULT as the HPARAM_TUNING_ALGORITHMvalue. This is because the parallel trials can't benefit from concurrenttraining results.

`HPARAM_TUNING_ALGORITHM`

Syntax

HPARAM_TUNING_ALGORITHM={'VIZIER_DEFAULT'|'RANDOM_SEARCH'|'GRID_SEARCH'}

Description

The algorithm used to tune the hyperparameters. If you specify a valuefor this option, you must also specify a value forNUM_TRIALS.

Arguments

Specify one of the following values:

VIZIER_DEFAULT: Use the default algorithm inVertex AI Vizier to tune hyperparameters. This algorithm is the mostpowerful algorithm of those offered. It performs a mixture of advancedsearch algorithms, includingBayesian optimizationwithGaussian processes. Italso usestransfer learningto take advantage of previously tuned models. This is the default, and alsothe recommended approach.
RANDOM_SEARCH: Userandom searchto explore the search space.
GRID_SEARCH: Usegrid searchto explore the search space. You can only use this algorithm when everyhyperparameter's search space is discrete.

`HPARAM_TUNING_OBJECTIVES`

Syntax

ForBOOSTED_TREE_CLASSIFIER models:

HPARAM_TUNING_OBJECTIVES={'PRECISION'|'RECALL'|'ACCURACY'|'F1_SCORE'|'LOG_LOSS'|'ROC_AUC'}

ForBOOSTED_TREE_REGRESSOR models:

HPARAM_TUNING_OBJECTIVES={'MEAN_ABSOLUTE_ERROR'|'MEAN_SQUARED_ERROR'|'MEAN_SQUARED_LOG_ERROR'|'MEDIAN_ABSOLUTE_ERROR'|'R2_SCORE'|'EXPLAINED_VARIANCE'}

DescriptionThe hyperparameter tuning objective for the model; only one objective issupported. If you specify a value for this option, you must also specify avalue forNUM_TRIALS.

Arguments

The possible objectives are a subset of themodel evaluation metricsfor the model type. If you aren't running hyperparameter tuning, or if you areand you don't specify an objective, the default objectiveis used. ForBOOSTED_TREE_CLASSIFIER models, the default isROC_AUC.ForBOOSTED_TREE_REGRESSOR models, the default isR2_SCORE.

`KMS_KEY_NAME`

Syntax

KMS_KEY_NAME =string_value

Description

The Cloud Key Management Servicecustomer-managed encryption key (CMEK) touse to encrypt the model.

Arguments

ASTRING value containing the fully-qualified name of the CMEK. For example,

'projects/my_project/locations/my_location/keyRings/my_ring/cryptoKeys/my_key'

Internal parameter defaults

BigQuery ML uses the following default values when buildingmodels:

loss_reduction=losses_utils.ReductionV2.SUM_OVER_BATCH_SIZEbatch_norm=False

hyperparameter tuning

Boosted trees classification and regression models supporthyperparameter tuning, which you can useto improve model performance for your data. To usehyperparameter tuning, set theNUM_TRIALs option to thenumber of trials that you want to run. BigQuery ML then trains themodel the number of times that you specify, using different hyperparametervalues, and returns the model that performs the best.

Hyperparameter tuning defaults to improving the key performance metric for thegiven model type. You can use theHPARAM_TUNING_OBJECTIVES option to tune fora different metric if you need to.

For more information about the training objectives and hyperparameterssupported for boosted trees classification models, seeBOOSTED_TREE_CLASSIFIER.For more information about the training objectives and hyperparameterssupported for boosted trees regression models, seeBOOSTED_TREE_REGRESSOR.To try a tutorial that walks you through hyperparameter tuning, seeImprove model performance with hyperparameter tuning.

Locations

For information about supported locations, seeLocations for non-remote models.

Limitations

CREATE MODEL statements must comply with the following rules:

ForBOOSTED_TREE_CLASSIFIER models, thelabel column can contain up to 1,000 unique values; that is, the number of classes is less than or equal to 1,000. If you need to classify into more than 1,000 labels, contactbqml-feedback@google.com.

Example

The following example trains a boosted trees classifier model against'mytable' with'mylabel' as the label column.

CREATEMODEL`project_id.mydataset.mymodel`OPTIONS(MODEL_TYPE='BOOSTED_TREE_CLASSIFIER',BOOSTER_TYPE='GBTREE',NUM_PARALLEL_TREE=1,MAX_ITERATIONS=50,TREE_METHOD='HIST',EARLY_STOP=FALSE,SUBSAMPLE=0.85,INPUT_LABEL_COLS=['mylabel'])ASSELECT*FROM`project_id.mydataset.mytable`;

Train a boosted trees regression model with hyperparameter tuning

The following example creates and trains a boosted trees regression model. It uses hyperparameter tuning to improve model performance.

CREATEMODEL`mydataset.mymodel`OPTIONS(MODEL_TYPE='BOOSTED_TREE_REGRESSOR',num_trials=10,max_parallel_trials=2,HPARAM_TUNING_OBJECTIVES=['R2_SCORE'])ASSELECTcolumn1,column2,column3,labelFROM`mydataset.mytable`

Supported regions

Training boosted trees models is not supported in all BigQuery MLregions. For a complete list of supported regions and multi-regions, seeBigQuery ML locations.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

The CREATE MODEL statement for boosted trees models using XGBoost

CREATE MODEL syntax

CREATE MODEL

CREATE MODEL IF NOT EXISTS

CREATE OR REPLACE MODEL

model_name

MODEL_TYPE

APPROX_GLOBAL_FEATURE_CONTRIB

CATEGORY_ENCODING_METHOD

BOOSTER_TYPE

DART_NORMALIZE_TYPE

NUM_PARALLEL_TREE

MAX_TREE_DEPTH

DROPOUT

L1_REG

L2_REG

LEARN_RATE

TREE_METHOD

MIN_TREE_CHILD_WEIGHT

COLSAMPLE_BYTREE

COLSAMPLE_BYLEVEL

COLSAMPLE_BYNODE

MIN_SPLIT_LOSS

SUBSAMPLE

INSTANCE_WEIGHT_COL

XGBOOST_VERSION

AUTO_CLASS_WEIGHTS

CLASS_WEIGHTS

ENABLE_GLOBAL_EXPLAIN

EARLY_STOP

MIN_REL_PROGRESS

INPUT_LABEL_COLS

MAX_ITERATIONS

DATA_SPLIT_METHOD

DATA_SPLIT_EVAL_FRACTION

DATA_SPLIT_TEST_FRACTION

DATA_SPLIT_COL

NUM_TRIALS

MAX_PARALLEL_TRIALS

HPARAM_TUNING_ALGORITHM

HPARAM_TUNING_OBJECTIVES

KMS_KEY_NAME

Internal parameter defaults

hyperparameter tuning

Locations

Limitations

Example

Train a boosted trees regression model with hyperparameter tuning

Supported regions

`CREATE MODEL` syntax

`CREATE MODEL`

`CREATE MODEL IF NOT EXISTS`

`CREATE OR REPLACE MODEL`

`model_name`

`MODEL_TYPE`

`APPROX_GLOBAL_FEATURE_CONTRIB`

`CATEGORY_ENCODING_METHOD`

`BOOSTER_TYPE`

`DART_NORMALIZE_TYPE`

`NUM_PARALLEL_TREE`

`MAX_TREE_DEPTH`

`DROPOUT`

`L1_REG`

`L2_REG`

`LEARN_RATE`

`TREE_METHOD`

`MIN_TREE_CHILD_WEIGHT`

`COLSAMPLE_BYTREE`

`COLSAMPLE_BYLEVEL`

`COLSAMPLE_BYNODE`

`MIN_SPLIT_LOSS`

`SUBSAMPLE`

`INSTANCE_WEIGHT_COL`

`XGBOOST_VERSION`

`AUTO_CLASS_WEIGHTS`

`CLASS_WEIGHTS`

`ENABLE_GLOBAL_EXPLAIN`

`EARLY_STOP`

`MIN_REL_PROGRESS`

`INPUT_LABEL_COLS`

`MAX_ITERATIONS`

`DATA_SPLIT_METHOD`

`DATA_SPLIT_EVAL_FRACTION`

`DATA_SPLIT_TEST_FRACTION`

`DATA_SPLIT_COL`

`NUM_TRIALS`

`MAX_PARALLEL_TRIALS`

`HPARAM_TUNING_ALGORITHM`

`HPARAM_TUNING_OBJECTIVES`

`KMS_KEY_NAME`