The CREATE MODEL statement for K-means models

This document describes theCREATE MODEL statement for creatingk-meansmodels in BigQuery by using SQL. Alternatively, you can use theGoogle Cloud console user interface tocreate a model by using a UI(Preview) instead of constructing the SQLstatement yourself.

You can use k-means models with theML.PREDICT functionto cluster data, and you can use k-means models with theML.DETECT_ANOMALIES functionto performanomaly detection.

For more information about supported SQL statements and functions for thismodel, seeEnd-to-end user journeys for ML models.

`CREATE MODEL` syntax

{CREATE MODEL |CREATE MODEL IF NOT EXISTS |CREATE OR REPLACE MODEL}model_nameOPTIONS(model_option_list)ASquery_statementmodel_option_list:MODEL_TYPE = { 'KMEANS' },    [,NUM_CLUSTERS = {int64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]    [,KMEANS_INIT_METHOD = { 'RANDOM' | 'KMEANS++' | 'CUSTOM' } ]    [,KMEANS_INIT_COL =string_value ]    [,DISTANCE_TYPE = { 'EUCLIDEAN' | 'COSINE' } ]    [,STANDARDIZE_FEATURES = { TRUE | FALSE } ]    [,MAX_ITERATIONS =int64_value ]    [,EARLY_STOP = { TRUE | FALSE } ]    [,MIN_REL_PROGRESS =float64_value ]    [,WARM_START = { TRUE | FALSE } ]    [,NUM_TRIALS =int64_value ]    [,MAX_PARALLEL_TRIALS =int64_value ]    [,HPARAM_TUNING_ALGORITHM = { 'VIZIER_DEFAULT' | 'RANDOM_SEARCH' | 'GRID_SEARCH' } ]    [,HPARAM_TUNING_OBJECTIVES = 'DAVIES_BOULDIN_INDEX' ]    [,MODEL_REGISTRY = { 'VERTEX_AI' } ]    [,VERTEX_AI_MODEL_ID =string_value ]    [,VERTEX_AI_MODEL_VERSION_ALIASES =string_array ]    [,KMS_KEY_NAME =string_value ]

`CREATE MODEL`

Creates and trains a new model in the specified dataset. If the model nameexists,CREATE MODEL returns an error.

`CREATE MODEL IF NOT EXISTS`

Creates and trains a new model only if the model doesn't exist in thespecified dataset.

`CREATE OR REPLACE MODEL`

Creates and trains a model and replaces an existing model with the same name inthe specified dataset.

`model_name`

The name of the model you're creating or replacing. The modelname must be unique in the dataset: no other model or table can have the samename. The model name must follow the same naming rules as aBigQuery table. A model name can:

Contain up to 1,024 characters
Contain letters (upper or lower case), numbers, and underscores

model_name is case-sensitive.

If you don't have a default project configured, then you must prepend theproject ID to the model name in the following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example, `myproject.mydataset.mymodel`.

`MODEL_TYPE`

Syntax

MODEL_TYPE={'KMEANS'}

Description

Specify the model type. This option is required.

Arguments

SpecifyKMEANS to use k-means clustering for data segmentation; for example,identifying customer segments. K-means is an unsupervised learning technique, somodel training does not require labels or split data for training orevaluation.

`NUM_CLUSTERS`

Syntax

NUM_CLUSTERS = {int64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

The number of clusters to identify in the input data.

Arguments

If you aren't running hyperparameter tuning, then you can specify anINT64value between2 and100. The default value islog₁₀(n), wheren is the number of training examples.

If you are running hyperparameter tuning, use one of the following options:

TheHPARAM_RANGE keyword and twoINT64 values that definethe range of the hyperparameter. For example,NUM_CLUSTERS = HPARAM_RANGE(2, 25).
TheHPARAM_CANDIDATES keyword and an array ofINT64 valuesthat provide discrete values to use for the hyperparameter. For example,NUM_CLUSTERS = HPARAM_CANDIDATES([5, 10, 50, 100]).

When running hyperparameter tuning, the valid range is[2, 100],the default range is[2, 10], and the scale type isLINEAR.

`KMEANS_INIT_METHOD`

Syntax

KMEANS_INIT_METHOD={'RANDOM'|'KMEANS++'|'CUSTOM'}

Description

The method of initializing the clusters.

To use the same centroids in repeatedCREATE MODEL queries, specify the option'CUSTOM'.

Arguments

This option accepts the following values:

RANDOM: Initializes the centroids by randomly selectinga number of data points equal to theNUM_CLUSTERS valuefrom the input data. This is the default value.
KMEANS++: Initializes a number of centroids equal to theNUM_CLUSTERS value by using thek-means++algorithm. Using this approach usually trains a better model thanusing random cluster initialization.
CUSTOM: Initializes the centroids using a provided column of typeBOOL.BigQuery ML uses the rows with a value ofTRUE as the initial centroids. You specify the column to use by using theKMEANS_INIT_COL option.
When you use this option, if the values in the column identified by'KMEANS_INIT_COL' remain constant, then repeatedCREATE MODEL queriesuse the same centroids.

`KMEANS_INIT_COL`

Syntax

KMEANS_INIT_COL =string_value

Description

The name of the column to use to initialize the centroids. This column must havea type ofBOOL. If this column contains a value ofTRUE for a given row,then BigQuery ML uses that row as an initial centroid. The numberofTRUE rows in this column must be equal to the value you have specifiedfor theNUM_CLUSTERS option.

You can only use this option if you have specifiedCUSTOM for theKMEANS_INIT_METHOD option.

You can't use this column as a feature; BigQuery ML automatically excludes it.

Arguments

ASTRING value.

`DISTANCE_TYPE`

Syntax

DISTANCE_TYPE={'EUCLIDEAN'|'COSINE'}

Description

The type of metric to use to compute the distance between two points.

Arguments

This option accepts the following values:

EUCLIDEAN: Use the following equation to calculate the distance betweenpointsx andy:
$$ \lVert x-y\rVert_{2} $$
This is the default value.
COSINE: Use the following equation to calculate the distance between pointsx andy:
$$ \sqrt{1-\frac{x \cdot y}{\lVert x\rVert_{2}\lVert y\rVert_{2}}} $$
where $ \lVert x\rVert_{2} $ represents the L2 norm forx.

`STANDARDIZE_FEATURES`

Syntax

STANDARDIZE_FEATURES={TRUE|FALSE}

Description

Determines whether tostandardize numerical features.This setting doesn't affectautomatic preprocessing of non-numericalfeatures.

Arguments

ABOOL value. The default value isTRUE.

`MAX_ITERATIONS`

Syntax

MAX_ITERATIONS =int64_value

Description

The maximum number of training iterations, where one iteration representsa single pass of the entire training data.

Arguments

AnINT64 value. The default value is20.

`EARLY_STOP`

Syntax

EARLY_STOP={TRUE|FALSE}

Description

Determines whether training should stop after the first iteration in which therelative loss improvement is less than the value specified forMIN_REL_PROGRESS.

Arguments

ABOOL value. The default value isTRUE.

`MIN_REL_PROGRESS`

Syntax

MIN_REL_PROGRESS =float64_value

Description

The minimum relative loss improvement that is necessary to continue trainingwhenEARLY_STOP is set toTRUE. For example, a value of0.01 specifies that each iteration must reduce the loss by 1% for trainingto continue.

Arguments

AFLOAT64 value. The default value is0.01.

`WARM_START`

Syntax

WARM_START={TRUE|FALSE}

Description

Determines whether to train a model with new training data, new model options,or both. Unless you explicitly override them, the initial options used to trainthe model are used for the warm start run.

In a warm start run, the iteration numbers are reset to start from zero. Usethe training run or iteration information returned by theML.TRAINING_INFO functionto distinguish the warm start run from the original run.

The value of theMODEL_TYPE and the training data schema must remainconstant in a warm start models retrain.

Arguments

ABOOL value. The default value isFALSE.

`NUM_TRIALS`

Syntax

NUM_TRIALS =int64_value

Description

The maximum number of submodels to train. The tuning stops whenNUM_TRIALSsubmodels are trained, or when the hyperparameter search space is exhausted.You must specify this option in order to use hyperparameter tuning.

Arguments

AnINT64 value between1 and100, inclusive.

Note: We recommend using at least(number_of_hyperparameters * 10) trials to tune a model.

`MAX_PARALLEL_TRIALS`

Syntax

MAX_PARALLEL_TRIALS =int64_value

Description

The maximum number of trials to run at the same time. If you specify a valuefor this option, you must also specify a value for NUM_TRIALS.

Arguments

AnINT64 value between1 and5, inclusive. The default value is1.

Note: Although specifying largerMAX_PARALLEL_TRIALS values can accelerate thehyperparameter tuning process, acceleration can undermine the final modelquality when you specifyVIZIER_DEFAULT as the HPARAM_TUNING_ALGORITHMvalue. This is because the parallel trials can't benefit from concurrenttraining results.

`HPARAM_TUNING_ALGORITHM`

Syntax

HPARAM_TUNING_ALGORITHM={'VIZIER_DEFAULT'|'RANDOM_SEARCH'|'GRID_SEARCH'}

Description

The algorithm used to tune the hyperparameters. If you specify a valuefor this option, you must also specify a value forNUM_TRIALS.

Arguments

Specify one of the following values:

VIZIER_DEFAULT: Use the default algorithm inVertex AI Vizier to tune hyperparameters. This algorithm is the mostpowerful algorithm of those offered. It performs a mixture of advancedsearch algorithms, includingBayesian optimizationwithGaussian processes. Italso usestransfer learningto take advantage of previously tuned models. This is the default, and alsothe recommended approach.
RANDOM_SEARCH: Userandom searchto explore the search space.
GRID_SEARCH: Usegrid searchto explore the search space. You can only use this algorithm when everyhyperparameter's search space is discrete.

`HPARAM_TUNING_OBJECTIVES`

Syntax

HPARAM_TUNING_OBJECTIVES={'DAVIES_BOULDIN_INDEX'}

DescriptionThe hyperparameter tuning objective for the model. If you specify a valuefor this option, you must also specify a value forNUM_TRIALS.

Arguments

The possible objectives are a subset of themodel evaluation metricsfor the model type. If you aren't running hyperparameter tuning, or if you areand you don't specify an objective, theDAVIES_BOULDIN_INDEX objective isused.

`MODEL_REGISTRY`

TheMODEL_REGISTRY option specifies the model registry destination.VERTEX_AI is the only supported model registry destination. To learn more, seeRegister a BigQuery ML model.

`VERTEX_AI_MODEL_ID`

TheVERTEX_AI_MODEL_ID option specifies a Vertex AI model IDto register the model with. The model ID is associated with yourBigQuery ML model, and is visible from theModel Registry. If you don't specify aVertex AI model ID, the BigQuery MLmodel name is used.

TheVERTEX_AI_MODEL_ID value can have up to 63 characters, and validcharacters are[a-z0-9_-]. The first character cannot be a number or hyphen.If you don't specify a Vertex AI model ID, theBigQuery ML model name must meet these requirements.

You can only set theVERTEX_AI_MODEL_ID option when theMODEL_REGISTRYoption is set toVERTEX_AI.

`VERTEX_AI_MODEL_VERSION_ALIASES`

TheVERTEX_AI_MODEL_VERSION_ALIASES option specifies aVertex AI model alias to use when registering a model. Modelaliases are helpful for fetching or deploying a particular model version byreference without needing to know the specific version ID. To learn more abouthow Model Registry aliases work, seeHow to use model version aliases.

You can only set theVERTEX_AI_MODEL_VERSION_ALIASES option when theMODEL_REGISTRY option is set toVERTEX_AI.

`KMS_KEY_NAME`

Syntax

KMS_KEY_NAME =string_value

Description

The Cloud Key Management Servicecustomer-managed encryption key (CMEK) touse to encrypt the model.

Arguments

ASTRING value containing the fully-qualified name of the CMEK. For example,

'projects/my_project/locations/my_location/keyRings/my_ring/cryptoKeys/my_key'

`query_statement`

TheASquery_statement clause specifies theGoogleSQL query used to generate the training data. See theGoogleSQL query syntaxpage for the supported SQL syntax of thequery_statement clause.

Hyperparameter tuning

K-means models supporthyperparameter tuning, which you can useto improve model performance for your data. To usehyperparameter tuning, set theNUM_TRIALs option to thenumber of trials that you want to run. BigQuery ML then trains themodel the number of times that you specify, using different hyperparametervalues, and returns the model that performs the best.

Hyperparameter tuning defaults to improving the key performance metric for thegiven model type. You can use theHPARAM_TUNING_OBJECTIVES option to tune fora different metric if you need to.

For more information about the training objectives and hyperparameterssupported for k-means models, seeKMEANS.To try a tutorial that walks you through hyperparameter tuning, seeImprove model performance with hyperparameter tuning.

Locations

For information about supported locations, seeLocations for non-remote models.

`CREATE MODEL` examples

The following examples create models namedmymodel inmydataset in yourdefault project.

Train a k-means model

This example creates a k-means model with four clusters using the defaultdistance_type value ofeuclidean_distance.

CREATEMODEL`mydataset.mymodel`OPTIONS(MODEL_TYPE='KMEANS',NUM_CLUSTERS=4)ASSELECT*FROM`mydataset.mytable`

Note: Changing the order of columns in theSELECT statement can affect thecentroids in the final model.

Train a k-means model with random cluster initialization method.

This example creates a k-means model with three clusters using the randomcluster initialization method.

CREATEMODEL`mydataset.mymodel`OPTIONS(MODEL_TYPE='KMEANS',NUM_CLUSTERS=3,KMEANS_INIT_METHOD='RANDOM')ASSELECT*FROM`mydataset.mytable`

Train a k-means model with custom cluster initialization method

This example creates a k-means model with three clusters using the customcluster initialization method.init_col identifies the column of typeBOOLthat contains the values which specify whether a given row is an initialcentroid. This column should only contain three rows with the valueTRUE.

CREATEMODEL`mydataset.mymodel`OPTIONS(MODEL_TYPE='KMEANS',NUM_CLUSTERS=3,KMEANS_INIT_METHOD='CUSTOM',KMEANS_INIT_COL='init_col')ASSELECTinit_col,featuresFROM`mydataset.mytable`

Train a k-means model with hyperparameter tuning

This example creates a k-means model and uses hyperparameter tuning to improve model performance.

CREATEMODEL`mydataset.mymodel`OPTIONS(MODEL_TYPE='KMEANS',num_trials=10,max_parallel_trials=2,HPARAM_TUNING_OBJECTIVES=['DAVIES_BOULDIN_INDEX'])ASSELECT*FROM`mydataset.mytable`

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-11-24 UTC.

Movatterモバイル変換

The CREATE MODEL statement for K-means models

CREATE MODEL syntax

CREATE MODEL

CREATE MODEL IF NOT EXISTS

CREATE OR REPLACE MODEL

model_name

MODEL_TYPE

NUM_CLUSTERS

KMEANS_INIT_METHOD

KMEANS_INIT_COL

DISTANCE_TYPE

STANDARDIZE_FEATURES

MAX_ITERATIONS

EARLY_STOP

MIN_REL_PROGRESS

WARM_START

NUM_TRIALS

MAX_PARALLEL_TRIALS

HPARAM_TUNING_ALGORITHM

HPARAM_TUNING_OBJECTIVES

MODEL_REGISTRY

VERTEX_AI_MODEL_ID

VERTEX_AI_MODEL_VERSION_ALIASES

KMS_KEY_NAME

query_statement

Hyperparameter tuning

Locations

CREATE MODEL examples

Train a k-means model

Train a k-means model with random cluster initialization method.

Train a k-means model with custom cluster initialization method

Train a k-means model with hyperparameter tuning

`CREATE MODEL` syntax

`CREATE MODEL`

`CREATE MODEL IF NOT EXISTS`

`CREATE OR REPLACE MODEL`

`model_name`

`MODEL_TYPE`

`NUM_CLUSTERS`

`KMEANS_INIT_METHOD`

`KMEANS_INIT_COL`

`DISTANCE_TYPE`

`STANDARDIZE_FEATURES`

`MAX_ITERATIONS`

`EARLY_STOP`

`MIN_REL_PROGRESS`

`WARM_START`

`NUM_TRIALS`

`MAX_PARALLEL_TRIALS`

`HPARAM_TUNING_ALGORITHM`

`HPARAM_TUNING_OBJECTIVES`

`MODEL_REGISTRY`

`VERTEX_AI_MODEL_ID`

`VERTEX_AI_MODEL_VERSION_ALIASES`

`KMS_KEY_NAME`

`query_statement`

`CREATE MODEL` examples