The CREATE MODEL statement for PCA models

`CREATE MODEL` statement for PCA

This document describes theCREATE MODEL statement for creatingprincipal component analysis (PCA) modelsin BigQuery by using SQL. Alternatively, you can use theGoogle Cloud console user interface tocreate a model by using a UI(Preview) instead of constructing the SQLstatement yourself.

You can use PCA models with theML.PREDICTorAI.GENERATE_EMBEDDINGfunctions to embed data into a lower-dimensional space, andwith theML.DETECT_ANOMALIES functionto performanomaly detection.

For more information about supported SQL statements and functions for thismodel, seeEnd-to-end user journeys for ML models.

`CREATE MODEL` syntax

{CREATE MODEL |CREATE MODEL IF NOT EXISTS |CREATE OR REPLACE MODEL}model_nameOPTIONS(model_option_list)ASquery_statementmodel_option_list:MODEL_TYPE = PCA,NUM_PRINCIPAL_COMPONENTS =int64_value |PCA_EXPLAINED_VARIANCE_RATIO =float64_value    [,SCALE_FEATURES = { TRUE | FALSE } ]    [,PCA_SOLVER = { 'FULL' | 'RANDOMIZED' | 'AUTO' } ]    [,MODEL_REGISTRY = { 'VERTEX_AI' } ]    [,VERTEX_AI_MODEL_ID =string_value ]    [,VERTEX_AI_MODEL_VERSION_ALIASES =string_array ]    [,KMS_KEY_NAME =string_value ]

`CREATE MODEL`

Creates and trains a new model in the specified dataset. If the model nameexists,CREATE MODEL returns an error.

`CREATE MODEL IF NOT EXISTS`

Creates and trains a new model only if the model doesn't exist in thespecified dataset.

`CREATE OR REPLACE MODEL`

Creates and trains a model and replaces an existing model with the same name inthe specified dataset.

`model_name`

The name of the model you're creating or replacing. The modelname must be unique in the dataset: no other model or table can have the samename. The model name must follow the same naming rules as aBigQuery table. A model name can:

Contain up to 1,024 characters
Contain letters (upper or lower case), numbers, and underscores

model_name is case-sensitive.

If you don't have a default project configured, then you must prepend theproject ID to the model name in the following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example, `myproject.mydataset.mymodel`.

`MODEL_TYPE`

Syntax

MODEL_TYPE={'PCA'}

Description

Specify the model type. This option is required.

Arguments

Principal component analysis computes principal components and uses them toperform a change of basis on the data. This approach is commonly used fordimensionality reduction by projecting each data point onto only the first fewprincipal components. This lets the model obtain lower-dimensional datawhile preserving as much of the data's variation as possible. The firstprincipal component can equivalently be defined as a direction that maximizesthe variance of the projected data.

PCA is an unsupervised learning technique, so model training doesn't requireeither labels or input data that is split into sets for training and evaluation.

`NUM_PRINCIPAL_COMPONENTS`

Syntax

NUM_PRINCIPAL_COMPONENTS =int64_value

Description

The number of principal components to keep.

You must specify eitherNUM_PRINCIPAL_COMPONENTS orPCA_EXPLAINED_VARIANCE_RATIO, but not both.

Arguments

AnINT64 value. This value can't be larger than thetotal number of rows or the total feature cardinalities after one-hot encodingthe categorical features.

`PCA_EXPLAINED_VARIANCE_RATIO`

Syntax

PCA_EXPLAINED_VARIANCE_RATIO =float64_value

Description

The ratio for the explained variance. The number of principal components isselected such that the percentage of variance explained by the principalcomponents is greater than the ratio specified by this argument.

You must specify eitherPCA_EXPLAINED_VARIANCE_RATIO orNUM_PRINCIPAL_COMPONENTS, but not both.

Arguments

AFLOAT64 value in the range(0, 1).

`SCALE_FEATURES`

Syntax

SCALE_FEATURES={TRUE|FALSE}

Description

Determines whether or not to scale the numerical features to unit variance. Theinput numerical features are always centered to have zero mean value.Separately, categorical features are one-hot encoded.

Arguments

ABOOL value. The default value isTRUE.

`PCA_SOLVER`

Syntax

PCA_SOLVER={'FULL'|'RANDOMIZED'|'AUTO'}

Description

The solver to use to calculate the principal components.

Arguments

This option accepts the following values:

FULL: Run a fulleigendecompositionalgorithm. In this case, the maximum allowed feature cardinality after one-hotencoding the categoricals is dynamically estimated. The primary factor thatdetermines the feature cardinality value is the lengths of the feature names,and this value isn't affected by the values of theNUM_PRINCIPAL_COMPONENTS orPCA_EXPLAINED_VARIANCE_RATIO options. As aguideline, the maximum allowed feature cardinality typically falls between1,000 and 1,500. If the total feature cardinality of the input dataviolates the estimated maximum value, then an invalid query error is returned.
RANDOMIZED: Run a randomized PCA algorithm. In this case, the maximumallowed feature cardinality is 10,000. If the featurecardinality of the input data is less than 10,000, then the cap on the numberof principal components to compute is dynamically determined based on resourceconstraints.
- If you specifyNUM_PRINCIPAL_COMPONENTS, then the value forNUM_PRINCIPAL_COMPONENTS must be less than or equal to 10,000. Largervalues result in invalid query errors.
- If you specifyPCA_EXPLAINED_VARIANCE_RATIO,then all principal components under the 10,000 cap are computed. If theirtotal explained variance ratio is less than thePCA_EXPLAINED_VARIANCE_RATIO value, then they are all returned; otherwisea subset is returned.
AUTO: This is the default. In this case, the solver is selected by adefault policy based on the input data.Typically, when the feature cardinality after one-hot encoding all thecategoricals is less than a dynamically determined threshold, the exact fulleigendecomposition is computed. Otherwise, randomized PCA is performed. Thedynamically determined threshold typically falls between 1,000 and 1,500. Thenumber of rows in the input data is not considered when choosing the solver.

`MODEL_REGISTRY`

TheMODEL_REGISTRY option specifies the model registry destination.VERTEX_AI is the only supported model registry destination. To learn more, seeRegister a BigQuery ML model.

`VERTEX_AI_MODEL_ID`

TheVERTEX_AI_MODEL_ID option specifies a Vertex AI model IDto register the model with. The model ID is associated with yourBigQuery ML model, and is visible from theModel Registry. If you don't specify aVertex AI model ID, the BigQuery MLmodel name is used.

TheVERTEX_AI_MODEL_ID value can have up to 63 characters, and validcharacters are[a-z0-9_-]. The first character cannot be a number or hyphen.If you don't specify a Vertex AI model ID, theBigQuery ML model name must meet these requirements.

You can only set theVERTEX_AI_MODEL_ID option when theMODEL_REGISTRYoption is set toVERTEX_AI.

`VERTEX_AI_MODEL_VERSION_ALIASES`

TheVERTEX_AI_MODEL_VERSION_ALIASES option specifies aVertex AI model alias to use when registering a model. Modelaliases are helpful for fetching or deploying a particular model version byreference without needing to know the specific version ID. To learn more abouthow Model Registry aliases work, seeHow to use model version aliases.

You can only set theVERTEX_AI_MODEL_VERSION_ALIASES option when theMODEL_REGISTRY option is set toVERTEX_AI.

`KMS_KEY_NAME`

Syntax

KMS_KEY_NAME =string_value

Description

The Cloud Key Management Servicecustomer-managed encryption key (CMEK) touse to encrypt the model.

Arguments

ASTRING value containing the fully-qualified name of the CMEK. For example,

'projects/my_project/locations/my_location/keyRings/my_ring/cryptoKeys/my_key'

`query_statement`

TheASquery_statement clause specifies theGoogleSQL query used to generate the training data. See theGoogleSQL query syntaxpage for the supported SQL syntax of thequery_statement clause.

Locations

For information about supported locations, seeLocations for non-remote models.

Examples

The following examples create models namedmymodel inmydataset in yourdefault project.

Use the`NUM_PRINCIPAL_COMPONENTS` option

Example 1

This example creates a PCA model with four principal components.

CREATEMODEL`mydataset.mymodel`OPTIONS(MODEL_TYPE='PCA',NUM_PRINCIPAL_COMPONENTS=4)ASSELECT*FROM`mydataset.mytable`

Example 2

This example performs dimensionality reduction with themydataset.iris_pca PCA model with input features.

CREATEMODEL`mydataset.iris_pca`OPTIONS(MODEL_TYPE='PCA',NUM_PRINCIPAL_COMPONENTS=2,SCALE_FEATURES=FALSE)ASSELECTsepal_length,sepal_width,petal_length,petal_widthFROM`bigquery-public-data.ml_datasets.iris`;

The following sample transforms the input features using themydataset.iris_pca model into a lower dimensional space, which is then usedto train themydataset.iris_logistic model.mydataset.iris_logistic will bea better ML model if the original input features are afflicted by the curse ofdimensionality.

CREATEMODEL`mydataset.iris_logistic`OPTIONS(MODEL_TYPE='LOGISTIC_REG',INPUT_LABEL_COLS=['species'])ASSELECT*FROMML.PREDICT(MODEL`mydataset.iris_pca`,(SELECTsepal_length,sepal_width,petal_length,petal_width,speciesFROM`bigquery-public-data.ml_datasets.iris`));

Use the`PCA_EXPLAINED_VARIANCE_RATIO` option

This example creates a PCA model, where the number of principal components isselected such that the percentage of variance explained by them is greater than0.8.

CREATEMODEL`mydataset.mymodel`OPTIONS(MODEL_TYPE='PCA',PCA_EXPLAINED_VARIANCE_RATIO=0.8)ASSELECT*FROM`mydataset.mytable`

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

The CREATE MODEL statement for PCA models

CREATE MODEL statement for PCA

CREATE MODEL syntax

CREATE MODEL

CREATE MODEL IF NOT EXISTS

CREATE OR REPLACE MODEL

model_name

MODEL_TYPE

NUM_PRINCIPAL_COMPONENTS

PCA_EXPLAINED_VARIANCE_RATIO

SCALE_FEATURES

PCA_SOLVER

MODEL_REGISTRY

VERTEX_AI_MODEL_ID

VERTEX_AI_MODEL_VERSION_ALIASES

KMS_KEY_NAME

query_statement

Locations

Examples

Use theNUM_PRINCIPAL_COMPONENTS option

Use thePCA_EXPLAINED_VARIANCE_RATIO option

`CREATE MODEL` statement for PCA

`CREATE MODEL` syntax

`CREATE MODEL`

`CREATE MODEL IF NOT EXISTS`

`CREATE OR REPLACE MODEL`

`model_name`

`MODEL_TYPE`

`NUM_PRINCIPAL_COMPONENTS`

`PCA_EXPLAINED_VARIANCE_RATIO`

`SCALE_FEATURES`

`PCA_SOLVER`

`MODEL_REGISTRY`

`VERTEX_AI_MODEL_ID`

`VERTEX_AI_MODEL_VERSION_ALIASES`

`KMS_KEY_NAME`

`query_statement`

Use the`NUM_PRINCIPAL_COMPONENTS` option

Use the`PCA_EXPLAINED_VARIANCE_RATIO` option