The CREATE MODEL statement for PCA models
CREATE MODEL statement for PCA
This document describes theCREATE MODEL statement for creatingprincipal component analysis (PCA) modelsin BigQuery by using SQL. Alternatively, you can use theGoogle Cloud console user interface tocreate a model by using a UI(Preview) instead of constructing the SQLstatement yourself.
You can use PCA models with theML.PREDICTorAI.GENERATE_EMBEDDINGfunctions to embed data into a lower-dimensional space, andwith theML.DETECT_ANOMALIES functionto performanomaly detection.
For more information about supported SQL statements and functions for thismodel, seeEnd-to-end user journeys for ML models.
CREATE MODEL syntax
{CREATE MODEL |CREATE MODEL IF NOT EXISTS |CREATE OR REPLACE MODEL}model_nameOPTIONS(model_option_list)ASquery_statementmodel_option_list:MODEL_TYPE = PCA,NUM_PRINCIPAL_COMPONENTS =int64_value |PCA_EXPLAINED_VARIANCE_RATIO =float64_value [,SCALE_FEATURES = { TRUE | FALSE } ] [,PCA_SOLVER = { 'FULL' | 'RANDOMIZED' | 'AUTO' } ] [,MODEL_REGISTRY = { 'VERTEX_AI' } ] [,VERTEX_AI_MODEL_ID =string_value ] [,VERTEX_AI_MODEL_VERSION_ALIASES =string_array ] [,KMS_KEY_NAME =string_value ]CREATE MODEL
Creates and trains a new model in the specified dataset. If the model nameexists,CREATE MODEL returns an error.
CREATE MODEL IF NOT EXISTS
Creates and trains a new model only if the model doesn't exist in thespecified dataset.
CREATE OR REPLACE MODEL
Creates and trains a model and replaces an existing model with the same name inthe specified dataset.
model_name
The name of the model you're creating or replacing. The modelname must be unique in the dataset: no other model or table can have the samename. The model name must follow the same naming rules as aBigQuery table. A model name can:
- Contain up to 1,024 characters
- Contain letters (upper or lower case), numbers, and underscores
model_name is case-sensitive.
If you don't have a default project configured, then you must prepend theproject ID to the model name in the following format, including backticks:
`[PROJECT_ID].[DATASET].[MODEL]`
For example, `myproject.mydataset.mymodel`.
MODEL_TYPE
Syntax
MODEL_TYPE={'PCA'}Description
Specify the model type. This option is required.
Arguments
Principal component analysis computes principal components and uses them toperform a change of basis on the data. This approach is commonly used fordimensionality reduction by projecting each data point onto only the first fewprincipal components. This lets the model obtain lower-dimensional datawhile preserving as much of the data's variation as possible. The firstprincipal component can equivalently be defined as a direction that maximizesthe variance of the projected data.
PCA is an unsupervised learning technique, so model training doesn't requireeither labels or input data that is split into sets for training and evaluation.
NUM_PRINCIPAL_COMPONENTS
Syntax
NUM_PRINCIPAL_COMPONENTS =int64_value
Description
The number of principal components to keep.
You must specify eitherNUM_PRINCIPAL_COMPONENTS orPCA_EXPLAINED_VARIANCE_RATIO, but not both.
Arguments
AnINT64 value. This value can't be larger than thetotal number of rows or the total feature cardinalities after one-hot encodingthe categorical features.
PCA_EXPLAINED_VARIANCE_RATIO
Syntax
PCA_EXPLAINED_VARIANCE_RATIO =float64_value
Description
The ratio for the explained variance. The number of principal components isselected such that the percentage of variance explained by the principalcomponents is greater than the ratio specified by this argument.
You must specify eitherPCA_EXPLAINED_VARIANCE_RATIO orNUM_PRINCIPAL_COMPONENTS, but not both.
Arguments
AFLOAT64 value in the range(0, 1).
SCALE_FEATURES
Syntax
SCALE_FEATURES={TRUE|FALSE}Description
Determines whether or not to scale the numerical features to unit variance. Theinput numerical features are always centered to have zero mean value.Separately, categorical features are one-hot encoded.
Arguments
ABOOL value. The default value isTRUE.
PCA_SOLVER
Syntax
PCA_SOLVER={'FULL'|'RANDOMIZED'|'AUTO'}Description
The solver to use to calculate the principal components.
Arguments
This option accepts the following values:
FULL: Run a fulleigendecompositionalgorithm. In this case, the maximum allowed feature cardinality after one-hotencoding the categoricals is dynamically estimated. The primary factor thatdetermines the feature cardinality value is the lengths of the feature names,and this value isn't affected by the values of theNUM_PRINCIPAL_COMPONENTSorPCA_EXPLAINED_VARIANCE_RATIOoptions. As aguideline, the maximum allowed feature cardinality typically falls between1,000 and 1,500. If the total feature cardinality of the input dataviolates the estimated maximum value, then an invalid query error is returned.RANDOMIZED: Run a randomized PCA algorithm. In this case, the maximumallowed feature cardinality is 10,000. If the featurecardinality of the input data is less than 10,000, then the cap on the numberof principal components to compute is dynamically determined based on resourceconstraints.- If you specify
NUM_PRINCIPAL_COMPONENTS, then the value forNUM_PRINCIPAL_COMPONENTSmust be less than or equal to 10,000. Largervalues result in invalid query errors. - If you specify
PCA_EXPLAINED_VARIANCE_RATIO,then all principal components under the 10,000 cap are computed. If theirtotal explained variance ratio is less than thePCA_EXPLAINED_VARIANCE_RATIOvalue, then they are all returned; otherwisea subset is returned.
- If you specify
AUTO: This is the default. In this case, the solver is selected by adefault policy based on the input data.Typically, when the feature cardinality after one-hot encoding all thecategoricals is less than a dynamically determined threshold, the exact fulleigendecomposition is computed. Otherwise, randomized PCA is performed. Thedynamically determined threshold typically falls between 1,000 and 1,500. Thenumber of rows in the input data is not considered when choosing the solver.
MODEL_REGISTRY
TheMODEL_REGISTRY option specifies the model registry destination.VERTEX_AI is the only supported model registry destination. To learn more, seeRegister a BigQuery ML model.
VERTEX_AI_MODEL_ID
TheVERTEX_AI_MODEL_ID option specifies a Vertex AI model IDto register the model with. The model ID is associated with yourBigQuery ML model, and is visible from theModel Registry. If you don't specify aVertex AI model ID, the BigQuery MLmodel name is used.
TheVERTEX_AI_MODEL_ID value can have up to 63 characters, and validcharacters are[a-z0-9_-]. The first character cannot be a number or hyphen.If you don't specify a Vertex AI model ID, theBigQuery ML model name must meet these requirements.
You can only set theVERTEX_AI_MODEL_ID option when theMODEL_REGISTRYoption is set toVERTEX_AI.
VERTEX_AI_MODEL_VERSION_ALIASES
TheVERTEX_AI_MODEL_VERSION_ALIASES option specifies aVertex AI model alias to use when registering a model. Modelaliases are helpful for fetching or deploying a particular model version byreference without needing to know the specific version ID. To learn more abouthow Model Registry aliases work, seeHow to use model version aliases.
You can only set theVERTEX_AI_MODEL_VERSION_ALIASES option when theMODEL_REGISTRY option is set toVERTEX_AI.
KMS_KEY_NAME
Syntax
KMS_KEY_NAME =string_value
Description
The Cloud Key Management Servicecustomer-managed encryption key (CMEK) touse to encrypt the model.
Arguments
ASTRING value containing the fully-qualified name of the CMEK. For example,
'projects/my_project/locations/my_location/keyRings/my_ring/cryptoKeys/my_key'query_statement
TheASquery_statement clause specifies theGoogleSQL query used to generate the training data. See theGoogleSQL query syntaxpage for the supported SQL syntax of thequery_statement clause.
Locations
For information about supported locations, seeLocations for non-remote models.
Examples
The following examples create models namedmymodel inmydataset in yourdefault project.
Use theNUM_PRINCIPAL_COMPONENTS option
Example 1
This example creates a PCA model with four principal components.
CREATEMODEL`mydataset.mymodel`OPTIONS(MODEL_TYPE='PCA',NUM_PRINCIPAL_COMPONENTS=4)ASSELECT*FROM`mydataset.mytable`
Example 2
This example performs dimensionality reduction with themydataset.iris_pca PCA model with input features.
CREATEMODEL`mydataset.iris_pca`OPTIONS(MODEL_TYPE='PCA',NUM_PRINCIPAL_COMPONENTS=2,SCALE_FEATURES=FALSE)ASSELECTsepal_length,sepal_width,petal_length,petal_widthFROM`bigquery-public-data.ml_datasets.iris`;
The following sample transforms the input features using themydataset.iris_pca model into a lower dimensional space, which is then usedto train themydataset.iris_logistic model.mydataset.iris_logistic will bea better ML model if the original input features are afflicted by the curse ofdimensionality.
CREATEMODEL`mydataset.iris_logistic`OPTIONS(MODEL_TYPE='LOGISTIC_REG',INPUT_LABEL_COLS=['species'])ASSELECT*FROMML.PREDICT(MODEL`mydataset.iris_pca`,(SELECTsepal_length,sepal_width,petal_length,petal_width,speciesFROM`bigquery-public-data.ml_datasets.iris`));
Use thePCA_EXPLAINED_VARIANCE_RATIO option
This example creates a PCA model, where the number of principal components isselected such that the percentage of variance explained by them is greater than0.8.
CREATEMODEL`mydataset.mymodel`OPTIONS(MODEL_TYPE='PCA',PCA_EXPLAINED_VARIANCE_RATIO=0.8)ASSELECT*FROM`mydataset.mytable`
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.