Automatic feature preprocessing
BigQuery ML performs automatic preprocessing during training by using theCREATE MODEL statement.Automatic preprocessing consists ofmissing value imputationandfeature transformations.
For information about feature preprocessing support in BigQuery ML,seeFeature preprocessing overview.
Missing data imputation
In statistics, imputation is used to replace missing data with substitutedvalues. When you train a model in BigQuery ML,NULL values aretreated as missing data. When you predict outcomes in BigQuery ML,missing values can occur when BigQuery ML encounters aNULLvalue or a previously unseen value. BigQuery ML handles missingdata differently, based on the type of data in the column.
| Column type | Imputation method |
|---|---|
| Numeric | In both training and prediction,NULL values in numeric columns are replaced with the mean value of the given column, as calculated by the feature column in the original input data. |
| One-hot/Multi-hot encoded | In both training and prediction,NULL values in the encoded columns are mapped to an additional category that is added to the data. Previously unseen data is assigned a weight of 0 during prediction. |
TIMESTAMP | TIMESTAMP columns use a mixture of imputation methods from both standardized and one-hot encoded columns. For the generated Unix time column, BigQuery ML replaces values with the mean Unix time across the original columns. For other generated values, BigQuery ML assigns them to the respectiveNULL category for each extracted feature. |
STRUCT | In both training and prediction, each field of theSTRUCT is imputed according to its type. |
Feature transformations
By default, BigQuery ML transforms input features as follows:
| Input data type | Transformation method | Details |
|---|---|---|
INT64NUMERICBIGNUMERICFLOAT64 | Standardization | For most models, BigQuery ML standardizes and centers numerical columns at zero before passing it into training. The exceptions are boosted tree and random forest models, for which no standardization occurs, and k-means models, where theSTANDARDIZE_FEATURES option controls whether numerical features are standardized. |
BOOLSTRINGBYTESDATEDATETIMETIME | One-hot encoded | For all non-numerical, non-array columns other thanTIMESTAMP, BigQuery ML performs a one-hot encoding transformation for all models other than boosted tree and random forest models. This transformation generates a separate feature for each unique value in the column. Label encoding transformation is applied to train boosted tree and random forest models to convert each unique value into a numerical value. |
ARRAY | Multi-hot encoded | For all non-numericalARRAY columns, BigQuery ML performs a multi-hot encoding transformation. This transformation generates a separate feature for each unique element in theARRAY. |
TIMESTAMP | Timestamp transformation | When a linear or logistic regression model encounters aTIMESTAMP column, it extracts a set of components from theTIMESTAMP and performs a mix of standardization and one-hot encoding on the extracted components. For the Unix epoch time in seconds component, BigQuery ML uses standardization. For all other components, it uses one-hot encoding.For more information, see the followingtimestamp feature transformation table. |
STRUCT | Struct expansion | When BigQuery ML encounters aSTRUCT column, it expands the fields inside theSTRUCT to create a single column. It requires all fields ofSTRUCT to be named. NestedSTRUCTs are not allowed. The column names after expansion are in the format of{struct_name}_{field_name}. |
ARRAY ofSTRUCT | No transformation | |
ARRAY ofNUMERIC | No transformation |
TIMESTAMP feature transformation
The following table shows the components extracted fromTIMESTAMP columns andthe corresponding transformation method.
TIMESTAMP component | processed_input result | Transformation method |
|---|---|---|
| Unix epoch time in seconds | [COLUMN_NAME] | Standardization |
| Day of month | _TS_DOM_[COLUMN_NAME] | One-hot encoding |
| Day of week | _TS_DOW_[COLUMN_NAME] | One-hot encoding |
| Month of year | _TS_MOY_[COLUMN_NAME] | One-hot encoding |
| Hour of day | _TS_HOD_[COLUMN_NAME] | One-hot encoding |
| Minute of hour | _TS_MOH_[COLUMN_NAME] | One-hot encoding |
| Week of year (weeks begin on Sunday) | _TS_WOY_[COLUMN_NAME] | One-hot encoding |
| Year | _TS_YEAR_[COLUMN_NAME] | One-hot encoding |
Category feature encoding
For features that are one-hot encoded, you can specify a different defaultencoding method by using the model optionCATEGORY_ENCODING_METHOD. Forgeneralized linear models (GLM) models, you can setCATEGORY_ENCODING_METHODto one of the following values:
One-hot encoding
One-hot encoding maps each category that a feature has to its own binaryfeature, where0 represents the absence of the feature and1 represents thepresence (known as adummy variable). This mapping createsN new featurecolumns, whereN is the number of unique categories for the feature acrossthe training table.
For example, suppose your training table has a feature column that's calledfruit with the categoriesApple,Banana, andCranberry, such as thefollowing:
| Row | fruit |
|---|---|
| 1 | Apple |
| 2 | Banana |
| 3 | Cranberry |
In this case, theCATEGORY_ENCODING_METHOD='ONE_HOT_ENCODING' optiontransforms the table to the following internal representation:
| Row | fruit_Apple | fruit_Banana | fruit_Cranberry |
|---|---|---|---|
| 1 | 1 | 0 | 0 |
| 2 | 0 | 1 | 0 |
| 3 | 0 | 0 | 1 |
One-hot encoding is supported bylinear and logistic regressionandboosted treemodels.
Dummy encoding
Dummy encoding issimilar to one-hot encoding, where a categorical feature is transformed into aset of placeholder variables. Dummy encoding usesN-1 placeholder variablesinstead ofN placeholder variables to representN categories for a feature.For example, if you setCATEGORY_ENCODING_METHOD to'DUMMY_ENCODING' forthe samefruit feature column shown in the preceding one-hot encoding example,then the table is transformed to the following internal representation:
| Row | fruit_Apple | fruit_Banana |
|---|---|---|
| 1 | 1 | 0 |
| 2 | 0 | 1 |
| 3 | 0 | 0 |
The category with the most occurrences inthe training dataset is dropped. When multiple categories have themost occurrences, a random category within that set is dropped.
The final set of weights fromML.WEIGHTSstill includes the dropped category, but its weight is always0.0. ForML.ADVANCED_WEIGHTS,the standard error and p-value for the dropped variable isNaN.
Ifwarm_start is used on a model that was initially trained with'DUMMY_ENCODING', the same placeholder variable is dropped from the firsttraining run. Models cannot change encoding methods between training runs.
Dummy encoding is supported bylinear and logistic regression models.
Label encoding
Label encoding transforms the value of a categorical feature to anINT64 valuein[0, <number of categories>].
For example, if you had a book dataset like the following:
| Title | Genre |
|---|---|
| Book 1 | Fantasy |
| Book 2 | Cooking |
| Book 3 | History |
| Book 4 | Cooking |
The label encoded values might look similar to the following:
| Title | Genre (text) | Genre (numeric) |
|---|---|---|
| Book 1 | Fantasy | 1 |
| Book 2 | Cooking | 2 |
| Book 3 | History | 3 |
| Book 4 | Cooking | 2 |
The encoding vocabulary is sorted alphabetically.NULL values and categoriesthat aren't in the vocabulary are encoded to0.
Label encoding is supported byboosted tree models.
Target encoding
Target encoding replaces the categorical feature value with the probability ofthe target for classification models, or with the expected value of the targetfor regression models.
Features that have been target encoded might look similar to the followingexample:
# Classification model+------------------------+----------------------+| original value | target encoded value |+------------------------+----------------------+| (category_1, target_1) | 0.5 || (category_1, target_2) | 0.5 || (category_2, target_1) | 0.0 |+------------------------+----------------------+# Regression model+------------------------+----------------------+| original value | target encoded value |+------------------------+----------------------+| (category_1, 2) | 2.5 || (category_1, 3) | 2.5 || (category_2, 1) | 1.5 || (category_2, 2) | 1.5 |+------------------------+----------------------+
Target encoding is supported byboosted tree models.
What's next
For more information about supported SQL statements and functions for models thatsupport automatic feature preprocessing, see the following documents:
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.