The ML.VALIDATE_DATA_SKEW function

This document describes theML.VALIDATE_DATA_SKEW function, which you can useto compute the data skew between a model's training and serving data. Thisfunction computes the statistics for the serving data, compares them to thestatistics that were computed for the training data at the time the model wascreated, and identifies where there are anomalous differences between the twodata sets.

You can optionally visualize the function output by usingVertex AI model monitoring.For more information, seeMonitoring visualization.

Statistics are only computed for feature columns in the serving data that matchfeature columns in the training data, in order to achieve better performance andlower cost. For models that were created with use of theTRANSFORM clause,the statistics are based on the raw feature data before feature preprocessingwithin theTRANSFORM clause.

Syntax

ML.VALIDATE_DATA_SKEW(MODEL`PROJECT_ID.DATASET.MODEL_NAME`,{TABLE`PROJECT_ID.DATASET.TABLE_NAME`|(QUERY_STATEMENT)},STRUCT([CATEGORICAL_DEFAULT_THRESHOLDAScategorical_default_threshold][,CATEGORICAL_METRIC_TYPEAScategorical_metric_type][,NUMERICAL_DEFAULT_THRESHOLDASnumerical_default_threshold][,NUMERICAL_METRIC_TYPEASnumerical_metric_type][,THRESHOLDSASthresholds][,ENABLE_VISUALIZATION_LINKASenable_visualization_link]))

Arguments

ML.VALIDATE_DATA_SKEW takes the following arguments:

PROJECT_ID: the BigQuery project thatcontains the resource.
DATASET: the BigQuery dataset thatcontains the resource.
MODEL_NAME: the name of the model.
TABLE_NAME: the name of the input table that containsthe serving data to calculate statistics for.
QUERY_STATEMENT: a query that generates the servingdata to calculate statistics for. For the supported SQL syntax of theQUERY_STATEMENT clause, seeGoogleSQL querysyntax.
CATEGORICAL_DEFAULT_THRESHOLD: aFLOAT64 value thatspecifies the custom threshold to use for anomaly detection for categoricalandARRAY<categorical> features. The value must be in the range[0, 1).The default value is0.3.
CATEGORICAL_METRIC_TYPE: aSTRING value thatspecifies the metric used to compare statistics for categorical andARRAY<categorical> features. Valid values are as follows:
- L_INFTY: useL-infinitydistance. This value isthe default.
- JENSEN_SHANNON_DIVERGENCE: useJensen–Shannon divergence.
NUMERICAL_DEFAULT_THRESHOLD: aFLOAT64 value thatspecifies the custom threshold to use for anomaly detection for numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>> features. The valuemust be in the range[0, 1). The default value is0.3.
NUMERICAL_METRIC_TYPE: aSTRING value that specifiesthe metric used to compare statistics for numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>> features. The only valid value isJENSEN_SHANNON_DIVERGENCE.
THRESHOLDS: anARRAY<STRUCT<STRING, FLOAT64>> valuethat specifies the anomaly detection thresholds for one or more columnsfor which you don't want to use the default threshold. TheSTRING value inthe struct specifies the column name, and theFLOAT64 value specifies thethreshold. TheFLOAT64 value must be in the range[0, 1). For example,[('col_a', 0.1), ('col_b', 0.8)].
ENABLE_VISUALIZATION_LINK: aBOOL value thatdetermines whether to return links to the visualized function output. When youspecifyTRUE for this argument, theML.VALIDATE_DATA_DRIFT output includesthevisualization_link column. Thevisualization_link column provides URLsthat link to visualizations of the function results inVertex AI monitoring.
When you specifyTRUE for this argument, themodel argument value mustrefer to a BigQuery ML model that isregistered withVertex AI. If the model isn't registered, an invalid queryerror is returned.

Output

ML.VALIDATE_DATA_SKEW returns one row for each column in the input data.ML.VALIDATE_DATA_SKEW output contains the following columns:

input: aSTRING column that contains the input column name.
metric: aSTRING column that contains the metric used to compare theinput column statistical value between the training and serving data sets.This column value isJENSEN_SHANNON_DIVERGENCE for numerical features, andeitherL_INFTY orJENSEN_SHANNON_DIVERGENCE for categorical features.
threshold: aFLOAT64 column that contains the threshold used to determinewhether the statistical difference in theinput column value between thetraining and serving data is anomalous.
value: aFLOAT64 column that contains the statistical difference intheinput column value between the serving and the training data sets.
is_anomaly: aBOOL column that indicates whether thevalue value ishigher than thethreshold value.

visualization_link: a URL thatlinks to a Vertex AI visualization of the results for the givenfeature. The URL is formatted as follows:

https://console.cloud.google.com/vertex-ai/model-monitoring/locations/region/model-monitors/vertex_model_monitor_id/model-monitoring-jobs/vertex_model_monitoring_job_id/feature-drift?project=project_id&featureName=feature_name

For example:

https://console.cloud.google.com/vertex-ai/model-monitoring/locations/us-central1/model-monitors/bq123456789012345647/model-monitoring-jobs/bqjob890123456789012/feature-drift?project=myproject&featureName=petal_length

This column is only returned when theenable_visualization_link argumentvalue isTRUE.

For more information, seeMonitoring visualization.

Examples

The following examples demonstrate how to use theML.VALIDATE_DATA_SKEWfunction.

Run`ML.VALIDATE_DATA_SKEW`

The following example computes data skew between the serving data and thetraining data used to create the model, with a categorical feature thresholdof0.2:

SELECT*FROMML.VALIDATE_DATA_SKEW(MODEL`myproject.mydataset.mymodel`,TABLE`myproject.mydataset.serving`,STRUCT(0.2AScategorical_default_threshold));

The output looks similar to the following:

+------------------+--------------------------+-----------+--------+------------+| input            | metric                   | threshold |  value | is_anomaly |+------------------+--------------------------+-----------+--------+------------+| dropoff_latitude | JENSEN_SHANNON_DIVERGENCE| 0.2       | 0.7    | true       |+------------------+--------------------------+-----------+--------+------------+| payment_type     | L_INTFY                  | 0.3       | 0.2    | false      |+------------------+--------------------------+-----------+--------+------------+

Run`ML.VALIDATE_DATA_SKEW` and visualize the results

The following example computes data skew between the serving data and thetraining data used to create the model, with a categorical feature thresholdof0.2:

SELECT*FROMML.VALIDATE_DATA_SKEW(MODEL`myproject.mydataset.mymodel`,TABLE`myproject.mydataset.serving`,STRUCT(0.2AScategorical_default_threshold,TRUEASenable_visualization_link));

The output looks similar to the following:

+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+| input            | metric                   | threshold |  value | is_anomaly | visualization_link                                     |+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+| dropoff_latitude | JENSEN_SHANNON_DIVERGENCE| 0.2       | 0.7    | true       | https://console.cloud.google.com/vertex-ai/            ||                  |                          |           |        |            | model-monitoring/locations/us-central1/model-monitors/ ||                  |                          |           |        |            | bq1111222233334444555/model-monitoring-jobs/           ||                  |                          |           |        |            | bqjob1234512345123451234/feature-drift?project=        ||                  |                          |           |        |            | myproject&featureName=dropoff_latitude                 |+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+| payment_type     | L_INTFY                  | 0.3       | 0.2    | false      | https://console.cloud.google.com/vertex-ai/            ||                  |                          |           |        |            | model-monitoring/locations/us-central1/model-monitors/ ||                  |                          |           |        |            | bq1111222233334444555/model-monitoring-jobs/           ||                  |                          |           |        |            | bqjob1234512345123451234/feature-drift?project=        ||                  |                          |           |        |            | myproject&featureName=payment_type                     |+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+

Copying and pasting the visualization link into a browser tab returns resultssimilar to the following for numerical features:

A graph that visualizes the monitoring output for the `ML.VALIDATE_DATA_SKEW` function

Copying and pasting the visualization link into a browser tab returns resultssimilar to the following for categorical features:

A bar chart that visualizes the monitoring output for the `ML.VALIDATE_DATA_SKEW` function

Automate skew detection

The following example shows how to automate skew detection for alinear regression model:

DECLAREanomaliesARRAY<STRING>;SETanomalies=(SELECTARRAY_AGG(input)FROMML.VALIDATE_DATA_SKEW(MODELmydataset.model_linear_reg,TABLEmydataset.serving,STRUCT(0.3AScategorical_default_threshold,0.2ASnumerical_default_threshold,'JENSEN_SHANNON_DIVERGENCE'ASnumerical_metric_type,[STRUCT('fare',0.15),STRUCT('company',0.25)]ASthresholds))WHEREis_anomaly);IF(ARRAY_LENGTH(anomalies)>0)THENCREATEORREPLACEMODELmydataset.model_linear_regTRANSFORM(ML.MIN_MAX_SCALER(fare)OVER()ASf1,ML.ROBUST_SCALER(pickup_longitude)OVER()ASf2,ML.LABEL_ENCODER(company)OVER()ASf3,ML.ONE_HOT_ENCODER(payment_type)OVER()ASf4,label)OPTIONS(model_type='linear_reg',max_iterations=1)AS(SELECTfare,pickup_longitude,company,payment_type,2ASlabelFROMmydataset.new_training_data);SELECTERROR(CONCAT("Found data skew in features: ",ARRAY_TO_STRING(anomalies,", "),". Model is retrained with the latest data."));ELSESELECT*FROMML.PREDICT(MODELmydataset.model_linear_reg,TABLEmydataset.serving);ENDIF;

Limitations

ML.VALIDATE_DATA_SKEW doesn't support the following types of models:
- AutoML
- Matrix factorization
- ARIMA_PLUS
- Remote models overLLMs,Cloud AI services,orVertex AI endpoints
- ImportedOpen Neural Network Exchange (ONNX),TensorFlow,TensorFlow Lite,orXGBoostmodels
ML.VALIDATE_DATA_SKEW doesn't support models created before March 28, 2024,or models that use theWARM START option. To enable use ofML.VALIDATE_DATA_SKEW, retrain the model by running theCREATE OR REPLACE model statement.
Running theML.VALIDATE_DATA_SKEW function on a large amount of input datacan cause the query toreturn the errorDry run query timed out. To resolve the error,disable retrieval of cached results for the query.
ML.VALIDATE_DATA_SKEW doesn't conduct schema validation between the two setsof input data, and so handles data type mismatches as follows:
- If you specifyJENSEN_SHANNON_DIVERGENCE for thecategorical_default_threshold ornumerical_default_thresholdargument, the feature isn't included in the final anomaly report.
- If you specifyL_INFTY for thecategorical_default_thresholdargument, the function outputs the computed feature distance as expected.
However, when you run inference on the serving data, theML.PREDICT functionhandles schema validation.

Pricing

TheML.VALIDATE_DATA_SKEW function usesBigQuery on-demand compute pricing.

What's next

For more information about model monitoring in BigQuery ML, seeModel monitoring overview.
For more information about supported SQL statements and functions for MLmodels, seeEnd-to-end user journeys for ML models.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-11-24 UTC.

Movatterモバイル変換