The ML.VALIDATE_DATA_SKEW function
This document describes theML.VALIDATE_DATA_SKEW function, which you can useto compute the data skew between a model's training and serving data. Thisfunction computes the statistics for the serving data, compares them to thestatistics that were computed for the training data at the time the model wascreated, and identifies where there are anomalous differences between the twodata sets.
You can optionally visualize the function output by usingVertex AI model monitoring.For more information, seeMonitoring visualization.
Statistics are only computed for feature columns in the serving data that matchfeature columns in the training data, in order to achieve better performance andlower cost. For models that were created with use of theTRANSFORM clause,the statistics are based on the raw feature data before feature preprocessingwithin theTRANSFORM clause.
Syntax
ML.VALIDATE_DATA_SKEW(MODEL`PROJECT_ID.DATASET.MODEL_NAME`,{TABLE`PROJECT_ID.DATASET.TABLE_NAME`|(QUERY_STATEMENT)},STRUCT([CATEGORICAL_DEFAULT_THRESHOLDAScategorical_default_threshold][,CATEGORICAL_METRIC_TYPEAScategorical_metric_type][,NUMERICAL_DEFAULT_THRESHOLDASnumerical_default_threshold][,NUMERICAL_METRIC_TYPEASnumerical_metric_type][,THRESHOLDSASthresholds][,ENABLE_VISUALIZATION_LINKASenable_visualization_link]))
Arguments
ML.VALIDATE_DATA_SKEW takes the following arguments:
PROJECT_ID: the BigQuery project thatcontains the resource.DATASET: the BigQuery dataset thatcontains the resource.MODEL_NAME: the name of the model.TABLE_NAME: the name of the input table that containsthe serving data to calculate statistics for.QUERY_STATEMENT: a query that generates the servingdata to calculate statistics for. For the supported SQL syntax of theQUERY_STATEMENTclause, seeGoogleSQL querysyntax.CATEGORICAL_DEFAULT_THRESHOLD: aFLOAT64value thatspecifies the custom threshold to use for anomaly detection for categoricalandARRAY<categorical>features. The value must be in the range[0, 1).The default value is0.3.CATEGORICAL_METRIC_TYPE: aSTRINGvalue thatspecifies the metric used to compare statistics for categorical andARRAY<categorical>features. Valid values are as follows:L_INFTY: useL-infinitydistance. This value isthe default.JENSEN_SHANNON_DIVERGENCE: useJensen–Shannon divergence.
NUMERICAL_DEFAULT_THRESHOLD: aFLOAT64value thatspecifies the custom threshold to use for anomaly detection for numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>features. The valuemust be in the range[0, 1). The default value is0.3.NUMERICAL_METRIC_TYPE: aSTRINGvalue that specifiesthe metric used to compare statistics for numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>features. The only valid value isJENSEN_SHANNON_DIVERGENCE.THRESHOLDS: anARRAY<STRUCT<STRING, FLOAT64>>valuethat specifies the anomaly detection thresholds for one or more columnsfor which you don't want to use the default threshold. TheSTRINGvalue inthe struct specifies the column name, and theFLOAT64value specifies thethreshold. TheFLOAT64value must be in the range[0, 1). For example,[('col_a', 0.1), ('col_b', 0.8)].ENABLE_VISUALIZATION_LINK: aBOOLvalue thatdetermines whether to return links to the visualized function output. When youspecifyTRUEfor this argument, theML.VALIDATE_DATA_DRIFToutput includesthevisualization_linkcolumn. Thevisualization_linkcolumn provides URLsthat link to visualizations of the function results inVertex AI monitoring.When you specify
TRUEfor this argument, themodelargument value mustrefer to a BigQuery ML model that isregistered withVertex AI. If the model isn't registered, an invalid queryerror is returned.
Output
ML.VALIDATE_DATA_SKEW returns one row for each column in the input data.ML.VALIDATE_DATA_SKEW output contains the following columns:
input: aSTRINGcolumn that contains the input column name.metric: aSTRINGcolumn that contains the metric used to compare theinputcolumn statistical value between the training and serving data sets.This column value isJENSEN_SHANNON_DIVERGENCEfor numerical features, andeitherL_INFTYorJENSEN_SHANNON_DIVERGENCEfor categorical features.threshold: aFLOAT64column that contains the threshold used to determinewhether the statistical difference in theinputcolumn value between thetraining and serving data is anomalous.value: aFLOAT64column that contains the statistical difference intheinputcolumn value between the serving and the training data sets.is_anomaly: aBOOLcolumn that indicates whether thevaluevalue ishigher than thethresholdvalue.visualization_link: a URL thatlinks to a Vertex AI visualization of the results for the givenfeature. The URL is formatted as follows:https://console.cloud.google.com/vertex-ai/model-monitoring/locations/region/model-monitors/vertex_model_monitor_id/model-monitoring-jobs/vertex_model_monitoring_job_id/feature-drift?project=project_id&featureName=feature_name
For example:
https://console.cloud.google.com/vertex-ai/model-monitoring/locations/us-central1/model-monitors/bq123456789012345647/model-monitoring-jobs/bqjob890123456789012/feature-drift?project=myproject&featureName=petal_lengthThis column is only returned when the
enable_visualization_linkargumentvalue isTRUE.For more information, seeMonitoring visualization.
Examples
The following examples demonstrate how to use theML.VALIDATE_DATA_SKEWfunction.
RunML.VALIDATE_DATA_SKEW
The following example computes data skew between the serving data and thetraining data used to create the model, with a categorical feature thresholdof0.2:
SELECT*FROMML.VALIDATE_DATA_SKEW(MODEL`myproject.mydataset.mymodel`,TABLE`myproject.mydataset.serving`,STRUCT(0.2AScategorical_default_threshold));
The output looks similar to the following:
+------------------+--------------------------+-----------+--------+------------+| input | metric | threshold | value | is_anomaly |+------------------+--------------------------+-----------+--------+------------+| dropoff_latitude | JENSEN_SHANNON_DIVERGENCE| 0.2 | 0.7 | true |+------------------+--------------------------+-----------+--------+------------+| payment_type | L_INTFY | 0.3 | 0.2 | false |+------------------+--------------------------+-----------+--------+------------+RunML.VALIDATE_DATA_SKEW and visualize the results
The following example computes data skew between the serving data and thetraining data used to create the model, with a categorical feature thresholdof0.2:
SELECT*FROMML.VALIDATE_DATA_SKEW(MODEL`myproject.mydataset.mymodel`,TABLE`myproject.mydataset.serving`,STRUCT(0.2AScategorical_default_threshold,TRUEASenable_visualization_link));
The output looks similar to the following:
+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+| input | metric | threshold | value | is_anomaly | visualization_link |+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+| dropoff_latitude | JENSEN_SHANNON_DIVERGENCE| 0.2 | 0.7 | true | https://console.cloud.google.com/vertex-ai/ || | | | | | model-monitoring/locations/us-central1/model-monitors/ || | | | | | bq1111222233334444555/model-monitoring-jobs/ || | | | | | bqjob1234512345123451234/feature-drift?project= || | | | | | myproject&featureName=dropoff_latitude |+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+| payment_type | L_INTFY | 0.3 | 0.2 | false | https://console.cloud.google.com/vertex-ai/ || | | | | | model-monitoring/locations/us-central1/model-monitors/ || | | | | | bq1111222233334444555/model-monitoring-jobs/ || | | | | | bqjob1234512345123451234/feature-drift?project= || | | | | | myproject&featureName=payment_type |+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+Copying and pasting the visualization link into a browser tab returns resultssimilar to the following for numerical features:

Copying and pasting the visualization link into a browser tab returns resultssimilar to the following for categorical features:

Automate skew detection
The following example shows how to automate skew detection for alinear regression model:
DECLAREanomaliesARRAY<STRING>;SETanomalies=(SELECTARRAY_AGG(input)FROMML.VALIDATE_DATA_SKEW(MODELmydataset.model_linear_reg,TABLEmydataset.serving,STRUCT(0.3AScategorical_default_threshold,0.2ASnumerical_default_threshold,'JENSEN_SHANNON_DIVERGENCE'ASnumerical_metric_type,[STRUCT('fare',0.15),STRUCT('company',0.25)]ASthresholds))WHEREis_anomaly);IF(ARRAY_LENGTH(anomalies)>0)THENCREATEORREPLACEMODELmydataset.model_linear_regTRANSFORM(ML.MIN_MAX_SCALER(fare)OVER()ASf1,ML.ROBUST_SCALER(pickup_longitude)OVER()ASf2,ML.LABEL_ENCODER(company)OVER()ASf3,ML.ONE_HOT_ENCODER(payment_type)OVER()ASf4,label)OPTIONS(model_type='linear_reg',max_iterations=1)AS(SELECTfare,pickup_longitude,company,payment_type,2ASlabelFROMmydataset.new_training_data);SELECTERROR(CONCAT("Found data skew in features: ",ARRAY_TO_STRING(anomalies,", "),". Model is retrained with the latest data."));ELSESELECT*FROMML.PREDICT(MODELmydataset.model_linear_reg,TABLEmydataset.serving);ENDIF;
Limitations
ML.VALIDATE_DATA_SKEWdoesn't support the following types of models:- AutoML
- Matrix factorization
ARIMA_PLUS- Remote models overLLMs,Cloud AI services,orVertex AI endpoints
- ImportedOpen Neural Network Exchange (ONNX),TensorFlow,TensorFlow Lite,orXGBoostmodels
ML.VALIDATE_DATA_SKEWdoesn't support models created before March 28, 2024,or models that use theWARM STARToption. To enable use ofML.VALIDATE_DATA_SKEW, retrain the model by running theCREATE OR REPLACE modelstatement.Running the
ML.VALIDATE_DATA_SKEWfunction on a large amount of input datacan cause the query toreturn the errorDry run query timed out. To resolve the error,disable retrieval of cached results for the query.ML.VALIDATE_DATA_SKEWdoesn't conduct schema validation between the two setsof input data, and so handles data type mismatches as follows:- If you specify
JENSEN_SHANNON_DIVERGENCEfor thecategorical_default_thresholdornumerical_default_thresholdargument, the feature isn't included in the final anomaly report. - If you specify
L_INFTYfor thecategorical_default_thresholdargument, the function outputs the computed feature distance as expected.
However, when you run inference on the serving data, the
ML.PREDICTfunctionhandles schema validation.- If you specify
Pricing
TheML.VALIDATE_DATA_SKEW function usesBigQuery on-demand compute pricing.
What's next
- For more information about model monitoring in BigQuery ML, seeModel monitoring overview.
- For more information about supported SQL statements and functions for MLmodels, seeEnd-to-end user journeys for ML models.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-11-24 UTC.