The ML.VALIDATE_DATA_SKEW function

This document describes theML.VALIDATE_DATA_SKEW function, which you can useto compute the data skew between a model's training and serving data. Thisfunction computes the statistics for the serving data, compares them to thestatistics that were computed for the training data at the time the model wascreated, and identifies where there are anomalous differences between the twodata sets.

You can optionally visualize the function output by usingVertex AI model monitoring.For more information, seeMonitoring visualization.

Statistics are only computed for feature columns in the serving data that matchfeature columns in the training data, in order to achieve better performance andlower cost. For models that were created with use of theTRANSFORM clause,the statistics are based on the raw feature data before feature preprocessingwithin theTRANSFORM clause.

Syntax

ML.VALIDATE_DATA_SKEW(MODEL`PROJECT_ID.DATASET.MODEL_NAME`,{TABLE`PROJECT_ID.DATASET.TABLE_NAME`|(QUERY_STATEMENT)},STRUCT([CATEGORICAL_DEFAULT_THRESHOLDAScategorical_default_threshold][,CATEGORICAL_METRIC_TYPEAScategorical_metric_type][,NUMERICAL_DEFAULT_THRESHOLDASnumerical_default_threshold][,NUMERICAL_METRIC_TYPEASnumerical_metric_type][,THRESHOLDSASthresholds][,ENABLE_VISUALIZATION_LINKASenable_visualization_link]))

Arguments

ML.VALIDATE_DATA_SKEW takes the following arguments:

  • PROJECT_ID: the BigQuery project thatcontains the resource.
  • DATASET: the BigQuery dataset thatcontains the resource.
  • MODEL_NAME: the name of the model.
  • TABLE_NAME: the name of the input table that containsthe serving data to calculate statistics for.
  • QUERY_STATEMENT: a query that generates the servingdata to calculate statistics for. For the supported SQL syntax of theQUERY_STATEMENT clause, seeGoogleSQL querysyntax.
  • CATEGORICAL_DEFAULT_THRESHOLD: aFLOAT64 value thatspecifies the custom threshold to use for anomaly detection for categoricalandARRAY<categorical> features. The value must be in the range[0, 1).The default value is0.3.
  • CATEGORICAL_METRIC_TYPE: aSTRING value thatspecifies the metric used to compare statistics for categorical andARRAY<categorical> features. Valid values are as follows:
  • NUMERICAL_DEFAULT_THRESHOLD: aFLOAT64 value thatspecifies the custom threshold to use for anomaly detection for numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>> features. The valuemust be in the range[0, 1). The default value is0.3.
  • NUMERICAL_METRIC_TYPE: aSTRING value that specifiesthe metric used to compare statistics for numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>> features. The only valid value isJENSEN_SHANNON_DIVERGENCE.
  • THRESHOLDS: anARRAY<STRUCT<STRING, FLOAT64>> valuethat specifies the anomaly detection thresholds for one or more columnsfor which you don't want to use the default threshold. TheSTRING value inthe struct specifies the column name, and theFLOAT64 value specifies thethreshold. TheFLOAT64 value must be in the range[0, 1). For example,[('col_a', 0.1), ('col_b', 0.8)].
  • ENABLE_VISUALIZATION_LINK: aBOOL value thatdetermines whether to return links to the visualized function output. When youspecifyTRUE for this argument, theML.VALIDATE_DATA_DRIFT output includesthevisualization_link column. Thevisualization_link column provides URLsthat link to visualizations of the function results inVertex AI monitoring.

    When you specifyTRUE for this argument, themodel argument value mustrefer to a BigQuery ML model that isregistered withVertex AI. If the model isn't registered, an invalid queryerror is returned.

Output

ML.VALIDATE_DATA_SKEW returns one row for each column in the input data.ML.VALIDATE_DATA_SKEW output contains the following columns:

  • input: aSTRING column that contains the input column name.
  • metric: aSTRING column that contains the metric used to compare theinput column statistical value between the training and serving data sets.This column value isJENSEN_SHANNON_DIVERGENCE for numerical features, andeitherL_INFTY orJENSEN_SHANNON_DIVERGENCE for categorical features.
  • threshold: aFLOAT64 column that contains the threshold used to determinewhether the statistical difference in theinput column value between thetraining and serving data is anomalous.
  • value: aFLOAT64 column that contains the statistical difference intheinput column value between the serving and the training data sets.
  • is_anomaly: aBOOL column that indicates whether thevalue value ishigher than thethreshold value.
  • visualization_link: a URL thatlinks to a Vertex AI visualization of the results for the givenfeature. The URL is formatted as follows:

    https://console.cloud.google.com/vertex-ai/model-monitoring/locations/region/model-monitors/vertex_model_monitor_id/model-monitoring-jobs/vertex_model_monitoring_job_id/feature-drift?project=project_id&featureName=feature_name

    For example:

    https://console.cloud.google.com/vertex-ai/model-monitoring/locations/us-central1/model-monitors/bq123456789012345647/model-monitoring-jobs/bqjob890123456789012/feature-drift?project=myproject&featureName=petal_length

    This column is only returned when theenable_visualization_link argumentvalue isTRUE.

    For more information, seeMonitoring visualization.

Examples

The following examples demonstrate how to use theML.VALIDATE_DATA_SKEWfunction.

RunML.VALIDATE_DATA_SKEW

The following example computes data skew between the serving data and thetraining data used to create the model, with a categorical feature thresholdof0.2:

SELECT*FROMML.VALIDATE_DATA_SKEW(MODEL`myproject.mydataset.mymodel`,TABLE`myproject.mydataset.serving`,STRUCT(0.2AScategorical_default_threshold));

The output looks similar to the following:

+------------------+--------------------------+-----------+--------+------------+| input            | metric                   | threshold |  value | is_anomaly |+------------------+--------------------------+-----------+--------+------------+| dropoff_latitude | JENSEN_SHANNON_DIVERGENCE| 0.2       | 0.7    | true       |+------------------+--------------------------+-----------+--------+------------+| payment_type     | L_INTFY                  | 0.3       | 0.2    | false      |+------------------+--------------------------+-----------+--------+------------+

RunML.VALIDATE_DATA_SKEW and visualize the results

The following example computes data skew between the serving data and thetraining data used to create the model, with a categorical feature thresholdof0.2:

SELECT*FROMML.VALIDATE_DATA_SKEW(MODEL`myproject.mydataset.mymodel`,TABLE`myproject.mydataset.serving`,STRUCT(0.2AScategorical_default_threshold,TRUEASenable_visualization_link));

The output looks similar to the following:

+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+| input            | metric                   | threshold |  value | is_anomaly | visualization_link                                     |+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+| dropoff_latitude | JENSEN_SHANNON_DIVERGENCE| 0.2       | 0.7    | true       | https://console.cloud.google.com/vertex-ai/            ||                  |                          |           |        |            | model-monitoring/locations/us-central1/model-monitors/ ||                  |                          |           |        |            | bq1111222233334444555/model-monitoring-jobs/           ||                  |                          |           |        |            | bqjob1234512345123451234/feature-drift?project=        ||                  |                          |           |        |            | myproject&featureName=dropoff_latitude                 |+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+| payment_type     | L_INTFY                  | 0.3       | 0.2    | false      | https://console.cloud.google.com/vertex-ai/            ||                  |                          |           |        |            | model-monitoring/locations/us-central1/model-monitors/ ||                  |                          |           |        |            | bq1111222233334444555/model-monitoring-jobs/           ||                  |                          |           |        |            | bqjob1234512345123451234/feature-drift?project=        ||                  |                          |           |        |            | myproject&featureName=payment_type                     |+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+

Copying and pasting the visualization link into a browser tab returns resultssimilar to the following for numerical features:

A graph that visualizes the monitoring output for the `ML.VALIDATE_DATA_SKEW` function

Copying and pasting the visualization link into a browser tab returns resultssimilar to the following for categorical features:

A bar chart that visualizes the monitoring output for the `ML.VALIDATE_DATA_SKEW` function

Automate skew detection

The following example shows how to automate skew detection for alinear regression model:

DECLAREanomaliesARRAY<STRING>;SETanomalies=(SELECTARRAY_AGG(input)FROMML.VALIDATE_DATA_SKEW(MODELmydataset.model_linear_reg,TABLEmydataset.serving,STRUCT(0.3AScategorical_default_threshold,0.2ASnumerical_default_threshold,'JENSEN_SHANNON_DIVERGENCE'ASnumerical_metric_type,[STRUCT('fare',0.15),STRUCT('company',0.25)]ASthresholds))WHEREis_anomaly);IF(ARRAY_LENGTH(anomalies)>0)THENCREATEORREPLACEMODELmydataset.model_linear_regTRANSFORM(ML.MIN_MAX_SCALER(fare)OVER()ASf1,ML.ROBUST_SCALER(pickup_longitude)OVER()ASf2,ML.LABEL_ENCODER(company)OVER()ASf3,ML.ONE_HOT_ENCODER(payment_type)OVER()ASf4,label)OPTIONS(model_type='linear_reg',max_iterations=1)AS(SELECTfare,pickup_longitude,company,payment_type,2ASlabelFROMmydataset.new_training_data);SELECTERROR(CONCAT("Found data skew in features: ",ARRAY_TO_STRING(anomalies,", "),". Model is retrained with the latest data."));ELSESELECT*FROMML.PREDICT(MODELmydataset.model_linear_reg,TABLEmydataset.serving);ENDIF;

Limitations

Pricing

TheML.VALIDATE_DATA_SKEW function usesBigQuery on-demand compute pricing.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-11-24 UTC.