The ML.VALIDATE_DATA_DRIFT function
This document describes theML.VALIDATE_DATA_DRIFT function, which you can useto compute the data drift between two sets of serving data. Thisfunction computes and compares the statistics for the two data sets, and thenidentifies where there are anomalous differences between the twodata sets.
For example, you might want to compare the current servingdata to historical serving data from atable snapshot, or to the featuresserved at a particular point in time, which you can get by using theML.FEATURES_AT_TIME function.
You can optionally visualize the function output by usingVertex AI model monitoring.For more information, seeMonitoring visualization.
Syntax
ML.VALIDATE_DATA_DRIFT({TABLE`PROJECT_ID.DATASET.BASE_TABLE_NAME`|(BASE_QUERY_STATEMENT)},{TABLE`PROJECT_ID.DATASET.STUDY_TABLE_NAME`|(STUDY_QUERY_STATEMENT)},STRUCT([NUM_HISTOGRAM_BUCKETSASnum_histogram_buckets][,NUM_QUANTILES_HISTOGRAM_BUCKETSASnum_quantiles_histogram_buckets][,NUM_VALUES_HISTOGRAM_BUCKETSASnum_values_histogram_buckets,][,NUM_RANK_HISTOGRAM_BUCKETSASnum_rank_histogram_buckets][,CATEGORICAL_DEFAULT_THRESHOLDAScategorical_default_threshold][,CATEGORICAL_METRIC_TYPEAScategorical_metric_type][,NUMERICAL_DEFAULT_THRESHOLDASnumerical_default_threshold][,NUMERICAL_METRIC_TYPEASnumerical_metric_type][,THRESHOLDSASthresholds])[,MODEL`PROJECT_ID.DATASET.MODEL_NAME`])
Arguments
ML.VALIDATE_DATA_DRIFT takes the following arguments:
PROJECT_ID: the BigQuery project thatcontains the resource.DATASET: the BigQuery dataset that containsthe resource.BASE_TABLE_NAME: the name of the input table ofserving data that you want to use as the baseline for comparison.BASE_QUERY_STATEMENT: a query that generates theserving data that you want to use as the baseline for comparison. For thesupported SQL syntax of theBASE_QUERY_STATEMENTclause, seeGoogleSQL querysyntax.STUDY_TABLE_NAME: the name of the input table thatcontains the serving data that you want to compare to the baseline.STUDY_QUERY_STATEMENT: a query that generates theserving data that you want to compare to the baseline. For the supported SQLsyntax of theSTUDY_QUERY_STATEMENTclause, seeGoogleSQL querysyntax.NUM_HISTOGRAM_BUCKETS: anINT64value that specifiesthe number of buckets to use for a histogram with equal-width buckets. Onlyapplies to numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64,numerical>>columns. TheNUM_HISTOGRAM_BUCKETSvalue must be in the range[1, 1,000]. The default value is10.NUM_QUANTILES_HISTOGRAM_BUCKETS: anINT64value thatspecifies the number of buckets to use for aquantiles histogram. Only applies tonumerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>columns.TheNUM_QUANTILES_HISTOGRAM_BUCKETSvalue must be in the range[1, 1,000].The default value is10.NUM_VALUES_HISTOGRAM_BUCKETS: anINT64value thatspecifies the number of buckets to use for a quantiles histogram. Only appliestoARRAYcolumns. TheNUM_VALUES_HISTOGRAM_BUCKETSvalue must be in therange[1, 1,000]. The default value is10.NUM_RANK_HISTOGRAM_BUCKETS: anINT64value thatspecifies the number of buckets to use for arank histogram. Onlyapplies to categorical andARRAY<categorical>columns. TheNUM_RANK_HISTOGRAM_BUCKETSvalue must be in the range[1, 10,000]. Thedefault value is50.CATEGORICAL_DEFAULT_THRESHOLD: aFLOAT64value thatspecifies the custom threshold to use for anomaly detection for categoricalandARRAY<categorical>features. The value must be in the range[0, 1).The default value is0.3.CATEGORICAL_METRIC_TYPE: aSTRINGvalue thatspecifies the metric used to compare statistics for categorical andARRAY<categorical>features. Valid values are as follows:L_INFTY: useL-infinity distance.This value is the default.JENSEN_SHANNON_DIVERGENCE: useJensen–Shannon divergence.
NUMERICAL_DEFAULT_THRESHOLD: aFLOAT64value thatspecifies the custom threshold to use for anomaly detection for numericalfeatures. The value must be in the range[0, 1). The default value is0.3.NUMERICAL_METRIC_TYPE: aSTRINGvalue that specifiesthe metric used to compare statistics for numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>features. The only valid value isJENSEN_SHANNON_DIVERGENCE.THRESHOLDS: anARRAY<STRUCT<STRING, FLOAT64>>valuethat specifies the anomaly detection thresholds for one or more columns forwhich you don't want to use the default threshold. TheSTRINGvalue in thestruct specifies the column name, and theFLOAT64value specifies thethreshold. TheFLOAT64value must be in the range[0,1). For example,[('col_a', 0.1), ('col_b', 0.8)].MODEL:The name of a BigQuery ML model that isregistered withVertex AI. When you specify this argument, theML.VALIDATE_DATA_DRIFToutput includes thevisualization_linkcolumn.Thevisualization_linkcolumn provides URLsthat link to visualizations of the function results inVertex AI model monitoring.
Output
ML.VALIDATE_DATA_DRIFT returns one row for each column in the input data.ML.VALIDATE_DATA_DRIFT output contains the following columns:
input: aSTRINGcolumn that contains the input column name.metric: aSTRINGcolumn that contains the metric used to compare theinputcolumn statistical value between the two data sets. This column valueisJENSEN_SHANNON_DIVERGENCEfor numerical features, and eitherL_INFTYorJENSEN_SHANNON_DIVERGENCEfor categorical features.threshold: aFLOAT64column that contains the threshold used to determinewhether the statistical difference in theinputcolumn value between the twodata sets is anomalous.value: aFLOAT64column that contains the statistical difference in theinputcolumn value between the two data sets.is_anomaly: aBOOLcolumn that indicates whether thevaluevalue ishigher than thethresholdvalue.visualization_link: a URL that links to a Vertex AIvisualization of the results for the given feature. The URL is formattedas follows:https://console.cloud.google.com/vertex-ai/model-monitoring/locations/region/model-monitors/vertex_model_monitor_id/model-monitoring-jobs/vertex_model_monitoring_job_id/feature-drift?project=project_id&featureName=feature_name
For example:
https://console.cloud.google.com/vertex-ai/model-monitoring/locations/europe-west4/model-monitors/bq123456789012345647/model-monitoring-jobs/bqjob890123456789012/feature-drift?project=myproject&featureName=units_producedThis column is only returned when you provide a value for the
MODELargument.For more information, seeMonitoring visualization.
Examples
The following examples show how to use theML.VALIDATE_DATA_DRIFT function.
Compute data drift
The following example computes data drift between a snapshot of theserving data table and the current serving data table,with a categorical feature threshold of0.2:
SELECT*FROMML.VALIDATE_DATA_DRIFT(TABLE`myproject.mydataset.previous_serving_data`,TABLE`myproject.mydataset.serving`,STRUCT(0.2AScategorical_default_threshold));
The output looks similar to the following:
+------------------+--------------------------+-----------+--------+------------+| input | metric | threshold | value | is_anomaly |+------------------+--------------------------+-----------+--------+------------+| dropoff_latitude | JENSEN_SHANNON_DIVERGENCE| 0.2 | 0.7 | true |+------------------+--------------------------+-----------+--------+------------+| payment_type | L_INTFY | 0.3 | 0.2 | false |+------------------+--------------------------+-----------+--------+------------+Compute data drift and visualize
The following example computes data drift between a snapshot of theserving data table and the current serving data table,with a categorical feature threshold of0.2:
SELECT*FROMML.VALIDATE_DATA_DRIFT(TABLE`myproject.mydataset.previous_serving_data`,TABLE`myproject.mydataset.serving`,STRUCT(0.2AScategorical_default_threshold),MODEL`myproject.mydataset.registered_model`);
The output looks similar to the following:
+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+| input | metric | threshold | value | is_anomaly | visualization_link |+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+| dropoff_latitude | JENSEN_SHANNON_DIVERGENCE| 0.2 | 0.7 | true | https://console.cloud.google.com/vertex-ai/ || | | | | | model-monitoring/locations/us-central1/model-monitors/ || | | | | | bq1111222233334444555/model-monitoring-jobs/ || | | | | | bqjob1234512345123451234/feature-drift?project= || | | | | | myproject&featureName=dropoff_latitude |+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+| payment_type | L_INTFY | 0.3 | 0.2 | false | https://console.cloud.google.com/vertex-ai/ || | | | | | model-monitoring/locations/us-central1/model-monitors/ || | | | | | bq1111222233334444555/model-monitoring-jobs/ || | | | | | bqjob1234512345123451234/feature-drift?project= || | | | | | myproject&featureName=payment_type |+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+Copying and pasting the visualization link into a browser tab returns resultssimilar to the following for numerical features:

Copying and pasting the visualization link into a browser tab returns resultssimilar to the following for categorical features:

Limitations
Running the
ML.VALIDATE_DATA_DRIFTfunction on a large amount of input datacan cause the query toreturn the errorDry run query timed out. To resolve the error,disable retrieval of cached results for the query.ML.VALIDATE_DATA_DRIFTdoesn't conduct schema validation between the twosets of input data, and so handles data type mismatches as follows:- If you specify
JENSEN_SHANNON_DIVERGENCEfor thecategorical_default_thresholdornumerical_default_thresholdargument,the feature isn't included in the final anomaly report. - If you specify
L_INFTYfor thecategorical_default_thresholdargument, the function outputs the computed feature distance as expected.
- If you specify
However, when you run inference on the serving data, theML.PREDICT functionhandles schema validation.
Pricing
TheML.VALIDATE_DATA_DRIFT function usesBigQuery on-demand compute pricing.
What's next
- For more information about model monitoring in BigQuery ML, seeModel monitoring overview.
- For more information about supported SQL statements and functions for MLmodels, seeEnd-to-end user journeys for ML models.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-11-24 UTC.