The ML.VALIDATE_DATA_DRIFT function

This document describes theML.VALIDATE_DATA_DRIFT function, which you can useto compute the data drift between two sets of serving data. Thisfunction computes and compares the statistics for the two data sets, and thenidentifies where there are anomalous differences between the twodata sets.

For example, you might want to compare the current servingdata to historical serving data from atable snapshot, or to the featuresserved at a particular point in time, which you can get by using theML.FEATURES_AT_TIME function.

You can optionally visualize the function output by usingVertex AI model monitoring.For more information, seeMonitoring visualization.

Syntax

ML.VALIDATE_DATA_DRIFT({TABLE`PROJECT_ID.DATASET.BASE_TABLE_NAME`|(BASE_QUERY_STATEMENT)},{TABLE`PROJECT_ID.DATASET.STUDY_TABLE_NAME`|(STUDY_QUERY_STATEMENT)},STRUCT([NUM_HISTOGRAM_BUCKETSASnum_histogram_buckets][,NUM_QUANTILES_HISTOGRAM_BUCKETSASnum_quantiles_histogram_buckets][,NUM_VALUES_HISTOGRAM_BUCKETSASnum_values_histogram_buckets,][,NUM_RANK_HISTOGRAM_BUCKETSASnum_rank_histogram_buckets][,CATEGORICAL_DEFAULT_THRESHOLDAScategorical_default_threshold][,CATEGORICAL_METRIC_TYPEAScategorical_metric_type][,NUMERICAL_DEFAULT_THRESHOLDASnumerical_default_threshold][,NUMERICAL_METRIC_TYPEASnumerical_metric_type][,THRESHOLDSASthresholds])[,MODEL`PROJECT_ID.DATASET.MODEL_NAME`])

Arguments

ML.VALIDATE_DATA_DRIFT takes the following arguments:

  • PROJECT_ID: the BigQuery project thatcontains the resource.
  • DATASET: the BigQuery dataset that containsthe resource.
  • BASE_TABLE_NAME: the name of the input table ofserving data that you want to use as the baseline for comparison.
  • BASE_QUERY_STATEMENT: a query that generates theserving data that you want to use as the baseline for comparison. For thesupported SQL syntax of theBASE_QUERY_STATEMENT clause, seeGoogleSQL querysyntax.
  • STUDY_TABLE_NAME: the name of the input table thatcontains the serving data that you want to compare to the baseline.
  • STUDY_QUERY_STATEMENT: a query that generates theserving data that you want to compare to the baseline. For the supported SQLsyntax of theSTUDY_QUERY_STATEMENT clause, seeGoogleSQL querysyntax.
  • NUM_HISTOGRAM_BUCKETS: anINT64 value that specifiesthe number of buckets to use for a histogram with equal-width buckets. Onlyapplies to numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64,numerical>> columns. TheNUM_HISTOGRAM_BUCKETS value must be in the range[1, 1,000]. The default value is10.
  • NUM_QUANTILES_HISTOGRAM_BUCKETS: anINT64 value thatspecifies the number of buckets to use for aquantiles histogram. Only applies tonumerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>> columns.TheNUM_QUANTILES_HISTOGRAM_BUCKETS value must be in the range[1, 1,000].The default value is10.
  • NUM_VALUES_HISTOGRAM_BUCKETS: anINT64 value thatspecifies the number of buckets to use for a quantiles histogram. Only appliestoARRAY columns. TheNUM_VALUES_HISTOGRAM_BUCKETS value must be in therange[1, 1,000]. The default value is10.
  • NUM_RANK_HISTOGRAM_BUCKETS: anINT64 value thatspecifies the number of buckets to use for arank histogram. Onlyapplies to categorical andARRAY<categorical> columns. TheNUM_RANK_HISTOGRAM_BUCKETS value must be in the range[1, 10,000]. Thedefault value is50.
  • CATEGORICAL_DEFAULT_THRESHOLD: aFLOAT64 value thatspecifies the custom threshold to use for anomaly detection for categoricalandARRAY<categorical> features. The value must be in the range[0, 1).The default value is0.3.
  • CATEGORICAL_METRIC_TYPE: aSTRING value thatspecifies the metric used to compare statistics for categorical andARRAY<categorical> features. Valid values are as follows:
  • NUMERICAL_DEFAULT_THRESHOLD: aFLOAT64 value thatspecifies the custom threshold to use for anomaly detection for numericalfeatures. The value must be in the range[0, 1). The default value is0.3.
  • NUMERICAL_METRIC_TYPE: aSTRING value that specifiesthe metric used to compare statistics for numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>> features. The only valid value isJENSEN_SHANNON_DIVERGENCE.
  • THRESHOLDS: anARRAY<STRUCT<STRING, FLOAT64>> valuethat specifies the anomaly detection thresholds for one or more columns forwhich you don't want to use the default threshold. TheSTRING value in thestruct specifies the column name, and theFLOAT64 value specifies thethreshold. TheFLOAT64 value must be in the range[0,1). For example,[('col_a', 0.1), ('col_b', 0.8)].
  • MODEL:The name of a BigQuery ML model that isregistered withVertex AI. When you specify this argument, theML.VALIDATE_DATA_DRIFT output includes thevisualization_link column.Thevisualization_link column provides URLsthat link to visualizations of the function results inVertex AI model monitoring.

Output

ML.VALIDATE_DATA_DRIFT returns one row for each column in the input data.ML.VALIDATE_DATA_DRIFT output contains the following columns:

  • input: aSTRING column that contains the input column name.
  • metric: aSTRING column that contains the metric used to compare theinput column statistical value between the two data sets. This column valueisJENSEN_SHANNON_DIVERGENCE for numerical features, and eitherL_INFTY orJENSEN_SHANNON_DIVERGENCE for categorical features.
  • threshold: aFLOAT64 column that contains the threshold used to determinewhether the statistical difference in theinput column value between the twodata sets is anomalous.
  • value: aFLOAT64 column that contains the statistical difference in theinput column value between the two data sets.
  • is_anomaly: aBOOL column that indicates whether thevalue value ishigher than thethreshold value.
  • visualization_link: a URL that links to a Vertex AIvisualization of the results for the given feature. The URL is formattedas follows:

    https://console.cloud.google.com/vertex-ai/model-monitoring/locations/region/model-monitors/vertex_model_monitor_id/model-monitoring-jobs/vertex_model_monitoring_job_id/feature-drift?project=project_id&featureName=feature_name

    For example:

    https://console.cloud.google.com/vertex-ai/model-monitoring/locations/europe-west4/model-monitors/bq123456789012345647/model-monitoring-jobs/bqjob890123456789012/feature-drift?project=myproject&featureName=units_produced

    This column is only returned when you provide a value for theMODELargument.

    For more information, seeMonitoring visualization.

Examples

The following examples show how to use theML.VALIDATE_DATA_DRIFT function.

Compute data drift

The following example computes data drift between a snapshot of theserving data table and the current serving data table,with a categorical feature threshold of0.2:

SELECT*FROMML.VALIDATE_DATA_DRIFT(TABLE`myproject.mydataset.previous_serving_data`,TABLE`myproject.mydataset.serving`,STRUCT(0.2AScategorical_default_threshold));

The output looks similar to the following:

+------------------+--------------------------+-----------+--------+------------+| input            | metric                   | threshold |  value | is_anomaly |+------------------+--------------------------+-----------+--------+------------+| dropoff_latitude | JENSEN_SHANNON_DIVERGENCE| 0.2       | 0.7    | true       |+------------------+--------------------------+-----------+--------+------------+| payment_type     | L_INTFY                  | 0.3       | 0.2    | false      |+------------------+--------------------------+-----------+--------+------------+

Compute data drift and visualize

The following example computes data drift between a snapshot of theserving data table and the current serving data table,with a categorical feature threshold of0.2:

SELECT*FROMML.VALIDATE_DATA_DRIFT(TABLE`myproject.mydataset.previous_serving_data`,TABLE`myproject.mydataset.serving`,STRUCT(0.2AScategorical_default_threshold),MODEL`myproject.mydataset.registered_model`);

The output looks similar to the following:

+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+| input            | metric                   | threshold |  value | is_anomaly | visualization_link                                     |+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+| dropoff_latitude | JENSEN_SHANNON_DIVERGENCE| 0.2       | 0.7    | true       | https://console.cloud.google.com/vertex-ai/            ||                  |                          |           |        |            | model-monitoring/locations/us-central1/model-monitors/ ||                  |                          |           |        |            | bq1111222233334444555/model-monitoring-jobs/           ||                  |                          |           |        |            | bqjob1234512345123451234/feature-drift?project=        ||                  |                          |           |        |            | myproject&featureName=dropoff_latitude                 |+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+| payment_type     | L_INTFY                  | 0.3       | 0.2    | false      | https://console.cloud.google.com/vertex-ai/            ||                  |                          |           |        |            | model-monitoring/locations/us-central1/model-monitors/ ||                  |                          |           |        |            | bq1111222233334444555/model-monitoring-jobs/           ||                  |                          |           |        |            | bqjob1234512345123451234/feature-drift?project=        ||                  |                          |           |        |            | myproject&featureName=payment_type                     |+------------------+--------------------------+-----------+--------+------------+--------------------------------------------------------+

Copying and pasting the visualization link into a browser tab returns resultssimilar to the following for numerical features:

A graph that visualizes the monitoring output for the `ML.VALIDATE_DATA_DRIFT` function

Copying and pasting the visualization link into a browser tab returns resultssimilar to the following for categorical features:

A bar chart that visualizes the monitoring output for the `ML.VALIDATE_DATA_DRIFT function

Limitations

  • Running theML.VALIDATE_DATA_DRIFT function on a large amount of input datacan cause the query toreturn the errorDry run query timed out. To resolve the error,disable retrieval of cached results for the query.

  • ML.VALIDATE_DATA_DRIFT doesn't conduct schema validation between the twosets of input data, and so handles data type mismatches as follows:

    • If you specifyJENSEN_SHANNON_DIVERGENCE for thecategorical_default_threshold ornumerical_default_thresholdargument,the feature isn't included in the final anomaly report.
    • If you specifyL_INFTY for thecategorical_default_thresholdargument, the function outputs the computed feature distance as expected.

However, when you run inference on the serving data, theML.PREDICT functionhandles schema validation.

Pricing

TheML.VALIDATE_DATA_DRIFT function usesBigQuery on-demand compute pricing.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-11-24 UTC.