The ML.TFDV_VALIDATE function
This document describes theML.TFDV_VALIDATE function, which you can use tocompare the statistics for training and serving data, or two sets ofserving data, in order to identify anomalous differences between the two datasets. Calling this function provides the same behavior as calling theTensorFlowvalidate_statistics API.You can use the data output by this function formodel monitoring.
Syntax
ML.TFDV_VALIDATE(base_statistics,study_statistics[,detection_type][,categorical_default_threshold][,categorical_metric_type][,numerical_default_threshold][,numerical_metric_type][,thresholds])
Arguments
ML.TFDV_VALIDATE takes the following arguments:
base_statistics: the statistics of the training or serving datathat you want to use as the baseline for comparison. This must bea TensorFlowDatasetFeatureStatisticsListprotocol bufferin JSON format. You can generate a protocol buffer in the correctformat by running theML.TFDV_DESCRIBEfunction,or you can load it from outside of BigQuery.study_statistics: the statistics of the training or serving datathat you want to compare to the baseline. This must bea TensorFlowDatasetFeatureStatisticsListprotocol bufferin JSON format. You can generate a protocol buffer in the correct format byrunning theML.TFDV_DESCRIBEfunction, or you can load it from outside ofBigQuery.detection_type: aSTRINGvalue that specifies the type of comparison thatyou want to make. Valid values are as follows:SKEW: returns the data skew, which represents the statistical variationbetween training and serving data.DRIFT: returns the data drift, which represents the statisticalvariation between two different sets of serving data.
categorical_default_threshold: aFLOAT64value that specifies the customthreshold to use for anomaly detection for categorical andARRAY<categorical>features. The valuemust be in the range[0, 1). The default value is0.3.categorical_metric_type: aSTRINGvalue that specifies the metric usedto compare statistics for categorical andARRAY<categorical>features.Valid values are as follows:L_INFTY: useL-infinity distance.This value is the default.JENSEN_SHANNON_DIVERGENCE: useJensen–Shannon divergence.
numerical_default_threshold: aFLOAT64value that specifies the customthreshold to use for anomaly detection for numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>features. The valuemust be in the range[0, 1). The default value is0.3.numerical_metric_type: aSTRINGvalue that specifies the metric usedto compare statistics for numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>features. The only valid value isJENSEN_SHANNON_DIVERGENCE.thresholds: anARRAY<STRUCT<STRING, FLOAT64>>valuethat specifies the anomaly detection thresholds for one or more columnsfor which you don't want to use the default threshold. TheSTRINGvalue inthe struct specifies the column name, and theFLOAT64value specifies thethreshold. TheFLOAT64value must be in the range[0, 1). For example,[('col_a', 0.1), ('col_b', 0.8)].
ML.TFDV_VALIDATE uses positional arguments, so if you specify anoptional argument, you must also specify all arguments prior to that argument.For more information on argument types, seeNamed arguments.
Output
ML.TFDV_VALIDATE returns a TensorFlowAnomalies protocol bufferin JSON format.
Examples
The following example returns the skew between training and serving dataand also sets custom anomaly detection thresholds for two of the featurecolumns:
DECLAREstats1JSON;DECLAREstats2JSON;SETstats1=(SELECT*FROMML.TFDV_DESCRIBE(TABLE`myproject.mydataset.training`));SETstats2=(SELECT*FROMML.TFDV_DESCRIBE(TABLE`myproject.mydataset.serving`));SELECTML.TFDV_VALIDATE(stats1,stats2,'SKEW',.3,'L_INFTY',.3,'JENSEN_SHANNON_DIVERGENCE',[('feature1',0.2),('feature2',0.5)]);INSERT`myproject.mydataset.serve_stats`(t,dataset_feature_statistics_list)SELECTCURRENT_TIMESTAMP()ASt,stats1;
The following example returns the drift between two sets of serving data:
SELECTML.TFDV_VALIDATE((SELECTdataset_feature_statistics_listFROM`myproject.mydataset.servingJan24`),(SELECT*FROMML.TFDV_DESCRIBE(TABLE`myproject.mydataset.serving`)),'DRIFT');
Limitations
TheML.TFDV_VALIDATE function doesn't conduct schema validation.
ML.TFDV_VALIDATE handles type mismatch as follows:
- If you specify
JENSEN_SHANNON_DIVERGENCEfor thecategorical_default_thresholdornumerical_default_thresholdargument, the feature isn't included in the final anomaly report. - If you specify
L_INFTYfor thecategorical_default_thresholdargument, the function outputs the computed feature distance as expected.
Pricing
TheML.TFDV_VALIDATE function usesBigQuery on-demand compute pricing.
What's next
- For more information about model monitoring in BigQuery ML, seeModel monitoring overview.
- For more information about supported SQL statements and functions for MLmodels, seeEnd-to-end user journeys for ML models.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-11-24 UTC.