The ML.TFDV_VALIDATE function

This document describes theML.TFDV_VALIDATE function, which you can use tocompare the statistics for training and serving data, or two sets ofserving data, in order to identify anomalous differences between the two datasets. Calling this function provides the same behavior as calling theTensorFlowvalidate_statistics API.You can use the data output by this function formodel monitoring.

Syntax

ML.TFDV_VALIDATE(base_statistics,study_statistics[,detection_type][,categorical_default_threshold][,categorical_metric_type][,numerical_default_threshold][,numerical_metric_type][,thresholds])

Arguments

ML.TFDV_VALIDATE takes the following arguments:

  • base_statistics: the statistics of the training or serving datathat you want to use as the baseline for comparison. This must bea TensorFlowDatasetFeatureStatisticsList protocol bufferin JSON format. You can generate a protocol buffer in the correctformat by running theML.TFDV_DESCRIBE function,or you can load it from outside of BigQuery.
  • study_statistics: the statistics of the training or serving datathat you want to compare to the baseline. This must bea TensorFlowDatasetFeatureStatisticsList protocol bufferin JSON format. You can generate a protocol buffer in the correct format byrunning theML.TFDV_DESCRIBE function, or you can load it from outside ofBigQuery.
  • detection_type: aSTRING value that specifies the type of comparison thatyou want to make. Valid values are as follows:
    • SKEW: returns the data skew, which represents the statistical variationbetween training and serving data.
    • DRIFT: returns the data drift, which represents the statisticalvariation between two different sets of serving data.
  • categorical_default_threshold: aFLOAT64 value that specifies the customthreshold to use for anomaly detection for categorical andARRAY<categorical> features. The valuemust be in the range[0, 1). The default value is0.3.
  • categorical_metric_type: aSTRING value that specifies the metric usedto compare statistics for categorical andARRAY<categorical>features.Valid values are as follows:
  • numerical_default_threshold: aFLOAT64 value that specifies the customthreshold to use for anomaly detection for numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>> features. The valuemust be in the range[0, 1). The default value is0.3.
  • numerical_metric_type: aSTRING value that specifies the metric usedto compare statistics for numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>> features. The only valid value isJENSEN_SHANNON_DIVERGENCE.
  • thresholds: anARRAY<STRUCT<STRING, FLOAT64>> valuethat specifies the anomaly detection thresholds for one or more columnsfor which you don't want to use the default threshold. TheSTRING value inthe struct specifies the column name, and theFLOAT64 value specifies thethreshold. TheFLOAT64 value must be in the range[0, 1). For example,[('col_a', 0.1), ('col_b', 0.8)].

ML.TFDV_VALIDATE uses positional arguments, so if you specify anoptional argument, you must also specify all arguments prior to that argument.For more information on argument types, seeNamed arguments.

Output

ML.TFDV_VALIDATE returns a TensorFlowAnomalies protocol bufferin JSON format.

Examples

The following example returns the skew between training and serving dataand also sets custom anomaly detection thresholds for two of the featurecolumns:

DECLAREstats1JSON;DECLAREstats2JSON;SETstats1=(SELECT*FROMML.TFDV_DESCRIBE(TABLE`myproject.mydataset.training`));SETstats2=(SELECT*FROMML.TFDV_DESCRIBE(TABLE`myproject.mydataset.serving`));SELECTML.TFDV_VALIDATE(stats1,stats2,'SKEW',.3,'L_INFTY',.3,'JENSEN_SHANNON_DIVERGENCE',[('feature1',0.2),('feature2',0.5)]);INSERT`myproject.mydataset.serve_stats`(t,dataset_feature_statistics_list)SELECTCURRENT_TIMESTAMP()ASt,stats1;

The following example returns the drift between two sets of serving data:

SELECTML.TFDV_VALIDATE((SELECTdataset_feature_statistics_listFROM`myproject.mydataset.servingJan24`),(SELECT*FROMML.TFDV_DESCRIBE(TABLE`myproject.mydataset.serving`)),'DRIFT');

Limitations

TheML.TFDV_VALIDATE function doesn't conduct schema validation.

ML.TFDV_VALIDATE handles type mismatch as follows:

  • If you specifyJENSEN_SHANNON_DIVERGENCE for thecategorical_default_threshold ornumerical_default_thresholdargument, the feature isn't included in the final anomaly report.
  • If you specifyL_INFTY for thecategorical_default_thresholdargument, the function outputs the computed feature distance as expected.

Pricing

TheML.TFDV_VALIDATE function usesBigQuery on-demand compute pricing.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-11-24 UTC.