The ML.TFDV_DESCRIBE function
This document describes theML.TFDV_DESCRIBE function, which you can useto generate fine-grained statistics for the columns in a table. For example, youmight want to know statistics for a table of training or serving datastatistics that you plan to use with a machine learning (ML) model. Callingthis function provides the same behavior as calling the TensorFlowTensorFlowtfdv.generate_statistics_from_csv API.You can use the data output by this function for such purposes asfeature preprocessing ormodel monitoring.
Syntax
ML.TFDV_DESCRIBE({TABLE`PROJECT_ID.DATASET.TABLE_NAME`|(QUERY_STATEMENT)},STRUCT([NUM_HISTOGRAM_BUCKETSASnum_histogram_buckets][,NUM_QUANTILES_HISTOGRAM_BUCKETSASnum_quantiles_histogram_buckets][,NUM_VALUES_HISTOGRAM_BUCKETSASnum_values_histogram_buckets][,NUM_RANK_HISTOGRAM_BUCKETSASnum_rank_histogram_buckets]))
Arguments
ML.TFDV_DESCRIBE takes the following arguments:
PROJECT_ID: your project ID.DATASET: the BigQuery dataset that containsthe table.TABLE_NAME: the name of the input table that containsthe training or serving data to calculate statistics for.QUERY_STATEMENT: a query that generates the trainingor serving data to calculate statistics for. For the supported SQL syntax oftheQUERY_STATEMENTclause, seeGoogleSQL querysyntax.NUM_HISTOGRAM_BUCKETS: anINT64value that specifiesthe number of buckets to use for a histogram with equal-width buckets. Onlyapplies to numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64,numerical>>columns. Thenum_histogram_bucketsvalue must be in the range[1, 1,000]. The default value is10.NUM_QUANTILES_HISTOGRAM_BUCKETS: anINT64value thatspecifies the number of buckets to use for aquantiles histogram. Only applies tonumerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>columns.Thenum_quantiles_histogram_bucketsvalue must be in the range[1, 1,000].The default value is10.NUM_VALUES_HISTOGRAM_BUCKETS: anINT64value thatspecifies the number of buckets to use for a quantiles histogram. Only appliestoARRAYcolumns. Thenum_values_histogram_bucketsvalue must be in therange[1, 1,000]. The default value is10.NUM_RANK_HISTOGRAM_BUCKETS: anINT64value thatspecifies the number of buckets to use for arank histogram. Onlyapplies to categorical andARRAY<categorical>columns. Thenum_rank_histogram_bucketsvalue must be in the range[1, 10,000]. Thedefault value is50.
Output
ML.TFDV_DESCRIBE returns a column nameddataset_feature_statistics_listthat contains a TensorFlowDatasetFeatureStatisticsList protocol bufferin JSON format.
Example
The following example returns statistics for thepenguins public dataset anduses 20 buckets for rank histograms for string values:
SELECT*FROMML.TFDV_DESCRIBE(TABLE`bigquery-public-data.ml_datasets.penguins`,STRUCT(20ASnum_rank_histogram_buckets));
Limitations
Input data for theML.TFDV_DESCRIBE function can only contain columns of thefollowing data types:
- Numerictypes
STRINGBOOLBYTEDATEDATETIMETIMETIMESTAMPARRAY<STRUCT<INT64, FLOAT64>>(a sparse tensor)STRUCTcolumns that contain any of the following types:- Numeric types
STRINGBOOLBYTEDATEDATETIMETIMETIMESTAMP
ARRAYcolumns that contain any of the following types:- Numeric types
STRINGBOOLBYTEDATEDATETIMETIMETIMESTAMP
Pricing
TheML.TFDV_DESCRIBE function usesBigQuery on-demand compute pricing.
What's next
- For more information about model monitoring in BigQuery ML, seeModel monitoring overview.
- For more information about supported SQL statements and functions for MLmodels, seeEnd-to-end user journeys for ML models.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.