The ML.TFDV_DESCRIBE function

This document describes theML.TFDV_DESCRIBE function, which you can useto generate fine-grained statistics for the columns in a table. For example, youmight want to know statistics for a table of training or serving datastatistics that you plan to use with a machine learning (ML) model. Callingthis function provides the same behavior as calling the TensorFlowTensorFlowtfdv.generate_statistics_from_csv API.You can use the data output by this function for such purposes asfeature preprocessing ormodel monitoring.

Syntax

ML.TFDV_DESCRIBE({TABLE`PROJECT_ID.DATASET.TABLE_NAME`|(QUERY_STATEMENT)},STRUCT([NUM_HISTOGRAM_BUCKETSASnum_histogram_buckets][,NUM_QUANTILES_HISTOGRAM_BUCKETSASnum_quantiles_histogram_buckets][,NUM_VALUES_HISTOGRAM_BUCKETSASnum_values_histogram_buckets][,NUM_RANK_HISTOGRAM_BUCKETSASnum_rank_histogram_buckets]))

Arguments

ML.TFDV_DESCRIBE takes the following arguments:

  • PROJECT_ID: your project ID.
  • DATASET: the BigQuery dataset that containsthe table.
  • TABLE_NAME: the name of the input table that containsthe training or serving data to calculate statistics for.
  • QUERY_STATEMENT: a query that generates the trainingor serving data to calculate statistics for. For the supported SQL syntax oftheQUERY_STATEMENT clause, seeGoogleSQL querysyntax.
  • NUM_HISTOGRAM_BUCKETS: anINT64 value that specifiesthe number of buckets to use for a histogram with equal-width buckets. Onlyapplies to numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64,numerical>> columns. Thenum_histogram_buckets value must be in the range[1, 1,000]. The default value is10.
  • NUM_QUANTILES_HISTOGRAM_BUCKETS: anINT64 value thatspecifies the number of buckets to use for aquantiles histogram. Only applies tonumerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>> columns.Thenum_quantiles_histogram_buckets value must be in the range[1, 1,000].The default value is10.
  • NUM_VALUES_HISTOGRAM_BUCKETS: anINT64 value thatspecifies the number of buckets to use for a quantiles histogram. Only appliestoARRAY columns. Thenum_values_histogram_buckets value must be in therange[1, 1,000]. The default value is10.
  • NUM_RANK_HISTOGRAM_BUCKETS: anINT64 value thatspecifies the number of buckets to use for arank histogram. Onlyapplies to categorical andARRAY<categorical> columns. Thenum_rank_histogram_buckets value must be in the range[1, 10,000]. Thedefault value is50.

Output

ML.TFDV_DESCRIBE returns a column nameddataset_feature_statistics_listthat contains a TensorFlowDatasetFeatureStatisticsList protocol bufferin JSON format.

Example

The following example returns statistics for thepenguins public dataset anduses 20 buckets for rank histograms for string values:

SELECT*FROMML.TFDV_DESCRIBE(TABLE`bigquery-public-data.ml_datasets.penguins`,STRUCT(20ASnum_rank_histogram_buckets));

Limitations

Input data for theML.TFDV_DESCRIBE function can only contain columns of thefollowing data types:

  • Numerictypes
  • STRING
  • BOOL
  • BYTE
  • DATE
  • DATETIME
  • TIME
  • TIMESTAMP
  • ARRAY<STRUCT<INT64, FLOAT64>> (a sparse tensor)
  • STRUCT columns that contain any of the following types:
    • Numeric types
    • STRING
    • BOOL
    • BYTE
    • DATE
    • DATETIME
    • TIME
    • TIMESTAMP
  • ARRAY columns that contain any of the following types:
    • Numeric types
    • STRING
    • BOOL
    • BYTE
    • DATE
    • DATETIME
    • TIME
    • TIMESTAMP

Pricing

TheML.TFDV_DESCRIBE function usesBigQuery on-demand compute pricing.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.