The ML.DESCRIBE_DATA function

This document describes theML.DESCRIBE_DATA function, which you can use togenerate descriptive statistics for the columns in a table or subquery. Forexample, you might want to know statistics for a table of training or servingdata that you plan to use with a machine learning (ML) model. You can use thedata output by this function for such purposes asfeature preprocessing ormodelmonitoring.

Syntax

ML.DESCRIBE_DATA({TABLE`PROJECT_ID.DATASET.TABLE_NAME`|(QUERY_STATEMENT)},STRUCT([NUM_QUANTILESASnum_quantiles][,NUM_ARRAY_LENGTH_QUANTILESASnum_array_length_quantiles][,TOP_KAStop_k]))

Arguments

ML.DESCRIBE_DATA takes the following arguments:

  • PROJECT_ID: your project ID.
  • DATASET: the BigQuery dataset that containsthe table.
  • TABLE_NAME: the name of the input table that containsthe training or serving data to calculate statistics for.
  • QUERY_STATEMENT: a query that generates the trainingor serving data to calculate statistics for. For the supported SQL syntax oftheQUERY_STATEMENT clause, seeGoogleSQL querysyntax.
  • NUM_QUANTILES: anINT64 value that specifies thenumber ofquantiles to return fornumerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>> columns.This affects the number of results shown in thequantiles output column.These quantiles describe the distribution of the data in the column. Specify alower value for coarser-grained distribution information and a higher valuefor finer-grained distribution information. TheNUM_QUANTILES value must bein the range[1, 100,000]. The default value is2.
  • NUM_ARRAY_LENGTH_QUANTILES: anINT64 value thatspecifies the number of quantiles to return forARRAY columns. This affectsthe number of results shown in thearray_length_quantiles output column.These quantiles describe the distribution of the length of the arrays in thecolumn. Specify a lower value for coarser-grained distribution information anda higher value for finer-grained distribution information. TheNUM_ARRAY_LENGTH_QUANTILES value must be in the range[1, 100,000]. Thedefault value is10.
  • TOP_K: anINT64 value that specifies the number oftop values to return for categorical andARRAY<categorical> columns. Thisaffects the number of results shown in thetop_values output column. The topvalues are the values that are shown most frequently in the column. TheTOP_K value must be in the range[1, 10,000]. The default value is1.

Details

ML.DESCRIBE_DATA handles input columns as follows:

  • ARRAY columns are unnested before statistics are computed on them.
  • ARRAY<STRUCT<INT64, numerical>>. TheINT64 value is the index, and thenumerical value is the value. For statistics computation,BigQuery ML treats columns of this type asARRAY<numerical>based on the value. The value of the dimension column in the output isMAX(index) + 1.
  • STRUCT fields are expanded, and then categorical columns are cast toSTRING and numerical columns are cast toFLOAT64.
  • Columns of the following data types arecast toSTRING and return the same statistics asSTRING columns:
    • BOOL
    • BYTE
    • DATE
    • DATETIME
    • TIME
    • TIMESTAMPColumns of the following data types arecast toFLOAT64 and return the same statistics asFLOAT64 columns:
    • INT64
    • NUMERIC
    • BIGNUMERIC

Output

ML.DESCRIBE_DATA returns one row for each column in the input data.ML.DESCRIBE_DATA output contains the following columns:

  • name: aSTRING column that contains the name of the input column.
  • num_rows: anINT64 column that contains the total number of rows for theinput column.
  • num_nulls: anINT64 column that returns the number ofNULL values foundin the column.
  • num_zeros: anINT64 column that contains one of the following:
    • For numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns the number of0 values found in the column.
    • For categorical orARRAY<categorical> input columns, returnsNULL.
  • min: aSTRING column that contains theMINvalue for the column.
  • max: aSTRING column that contains theMAXvalue for the column.
  • mean: aFLOAT64 column that contains one of the following:
    • For numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns themean value calculated for thecolumn.
    • For categorical orARRAY<categorical> input columns, returnsNULL.
  • stdev: aFLOAT64 column that contains one of the following:
    • For numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns thestandard deviationvalue calculated for the column.
    • For categorical orARRAY<categorical> input columns, returnsNULL.
  • median: aFLOAT64 column that contains one of the following:
    • For numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns themedian value calculated for thecolumn.
    • For categorical orARRAY<categorical> input columns, returnsNULL.
  • quantiles: anARRAY<FLOAT64> column that contains information about thequantiles in an input column, ascomputed by theAPPROX_QUANTILESfunction. Thequantiles column contains one of the following values:
    • For numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns thequantiles computed for thecolumn.
    • For categorical orARRAY<categorical> input columns, returnsNULL.
  • unique: anINT64 column that contains information about the number ofunique values in an input column, as computed by theAPPROX_COUNT_DISTINCTfunction. Theunique column contains one of the following values:
    • For numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returnsNULL.
    • For categorical orARRAY<categorical> input columns, returns thenumber of unique values in the input column.
  • avg_string_length: aFLOAT64 column that contains one of the following:
    • For numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returnsNULL.
    • For categorical orARRAY<categorical> input columns, returns theaverage length of the values in the column.
  • num_values: anINT64 column that contains the number of array elements forARRAY columns, and the number of values in the column for other types ofcolumns.
  • top_values: aARRAY<STRUCT<STRING, INT64>> column that containsinformation about the top values and number of occurrences in an input column,as computed by theAPPROX_TOP_COUNTfunction. Thetop_values column contains the following fields:
    • top_values.value: aSTRING field that contains one of the followingvalues:
      • For numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns,returnsNULL.
      • For categorical orARRAY<categorical> input columns, returns one ofthe top values in the input column.
    • top_values.count: anINT64 field that contains one of the followingvalues:
      • For numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returnsNULL.
      • For categorical orARRAY<categorical> input columns, returns thenumber of times the related top value appears.
  • min_array_length: anINT64 column that contains one of the followingvalues:
    • ForARRAY input columns, returns the minimum length of anarray in the column.
    • For other types of input columns, returnsNULL.
  • max_array_length: anINT64 column that contains one of the followingvalues:
    • ForARRAY input columns, returns the maximum length of anarray in the column.
    • For other types of input columns, returnsNULL.
  • avg_array_length: aFLOAT64 column that contains one of the followingvalues:
    • ForARRAY input columns, returns the average length of anarray in the column.
    • For other types of input columns, returnsNULL.
  • total_array_length: anINT64 column that contains one of the followingvalues:
    • ForARRAY input columns, returns the sum of the size of the arrays inthe column.
    • For other types of input columns, returnsNULL.
  • array_length_quantiles: anARRAY<INT64> column that contains theinformation about the quantiles for the array length in an input column, ascomputed by theAPPROX_QUANTILESfunction. Thearray_length_quantiles column contains one of the followingvalues:
    • ForARRAY input columns, returns the quantiles for the array lengthcomputed for the column.
    • For other types of input columns, returns0.
  • dimension: anINT64 column that contains one of the following:
    • ForARRAY<STRUCT<INT64, numerical>> input columns, returns thedimension computed for the column, which isMAX(index) + 1forsparse input.
    • For other types of input columns, returnsNULL.

Example

The following example returns statistics for a table with five quantilescalculated for numeric columns and three top values returned for non-numericcolumns:

SELECT*FROMML.DESCRIBE_DATA(TABLE`myproject.mydataset.mytable`,STRUCT(5ASnum_quantiles,3AStop_k));

Limitations

Input data for theML.DESCRIBE_DATA function can only contain columns of thefollowing data types:

  • Numerictypes
  • STRING
  • BOOL
  • BYTE
  • DATE
  • DATETIME
  • TIME
  • TIMESTAMP
  • ARRAY<STRUCT<INT64, FLOAT64>> (a sparse tensor)
  • STRUCT columns that contain any of the following types:
    • Numeric types
    • STRING
    • BOOL
    • BYTE
    • DATE
    • DATETIME
    • TIME
    • TIMESTAMP
  • ARRAY columns that contain any of the following types:
    • Numeric types
    • STRING
    • BOOL
    • BYTE
    • DATE
    • DATETIME
    • TIME
    • TIMESTAMP

Pricing

TheML.DESCRIBE_DATA function usesBigQuery on-demand compute pricing.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.