The ML.DESCRIBE_DATA function
This document describes theML.DESCRIBE_DATA function, which you can use togenerate descriptive statistics for the columns in a table or subquery. Forexample, you might want to know statistics for a table of training or servingdata that you plan to use with a machine learning (ML) model. You can use thedata output by this function for such purposes asfeature preprocessing ormodelmonitoring.
Syntax
ML.DESCRIBE_DATA({TABLE`PROJECT_ID.DATASET.TABLE_NAME`|(QUERY_STATEMENT)},STRUCT([NUM_QUANTILESASnum_quantiles][,NUM_ARRAY_LENGTH_QUANTILESASnum_array_length_quantiles][,TOP_KAStop_k]))
Arguments
ML.DESCRIBE_DATA takes the following arguments:
PROJECT_ID: your project ID.DATASET: the BigQuery dataset that containsthe table.TABLE_NAME: the name of the input table that containsthe training or serving data to calculate statistics for.QUERY_STATEMENT: a query that generates the trainingor serving data to calculate statistics for. For the supported SQL syntax oftheQUERY_STATEMENTclause, seeGoogleSQL querysyntax.NUM_QUANTILES: anINT64value that specifies thenumber ofquantiles to return fornumerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>columns.This affects the number of results shown in thequantilesoutput column.These quantiles describe the distribution of the data in the column. Specify alower value for coarser-grained distribution information and a higher valuefor finer-grained distribution information. TheNUM_QUANTILESvalue must bein the range[1, 100,000]. The default value is2.NUM_ARRAY_LENGTH_QUANTILES: anINT64value thatspecifies the number of quantiles to return forARRAYcolumns. This affectsthe number of results shown in thearray_length_quantilesoutput column.These quantiles describe the distribution of the length of the arrays in thecolumn. Specify a lower value for coarser-grained distribution information anda higher value for finer-grained distribution information. TheNUM_ARRAY_LENGTH_QUANTILESvalue must be in the range[1, 100,000]. Thedefault value is10.TOP_K: anINT64value that specifies the number oftop values to return for categorical andARRAY<categorical>columns. Thisaffects the number of results shown in thetop_valuesoutput column. The topvalues are the values that are shown most frequently in the column. TheTOP_Kvalue must be in the range[1, 10,000]. The default value is1.
Details
ML.DESCRIBE_DATA handles input columns as follows:
ARRAYcolumns are unnested before statistics are computed on them.ARRAY<STRUCT<INT64, numerical>>. TheINT64value is the index, and thenumerical value is the value. For statistics computation,BigQuery ML treats columns of this type asARRAY<numerical>based on the value. The value of the dimension column in the output isMAX(index) + 1.STRUCTfields are expanded, and then categorical columns are cast toSTRINGand numerical columns are cast toFLOAT64.- Columns of the following data types arecast to
STRINGand return the same statistics asSTRINGcolumns:BOOLBYTEDATEDATETIMETIMETIMESTAMPColumns of the following data types arecast toFLOAT64and return the same statistics asFLOAT64columns:INT64NUMERICBIGNUMERIC
Output
ML.DESCRIBE_DATA returns one row for each column in the input data.ML.DESCRIBE_DATA output contains the following columns:
name: aSTRINGcolumn that contains the name of the input column.num_rows: anINT64column that contains the total number of rows for theinput column.num_nulls: anINT64column that returns the number ofNULLvalues foundin the column.num_zeros: anINT64column that contains one of the following:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns the number of0values found in the column. - For categorical or
ARRAY<categorical>input columns, returnsNULL.
- For numerical,
min: aSTRINGcolumn that contains theMINvalue for the column.max: aSTRINGcolumn that contains theMAXvalue for the column.mean: aFLOAT64column that contains one of the following:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns themean value calculated for thecolumn. - For categorical or
ARRAY<categorical>input columns, returnsNULL.
- For numerical,
stdev: aFLOAT64column that contains one of the following:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns thestandard deviationvalue calculated for the column. - For categorical or
ARRAY<categorical>input columns, returnsNULL.
- For numerical,
median: aFLOAT64column that contains one of the following:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns themedian value calculated for thecolumn. - For categorical or
ARRAY<categorical>input columns, returnsNULL.
- For numerical,
quantiles: anARRAY<FLOAT64>column that contains information about thequantiles in an input column, ascomputed by theAPPROX_QUANTILESfunction. Thequantilescolumn contains one of the following values:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns thequantiles computed for thecolumn. - For categorical or
ARRAY<categorical>input columns, returnsNULL.
- For numerical,
unique: anINT64column that contains information about the number ofunique values in an input column, as computed by theAPPROX_COUNT_DISTINCTfunction. Theuniquecolumn contains one of the following values:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returnsNULL. - For categorical or
ARRAY<categorical>input columns, returns thenumber of unique values in the input column.
- For numerical,
avg_string_length: aFLOAT64column that contains one of the following:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returnsNULL. - For categorical or
ARRAY<categorical>input columns, returns theaverage length of the values in the column.
- For numerical,
num_values: anINT64column that contains the number of array elements forARRAYcolumns, and the number of values in the column for other types ofcolumns.top_values: aARRAY<STRUCT<STRING, INT64>>column that containsinformation about the top values and number of occurrences in an input column,as computed by theAPPROX_TOP_COUNTfunction. Thetop_valuescolumn contains the following fields:top_values.value: aSTRINGfield that contains one of the followingvalues:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns,returnsNULL. - For categorical or
ARRAY<categorical>input columns, returns one ofthe top values in the input column.
- For numerical,
top_values.count: anINT64field that contains one of the followingvalues:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returnsNULL. - For categorical or
ARRAY<categorical>input columns, returns thenumber of times the related top value appears.
- For numerical,
min_array_length: anINT64column that contains one of the followingvalues:- For
ARRAYinput columns, returns the minimum length of anarray in the column. - For other types of input columns, returns
NULL.
- For
max_array_length: anINT64column that contains one of the followingvalues:- For
ARRAYinput columns, returns the maximum length of anarray in the column. - For other types of input columns, returns
NULL.
- For
avg_array_length: aFLOAT64column that contains one of the followingvalues:- For
ARRAYinput columns, returns the average length of anarray in the column. - For other types of input columns, returns
NULL.
- For
total_array_length: anINT64column that contains one of the followingvalues:- For
ARRAYinput columns, returns the sum of the size of the arrays inthe column. - For other types of input columns, returns
NULL.
- For
array_length_quantiles: anARRAY<INT64>column that contains theinformation about the quantiles for the array length in an input column, ascomputed by theAPPROX_QUANTILESfunction. Thearray_length_quantilescolumn contains one of the followingvalues:- For
ARRAYinput columns, returns the quantiles for the array lengthcomputed for the column. - For other types of input columns, returns
0.
- For
dimension: anINT64column that contains one of the following:- For
ARRAY<STRUCT<INT64, numerical>>input columns, returns thedimension computed for the column, which isMAX(index) + 1forsparse input. - For other types of input columns, returns
NULL.
- For
Example
The following example returns statistics for a table with five quantilescalculated for numeric columns and three top values returned for non-numericcolumns:
SELECT*FROMML.DESCRIBE_DATA(TABLE`myproject.mydataset.mytable`,STRUCT(5ASnum_quantiles,3AStop_k));
Limitations
Input data for theML.DESCRIBE_DATA function can only contain columns of thefollowing data types:
- Numerictypes
STRINGBOOLBYTEDATEDATETIMETIMETIMESTAMPARRAY<STRUCT<INT64, FLOAT64>>(a sparse tensor)STRUCTcolumns that contain any of the following types:- Numeric types
STRINGBOOLBYTEDATEDATETIMETIMETIMESTAMP
ARRAYcolumns that contain any of the following types:- Numeric types
STRINGBOOLBYTEDATEDATETIMETIMETIMESTAMP
Pricing
TheML.DESCRIBE_DATA function usesBigQuery on-demand compute pricing.
What's next
- For more information about model monitoring in BigQuery ML, seeModel monitoring overview.
- For more information about supported SQL statements and functions for MLmodels, seeEnd-to-end user journeys for ML models.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.