The ML.DETECT_ANOMALIES function

This document describes theML.DETECT_ANOMALIES function, which lets youperform anomaly detection in BigQuery ML.

You can use the following types of models withML.DETECT_ANOMALIES, dependingon the type of input data you want to analyze:

If you don't want to manage your own times series anomaly detection model, you canuse theAI.DETECT_ANOMALIES function(Preview)with BigQuery ML's built-inTimesFM time series model to perform anomaly detection.

Syntax

# ARIMA_PLUS and ARIMA_PLUS_XREG models:ML.DETECT_ANOMALIES(  MODEL `PROJECT_ID.DATASET.MODEL_NAME`  [, STRUCT(ANOMALY_PROB_THRESHOLD AS anomaly_prob_threshold)]  [, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }])#Autoencoder, k-means, or PCA models:ML.DETECT_ANOMALIES(  MODEL `PROJECT_ID.DATASET.MODEL_NAME`,  STRUCT(CONTAMINATION AS contamination),  { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) })

Arguments

ML.DETECT_ANOMALIES takes the following arguments:

  • PROJECT_ID: the project that contains theresource.
  • DATASET: the dataset that contains theresource.
  • MODEL: the name of the model.
  • TABLE: The name of the table to use to perform anomaly detection.
  • QUERY_STATEMENT: The GoogleSQL query thatgenerates the data to use to perform anomaly detection. For the supported SQLsyntax for theQUERY_STATEMENT clause in GoogleSQL, seeQuery syntax.
  • ANOMALY_PROB_THRESHOLD: aFLOAT64 value thatidentifies the custom threshold to use for anomaly detection. The value mustbe in the range[0, 1), with a default value of0.95.

    The value of the anomaly probabilityat each timestamp is calculated using the actual time series data value andthe values of the predicted time series data and the variance from the modeltraining. The actual time series data value at a specific timestamp isidentified as anomalous if the anomaly probability exceeds theANOMALY_PROB_THRESHOLD value. TheANOMALY_PROB_THRESHOLD value alsodetermines the lower and upper bounds, where a larger threshold valueresults in a larger interval size.

  • CONTAMINATION: aFLOAT64 value that identifies theproportion of anomalies in the training dataset that are used to create theautoencoder, k-means, or PCA input models. The value must be in the range[0,0.5].

    For example, aCONTAMINATION value of0.1 means that 10% of the trainingdata that was used to create the input model is anomalous. TheCONTAMINATION value determines the cutoff threshold of the target metricto become anomalous, and any input data with a target metric greater thanthe cutoff threshold is identified as anomalous. The target metric ismean squared error forautoencoder and PCA models, and the target metric is normalized distancefor k-means models. For more information on normalized distance, seeK-means model output.

Input

The input requirements for theML.DETECT_ANOMALIES function depend upon theinput model type.

Time series model input

Anomaly detection withARIMA_PLUS andARIMA_PLUS_XREG models has thefollowing requirements:

  • To detect anomalies in historical time-series data, theDECOMPOSE_TIME_SERIEStraining option must be set as its default value ofTRUE when the inputmodel is created. Neithertable_name norquery_statement is accepted.
  • Theanomaly_prob_threshold value must be specified to detect anomalies innew time-series data.
  • The column names of either thetable_name input table or thequery_statement clause must match the column names that are used to createthe input model.
  • The data types of theTIME_SERIES_ID_COL columns must match the data typesof the columns that are used to create the input model.

For a list of supported data types, seeTIME_SERIES_TIMESTAMP_COLandTIME_SERIES_DATA_COL.

Autoencoder, k-means, or PCA model input

Anomaly detection with autoencoder, k-means, or PCA models has the followingrequirements:

  • The column names of the input data from either thetable or thequery_statement argument must match the column names of the model. Thecolumn data types must be compatible according to BigQueryimplicit coercion rules.
  • If you used theTRANSFORM clausein theCREATE MODEL statement that created the model, then only the inputcolumns present in theTRANSFORM clause can appear inquery_statement.

Output

ML.DETECT_ANOMALIES always returns theis_anomaly column that contains theanomaly detection results. Other output columns differ based upon theinput model type and input data table.

Time series model output

ARIMA_PLUS andARIMA_PLUS_XREG model output includes the following columns,followed by the input table columns, if present. Output can include thefollowing:

  • time_series_id_col ortime_series_id_cols: the identifiers of a timeseries. Only present when forecasting multiple time series at once. The columnnames and types are inherited from theTIME_SERIES_ID_COL optionas specified in the model creation query.
  • time_series_timestamp: aSTRING value that containsthe timestamp column for a time series. Thecolumn name is inherited from theTIME_SERIES_TIMESTAMP_COL optionas specified in theCREATE MODEL statement. The column has a type ofTIMESTAMP, regardless of theTIME_SERIES_TIMESTAMP_COLinput column data type.
  • time_series_data: aSTRING value that containsthe data column for a time series. The column nameis inherited from theTIME_SERIES_DATA_COL option as specified intheCREATE MODEL statement. The column has a type ofFLOAT64, regardless of theTIME_SERIES_DATA_COLinput column data type.
  • is_anomaly: aBOOL value that indicates whether the value at a specifictimestamp is an anomaly. If theanomaly_probability value is above theanomaly_prob_threshold value, thenthetime_series_data value is out of the range for the lower and upperbounds and theis_anomaly value isTRUE.
  • lower_bound: aFLOAT64 value that contains the lower bound of theprediction result.
  • upper_bound: aFLOAT64 value that that contains the upper bound of theprediction result.
  • anomaly_probability: aFLOAT64 value that contains the probability thatthis point is an anomaly. For example, ananomaly_probability value of0.95 means that, among all possible valuesat the given timestamp, there is a 95% chance that the value is closer to thepredicted value than it is to the given time series data value. This indicatesa 95% probability that the given time series data value is an anomaly.

ML.DETECT_ANOMALIES output for time series models has thefollowing properties:

  • The function returnsNULL values in theis_anomaly,upper_bound,lower_bound andanomaly_probability columns for rows with invalid input,which include the following cases:
    • The value in theTIME_SERIES_ID_COL column does not exist in the model.
    • The value in theTIME_SERIES_TIMESTAMP_COL column is not in the range ofthe forecasthorizon.
    • The value in theTIME_SERIES_TIMESTAMP_COL column does not follow the samefrequencyas the one in the model.

Autoencoder and PCA model output

Autoencoder and PCA model output includes the following columns, followed bythe input table columns:

  • is_anomaly: aBOOL value that indicates whether the value is anomalous.
  • mean_squared_error: aFLOAT64 value that contains themean squared error.

K-means model output

K-means model output includes the following, followed by the input tablecolumns:

  • is_anomaly: aBOOL value that indicates whether the value is anomalous.
  • normalized_distance: aFLOAT64 value that contains theshortest distance among the normalized distances from the input datato each cluster centroid. Normalized distances are computed as theabsolute distance from the input data to a cluster centroid, divided bythe cluster's radius. The cluster radius is defined as the root meansquare of all of the distances from each cluster's assigned data points toits centroid. Normalized distance is used in favor of absolute distance todetermine anomalies because anomalies might not be detected as effectivelyusing absolute distances, since they don't account for cluster radius.The distance type is determined by theDISTANCE_TYPEvalue specified during model training.
  • centroid_id: anINT64 value that contains the centroid ID.

Examples

The following examples show how to useML.DETECT_ANOMALIES with differentinput models and settings.

ARIMA_PLUS model without specified settings

The following example detects anomalies using anARIMA_PLUS model that has theDECOMPOSE_TIME_SERIEStraining option set to its default value ofTRUE, without specifying theanomaly_prob_threshold argument.

SELECT*FROMML.DETECT_ANOMALIES(MODEL`mydataset.my_arima_plus_model`)

If the time series input column names arets_timestamp andts_data, thenthis query returns results similar to the following:

+-------------------------+----------+------------+-------------+-------------+---------------------+|      ts_timestamp       | ts_data  | is_anomaly | lower_bound | upper_bound | anomaly_probability |+-------------------------+----------+------------+-------------+-------------+---------------------+| 2021-01-01 00:00:01 UTC |  125.3   |   FALSE    |  123.5      |  139.1      |  0.93               || 2021-01-02 00:00:01 UTC |  145.3   |   TRUE     |  128.5      |  143.1      |  0.96               |+-------------------------+----------+------------+-------------+-------------+---------------------+

ARIMA_PLUS model with a customanomaly_prob_threshold value

The following example detects anomalies using anARIMA_PLUS model that has theDECOMPOSE_TIME_SERIEStraining option set to its default value ofTRUE, using a customanomaly_prob_threshold value of0.8:

SELECT*FROMML.DETECT_ANOMALIES(MODEL`mydataset.my_arima_plus_model`,STRUCT(0.8ASanomaly_prob_threshold))

If the time series input column names arets_timestamp andts_data, thenthis query returns results similar to the following:

+-------------------------+----------+------------+-------------+-------------+---------------------+|      ts_timestamp       | ts_data  | is_anomaly | lower_bound | upper_bound | anomaly_probability |+-------------------------+----------+------------+-------------+-------------+---------------------+| 2021-01-01 00:00:01 UTC |  125.3   |    TRUE    |  129.5      |  133.6      |  0.93               || 2021-01-02 00:00:01 UTC |  145.3   |    TRUE    |  131.5      |  136.6      |  0.96               |+-------------------------+----------+------------+-------------+-------------+---------------------+

ARIMA_PLUS model with input data as a query statement

The following example detects anomalies using anARIMA_PLUS model, using acustomanomaly_prob_threshold value of0.9 and passing an input data tableinto the query:

SELECT*FROMML.DETECT_ANOMALIES(MODEL`mydataset.my_arima_plus_model`,STRUCT(0.9ASanomaly_prob_threshold),(SELECTstate,city,date,temperature,weatherFROM`mydataset.my_time_series_data_table`))

This example uses the following column values:

  • TIME_SERIES_ID_COL isstate,city.
  • TIME_SERIES_TIMESTAMP_COL isdate.
  • TIME_SERIES_DATA_COL istemperature.

This example returns results similar to the following:

+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+| state |   city     |           date          | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability | weather |+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+| "WA"  | "Kirkland" | 2021-01-01 00:00:00 UTC |   38.1      |   FALSE    |     36.4    |    42.0     |        0.8293       | "sunny" || "WA"  | "Kirkland" | 2021-01-02 00:00:00 UTC |   37.1      |   TRUE     |     37.4    |    43.3     |        0.9124       | "rainy" |+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+

ARIMA_PLUS model with input data as a table

The following example detects anomalies using anARIMA_PLUS model, usinga customanomaly_prob_threshold value of0.9 and passing an input data tableinto the query:

SELECT*FROMML.DETECT_ANOMALIES(MODEL`mydataset.my_arima_plus_model`,STRUCT(0.9ASanomaly_prob_threshold),TABLE`mydataset.my_time_series_data_table`)

If theTIME_SERIES_ID_COL column names arestate,city, andTIME_SERIES_TIMESTAMP_COL, and theTIME_SERIES_DATA_COL column names aredate andtemperature, and one additional columnweather is in the inputdata table, then this query returns results similar to the following:

+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+| state |   city     |           date          | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability | weather |+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+| "WA"  | "Kirkland" | 2021-01-01 00:00:00 UTC |   38.1      |   FALSE    |     36.4    |    42.0     |        0.8293       | "sunny" || "WA"  | "Kirkland" | 2021-01-02 00:00:00 UTC |   37.1      |   TRUE     |     37.4    |    43.3     |        0.9124       | "rainy" |+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+

ARIMA_PLUS_XREG model with a customanomaly_prob_threshold value

The following example detects anomalies using anARIMA_PLUS_XREG model thatuses a customanomaly_prob_threshold value of0.6:

SELECT*FROMML.DETECT_ANOMALIES(MODEL`mydataset.myarima_plus_xreg_model`,STRUCT(0.6ASanomaly_prob_threshold))ORDERBYdateASC;

If the time series input column names aredate andtemperature, thenthis query returns results similar to the following:

+-------------------------+-------------+------------+---------------------+---------------------+----------------------+|      date               | temperature | is_anomaly | lower_bound         | upper_bound         | anomaly_probability  |+-------------------------+-------------+------------+---------------------+---------------------+----------------------+| 2009-08-11 00:00:00 UTC |  70.1       |    false   |  67.65879917809896  |  72.541200821901029 |  0.0                 || 2009-08-12 00:00:00 UTC |  73.4       |    false   |  71.714971312549849 |  76.597372956351919 |  0.20573021642489953 || 2009-08-13 00:00:00 UTC |  64.6       |    true    |  67.7428898975034   |  72.625291541305472 |  0.94632610424009034 |+-------------------------+-------------+------------+---------------------+---------------------+----------------------+

Autoencoder model

The following example detects anomalies using an autoencoder model and acontamination value of0.1.

SELECT*FROMML.DETECT_ANOMALIES(MODEL`mydataset.my_autoencoder_model`,STRUCT(0.1AScontamination),TABLE`mydataset.mytable`)

If the feature column names aref1 andf2, then this query returns resultssimilar to the following:

+------------+--------------------+---------+--------+| is_anomaly | mean_squared_error |    f1   |   f2   |+------------+--------------------+---------+--------+|   FALSE    |     0.63456        |   120   |  "a"   ||   TRUE     |     11.342         |  15000  |  "b"   |+------------+--------------------+---------+--------+

K-means model

The following example detects anomalies using a k-means model and acontamination value of0.2.

SELECT*FROMML.DETECT_ANOMALIES(MODEL`mydataset.my_kmeans_model`,STRUCT(0.2AScontamination),(SELECTf1,f2FROM`mydataset.mytable`))

This query returns results similar to the following:

+------------+---------------------+-------------+--------+--------+| is_anomaly | normalized_distance | centroid_id |   f1   |   f2   |+------------+---------------------+-------------+--------+--------+|   FALSE    |     0.63456         |     1       |  120   |  "a"   ||   TRUE     |     6.3243          |     2       | 15000  |  "b"   |+------------+---------------------+-------------+--------+--------+

PCA model

The following example detects anomalies using a PCA model and acontamination value of0.1.

SELECT*FROMML.DETECT_ANOMALIES(MODEL`mydataset.my_pca_model`,STRUCT(0.1AScontamination),TABLE`mydataset.mytable`)

If the feature column names aref1,f2 andf3, then this query returnsresults similar to the following:

+------------+--------------------+---------+--------+------+| is_anomaly | mean_squared_error |    f1   |   f2   |  f3  |+------------+--------------------+---------+--------+------+|   FALSE    |     0.63456        |   120   |  "a"   |  0.9 ||   TRUE     |     11.342         |  15000  |  "b"   |  25  |+------------+--------------------+---------+--------+------+

Pricing

All queries that use theML.DETECT_ANOMALIES function are billable, regardlessof the pricing model.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.