The ML.DETECT_ANOMALIES function
This document describes theML.DETECT_ANOMALIES function, which lets youperform anomaly detection in BigQuery ML.
You can use the following types of models withML.DETECT_ANOMALIES, dependingon the type of input data you want to analyze:
- For time series data, use one of the following models:
- Forindependent and identically distributed random variables (IID)data, use one of the following models:
If you don't want to manage your own times series anomaly detection model, you canuse theAI.DETECT_ANOMALIES function(Preview)with BigQuery ML's built-inTimesFM time series model to perform anomaly detection.
Syntax
# ARIMA_PLUS and ARIMA_PLUS_XREG models:ML.DETECT_ANOMALIES( MODEL `PROJECT_ID.DATASET.MODEL_NAME` [, STRUCT(ANOMALY_PROB_THRESHOLD AS anomaly_prob_threshold)] [, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }])#Autoencoder, k-means, or PCA models:ML.DETECT_ANOMALIES( MODEL `PROJECT_ID.DATASET.MODEL_NAME`, STRUCT(CONTAMINATION AS contamination), { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) })Arguments
ML.DETECT_ANOMALIES takes the following arguments:
PROJECT_ID: the project that contains theresource.DATASET: the dataset that contains theresource.MODEL: the name of the model.TABLE: The name of the table to use to perform anomaly detection.QUERY_STATEMENT: The GoogleSQL query thatgenerates the data to use to perform anomaly detection. For the supported SQLsyntax for theQUERY_STATEMENTclause in GoogleSQL, seeQuery syntax.ANOMALY_PROB_THRESHOLD: aFLOAT64value thatidentifies the custom threshold to use for anomaly detection. The value mustbe in the range[0, 1), with a default value of0.95.The value of the anomaly probabilityat each timestamp is calculated using the actual time series data value andthe values of the predicted time series data and the variance from the modeltraining. The actual time series data value at a specific timestamp isidentified as anomalous if the anomaly probability exceeds the
ANOMALY_PROB_THRESHOLDvalue. TheANOMALY_PROB_THRESHOLDvalue alsodetermines the lower and upper bounds, where a larger threshold valueresults in a larger interval size.CONTAMINATION: aFLOAT64value that identifies theproportion of anomalies in the training dataset that are used to create theautoencoder, k-means, or PCA input models. The value must be in the range[0,0.5].For example, a
CONTAMINATIONvalue of0.1means that 10% of the trainingdata that was used to create the input model is anomalous. TheCONTAMINATIONvalue determines the cutoff threshold of the target metricto become anomalous, and any input data with a target metric greater thanthe cutoff threshold is identified as anomalous. The target metric ismean squared error forautoencoder and PCA models, and the target metric is normalized distancefor k-means models. For more information on normalized distance, seeK-means model output.
Input
The input requirements for theML.DETECT_ANOMALIES function depend upon theinput model type.
Time series model input
Anomaly detection withARIMA_PLUS andARIMA_PLUS_XREG models has thefollowing requirements:
- To detect anomalies in historical time-series data, the
DECOMPOSE_TIME_SERIEStraining option must be set as its default value ofTRUEwhen the inputmodel is created. Neithertable_namenorquery_statementis accepted. - The
anomaly_prob_thresholdvalue must be specified to detect anomalies innew time-series data. - The column names of either the
table_nameinput table or thequery_statementclause must match the column names that are used to createthe input model. - The data types of the
TIME_SERIES_ID_COLcolumns must match the data typesof the columns that are used to create the input model.
For a list of supported data types, seeTIME_SERIES_TIMESTAMP_COLandTIME_SERIES_DATA_COL.
Autoencoder, k-means, or PCA model input
Anomaly detection with autoencoder, k-means, or PCA models has the followingrequirements:
- The column names of the input data from either the
tableor thequery_statementargument must match the column names of the model. Thecolumn data types must be compatible according to BigQueryimplicit coercion rules. - If you used the
TRANSFORMclausein theCREATE MODELstatement that created the model, then only the inputcolumns present in theTRANSFORMclause can appear inquery_statement.
Output
ML.DETECT_ANOMALIES always returns theis_anomaly column that contains theanomaly detection results. Other output columns differ based upon theinput model type and input data table.
Time series model output
ARIMA_PLUS andARIMA_PLUS_XREG model output includes the following columns,followed by the input table columns, if present. Output can include thefollowing:
time_series_id_colortime_series_id_cols: the identifiers of a timeseries. Only present when forecasting multiple time series at once. The columnnames and types are inherited from theTIME_SERIES_ID_COLoptionas specified in the model creation query.time_series_timestamp: aSTRINGvalue that containsthe timestamp column for a time series. Thecolumn name is inherited from theTIME_SERIES_TIMESTAMP_COLoptionas specified in theCREATE MODELstatement. The column has a type ofTIMESTAMP, regardless of theTIME_SERIES_TIMESTAMP_COLinput column data type.time_series_data: aSTRINGvalue that containsthe data column for a time series. The column nameis inherited from theTIME_SERIES_DATA_COLoption as specified intheCREATE MODELstatement. The column has a type ofFLOAT64, regardless of theTIME_SERIES_DATA_COLinput column data type.is_anomaly: aBOOLvalue that indicates whether the value at a specifictimestamp is an anomaly. If theanomaly_probabilityvalue is above theanomaly_prob_thresholdvalue, thenthetime_series_datavalue is out of the range for the lower and upperbounds and theis_anomalyvalue isTRUE.lower_bound: aFLOAT64value that contains the lower bound of theprediction result.upper_bound: aFLOAT64value that that contains the upper bound of theprediction result.anomaly_probability: aFLOAT64value that contains the probability thatthis point is an anomaly. For example, ananomaly_probabilityvalue of0.95means that, among all possible valuesat the given timestamp, there is a 95% chance that the value is closer to thepredicted value than it is to the given time series data value. This indicatesa 95% probability that the given time series data value is an anomaly.
ML.DETECT_ANOMALIES output for time series models has thefollowing properties:
- The function returns
NULLvalues in theis_anomaly,upper_bound,lower_boundandanomaly_probabilitycolumns for rows with invalid input,which include the following cases:
Autoencoder and PCA model output
Autoencoder and PCA model output includes the following columns, followed bythe input table columns:
is_anomaly: aBOOLvalue that indicates whether the value is anomalous.mean_squared_error: aFLOAT64value that contains themean squared error.
K-means model output
K-means model output includes the following, followed by the input tablecolumns:
is_anomaly: aBOOLvalue that indicates whether the value is anomalous.normalized_distance: aFLOAT64value that contains theshortest distance among the normalized distances from the input datato each cluster centroid. Normalized distances are computed as theabsolute distance from the input data to a cluster centroid, divided bythe cluster's radius. The cluster radius is defined as the root meansquare of all of the distances from each cluster's assigned data points toits centroid. Normalized distance is used in favor of absolute distance todetermine anomalies because anomalies might not be detected as effectivelyusing absolute distances, since they don't account for cluster radius.The distance type is determined by theDISTANCE_TYPEvalue specified during model training.centroid_id: anINT64value that contains the centroid ID.
Examples
The following examples show how to useML.DETECT_ANOMALIES with differentinput models and settings.
ARIMA_PLUS model without specified settings
The following example detects anomalies using anARIMA_PLUS model that has theDECOMPOSE_TIME_SERIEStraining option set to its default value ofTRUE, without specifying theanomaly_prob_threshold argument.
SELECT*FROMML.DETECT_ANOMALIES(MODEL`mydataset.my_arima_plus_model`)
If the time series input column names arets_timestamp andts_data, thenthis query returns results similar to the following:
+-------------------------+----------+------------+-------------+-------------+---------------------+| ts_timestamp | ts_data | is_anomaly | lower_bound | upper_bound | anomaly_probability |+-------------------------+----------+------------+-------------+-------------+---------------------+| 2021-01-01 00:00:01 UTC | 125.3 | FALSE | 123.5 | 139.1 | 0.93 || 2021-01-02 00:00:01 UTC | 145.3 | TRUE | 128.5 | 143.1 | 0.96 |+-------------------------+----------+------------+-------------+-------------+---------------------+
ARIMA_PLUS model with a customanomaly_prob_threshold value
The following example detects anomalies using anARIMA_PLUS model that has theDECOMPOSE_TIME_SERIEStraining option set to its default value ofTRUE, using a customanomaly_prob_threshold value of0.8:
SELECT*FROMML.DETECT_ANOMALIES(MODEL`mydataset.my_arima_plus_model`,STRUCT(0.8ASanomaly_prob_threshold))
If the time series input column names arets_timestamp andts_data, thenthis query returns results similar to the following:
+-------------------------+----------+------------+-------------+-------------+---------------------+| ts_timestamp | ts_data | is_anomaly | lower_bound | upper_bound | anomaly_probability |+-------------------------+----------+------------+-------------+-------------+---------------------+| 2021-01-01 00:00:01 UTC | 125.3 | TRUE | 129.5 | 133.6 | 0.93 || 2021-01-02 00:00:01 UTC | 145.3 | TRUE | 131.5 | 136.6 | 0.96 |+-------------------------+----------+------------+-------------+-------------+---------------------+
ARIMA_PLUS model with input data as a query statement
The following example detects anomalies using anARIMA_PLUS model, using acustomanomaly_prob_threshold value of0.9 and passing an input data tableinto the query:
SELECT*FROMML.DETECT_ANOMALIES(MODEL`mydataset.my_arima_plus_model`,STRUCT(0.9ASanomaly_prob_threshold),(SELECTstate,city,date,temperature,weatherFROM`mydataset.my_time_series_data_table`))
This example uses the following column values:
TIME_SERIES_ID_COLisstate,city.TIME_SERIES_TIMESTAMP_COLisdate.TIME_SERIES_DATA_COListemperature.
This example returns results similar to the following:
+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+| state | city | date | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability | weather |+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+| "WA" | "Kirkland" | 2021-01-01 00:00:00 UTC | 38.1 | FALSE | 36.4 | 42.0 | 0.8293 | "sunny" || "WA" | "Kirkland" | 2021-01-02 00:00:00 UTC | 37.1 | TRUE | 37.4 | 43.3 | 0.9124 | "rainy" |+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+
ARIMA_PLUS model with input data as a table
The following example detects anomalies using anARIMA_PLUS model, usinga customanomaly_prob_threshold value of0.9 and passing an input data tableinto the query:
SELECT*FROMML.DETECT_ANOMALIES(MODEL`mydataset.my_arima_plus_model`,STRUCT(0.9ASanomaly_prob_threshold),TABLE`mydataset.my_time_series_data_table`)
If theTIME_SERIES_ID_COL column names arestate,city, andTIME_SERIES_TIMESTAMP_COL, and theTIME_SERIES_DATA_COL column names aredate andtemperature, and one additional columnweather is in the inputdata table, then this query returns results similar to the following:
+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+| state | city | date | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability | weather |+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+| "WA" | "Kirkland" | 2021-01-01 00:00:00 UTC | 38.1 | FALSE | 36.4 | 42.0 | 0.8293 | "sunny" || "WA" | "Kirkland" | 2021-01-02 00:00:00 UTC | 37.1 | TRUE | 37.4 | 43.3 | 0.9124 | "rainy" |+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+
ARIMA_PLUS_XREG model with a customanomaly_prob_threshold value
The following example detects anomalies using anARIMA_PLUS_XREG model thatuses a customanomaly_prob_threshold value of0.6:
SELECT*FROMML.DETECT_ANOMALIES(MODEL`mydataset.myarima_plus_xreg_model`,STRUCT(0.6ASanomaly_prob_threshold))ORDERBYdateASC;
If the time series input column names aredate andtemperature, thenthis query returns results similar to the following:
+-------------------------+-------------+------------+---------------------+---------------------+----------------------+| date | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability |+-------------------------+-------------+------------+---------------------+---------------------+----------------------+| 2009-08-11 00:00:00 UTC | 70.1 | false | 67.65879917809896 | 72.541200821901029 | 0.0 || 2009-08-12 00:00:00 UTC | 73.4 | false | 71.714971312549849 | 76.597372956351919 | 0.20573021642489953 || 2009-08-13 00:00:00 UTC | 64.6 | true | 67.7428898975034 | 72.625291541305472 | 0.94632610424009034 |+-------------------------+-------------+------------+---------------------+---------------------+----------------------+
Autoencoder model
The following example detects anomalies using an autoencoder model and acontamination value of0.1.
SELECT*FROMML.DETECT_ANOMALIES(MODEL`mydataset.my_autoencoder_model`,STRUCT(0.1AScontamination),TABLE`mydataset.mytable`)
If the feature column names aref1 andf2, then this query returns resultssimilar to the following:
+------------+--------------------+---------+--------+| is_anomaly | mean_squared_error | f1 | f2 |+------------+--------------------+---------+--------+| FALSE | 0.63456 | 120 | "a" || TRUE | 11.342 | 15000 | "b" |+------------+--------------------+---------+--------+
K-means model
The following example detects anomalies using a k-means model and acontamination value of0.2.
SELECT*FROMML.DETECT_ANOMALIES(MODEL`mydataset.my_kmeans_model`,STRUCT(0.2AScontamination),(SELECTf1,f2FROM`mydataset.mytable`))
This query returns results similar to the following:
+------------+---------------------+-------------+--------+--------+| is_anomaly | normalized_distance | centroid_id | f1 | f2 |+------------+---------------------+-------------+--------+--------+| FALSE | 0.63456 | 1 | 120 | "a" || TRUE | 6.3243 | 2 | 15000 | "b" |+------------+---------------------+-------------+--------+--------+
PCA model
The following example detects anomalies using a PCA model and acontamination value of0.1.
SELECT*FROMML.DETECT_ANOMALIES(MODEL`mydataset.my_pca_model`,STRUCT(0.1AScontamination),TABLE`mydataset.mytable`)
If the feature column names aref1,f2 andf3, then this query returnsresults similar to the following:
+------------+--------------------+---------+--------+------+| is_anomaly | mean_squared_error | f1 | f2 | f3 |+------------+--------------------+---------+--------+------+| FALSE | 0.63456 | 120 | "a" | 0.9 || TRUE | 11.342 | 15000 | "b" | 25 |+------------+--------------------+---------+--------+------+
Pricing
All queries that use theML.DETECT_ANOMALIES function are billable, regardlessof the pricing model.
What's next
- For more information about model inference, seeModel inference overview.
For more information about supported SQL statements and functions for modelsthat work with
ML.DETECT_ANOMALIES, see the following documents:
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.