The CREATE MODEL statement for ARIMA_PLUS models

This document describes theCREATE MODEL statement for creating univariatetime series models in BigQuery by using SQL. Alternatively, you can use theGoogle Cloud console user interface tocreate a model by using a UI(Preview) instead of constructing the SQLstatement yourself. You can use a univariatetime series model toforecast thefuture value for a given column based on the analysis of historical values forthat column. You can also use univariate time series models to detectanomalies in time series data.

Note: If you don't want to create and manage your own model, you can useBigQuery ML's built-inTimesFM time series modelwith theAI.FORECAST function(Preview) to perform forecasting.

For more information about supported SQL statements and functions for thismodel, seeEnd-to-end user journeys for time series forecasting models.

Time series modeling pipeline

The BigQuery ML time series modeling pipeline includesmultiple modules. TheARIMAmodel is the most computationally expensive module, which is why the model isnamedARIMA_PLUS. For more information, seeARIMA_PLUS: Large-scale, Accurate, Automatic and Interpretable In-Database Time Series Forecasting and Anomaly Detection in Google BigQuery.

SINGLE_TIME_SERIES_DIAGRAM

The modeling pipeline for theARIMA_PLUS time series models performs thefollowing functions:

  • Infer the data frequency of the time series.
  • Handle irregular time intervals.
  • Handle duplicated timestamps by taking the mean value.
  • Interpolate missing data using local linear interpolation.
  • Detect and clean spike and dip outliers.
  • Detect and adjust abrupt step (level) changes.
  • Detect and adjust holiday effect.
  • Detect multiple seasonal patterns within a single time series by usingSeasonal and Trend decomposition using Loess (STL),and extrapolate seasonality by usingdouble exponential smoothing (ETS).
  • Detect and model the trend using the ARIMA model and theauto.ARIMAalgorithm for automatic hyperparameter tuning. In auto.ARIMA, dozens ofcandidate models are trained and evaluated in parallel. The model with thelowestAkaike information criterion (AIC)is selected as the best model.

Large-scale time series

You can forecast up to 100,000,000 time series simultaneously with a single queryby using theTIME_SERIES_ID_COLoption. With this option, different modeling pipelines run in parallel, as longas enoughslots are available. The following diagramshows this process:

MULTIPLE_TIME_SERIES_DIAGRAM

Large-scale time series forecasting best practices

Forecasting many time series simultaneously can lead to long-runningqueries, because query processing isn't completely parallel due to limitedslot capacity. The following best practices can help you avoid long-runningqueries when forecasting many time series simultaneously:

  • When you have a large number (for example, 100,000) of time series toforecast, first forecast a small number of time series (for example, 1,000)to see how long the query takes. You can then estimate how long your entiretime series forecast will take.
  • You can use theAUTO_ARIMA_MAX_ORDER option to balance between queryrun time and forecast accuracy. IncreasingAUTO_ARIMA_MAX_ORDERexpands the hyperparameter search space to try more complex ARIMA models, thatis, ARIMA models with higher non-seasonal p and q. This increases forecastaccuracy but also increases query run time. Decreasing the value ofAUTO_ARIMA_MAX_ORDER decreases forecast accuracy but also decreases queryrun time. For example, if you specify a value of1 instead of using the2 for this option, the query run time is reducedby more than 50%. The forecast accuracy might drop slightly for some of thetime series. If a shorter training time is important to your use case, use asmaller value forAUTO_ARIMA_MAX_ORDER.
  • The model training time for each time series has a linear relationship to itslength, which is based on the number of data points. The longer thetime series, the longer the training takes. However, not all data pointscontribute equally to the model fitting process. Instead, the morerecent the data point is, the more it contributes to the process. Therefore,if you have a long time series, for example ten years of daily data, youdon't need to train a time series model using all of the data points. Themost recent two or three years of data points are enough.
  • You can use theTIME_SERIES_LENGTH_FRACTION,MIN_TIME_SERIES_LENGTH andMAX_TIME_SERIES_LENGTH training options to enable fast model trainingwith little to no loss of forecasting accuracy. The idea behind these optionsis that while periodic modeling, such as seasonality, requires a certainnumber of time points, trend modeling doesn't need many time points. However,trend modeling is much more computationally expensive than other time seriescomponents. By using the aforementioned training options, you can efficientlymodel the trend component with a subset of the time series, while the othertime series components use the entire time series.
  • To avoid a single long-running query, use BigQuerymulti-statement queries.
  • In some cases, your input table might be missing values in the feature columns.For example, in a multivariate time series with features A and B, most of therows contain valid values for both A and B, while some might haveNULLvalues for one or the other of the features. BigQuery ML fillsin the missing values by using the mean value of the entire data column, whichmight include values from different time series. This causes discrepancies inthe results when compared to single time series training, because the inputvalues are affected by other time series. The best practice is to analyze thedata and impute the missing feature values before runningCREATE MODEL.For example, using the mean value of that feature within each time series, orassigning a value of zero.

You can try these best practices by following theScalable forecasting with millions of time series in BigQuery tutorial.

CREATE MODEL syntax

{CREATE MODEL |CREATE MODEL IF NOT EXISTS |CREATE OR REPLACE MODEL}model_nameOPTIONS(model_option_list)AS {query_statement |  (    training_data AS (query_statement),    custom_holiday AS (holiday_statement)  )}model_option_list:MODEL_TYPE = 'ARIMA_PLUS'    [,TIME_SERIES_TIMESTAMP_COL =string_value ]    [,TIME_SERIES_DATA_COL =string_value ]    [,TIME_SERIES_ID_COL = {string_value |string_array } ]    [,HORIZON =int64_value ]    [,AUTO_ARIMA = { TRUE | FALSE } ]    [,AUTO_ARIMA_MAX_ORDER =int64_value ]    [,AUTO_ARIMA_MIN_ORDER =int64_value ]    [,NON_SEASONAL_ORDER =(int64_value, int64_value, int64_value) ]    [,DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' } ]    [,INCLUDE_DRIFT = { TRUE | FALSE } ]    [,HOLIDAY_REGION =string_value |string_array ]    [,CLEAN_SPIKES_AND_DIPS = { TRUE | FALSE } ]    [,ADJUST_STEP_CHANGES = { TRUE | FALSE } ]    [,TIME_SERIES_LENGTH_FRACTION =float64_value ]    [,MIN_TIME_SERIES_LENGTH =int64_value ]    [,MAX_TIME_SERIES_LENGTH =int64_value ]    [,TREND_SMOOTHING_WINDOW_SIZE =int64_value ]    [,DECOMPOSE_TIME_SERIES = { TRUE | FALSE } ]    [,FORECAST_LIMIT_LOWER_BOUND =float64_value ]    [,FORECAST_LIMIT_UPPER_BOUND =float64_value ]    [,SEASONALITIES =string_array ]    [,HIERARCHICAL_TIME_SERIES_COLS = {string_array } ]    [,KMS_KEY_NAME =string_value ]

CREATE MODEL

Creates and trains a new model in the specified dataset. If the model nameexists,CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates and trains a new model only if the model doesn't exist in thespecified dataset.

CREATE OR REPLACE MODEL

Creates and trains a model and replaces an existing model with the same name inthe specified dataset.

model_name

The name of the model you're creating or replacing. The modelname must be unique in the dataset: no other model or table can have the samename. The model name must follow the same naming rules as aBigQuery table. A model name can:

  • Contain up to 1,024 characters
  • Contain letters (upper or lower case), numbers, and underscores

model_name is case-sensitive.

If you don't have a default project configured, then you must prepend theproject ID to the model name in the following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example, `myproject.mydataset.mymodel`.

MODEL_TYPE

Syntax

MODEL_TYPE='ARIMA_PLUS'

Description

Specifies the model type. This option is required.

Note: TheARIMA model type is deprecated. While the model trainingpipelines ofARIMA andARIMA_PLUS models are the same,ARIMA_PLUS supportsmore capabilities, including use of theDECOMPOSE_TIME_SERIES optionand the ability to work with theML.ARIMA_EVALUATEandML.EXPLAIN_FORECASTfunctions.

TIME_SERIES_TIMESTAMP_COL

Syntax

TIME_SERIES_TIMESTAMP_COL =string_value

Description

The name of the column that provides the time points used intraining the model. The column must be of one of the following data types:

  • TIMESTAMP
  • DATE
  • DATETIME

Arguments

ASTRING value.

TIME_SERIES_DATA_COL

Syntax

TIME_SERIES_DATA_COL =string_value

Description

The name of the column that contains the data to forecast. The column must beof one of the following data types:

  • INT64
  • NUMERIC
  • BIGNUMERIC
  • FLOAT64

Arguments

ASTRING value.

TIME_SERIES_ID_COL

Syntax

TIME_SERIES_ID_COL = {string_value |string_array }

Description

The names of the ID columns. Specify one or more values for this option whenyou want to fit and forecast multiple time series using a single query.Each ID identifies a unique time series. The columns must beof one of the following data types:

  • STRING
  • INT64
  • ARRAY<STRING>
  • ARRAY<INT64>

Arguments

ASTRING orARRAY<STRING> value.

HORIZON

Syntax

HORIZON =int64_value

Description

The number of time points to forecast.

When forecasting multiple time series at once, this parameter applies to eachtime series.

Arguments

AnINT64 value. The default value is1,000. The maximum value is10,000.

AUTO_ARIMA

Syntax

AUTO_ARIMA={TRUE|FALSE}

Description

Determines whether the training process uses auto.ARIMA or not. IfTRUE,training automatically finds the best non-seasonal order (that is, the p, d,q tuple) and decides whether or not to include a linear drift term when d is 1.IfFALSE, you must specify theNON_SEASONAL_ORDER option.

When forecastingmultiple time series at the same time, you must use the auto.ARIMA algorithmfor each time series, so this option must beTRUE.

Arguments

ABOOL value. The default value isTRUE.

AUTO_ARIMA_MAX_ORDER

Syntax

AUTO_ARIMA_MAX_ORDER =int64_value

Description

The maximum value for the sum of non-seasonal p and q. This value determinesthe parameter search space in the auto.ARIMA algorithm, in combination with theAUTO_ARIMA_MIN_ORDER value. This option is disabled when theAUTO_ARIMAvalue isFALSE.

Arguments

AnINT64 value between1 and5, inclusive. The default value is2.

If non-seasonal d is determined to be 0 or 2, the number of candidate modelsevaluated for each supported value is as follows:

  • 1: 3 candidate models
  • 2: 6 candidate models
  • 3: 10 candidate models
  • 4: 15 candidate models
  • 5: 21 candidate models

If non-seasonal d is determined to be 1, the number of candidate models toevaluate is doubled, because there's an additional drift term to consider forall of the existing candidate models.

Note: the number of bytes processed by the inputSELECT statement ismultiplied by the number of candidate models, which is controlled bytheAUTO_ARIMA_MAX_ORDER andAUTO_ARIMA_MIN_ORDER options.This affects the pricing. SeeBigQuery ML pricingfor details.

AUTO_ARIMA_MIN_ORDER

Syntax

AUTO_ARIMA_MIN_ORDER =int64_value

Description

The minimum value for the sum of non-seasonal p and q. This value determinesthe parameter search space in the auto.ARIMA algorithm, in combination with theAUTO_ARIMA_MAX_ORDER value. Setting this option to1 or greater lets themodel exclude some flat forecasting results. This option is disabled whenAUTO_ARIMA isFALSE.

Arguments

The value is aINT64. The default value is0.

Note: IfAUTO_ARIMA_MAX_ORDER is set, theAUTO_ARIMA_MIN_ORDER value mustbe less than theAUTO_ARIMA_MAX_ORDER value.

NON_SEASONAL_ORDER

Syntax

NON_SEASONAL_ORDER =(p_value, d_value, q_value)

Description

The tuple of non-seasonal p, d, q for theARIMA_PLUS model. There are nodefault values, and you must specify all three values. p and q must be a valuebetween0 and5, inclusive. d must be a value between0 and2,inclusive.

You can't use this option when forecasting multiple time series atthe same time, because the auto.ARIMA algorithm must be used for each timeseries.

TheAUTO_ARIMA value must beFALSE to use this option.

Arguments

A tuple of threeINT64 values. For example,(1, 2, 1).

DATA_FREQUENCY

Syntax

DATA_FREQUENCY={'AUTO_FREQUENCY'|'PER_MINUTE'|'HOURLY'|'DAILY'|'WEEKLY'|'MONTHLY'|'QUARTERLY'|'YEARLY'}

Description

The data frequency of the input time series. The finest supported granularity isPER_MINUTE.

When forecasting multiple time series at once, this argumentapplies to all individual time series.

Arguments

This option accepts the following values:

  • AUTO_FREQUENCY: This is the default. The training process automaticallyinfers the data frequency, which can be any of the other supported values forthis option.
  • PER_MINUTE
  • HOURLY
  • DAILY
  • WEEKLY
  • MONTHLY
  • QUARTERLY
  • YEARLY

INCLUDE_DRIFT

Syntax

INCLUDE_DRIFT={TRUE|FALSE}

Description

Determines whether theARIMA_PLUS model should include a linear drift term ornot. The drift term is applicable when non-seasonal d is 1.

  • When theAUTO_ARIMA value isFALSE , this argument defaults toFALSE.You can set it toTRUE only when non-seasonal d is 1. Otherwise theCREATE MODEL statement returns an invalid query error.
  • When theAUTO_ARIMA value isTRUE, BigQuery MLautomatically determines whether or not to include a linear drift term, soyou can't use this option.

Arguments

ABOOL value. The default value isFALSE.

HOLIDAY_REGION

Syntax

HOLIDAY_REGION =string_value |string_array

Description

The geographical region based on which the holiday effect is applied inmodeling. By default, holiday effect modeling isn't used. To use it,specify one or more holiday regions using this option. If you include more thanone region string, the union of the holidays in all the provided regions aretaken into account when modeling.

Holiday effect modeling is only applicable when the time series is dailyor weekly, and longer than a year. If the input time series doesn't meet theserequirements, holiday effect modeling isn't used even if you specifythis option.

For more information about the holidays included in each region, seeHoliday data.

Arguments

ASTRING orARRAY<STRING> value.

Use a single string value to identify one region. For example:

HOLIDAY_REGION='GLOBAL'

Use an array of string values to identify multiple regions. For example:

HOLIDAY_REGION=['US','GB']

This option accepts the following values:

Global

  • GLOBAL

Continental regions

  • NA: North America
  • JAPAC: Japan and Asia Pacific
  • EMEA: Europe, the Middle East and Africa
  • LAC: Latin America and the Caribbean

Countries

  • AE: United Arab Emirates
  • AR: Argentina
  • AT: Austria
  • AU: Australia
  • BE: Belgium
  • BR: Brazil
  • CA: Canada
  • CH: Switzerland
  • CL: Chile
  • CN: China
  • CO: Colombia
  • CZ: Czechia
  • DE: Germany
  • DK: Denmark
  • DZ: Algeria
  • EC: Ecuador
  • EE: Estonia
  • EG: Egypt
  • ES: Spain
  • FI: Finland
  • FR: France
  • GB: United Kingdom
  • GR: Greece
  • HK: Hong Kong
  • HU: Hungary
  • ID: Indonesia
  • IE: Ireland
  • IL: Israel
  • IN: India
  • IR: Iran
  • IT: Italy
  • JP: Japan
  • KR: South Korea
  • LV: Latvia
  • MA: Morocco
  • MX: Mexico
  • MY: Malaysia
  • NG: Nigeria
  • NL: Netherlands
  • NO: Norway
  • NZ: New Zealand
  • PE: Peru
  • PH: Philippines
  • PK: Pakistan
  • PL: Poland
  • PT: Portugal
  • RO: Romania
  • RS: Serbia
  • RU: Russia
  • SA: Saudi Arabia
  • SE: Sweden
  • SG: Singapore
  • SI: Slovenia
  • SK: Slovakia
  • TH: Thailand
  • TR: Turkey
  • TW: Taiwan
  • UA: Ukraine
  • US: United States
  • VE: Venezuela
  • VN: Vietnam
  • ZA: South Africa

CLEAN_SPIKES_AND_DIPS

Syntax

CLEAN_SPIKES_AND_DIPS={TRUE|FALSE}

Description

Determines whether or not to perform automatic spikes and dips detection andcleanup in theARIMA_PLUS model training pipeline. The spikes and dips arereplaced with local linear interpolated values when they're detected.

Arguments

ABOOL value. The default value isTRUE.

ADJUST_STEP_CHANGES

Syntax

ADJUST_STEP_CHANGES={TRUE|FALSE}

Description

Determines whether or not to perform automatic step change detection andadjustment in theARIMA_PLUS model training pipeline.

Arguments

ABOOL value. The default value isTRUE.

TIME_SERIES_LENGTH_FRACTION

Syntax

TIME_SERIES_LENGTH_FRACTION =float64_value

Description

The fraction of the interpolated length of the time series that's used tomodel the time series trend component. All of the time points of the time seriesare used to model the non-trend component. For example, if the time series has100 time points, then specifying aTIME_SERIES_LENGTH_FRACTION of0.5 usesthe most recent 50 time points for trend modeling. This training optionaccelerates modeling training without sacrificing much forecasting accuracy.

You can use theTIME_SERIES_LENGTH_FRACTION option with theMIN_TIME_SERIES_LENGTH option, but not with theMAX_TIME_SERIES_LENGTHoption.

Arguments

AFLOAT64 value in the range(0, 1). The default behavior is to use allthe points in the time series.

MIN_TIME_SERIES_LENGTH

Syntax

MIN_TIME_SERIES_LENGTH =int64_value

Description

The minimum number of time points that are used in modeling thetrend component of the time series. If you use this option, you must alsospecify a value for theTIME_SERIES_LENGTH_FRACTION option, and you can'tspecify a value for theMAX_TIME_SERIES_LENGTH option.

For example,if you useTIME_SERIES_ID_COL to forecast two time series, one with 100 timepoints and another with 30 time points, then settingTIME_SERIES_LENGTH_FRACTION to0.5 andMIN_TIME_SERIES_LENGTH to20results in the last 50 points of first time series being used for trendmodeling. For the second time series, the last 20 points rather than the last15 points (30 * 0.5) are used in trend modeling becausetheMIN_TIME_SERIES_LENGTH value is20. This option ensures that enoughtime points are available when you useTIME_SERIES_LENGTH_FRACTION in trendmodeling. This is particularly importantwhen forecasting multiple time seriesin a single query using theTIME_SERIES_ID_COL option. If the total number oftime points is less than theMIN_TIME_SERIES_LENGTH value, then the query usesall available time points.

Arguments

AnINT64 value greater than or equal to4. The default value is20.

MAX_TIME_SERIES_LENGTH

Syntax

MAX_TIME_SERIES_LENGTH =int64_value

Description

The maximum number of time points in a time series that can be used in modelingthe trend component of the time series.

For example, if you are forecastingtwo time series simultaneously by specifying theTIME_SERIES_ID_COL option,and one time series has 100 time points while the other one has 50 time points,then by settingMAX_TIME_SERIES_LENGTH to30, both of the time series usethe last 30 time points for trend modeling.

You can't use theMAX_TIME_SERIES_LENGTH with theTIME_SERIES_LENGTH_FRACTION orMIN_TIME_SERIES_LENGTH options.

Arguments

AnINT64 value greater than or equal to4. There is no defaultvalue. We recommend trying30 as a starting value.

TREND_SMOOTHING_WINDOW_SIZE

Syntax

TREND_SMOOTHING_WINDOW_SIZE =int64_value

Description

The smoothing window size for the trend component. When you specify avalue, a center moving average smoothing is applied on the historytrend. When the smoothing window is out of the boundary at the beginning or theend of the trend, the first element or the last element is padded to fillthe smoothing window before the average is applied.

Specifying a value forTREND_SMOOTHING_WINDOW_SIZE doesn't affect forecastingresults. It only affects the smoothness of the trend component, which you cansee by using theML.EXPLAIN_FORECAST function.

Arguments

AnINT64 value. There is no default value. You must specify a positive valueto smooth the trend.

DECOMPOSE_TIME_SERIES

Syntax

DECOMPOSE_TIME_SERIES={TRUE|FALSE}

DescriptionDetermines whether the separate components of both the history and forecastparts of the time series (such as holiday effect and seasonal components)are saved in theARIMA_PLUS model.

Time series decomposition takes place when you create the model. TheML.EXPLAIN_FORECASTfunction retrieves the separate components of both the training and theforecasting data and computes the confidence intervals. Because the decompositionresults are saved in the model, the training data can be partially or fullyrecovered from the decomposition results.

Arguments

ABOOL value. The default value isTRUE.

FORECAST_LIMIT_LOWER_BOUND

Syntax

FORECAST_LIMIT_LOWER_BOUND =float64_value

Description

The lower bound of the forecasting values. When you specify theFORECAST_LIMIT_LOWER_BOUND option, all of the forecast values must be greaterthan the specified value. For example, if you setFORECAST_LIMIT_LOWER_BOUNDto0, then all of the forecast values are larger than0. Also, all valuesless than or equal to theFORECAST_LIMIT_LOWER_BOUND value are excluded frommodelling. The forecasting limitensures that forecasts stay within limits.

If you specify a value for theFORECAST_LIMIT_UPPER_BOUND option,theFORECAST_LIMIT_UPPER_BOUND value must be greater than theFORECAST_LIMIT_LOWER_BOUND value.

Arguments

AFLOAT_64 value greater than or equal to-1.7976931348623157E+308.

FORECAST_LIMIT_UPPER_BOUND

Syntax

FORECAST_LIMIT_UPPER_BOUND =float64_value

Description

The upper bound of the forecasting values. When you specify theFORECAST_LIMIT_UPPER_BOUND option, all of the forecast values must be lessthan the specified value. For example, if you setFORECAST_LIMIT_UPPER_BOUNDto100, then all of the forecast values are less than100. Also, all valuesgreater than or equal to theFORECAST_LIMIT_UPPER_BOUND value are excludedfrom modelling. The forecasting limit ensures thatforecasts stay within limits.

If you specify a value for theFORECAST_LIMIT_LOWER_BOUND option,theFORECAST_LIMIT_LOWER_BOUND value must be less than theFORECAST_LIMIT_UPPER_BOUND value.

Arguments

AFLOAT_64 value less thanor equal to1.7976931348623157E+308.

SEASONALITIES

Syntax

SEASONALITIES =string_array

Description

The seasonality of the time series data refers to the presence of variationsthat occur at certain regular intervals such as weekly, monthly or quarterly.Specifying the seasonality helps the model more accurately learn and predict thecyclic trends in your data. The input seasonality you provide as an argument isignored if it's more granular than the finest seasonal granularity detected inthe time series data. For example, if you input['DAILY', 'WEEKLY', 'MONTHLY']for this option, but input time series contains weekly data, thentheDAILY variable is ignored during the model training.

Arguments

AnARRAY<STRING> value. The following string values are accepted:

  • AUTO: This is the default. The training process automatically infers theseasonalities by data frequency.
  • NO_SEASONALITY: Deactivates automatic seasonality detection.
  • DAILY
  • WEEKLY
  • MONTHLY
  • QUARTERLY
  • YEARLY

You can only use theNO_SEASONALITY orAUTO values by themselves. Forexample,['NO_SEASONALITY', 'DAILY'] isn't a valid value for this option.

HIERARCHICAL_TIME_SERIES_COLS

Syntax

HIERARCHICAL_TIME_SERIES_COLS = {string_array }

Description

The column names used to generate hierarchical time series forecasts. Specifyone or more values for this option to aggregate and roll up values for all timeseries. The column order represents the hierarchy structure, where the left-mostcolumn is the parent. The columns must be of one of the following data types:

  • STRING
  • INT64

Arguments

AnARRAY<STRING> value.

KMS_KEY_NAME

Syntax

KMS_KEY_NAME =string_value

Description

The Cloud Key Management Servicecustomer-managed encryption key (CMEK) touse to encrypt the model.

Arguments

ASTRING value containing the fully-qualified name of the CMEK. For example,

'projects/my_project/locations/my_location/keyRings/my_ring/cryptoKeys/my_key'

AS

All time series forecasting models support the followingAS clausesyntax for specifying the training data:
ASquery_statement

For time series forecasting models that have aDATA_FREQUENCY valueof eitherDAILY orAUTO_FREQUENCY, you can optionally use thefollowingAS clause syntax to performcustom holiday modelingin addition to specifying the training data:

AS (  training_data AS (query_statement),  custom_holiday AS (holiday_statement))

query_statement

Thequery_statement argument specifies the query that is used togenerate the training data. For information about the supported SQL syntax ofthequery_statement clause, seeGoogleSQL query syntax.

holiday_statement

Theholiday_statement argument specifies the query that provides customholiday modeling information for time series forecast models. This query mustreturn 50,000 rows or less and must contain the following columns:

  • region: Required. ASTRING value that identifies the region to target forholiday modeling. Use one of the following options:

    • An upper-caseholiday region code.Use this option tooverwrite or supplement the holidays for the specified region. Youcan see the holidays for a region by runningSELECT * FROMbigquery-public-data.ml_datasets.holidays_and_events_for_forecasting WHEREregion =region.
    • An arbitrary string. Use this option to specify a custom region thatyou want to model holidays for. For example, you could specifyLondonif you are only modeling holidays for that city.

    Be sure not to use an existing holiday region code when you aretrying to model for a custom region. For example, if you want to model aholiday in California, and specifyCA as theregion value, theservice recognizes that as the holiday region code for Canada andtargets that region. Because the argument is case-sensitive, you couldspecifyca,California, or some other value that isn't a holidayregion code.

  • holiday_name: Required. ASTRING value that identifies the holidayto target for holiday modeling. Use one of the following options:

    • The holiday name as it is represented in thebigquery-public-data.ml_datasets.holidays_and_events_for_forecastingpublic table, including case.Use this option tooverwriteorsupplementthe specified holiday.
    • A string that represents a custom holiday. The string must be a validcolumn name so that it can be used inML.EXPLAIN_FORECAST output. Forexample, it cannot contain space. For more information on column naming,seeColumn names.
  • primary_date: Required. ADATE value that specifies the date the holidayfalls on.

  • preholiday_days: Optional. AnINT64 value that specifies the start of theholiday window around the holiday that is taken into account whenmodeling. Must be greater than or equal to1. Defaults to1.

  • postholiday_days: Optional. AnINT64 value that specifies the end of theholiday window around the holiday that is taken into account whenmodeling. Must be greater than or equal to1. Defaults to1.

Thepreholiday_days andpostholiday_days arguments together describethe holiday window around the holiday that is taken into accountwhen modeling. The holiday window is defined as[primary_date - preholiday_days, primary_date + postholiday_days] and isinclusive of the pre- and post-holiday days. The value for each holiday windowmust be less than or equal to30 and must be the same across the givenholiday. For example, if you are modeling Arbor Day for several different years,you must specify the same holiday window for all of those years.

To achieve the best holiday modeling result, provide as much historical andforecast information about the occurrences of each included holiday as possible.For example, if you have time series data from 2018 to 2022 and would like toforecast for 2023, you get the best result by providing the custom holidayinformation for all of those years, similar to the following:

CREATEORREPLACEMODEL`mydataset.arima_model`OPTIONS(model_type='ARIMA_PLUS',holiday_region='US',...)AS(training_dataAS(SELECT*FROM`mydataset.timeseries_data`),custom_holidayAS(SELECT'US'ASregion,'Halloween'ASholiday_name,primary_date,5ASpreholiday_days,1ASpostholiday_daysFROMUNNEST([DATE('2018-10-31'),DATE('2019-10-31'),DATE('2020-10-31'),DATE('2021-10-31'),DATE('2022-10-31'),DATE('2023-10-31')])ASprimary_date))

Holiday data

When you perform holiday modeling by specifying theHOLIDAY_REGION option,the model uses holiday data from the region or regions you specify. For example,the following table describes the holiday data used in theUS region for theyear 2022-2023.

  • region specifies the geographic region to which the holiday applies. Thesupported regions are listed inHOLIDAY_REGION.
  • holiday_name contains the name of the holiday.
  • primary_date specifies the date of the holiday. For holidays that spanmultiple days, this is usually the first day of the holiday.
  • preholiday_days describes the number of days the holiday effect startsbefore theprimary_date value.
  • postholiday_days describes the number of days the holiday effect ends aftertheprimary_date value.
regionholiday_nameprimary_datepreholiday_dayspostholiday_days
USChristmas2022-12-25101
USChristmas2023-12-25101
USMothersDay2022-05-0861
USMothersDay2023-05-1461
USNewYear2022-01-0153
USNewYear2023-01-0153
USDaylightSavingEnd2022-11-0611
USDaylightSavingEnd2023-11-0511
USDaylightSavingStart2022-03-1311
USDaylightSavingStart2023-03-1211
USThanksgiving2022-11-2435
USThanksgiving2023-11-2335
USValentine2022-02-1431
USValentine2023-02-1431
USEasterMonday2022-04-1881
USEasterMonday2023-04-1081
USHalloween2022-10-3111
USHalloween2023-10-3111
USStPatrickDay2022-03-1711
USStPatrickDay2023-03-1711
USColumbusDay2022-10-1011
USColumbusDay2023-10-0911
USIndependenceDay2022-07-0411
USIndependenceDay2023-07-0411
USJuneteenth2022-06-1911
USJuneteenth2023-06-1911
USLaborDay2022-09-0511
USLaborDay2023-09-0411
USMemorialDay2022-05-3011
USMemorialDay2023-05-2911
USMLKDay2022-01-1711
USMLKDay2023-01-1611
USPresidentDay2022-02-2111
USPresidentDay2023-02-2011
USSuperbowl2022-02-1311
USSuperbowl2023-02-0511
USVeteranDay2022-11-1111
USVeteranDay2023-11-1111

You can also see the holidays for a region by runningSELECT * FROMbigquery-public-data.ml_datasets.holidays_and_events_for_forecasting WHEREregion =region.

Thebigquery-public-data.ml_datasets.holidays_and_events_for_forecastingtable only contains holidays and events from the following regions:

  • AU: Australia
  • CA: Canada
  • CH: Switzerland
  • CL: Chile
  • CZ: Czechia
  • DE: Germany
  • DK: Denmark
  • EMEA: Europe, the Middle East and Africa
  • ES: Spain
  • FR: France
  • GB: United Kingdom
  • GLOBAL
  • ID: Indonesia
  • IN: India
  • IT: Italy
  • JAPAC: Japan and Asia Pacific
  • JP: Japan
  • KR: South Korea
  • LAC: Latin America and the Caribbean
  • MX: Mexico
  • MY: Malaysia
  • NA: North America
  • NL: Netherlands
  • NZ: New Zealand
  • PT: Portugal
  • SK: Slovakia
  • US: United States
  • ZA: South Africa

Custom holidays

You can combine use of theholiday_statement argumentand theHOLIDAY_REGION option to enable several differentcustom holiday scenarios, as described in the following sections.

Supplement built-in holidays with additional custom holidays

To model one or more custom holidays in addition to a region's built-inholidays, specify the target holiday region with theHOLIDAY_REGION option,and then provide the new holiday metadata in theholiday_statement argument.

The following example models all built-in holidays for theUS holiday region,and additionally models the custom holidaymembers_day:

CREATEORREPLACEMODEL`mydataset.arima_model`OPTIONS(model_type='ARIMA_PLUS',holiday_region='US',...)AS(training_dataAS(SELECT*FROM`mydataset`.timeseries_data`),        custom_holiday AS (            SELECT              'US' AS region,              'members_day' AS holiday_name,              primary_date,              2 AS preholiday_days,              2 AS postholiday_days            FROM              UNNEST(                [                  DATE('2016-06-15'),                  DATE('2017-06-07'),                  DATE('2018-06-06')])                AS primary_date          )      );

Model only custom holidays

To model only custom holidays, don't specify a value for theHOLIDAY_REGIONoption, and provide the new holiday metadata in theholiday_statementargument.

The following example models only the custom holidaymembers_dayfor theUS holiday region:

CREATEORREPLACEMODEL`mydataset.arima_model`OPTIONS(model_type='ARIMA_PLUS',-- Don't specify HOLIDAY_REGION...)AS(training_dataAS(SELECT*FROM`mydataset.timeseries_data`),custom_holidayAS(SELECT'US'ASregion,'members_day'ASholiday_name,primary_date,2ASpreholiday_days,2ASpostholiday_daysFROMUNNEST([DATE('2016-06-15'),DATE('2017-06-07'),DATE('2018-06-06')])ASprimary_date));

Change the metadata for built-in holidays

You can change the primary date and holiday effect window used by the modelfor one or more built-in holidays. To do this, specify the target holidayregion with theHOLIDAY_REGION option, and then provide the modified holidaymetadata in theholiday_statement argument.

The following example models all built-in holidays for theUS holiday region,but models 3 years of theEasterMonday holiday with a 3-dayholiday effect window instead of the default 9-day holiday effect window:

OPTIONS(model_type='ARIMA_PLUS',holiday_region='US',...)AS(training_dataAS(SELECT*FROM`mydataset.timeseries_data`),custom_holidayAS(SELECT'US'ASregion,'EasterMonday'ASholiday_name,primary_date,1ASpreholiday_days,1ASpostholiday_daysFROMUNNEST([DATE('2016-03-28'),DATE('2017-04-17'),DATE('2018-04-02')])ASprimary_date));

Model a subset of built-in holidays

To model only a subset of built-in holidays, don't specify a value for theHOLIDAY_REGION option, and provide a query based on thebigquery-public-data.ml_datasets.holidays_and_events_for_forecasting publictable to specify the set of holidays to model.

The following example models all built-in holidays for theUS holiday regionexcept for theChristmas andNewYears holidays:

CREATEORREPLACEMODEL`mydataset.arima_model`OPTIONS(model_type='ARIMA_PLUS',-- Don't specify HOLIDAY_REGION...)AS(training_dataAS(SELECT*FROM`mydataset.timeseries_data`),custom_holidayAS(SELECT*FROM`bigquery-public-data.ml_datasets.holiday`WHEREregion='US'AND(holiday_name!='Christmas'ORholiday_name!='NewYear')));

Custom holiday limitations

  • Custom holiday modeling only works for models that have adata_frequencyvalue of eitherDAILY orAUTO_FREQUENCY. If you useAUTO_FREQUENCY, theactual frequency of the time series data needs to be daily.
  • You can't use theTRANSFORM clauseof theCREATE MODEL statement if you are performing customholiday modeling.
  • holiday_name column cannot have more than 500 distinct values.
  • Custom holiday modeling uses an algorithm that automaticallydetects the significance of the holiday effect within the provided holidayeffect window, and only extracts the holiday effect on the days that thealgorithm classifies as significant. For example, ifprimary date is01/02 withpreholiday_days andpostholiday_days set to1, the algorithmanalyzes the holiday effect for these three days:[01/01, 01/02, 01/03].In theML.EXPLAIN_FORECAST output, it is not guaranteed that all three ofthese days will have a holiday effect. Only those days within this window thathave a significant holiday effect are associated with a non-zero holidayeffect in the output.
  • To get a better result from custom holiday modeling, don't specify the sameholiday more than twice a year.
  • If you specify the same holiday more than once, make sure its occurrencesdon't overlap. For example, a holiday that happens twice a year, with thefirst occurrence from December 1 - December 5 and the second occurrence fromDecember 4 - December 8, won't work because these two occurrences overlap witheach other on December 4 and December 5.
  • Different holidays can have partial overlap with each other, but full overlapisn't allowed. For example, if holiday A has an occurrence from December 1 -December 5, then holiday B can have an occurrence from December 4 -December 8. However, holiday B can't have an occurrence from December 2 -December 4, which is fully within holiday A's occurrence, and holiday B can'thave an occurrence from December 1 - December 8, which overlaps with all ofholiday A's occurrence.
  • In cases where holidays overlap for a period, the overlap between the holidayscan't be identical in multiple years. If it is, then it isn'tpossible to determine each holiday's effect on the overlapped period. Forexample, suppose holiday A and holiday B overlap from December 2 - December 4in 2024, and the time series has three years of data from 2022-2024. HolidaysA and B can't overlap from December 2 - December 4 in 2023 and 2022 as well ifyou want to be able to determine the holiday effect for each.

Hierarchical reconciliation

You can break down or aggregate time series forecasts by specifying differentdimensions of interest. These are known as hierarchical time series. Forexample, census data that reveals the total population per state can be brokendown by city and zip code. Conversely, you can aggregate that data for eachcountry or continent.

There are several techniques that you can use to generate and reconcilehierarchical forecasts. Take the following example, which shows a simplifiedhierarchical structure for liquor sales in the state of Iowa:

Hierarchical Time Series.

The lowest level shows the store level, followed by the zip code level, city,county, and finally by state. The goal for hierarchical forecasts is to makesure that all forecasts across each level reconcile. For example, given theearlier figure, this means that the forecast values for the cities of Clive andDes Moines must add up to the forecast value for Polk county. Similarly, theforecasts in Polk, Linn, and Scott must add up to the forecast in Iowa.

A common technique that you can use to generate reconciled forecasts is knownas theBottom-Up approach. In this approach, the forecasts are generated atthe bottom level of the hierarchy first, before summing up the other levels.Taking the earlier example, the forecasts for each store are used to build theforecasting models for the other levels, so the store models are used to buildthe zip code models, the zip code models are used to build the city models,and so forth.In BigQuery ML, you use theTIME_SERIES_ID_COL option to identify the dimensionsthat you want to generate time series for, and theHIERACHICAL_TIME_SERIES_ID_COLS optionto identify the dimensions that you want to roll up and reconcile. To learnmore about generating a hierarchical time series, seeForecast multiple hierarchical time series with a univariate model.

Locations

For information about supported locations, seeLocations for non-remote models.

Limitations

ARIMA_PLUS models have the following limitations:

  • For the input time series, the minimum length is 3 time points. The maximumlength is 500,000 time points when theDECOMPOSE_TIME_SERIES optionis setTRUE and 1,000,000 when the option is set toFALSE. When forecastingmultiple time series at the same time, the limit applies to each time series.
  • The maximum number of time series to forecast simultaneously using the IDcolumns is 100,000,000.
  • When forecasting multiple time series simultaneously using the ID column,any invalid time series that fail the model fitting are ignored anddon't appear in the results of forecast. For example, a single pointtime series. A warning message is shown in this case, and you can use theML.ARIMA_EVALUATE functionto retrieve the error message.
  • The maximum time points to forecast is 10,000.
  • Holiday effect modeling is effective only for approximately 5 years.
  • After a multiple time series model is trained, the evaluation tab in theBigQuery page on the Google Cloud console only shows the evaluationmetrics for the first 100 time series. To see the evaluation metrics for allof the time series, use theML.ARIMA_EVALUATE function.
  • You can't exportARIMA_PLUS models.

Examples

The following examples show how to create different types ofARIMA_PLUStime series models.

Forecast a single time series

This example shows how to create a time series model that forecasts asingle time series:

CREATEMODEL`project_id.mydataset.mymodel`OPTIONS(MODEL_TYPE='ARIMA_PLUS',time_series_timestamp_col='date',time_series_data_col='transaction')ASSELECTdate,transactionFROM`mydataset.mytable`

Forecast multiple time series

This example shows how to create multiple time series models, one for eachinput time series:

CREATEMODEL`project_id.mydataset.mymodel`OPTIONS(MODEL_TYPE='ARIMA_PLUS',time_series_timestamp_col='date',time_series_data_col='transaction',time_series_id_col='company_name')ASSELECTdate,transaction,company_nameFROM`mydataset.mytable`

Forecast multiple time series using multiple time series ID columns

This example shows how to create multiple time series models for multiple IDs:

CREATEMODEL`project_id.mydataset.mymodel`OPTIONS(MODEL_TYPE='ARIMA_PLUS',time_series_timestamp_col='date',time_series_data_col='transaction',time_series_id_col=['company_name','department_name'])ASSELECTdate,transaction,company_name,department_nameFROM`mydataset.mytable`

Forecast multiple time series more quickly by using a fraction of the time points

This example shows how to create multiple time series models whileimproving training speed by using theTIME_SERIES_LENGTH_FRACTION andMIN_TIME_SERIES_LENGTH options:

CREATEMODEL`project_id.mydataset.mymodel`OPTIONS(MODEL_TYPE='ARIMA_PLUS',time_series_timestamp_col='date',time_series_data_col='transaction',time_series_id_col=['company_name','department_name'],time_series_length_fraction=0.5,min_time_series_length=30)ASSELECTdate,transaction,company_name,department_nameFROM`mydataset.mytable`

Forecast multiple time series more quickly by defining a maximum number of time points

This example shows how to create multiple time series models whileimproving training speed by usingMAX_TIME_SERIES_LENGTH option:

CREATEMODEL`project_id.mydataset.mymodel`OPTIONS(MODEL_TYPE='ARIMA_PLUS',time_series_timestamp_col='date',time_series_data_col='transaction',time_series_id_col=['company_name','department_name'],max_time_series_length=50)ASSELECTdate,transaction,company_name,department_nameFROM`mydataset.mytable`

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.