Scale an ARIMA_PLUS univariate time series model to millions of time series

In this tutorial, you will learn how to significantly accelerate training ofa set ofARIMA_PLUS univariate time series model,in order to perform multiple time-series forecasts with a singlequery. You will also learn how to evaluate forecasting accuracy.

This tutorial forecasts for multiple time series. Forecasted values arecalculated for each time point, for each value in one or more specified columns.For example, if you wanted to forecast weather and specified a column containingcity data, the forecasted data would contain forecasts for all time points forCity A, then forecasted values for all time points for City B, and so forth.

This tutorial uses data from the publicbigquery-public-data.new_york.citibike_tripsandiowa_liquor_sales.salestables. The bike trips data only contains a few hundred time series, so it isused to illustrate various strategies to accelerate model training.The liquor sales data has more than 1 million time series, so it is used to showtime series forecasting at scale.

Before reading this tutorial, you should readForecast multiple time series with a univariate modelandLarge-scale time series forecasting best practices.

Objectives

In this tutorial, you use the following:

For simplicity, this tutorial doesn't cover how to use theML.FORECASTorML.EXPLAIN_FORECASTfunctions to generate forecasts. To learn how to use those functions, seeForecast multiple time series with a univariate model.

Costs

This tutorial uses billable components of Google Cloud, including:

  • BigQuery
  • BigQuery ML

For more information about costs, see theBigQuery pricing page and theBigQuery ML pricing page.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  5. Verify that billing is enabled for your Google Cloud project.

  6. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to

    Enable the BigQuery API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the API

Required Permissions

  • To create the dataset, you need thebigquery.datasets.createIAM permission.

  • To create the model, you need the following permissions:

    • bigquery.jobs.create
    • bigquery.models.create
    • bigquery.models.getData
    • bigquery.models.updateData
  • To run inference, you need the following permissions:

    • bigquery.models.getData
    • bigquery.jobs.create

For more information about IAM roles and permissions inBigQuery, seeIntroduction to IAM.

Create a dataset

Create a BigQuery dataset to store your ML model.

Console

  1. In the Google Cloud console, go to theBigQuery page.

    Go to the BigQuery page

  2. In theExplorer pane, click your project name.

  3. ClickView actions > Create dataset

  4. On theCreate dataset page, do the following:

    • ForDataset ID, enterbqml_tutorial.

    • ForLocation type, selectMulti-region, and then selectUS (multiple regions in United States).

    • Leave the remaining default settings as they are, and clickCreate dataset.

bq

To create a new dataset, use thebq mk commandwith the--location flag. For a full list of possible parameters, see thebq mk --dataset commandreference.

  1. Create a dataset namedbqml_tutorial with the data location set toUSand a description ofBigQuery ML tutorial dataset:

    bq --location=US mk -d \ --description "BigQuery ML tutorial dataset." \ bqml_tutorial

    Instead of using the--dataset flag, the command uses the-d shortcut.If you omit-d and--dataset, the command defaults to creating adataset.

  2. Confirm that the dataset was created:

    bqls

API

Call thedatasets.insertmethod with a defineddataset resource.

{"datasetReference":{"datasetId":"bqml_tutorial"}}

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.

importgoogle.cloud.bigquerybqclient=google.cloud.bigquery.Client()bqclient.create_dataset("bqml_tutorial",exists_ok=True)

Create a table of input data

TheSELECT statement of the following query uses theEXTRACT functionto extract the date information from thestarttime column. The query usestheCOUNT(*) clause to get the daily total number of Citi Bike trips.

table_1 has 679 time series. The query uses additionalINNER JOIN logicto select all those time series that have more than 400 time points, resultingin a total of 383 time series.

Follow these steps to create the input data table:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, paste in the following query and clickRun:

    CREATEORREPLACETABLE`bqml_tutorial.nyc_citibike_time_series`ASWITHinput_time_seriesAS(SELECTstart_station_name,EXTRACT(DATEFROMstarttime)ASdate,COUNT(*)ASnum_tripsFROM`bigquery-public-data.new_york.citibike_trips`GROUPBYstart_station_name,date)SELECTtable_1.*FROMinput_time_seriesAStable_1INNERJOIN(SELECTstart_station_name,COUNT(*)ASnum_pointsFROMinput_time_seriesGROUPBYstart_station_name)table_2ONtable_1.start_station_name=table_2.start_station_nameWHEREnum_points>400;

Create a model to multiple time-series with default parameters

You want to forecast the number of bike trips for eachCiti Bike station, which requires many time series models; one for eachCiti Bike station that is included in the input data. You can write multipleCREATE MODELqueries to do this, but that can be a tedious and time consuming process,especially when you have a large number of time series. Instead, you can use asingle query to create and fit a set of time series models in order to forecastmultiple time series at once.

TheOPTIONS(model_type='ARIMA_PLUS', time_series_timestamp_col='date', ...)clause indicates that you are creating a set ofARIMA-based time-seriesARIMA_PLUS models. Thetime_series_timestamp_col option specifies the column that contains the timeseries, thetime_series_data_col option specifies the column to forecast for,and thetime_series_id_col specifies one or more dimensions that you want tocreate time series for.

This example leaves out the time points in the time series after June 1, 2016so that those time points can be used to evaluate the forecasting accuracylater by using theML.EVALUATE function.

Follow these steps to create the model:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, paste in the following query and clickRun:

    CREATEORREPLACEMODEL`bqml_tutorial.nyc_citibike_arima_model_default`OPTIONS(model_type='ARIMA_PLUS',time_series_timestamp_col='date',time_series_data_col='num_trips',time_series_id_col='start_station_name')ASSELECT*FROMbqml_tutorial.nyc_citibike_time_seriesWHEREdate<'2016-06-01';

    The query takes about 15 minutes to complete.

Evaluate forecasting accuracy for each time series

Evaluate the forecasting accuracy of the model by using theML.EVALUATEfunction.

Follow these steps to evaluate the model:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, paste in the following query and clickRun:

    SELECT*FROMML.EVALUATE(MODEL`bqml_tutorial.nyc_citibike_arima_model_default`,TABLE`bqml_tutorial.nyc_citibike_time_series`,STRUCT(7AShorizon,TRUEASperform_aggregation));

    This query reports several forecasting metrics, including:

    The results should look similar to the following:Evaluation metrics for the time series model.

    TheTABLE clause in theML.EVALUATE function identifies a tablecontaining the ground truth data. The forecasting results are compared tothe ground truth data to compute accuracy metrics. In this case, thenyc_citibike_time_series contains both the time series points that arebefore and after June 1, 2016. The points after June 1, 2016 are the groundtruth data. The points before June 1, 2016 are used to train the model togenerate forecasts after that date. Only the points after June 1, 2016 arenecessary to compute the metrics. The points before June 1, 2016 are ignoredin metrics calculation.

    TheSTRUCT clause in theML.EVALUATE function specified parameters forthe function. Thehorizon value is7, which means the query iscalculating the forecasting accuracy based on a seven point forecast. Notethat if the ground truth data has less than seven points for the comparison,then accuracy metrics are computed based on the available points only. Theperform_aggregation value isTRUE, which means that the forecastingaccuracy metrics are aggregated over the metrics on the time point basis. Ifyou specify aperform_aggregation value ofFALSE, forecasting accuracyis returned for each forecasted time point.

    For more information about the output columns, seeML.EVALUATE function.

Evaluate overall forecasting accuracy

Evaluate the forecasting accuracy for all 383 time series.

Of the forecasting metrics returned byML.EVALUATE, onlymean absolute percentage error andsymmetric mean absolute percentage error aretime series value independent. Therefore, to evaluate the entire forecasting accuracy of the set of time series, only the aggregate of these two metrics is meaningful.

Follow these steps to evaluate the model:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, paste in the following query and clickRun:

    SELECTAVG(mean_absolute_percentage_error)ASMAPE,AVG(symmetric_mean_absolute_percentage_error)ASsMAPEFROMML.EVALUATE(MODEL`bqml_tutorial.nyc_citibike_arima_model_default`,TABLE`bqml_tutorial.nyc_citibike_time_series`,STRUCT(7AShorizon,TRUEASperform_aggregation));

This query returns aMAPE value of0.3471, and asMAPE value of0.2563.

Create a model to forecast multiple time-series with a smaller hyperparameter search space

In theCreate a model to multiple time-series with default parameterssection, you used the default values for all of the training options, includingtheauto_arima_max_order option. This option controls the search spacefor hyperparameter tuning in theauto.ARIMA algorithm.

In the model created by the following query, you use a smaller search spacefor the hyperparameters by changing theauto_arima_max_order option valuefrom the default of5 to2.

Follow these steps to evaluate the model:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, paste in the following query and clickRun:

    CREATEORREPLACEMODEL`bqml_tutorial.nyc_citibike_arima_model_max_order_2`OPTIONS(model_type='ARIMA_PLUS',time_series_timestamp_col='date',time_series_data_col='num_trips',time_series_id_col='start_station_name',auto_arima_max_order=2)ASSELECT*FROM`bqml_tutorial.nyc_citibike_time_series`WHEREdate<'2016-06-01';

    The query takes about 2 minutes to complete. Recall that the previous modeltook about 15 minutes to complete when theauto_arima_max_order value was5, so this change improves model training speed gain by around 7x. If youwonder why the speed gain is not5/2=2.5x, this is because when theauto_arima_max_order value increases, not only do the number of candidatemodels increase, but also the complexity. This causes the training time ofthe model increases.

Evaluate forecasting accuracy for a model with a smaller hyperparameter search space

Follow these steps to evaluate the model:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, paste in the following query and clickRun:

    SELECTAVG(mean_absolute_percentage_error)ASMAPE,AVG(symmetric_mean_absolute_percentage_error)ASsMAPEFROMML.EVALUATE(MODEL`bqml_tutorial.nyc_citibike_arima_model_max_order_2`,TABLE`bqml_tutorial.nyc_citibike_time_series`,STRUCT(7AShorizon,TRUEASperform_aggregation));

This query returns aMAPE value of0.3337, and asMAPE value of0.2337.

In theEvaluate overall forecasting accuracysection, you evaluated a model with a larger hyperparameter search space,where theauto_arima_max_order option value is5. This resulted in aMAPEvalue of0.3471, and asMAPE value of0.2563. In this case, you can seethat a smaller hyperparameter search space actually gives higher forecastingaccuracy. One reason for this is that theauto.ARIMA algorithm only performshyperparameter tuning for the trend module of the entire modeling pipeline. Thebest ARIMA model selected by theauto.ARIMA algorithm might not generate thebest forecasting results for the entire pipeline.

Create a model to forecast multiple time-series with a smaller hyperparameter search space and smart fast training strategies

In this step, you use both a smaller hyperparameter search space and thesmart fast training strategy by using one or more of themax_time_series_length,max_time_series_length, ortime_series_length_fraction training options.

While periodic modeling such as seasonality requires a certain number of timepoints, trend modeling requires fewer time points. Meanwhile, trend modeling ismuch more computationally expensive than other time series components such asseasonality. By using the fast training options above, you can efficiently modelthe trend component with a subset of the time series, while the other timeseries components use the entire time series.

The following example uses themax_time_series_length option to achieve fasttraining. By setting themax_time_series_length option value to30, only the30 most recent time points are used to model the trend component. All 383time series are still used to model the non-trend components.

Follow these steps to create the model:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, paste in the following query and clickRun:

    CREATEORREPLACEMODEL`bqml_tutorial.nyc_citibike_arima_model_max_order_2_fast_training`OPTIONS(model_type='ARIMA_PLUS',time_series_timestamp_col='date',time_series_data_col='num_trips',time_series_id_col='start_station_name',auto_arima_max_order=2,max_time_series_length=30)ASSELECT*FROM`bqml_tutorial.nyc_citibike_time_series`WHEREdate<'2016-06-01';

    The query takes about 35 seconds to complete. This is 3x faster compared tothe query you used in theCreate a model to forecast multiple time-series with a smaller hyperparameter search spacesection. Due to the constant time overhead for thenon-training part of the query, such as data preprocessing, the speedgain is much higher when the number of time series is much larger than inthis example. For a million time series, the speed gain approaches theratio of the time series length and the value of themax_time_series_length option value. In that case,the speed gain is greater than 10x.

Evaluate forecasting accuracy for a model with a smaller hyperparameter search space and smart fast training strategies

Follow these steps to evaluate the model:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, paste in the following query and clickRun:

    SELECTAVG(mean_absolute_percentage_error)ASMAPE,AVG(symmetric_mean_absolute_percentage_error)ASsMAPEFROMML.EVALUATE(MODEL`bqml_tutorial.nyc_citibike_arima_model_max_order_2_fast_training`,TABLE`bqml_tutorial.nyc_citibike_time_series`,STRUCT(7AShorizon,TRUEASperform_aggregation));

This query returns aMAPE value of0.3515, and asMAPE value of0.2473.

Recall that without the use of fast training strategies, the forecastingaccuracy results in aMAPE value of0.3337 and asMAPE value of0.2337.The difference between the two sets of metric values are within 3%, which isstatistically insignificant.

In short, you have used a smaller hyperparameter search space and smart fasttraining strategies to make your model training more than 20x faster withoutsacrificing forecasting accuracy. As mentioned earlier, with more time series,the speed gain by the smart fast training strategies can be significantlyhigher. Additionally, the underlying ARIMA library used byARIMA_PLUS modelshas been optimized to run 5x faster than before. Together, these gains enablethe forecasting of millions of time series within hours.

Create a model to forecast a million time series

In this step, you forecast liquor sales for over 1 million liquor products indifferent stores using the public Iowa liquor sales data. The model traininguses a small hyperparameter search space as well as thesmart fast training strategy.

Follow these steps to evaluate the model:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, paste in the following query and clickRun:

    CREATEORREPLACEMODEL`bqml_tutorial.liquor_forecast_by_product`OPTIONS(MODEL_TYPE='ARIMA_PLUS',TIME_SERIES_TIMESTAMP_COL='date',TIME_SERIES_DATA_COL='total_bottles_sold',TIME_SERIES_ID_COL=['store_number','item_description'],HOLIDAY_REGION='US',AUTO_ARIMA_MAX_ORDER=2,MAX_TIME_SERIES_LENGTH=30)ASSELECTstore_number,item_description,date,SUM(bottles_sold)astotal_bottles_soldFROM`bigquery-public-data.iowa_liquor_sales.sales`WHEREdateBETWEENDATE("2015-01-01")ANDDATE("2021-12-31")GROUPBYstore_number,item_description,date;

    The query takes about 1 hour 16 minutes to complete.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

  • You can delete the project you created.
  • Or you can keep the project and delete the dataset.

Delete your dataset

Deleting your project removes all datasets and all tables in the project. If youprefer to reuse the project, you can delete the dataset you created in thistutorial:

  1. If necessary, open the BigQuery page in theGoogle Cloud console.

    Go to the BigQuery page

  2. In the navigation, click thebqml_tutorial dataset you created.

  3. ClickDelete dataset to delete the dataset, the table, and all of thedata.

  4. In theDelete dataset dialog, confirm the delete command by typingthe name of your dataset (bqml_tutorial) and then clickDelete.

Delete your project

To delete the project:

    Caution: Deleting a project has the following effects:
    • Everything in the project is deleted. If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.
    • Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as anappspot.com URL, delete selected resources inside the project instead of deleting the whole project.

    If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects can help you avoid exceeding project quota limits.

  1. In the Google Cloud console, go to theManage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then clickDelete.
  3. In the dialog, type the project ID, and then clickShut down to delete the project.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.