Continuous data integration in BigQuery

This document describes principles and techniques for implementing a repeatableworkflow that will help you integrate data changes into yourBigQuery-based data warehouse (DWH). These changes might includenew datasets, new data sources, or updates and changes to existing datasets. Thedocument also describes a reference architecture for this task.

The audience for this document is software and data architects and dataengineers who use BigQuery as a DWH. The document assumes thatyou're familiar with basic concepts of CI/CD or similar application lifecyclemanagement practices.

Introduction

Continuous integration and continuous delivery (CI/CD) has become an essentialtechnique in the software development lifecycle. Adopting the principles ofCI/CD lets teams deliver better software with fewer issues than by integratingfeatures and deploying them manually. CI/CD can also be part of a strategy fordata management when you modernize your data warehousing.

However, when you're working with a DWH like BigQuery, there aredifferences in how you implement CI/CD compared to implementing CI/CD in sourcecode. These differences are in part because data warehousing is aninherently stateful system for managing the underlying data.

This document provides the following information:

Techniques for implementing a continuous integration (CI) strategy inBigQuery.
Guidance and methods that help you avoid pitfalls.
Suggestions for BigQuery features that help with CI inBigQuery.

This document focuses on CI, because integration has more data-specificconsiderations for a data warehousing team than continuous delivery (CD) does.

When to use CI for a BigQuery DWH

In this document,data integration is a task that's usually performed bythe DWH team, which includes incorporating new data into the DWH. This taskmight involve incorporating a new data source into the DWH, or changing thestructure of a table that's already inside the DWH.

Integrating new data into the DWH is a task similar to integrating a newfeature into existing software. It might introduce bugs and negatively affectthe end-user experience. When you integrate data into BigQuery,downstream consumers of the data (for example, applications, BI dashboards, andindividual users) might experience issues due to schema mismatches.Or the consumers might use incorrect data that doesn't reflect the data fromthe original data source.

CI for DWH is useful when you want to do the following:

Describe key points in CI for a DWH system.
Design and implement a CI strategy for your BigQueryenvironment.
Learn how to use BigQuery features for implementing CI.

This guide doesn't describe how to manage CI for non-DWH products, includingdata products like Dataflow and Bigtable.

Example scenario

Example Company is a large retail company that maintains its DWH inBigQuery. In the upcoming year, the company wants to integratenew data sources into its DWH from companies that were recently acquired byExample Company. The new data sources to be integrated have different schemas.However, the DWH must keep its existing schema and must provide strong dataconsistency and data completeness so that the downstream consumers of dataaren't negatively affected.

Currently, the Example Company DWH team performs data integration. Theintegration relies on having the existing data sources in a predefined schema.This workflow involves legacy data import processes, acquired databases, andapplication services.

To update their data integration processes to accommodate the new data sources,the DWH team must redesign their approach to data integration to comply withthe requirements that were noted earlier, such as strong data consistency.The team must implement the changes in an isolated fashion so that thedata changes can be tested and measured before the data is made available todownstream consumers.

After the DWH team adopts the new workflow, the team has a repeatableprocess. Each developer can create an isolated development environment that'sbased on production data. Using these isolated environments, developers can thenmake changes, test them, have them reviewed, and deliver the required changes tothe production environment. If the changes cause bugs or unforeseenissues, the changes can be easily rolled back.

What continuous integration means for a DWH

Continuous integration (CI) is a set of practices that lets development teams shorten development cycles andfind issues in the code faster than with manual systems. The main benefit ofadopting a CI approach is the ability to develop rapidly, reducing the risks ofinterference between developers. This goal is achieved by making sure that theprocess is repeatable, while allowing each developer to work in isolation fromother developers.

These principles also apply when an organization must integrate data into aDWH, with a few differences. In the context of typical software development, CIisolates changes to source code, which is stateless. In the context of CI indata, CI integrates data into a DWH system. However, data is stateful bydefinition. This difference has implications for how CI applies to DWHscenarios, as described in this document.

Additional scenarios that aren't covered in this document

Although this document focuses on isolating development changes from theproduction environment, the document doesn't cover the following aspects of dataintegration:

Data testing: Are you able to verify that the data you haveconforms to business requirements? Is the data reliable to serve as thesource of truth? To increase your confidence level in the data that you'reserving from your DWH, it's important to test the data. To test, you canrun a set of queries, asserting that the data isn't missing values orasserting that it contains "bad" values.
Data lineage: Are you able to see any table in its context? Forexample, can you see where the data was gathered from, and which datasetswere pre-computed in order to generate the table? In modern DWHarchitectures, data is split into many systems that use different,specialized data structures. These include relational databases, NoSQLdatabases, and external data sources. To fully understand the data thatyou have, you must keep track of that data. You must also understand howthe data was generated and from where it was generated.

These topics are out of scope for this guide. However, it will benefit yourdata strategy to plan for these topics when you're designing a workflow for yourteam.

Typical setup of BigQuery as a DWH

The following diagram illustrates a typical DWH design forBigQuery. It shows how data from external sources is ingestedinto the DWH, and how consumers consume data from the DWH.

Three databases outside of Google Cloud are data sources. Their datamoves into storage in a staging area. The data then moves intoBigQuery tables, which are the source forBigQuery views. Consumers like Looker, App Engine,Vertex AI notebooks, and human users consume the data using the views.

The data starts at the data sources, where the data is in conventionaltransactional or low-latency databases such as OLTP SQL databases and NoSQLdatabases. A scheduled process copies the data into a staging area.

The data is stored temporarily in the staging area. If necessary, the data istransformed to fit an analytical system before it's loaded into the DWH tables.(In this diagram, the staging area is inside Google Cloud, but it doesn't haveto be.) Transformations might include denormalization, enriching certaindatasets, or handling malformed entries (for example, entries with missingvalues).

From the staging area, the new data is loaded into the DWH tables. The tablesmight be organized in different ways depending on the design of the DWH, and areusually referred to ascore tables. Some examples of table design paradigmsinclude thestar schema paradigm, thedenormalized paradigm, andmulti-tier aggregates.

Regardless of the table design, these tables save data over time. The tablesmust adhere to the schema, and they're presumed to hold the source of truth forall analytical purposes. This role for the tables means that data in thesetables must be complete, be consistent, and adhere to the predefined schemas.

These tables don't serve data directly to consumers. Instead, the data isserved through an access layer, which is designed to encapsulate business logicthat must be applied to the underlying data. Examples of this type of businesslogic are calculating a metric for each record, or filtering and groupingthe data.

The consumers of the data can connect to and read data from the DWH accesslayer. These data consumers might include systems like the following:

Business intelligence (BI) dashboards
Data science notebooks
Operational systems that rely on data calculated in the DWH
Human users for ad-hoc queries

The data consumers rely heavily on the DWH for providing consistent schemas andon the business logic that the DWH encapsulates. These schemas and businesslogic can be considered as the service level agreements (SLAs) of the DWHplatform. Any change to the business logic, to the schema, or to thecompleteness of data might have large implications downstream. Given theever-changing nature of modern data platforms, the DWH team might be required tomake those changes while nevertheless strictly adhering to the SLAs. In orderfor the team to meet these SLAs and also keep the DWH up to date, they need aworkflow that allows data integration while minimizing the friction that thesechanges might create.

Assets for continuous integration in a DWH

As with any other development or IT team, the DWH team must maintain assetsthat are essential to their responsibilities. These assets can typically bedivided into the following categories:

The codebase for data pipelines: These assets usually consist ofsource code in a high-level programming language like Python or Java. Forthose types of assets, the CI/CD processes are built by usingtools like Git and Jenkins, or by usingGoogle Cloud solutions like Cloud Source Repositories and Cloud Build.
SQL scripts: These assets describe the structure and the businesslogic that's encapsulated inside the DWH. Within this category, the assetscan be further divided into the following subcategories:
- Data definition language (DDL): These assets are used for definingthe schema of tables and views.
- Data manipulation language (DML): These assets are used formanipulating data inside a table. DML commands are also used to createnew tables based on existing tables.
- Data control language (DCL): These assets are used for controllingpermissions and access to tables. Within BigQuery, youcan control access by using SQL and thebq command-line tool or by using the BigQuery REST API.However, we recommend that you use IAM.

These assets, and others like Terraform scripts that are used to buildcomponents, are maintained inside code repositories. Tools likeDataform can help you construct a CI/CD pipeline that validates your SQL scripts andchecks predefined validation rules on tables that are created by DDL scripts.These tools let you apply compilation and testing processes for SQL, which inmost contexts doesn't have a natural testing environment.

In addition to the assets that are associated with tools and processes, themain asset for a DWH team is the data. Data isn't trackable by usingasset-tracking systems like Git, which is designed to track source code. Thisdocument addresses the issues that are associated with tracking data.

Issues with integrating data

Because of the potential complexity of table relationships inside a DWH (forexample, in one of the table design paradigms mentioned earlier), keeping thestate of production data isolated from a testing environment is a challenge.Standard practices in software development can't be applied to the dataintegration scenario.

The following table summarizes the differences between the practices forintegrating code and the practices for integrating data.

	Integrating code	Integrating data
Local development	Source code is easily cloneable due to its relatively small size.Generally the code fits most end-user machines (excluding cases ofmonorepos, which have other solutions).	Most tables in a DWH cannot fit on a development machine due to theirsize.
Central testing	Different states of the source code are cloned into a central system(a CI server) to undergo automated testing. Having different statesof the code lets you compare results between a stable version and adevelopment version.	Creating different states of the data in an isolated environmentisn't straightforward. Moving data outside the DWH is aresource-intensive and time-consuming operation. It isn't practicalto do as frequently as needed for testing.
Past versions	During the process of releasing new versions of software, you cantrack past versions. If you detect a problem in a new release, youcan roll back to a safe version.	Taking backups of tables inside the DWH is a standard practice incase you must roll back. However, you must make sure that allaffected tables are rolled back to the same point in time. That way,related tables are consistent with one another.

Integrate data into BigQuery tables

BigQuery has two features that can help you design a workflowfor data integration:table snapshots andtable clones.You can combine these features to achieve a workflow that gives you acost-effective development environment. Developers can manipulate data and itsstructure in isolation from the production environment, and they can roll back achange if necessary.

A BigQuery table snapshot is a read-only representation of atable (called thebase table) at a given moment in time. Similarly, ABigQuery table clone is a read-write representation of a tableat a given moment in time. In both cases, storage costs are minimized becauseonly the differences from the base table are stored. Table clones start to incurcosts when the base table changes or when the table clones change. Becausetable snapshots are read-only, they incur costs only when the base tablechanges.

For more information about the pricing of table snapshots and table clones, seeIntroduction to table snapshots andIntroduction to table clones.

You can take table snapshots and table clones using the BigQuerytime travel feature (up to seven days in the past). This feature lets you capture snapshotsand clones of multiple tables at the same point in time, which makes yourworking environment and snapshots consistent with one another. Using this featurecan be helpful for tables that are updated frequently.

How to use table clones and table snapshots to allow isolation

To illustrate the integration workflow for CI in a DWH, imagine the followingscenario. You're given a task of integrating a new dataset into the DWH. Thetask might be to create new DWH tables, to update existing tables, to change thestructure of tables, or any combination of these tasks. The workflow might looklike the following sequence:

You identify the tables that might be affected by the changes andadditional tables that you might want to check.
Youcreate a new BigQuery dataset to contain the assets for this change. This dataset helps isolate thechanges and separates this task from other tasks that other team memberswork on. The dataset must be in the same region as the source dataset.However, the project can be separated from the production project to helpwith your organization's security and billing requirements.
For each of the tables, you create both aclone and asnapshot in the new dataset, potentially for the same point in time. This approachoffers the following benefits:
- The table clone can act as a working copy where you can makechanges freely without affecting the production table. You can createmultiple table clones of the same base table in order to test differentintegration paths at the same time, with minimal overhead.
- The snapshot can act as a restore and reference point, a pointwhere the data is known to have worked before any change took place.Having this snapshot lets you perform a rollback in case an issue isdetected later in the process.
You use the table clones to implement the changes that are required forthe tables. This action results in an updated version of the table clones,which you can test in an isolated dataset.
Optionally, at the end of the implementation phase, you can present adataset that can be used for the following tasks:
- Unit testing with a validation tool likeDataform.Unit tests are self-contained, which means that the asset is tested inisolation. In this case, the asset is the table inBigQuery. Unit tests can check for null values, canverify that all strings meet length requirements, and can make surethat certain aggregates produce useful results. Unit tests can includeany confidence test that makes sure that the table maintains theorganization's business rules.
- Integration testing with downstream consumers.
- Peer review.
This workflow lets you test with production data, without affecting thedownstream consumers.
Before you merge the new data into BigQuery, you cancreate another snapshot. This snapshot is useful as another rollback optionin case the data in the base table has changed.
The process of merging the changes depends on the process that yourorganization wants to adopt and on what changes are required. For example,for a change in the SQL scripts, the new dataset might be accompanied by apull request to the standard codebase. If the change is limited to a changein the data within a given table, you could just copy data using standardmethods of BigQuery.

You can use a script of stored procedures to encapsulate and automate the stepsfor creating a dataset and creating the clones and snapshots. Automating thesetasks reduces risk of human error. For an example of a script that can helpautomate the processes, see theCI for Data in BigQuery CLI utility GitHub repository.

Benefits of using table clones and table snapshots

When you use the workflow described in the preceding section, your developerscan work in isolation and in parallel, without interfering with theircolleagues. Developers can test and review changes, and if there's an issue,roll back the changes. Because you're working with table snapshots and not withfull tables, you can minimize costs and storage compared to working withfull tables.

This section provides more detail about how table snapshots and table cloneslet developers achieve this workflow. The following diagram shows how tablesnapshots and table clones relate to the data in the production dataset.

A production dataset contains 9 tables. A second dataset named "Dev Dataset 1"contains snapshots of tables 2 and 3 and clones of tables 2 and 3. A thirddataset named "Dev Dataset 2" contains snapshots of tables 3 and 4 and clonesof tables 3 and 4.

In the diagram, the production dataset contains all the tables that are beingused in production. Every developer can create a dataset for their owndevelopment environment. The diagram shows two developer datasets, which arelabeledDev Dataset 1 andDev Dataset 2. By using these developerdatasets, developers can work simultaneously on the same tables withoutinterfering with one another.

After developers have created a dataset, they can create clones and snapshotsof the tables they are working on. The clones and snapshots represent the dataat a particular point in time. At this point, developers are free to change thetable clones, because changes aren't visible on the base table.

A developer can review the table clones, compare them to the snapshot, and testthem for compatibility with downstream consumers. Other developers are able towork with other clones and snapshots, without interference, and without creatingtoo many resource-consuming copies of the base table.

Developers can merge changes into the base table while keeping the snapshotsafe to have as a rollback option, if needed. This process can also be repeatedfor different environments, like development, test, and production.

Alternatives to table clones and table snapshots

There are alternatives to using table clones and table snapshots that let youachieve a similar result. These alternative methods are typically useddifferently than clones and snapshots. It's important to understand thedifferences between these methods and where you might prefer one method over theother.

Copy entire tables into a different dataset

One alternative method is to use a different dataset and to copy the tablesinto that dataset. Using this method is similar to using table clones andsnapshots, but you copy the entire set of tables. Depending on the sizes of thedata being copied, the storage costs might be high. Some organizations used thismethod before table clones became available in BigQuery. However,this method doesn't present any advantages over using clones and snapshots.

Export and import tables to Cloud Storage

Another alternative method is to move the data through Cloud Storage.This method is also similar to using table clones and table snapshots. However,it includes the extra step of exporting the data to aCloud Storage bucket. One advantage of this method is that it gives youan extra backup of your data. You might choose this method if you want a backupfor disaster recovery or hybrid solutions.

Use BigQuery sharing

BigQuery sharing (formerly Analytics Hub) lets you share datasets both outside and inside the organization in a way that'sdesigned to be secure. It offers many features that let you publish datasets toprovide subscribers with controlled, read-only access to those datasets.However, even though you can use BigQuery sharing to expose datasetsin order to implement changes, a developer still must create table clones inorder to work with the tables.

Summary of DWH continuous integration options

The following table summarizes the differences, advantages, and potentialdisadvantages between the options for DWH continuous integration.(Sharing offers a different feature set, and is thereforenot measurable using the parameters listed in the table.)

	Costs	Rollbacks	Risks
Table snapshots and table clones	Minimal. You pay only for the difference between the snapshot orclone and the base table.	The snapshot acts as a backup to roll back to if necessary.	You control the amount of risk. Snapshots can be taken at a point intime for all tables, which reduces inconsistencies even if there is arollback.
Table copy	Higher costs than using table snapshots and table clones. Theentirety of the data is duplicated. To support rollbacks, you needmultiple copies of the same table.	Possible, but requires two copies of the table—one copy to serve asbackup and one copy to work with and make changes to.	Cloning is harder to do for a point in time. If a rollback isnecessary, not all tables are taken from the same point in time.
Export and import	Higher costs than using table snapshots and table clones. The data isduplicated. To support rollback, you need multiple copies of the sametable.	The exported data serves as a backup.	Exported data is not a point-in-time export for multiple tables.

What's next

Read about BigQuery table snapshots inIntroduction to table snapshots.
Learn more about continuous integration for software development inDevOps tech: Continuous integration.
For more reference architectures, diagrams, and best practices, explore theCloud Architecture Center.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Continuous data integration in BigQuery Stay organized with collections Save and categorize content based on your preferences.