Scalable BigQuery backup automation

This architecture provides a framework andreference deployment to help you develop your BigQuery backup strategy. This recommendedframework and its automation can help your organization do the following:

  • Adhere to your organization's disaster recovery objectives.
  • Recover data that was lost due to human errors.
  • Comply with regulations.
  • Improve operational efficiency.

The scope of BigQuery data can include (or exclude) folders,projects, datasets, and tables. This recommended architecture shows you how toautomate the recurrent backup operations at scale. You can use two backupmethods for each table:BigQuery snapshots andBigQuery exports to Cloud Storage.

This document is intended for cloud architects, engineers, and data governanceofficers who want to define and automate data policies in their organizations.

Architecture

The following diagram shows the automated backup architecture:

Architecture for the automated backup solution.

The workflow that's shown in the preceding diagram includes the following phases:

  1. Cloud Scheduler triggers a run to the dispatcher service through a Pub/Sub message,which contains the scope of the BigQuery data that's includedand excluded. Runs are scheduled by using acron expression.
  2. The dispatcher service, which is built onCloud Run,uses the BigQuery API to list the tables that are within theBigQuery scope.
  3. The dispatcher service submits one request for each table to theconfigurator service through a Pub/Sub message.
  4. The Cloud Run configurator service computes the backuppolicy of the table from one of the following defined options:

    1. The table-level policy, which is defined by data owners.
    2. The fallback policy, which is defined by the data governanceofficer, for tables that don't have defined policies.

    For details about backup policies, seeBackup policies.

  5. The configurator service submits one request for each table to the nextservice, based on the computed backup policy.

  6. Depending on the backup method, one of the following customCloud Run services submits a request to theBigQuery API and runs the backup process:

    1. The service for BigQuery snapshots backs up thetable as asnapshot.
    2. The service for data exports backs up the table as adata export to Cloud Storage.
  7. When the backup method is a table data export, aCloud Logginglog sink listens to the export jobs completion events in order to enable the asynchronousexecution of the next step.

  8. After the backup services complete their operations,Pub/Sub triggers the tagger service.

  9. For each table, the tagger service logs the results of the backupservices and updates the backup state in the Cloud Storagemetadata layer.

Products used

This reference architecture uses the following Google Cloud products:

  • BigQuery: An enterprise data warehouse that helps you manage andanalyze your data with built-in features like machine learning geospatialanalysis, and business intelligence.
  • Cloud Logging: A real-time log management system with storage, search,analysis, and alerting.
  • Pub/Sub: An asynchronous and scalable messaging service thatdecouples services that produce messages from services that process thosemessages.
  • Cloud Run: A serverless compute platform that lets you runcontainers directly on top of Google's scalable infrastructure.
  • Cloud Storage: A low-cost, no-limit object store for diverse data types.Data can be accessed from within and outside Google Cloud, and it'sreplicated across locations for redundancy.
  • Cloud Scheduler: A fully managed enterprise-grade cron job schedulerthat lets you set up scheduled units of work to be executed at defined timesor regular intervals.
  • Datastore: A highly scalable NoSQL database for your web and mobile applications.

Use cases

This section provides examples of use cases for which you can use thisarchitecture.

Backup automation

As an example, your company might operate in a regulated industry and useBigQuery as the main data warehouse. Even when your companyfollows best practices in software development, code review, and releaseengineering, there's still a risk of data loss or data corruption due to humanerrors. In a regulated industry, you need to minimize this risk as much aspossible.

Examples of these human errors include the following:

  • Accidental deletion of tables.
  • Data corruption due to erroneous data pipeline logic.

These types of human errors can usually be resolved with thetime travel feature, which lets you recover data from up to seven days ago. In addition,BigQuery also offers afail-safe period,during which deleted data is retained in fail-safe storage for an additionalseven days after the time travel window. That data is available for emergencyrecovery through Cloud Customer Care. However, if your company doesn'tdiscover and fix such errors within this combined timeframe, the deleted data isno longer recoverable from its last stable state.

To mitigate this, we recommend that you execute regular backups for anyBigQuery tables that can't be reconstructed from source data (forexample, historical records or KPIs with evolving business logic).

Your company could use basic scripts to back up tens of tables. However, if youneed to regularly back up hundreds or thousands of tables across theorganization, you need a scalable automation solution that can do thefollowing:

  • Handle different Google Cloud API limits.
  • Provide a standardized framework for defining backup policies.
  • Provide transparency and monitoring capabilities for the backup operations.

Backup policies

Your company might also require that the backup policies be defined by thefollowing groups of people:

  • Data owners, who are most familiar with the tables and can set theappropriatetable-level backup policies.
  • Data governance team, who ensure that afallback policy is in place tocover any tables that don't have a table-level policy. The fallback policyensures that certain datasets, projects, and folders are backed up tocomply with your company's data retention regulations.

In the deployment for this reference architecture, there are two ways to definethe backup policies for tables, and they can be used together:

  • Data owner configuration (decentralized): a table-level backup policy,which is manually attached to a table.

    • The data owner defines a table-level JSON file that's stored ina common bucket.
    • Manual policies take precedence over fallback policies when thesolution determines the backup policy of a table.
    • For details in the deployment, seeSet table-level backup policies.
  • Organization default configuration (centralized): a fallbackpolicy, which applies only to tables that don't have manually-attachedpolicies.

    • A data governance team defines a central JSON file in Terraform, as partof the solution.
    • The fallback policy offers default backup strategies on folder, project,dataset, and table levels.
    • For details in the deployment, seeDefine fallback backup policies.

Backup versus replication

Abackup process makes a copy of the table data from a certain point in time,so that it can be restored if the data is lost or corrupted. Backups can be runas a one-time occurrence or recurrently (through a scheduled query or workflow).In BigQuery, point-in-time backups can be achieved withsnapshots.You can use snapshots to keep copies of the data beyond the seven-day time travelperiod within the same storage location as the source data.BigQuery snapshots are particularly helpful for recovering dataafter human errors that lead to data loss or corruption, rather thanrecovering from regional failures. BigQuery offers a ServiceLevel Objective (SLO) of99.9% to 99.99%,depending on the edition.

By contrast,replication is the continuous process of copying databasechanges to a secondary (or replica) database in a different location. InBigQuery,cross-region replication can help provide geo-redundancy by creating read-only copies of the data insecondary Google Cloudregions, which are different from the source dataregion. However, BigQuery cross-region replication isn'tintended for use as a disaster recovery plan for total-regionoutage scenarios.For resilience against regional disasters, consider usingBigQuery managed disaster recovery.

BigQuery cross-region replication provides a synchronizedread-only copy of the data in a region that is close to the data consumers. Thesedata copies enable collocated joins and avoid cross-regional traffic and cost.However, in cases of data corruption due to human error, replication alone can'thelp with recovery, because the corrupted data is automatically copied to thereplica. In such cases, point-in-time backups (snapshots) are a better choice.

The following table shows a summarized comparison of backup methods andreplication:

MethodFrequencyStorage locationUse casesCosts
Backup

(Snapshots or Cloud Storage export)
One-time or recurrentlySame as the source table dataRestore original data, beyond the time travel periodSnapshots incurstorage charges for data changes in the snapshot only

Exports can incur standardstorage charges

SeeCost optimization
Cross-region replicationContinuouslyRemoteCreate a replica in another region

One-time migrations between regions
Incurscharges for storing data in the replica

Incursdata replication costs

Design considerations

This section provides guidance for you to consider when you use this referencearchitecture to develop a topology that meets your specific requirements forsecurity, reliability, cost optimization, operational efficiency, andperformance.

Security, privacy, and compliance

The deployment incorporates the following security measures in its design andimplementation:

  • Thenetwork ingress setting for Cloud Run accepts onlyinternal traffic, torestrict access from the internet. It also allows only authenticated usersand service accounts to call the services.
  • Each Cloud Run service and Pub/Subsubscription uses a separate service account, which has only the requiredpermissions assigned to it. This mitigates the risks associated with usingone service account for the system and follows theprinciple of least privilege.

For privacy considerations, the solution doesn't collect or process personallyidentifiable information (PII). However, if the source tables have exposed PII,the backups taken of those tables also include this exposed data. The owner ofthe source data is responsible for protecting any PII in the source tables (forexample, by applyingcolumn-level security,data masking,orredaction).The backups are secure only when the source data is secured. Another approach isto make sure that projects, datasets, or buckets that hold backup data withexposed PII have the required Identity and Access Management (IAM) policies that restrictaccess to only authorized users.

As a general-purpose solution, the reference deployment doesn't necessarilycomply with a particular industry's specific requirements.

Reliability

This section describes features and design considerations for reliability.

Failure mitigation with granularity

To take backups of thousands of tables, it's likely that you might reach APIlimits for the underlying Google Cloud products (for example,snapshot andexport operation limits for each project). However, if the backup of one table failsdue to misconfiguration or other transient issues, that shouldn't affect theoverall execution and ability to back up other tables.

To mitigate potential failures, the reference deploymentdecouples the processing steps by using granular Cloud Runservices and connecting them through Pub/Sub. If a table backuprequest fails at the final tagger service step, Pub/Sub retriesonly this step and it doesn't retry the entire process.

Breaking down the flow into multiple Cloud Run services, insteadof multiple endpoints hosted under one Cloud Run service, helpsprovide granular control of each service configuration. The level ofconfiguration depends on the service's capabilities and the APIs that it communicateswith. For example, the dispatcher service executes once per run, but itrequires a substantial amount of time to list all the tables within theBigQuery backup scope. Therefore, the dispatcher service requireshigher time-out and memory settings. However, the Cloud Run servicefor BigQuery snapshots executes once per table in a single run, andcompletes in less time than the dispatcher service. Therefore, the Cloud Runservice requires a different set of configurations at the service level.

Data consistency

Data consistency across tables and views is crucial for maintaining a reliablebackup strategy. Because data is continuously updated and modified, backups taken atdifferent times might capture different states of your dataset. These backups indifferent states can lead to inconsistencies when you restore data, particularlyfor tables that belong to the same functional dataset. For example, restoring asales table to a point in time that's different from its corresponding inventorytable could create a mismatch in available stock. Similarly, database views thataggregate data from multiple tables can be particularly sensitive to inconsistencies.Restoring these views without ensuring that the underlying tables are in a consistentstate could lead to inaccurate or misleading results. Therefore, when you designyour BigQuery backup policies and frequencies, it's imperative toconsider this consistency and ensure that your restored data accurately reflectsthe real-world state of your dataset at a given point in time.

For example, in the deployment for this reference architecture, data consistencyis controlled through the following two configurations in thebackup policies.These configurations compute the exact table snapshot time throughtime travel,without necessarily backing up all tables at the same time.

  • backup_cron: Controls the frequency with which a table is backedup. The start timestamp of a run is used as a reference point for timetravel calculation for all tables that are backed up in this run.
  • backup_time_travel_offset_days: Controls how many days in the pastshould be subtracted from the reference point in time (run start time), tocompute the exact time travel version of the table.

Automated backup restoration

Although this reference architecture focuses on backup automation at scale, youcan consider restoring these backups in an automated way as well. Thisadditional automation can provide similar benefits to those of the backupautomation, including improved recovery efficiency and speed, with lessdowntime. Because the solution keeps track of all backup parameters and resultsthrough the tagger service, you could develop a similar architecture to applythe restoration operations at scale.

For example, you could create a solution based on an on-demand trigger thatsends a scope of BigQuery data to a dispatcher service, whichdispatches one request per table to a configurator service. The configuratorservice could fetch the backup history that you want for a particular table.The configurator service could then pass it on to either aBigQuerysnapshot restoration service orCloud Storage restoration service toapply the restoration operation accordingly. Lastly, a tagger service could store the results ofthese operations in a state store. By doing so, the automated restorationframework can benefit from the same design objectives as the backup frameworkdetailed in this document.

Cost optimization

The framework of this architecture provides backup policies that set the followingparameters for overall cost optimization:

  • Backup method: The framework offers the following two backupmethods:
    • BigQuery snapshots, which incur storage costsbased onupdated and deleted data compared to the base table. Therefore, snapshots are more costeffective for tables that are append-only or have limited updates.
    • BigQuery exports to Cloud Storage,which incur standard storage charges. However, for large tables thatfollow a truncate and load approach, it's more cost effective to backthem up as exports in less expensivestorage classes.
  • Snapshot expiration: The time to live (TTL) is set for a singletable snapshot, to avoid incurringstorage costsfor the snapshot indefinitely. Storage costs can grow over time if tableshave no expiration.

Operational efficiency

This section describes features and considerations for operationalefficiency.

Granular and scalable backup policies

One of the goals of this framework is operational efficiency by scaling upbusiness output while keeping business input relatively low and manageable. Forexample, the output is a high number of regularly backed up tables, while theinput is a small number of maintained backup policies and configurations.

In addition to allowing backup policies at the table level, the framework alsoallows for policies at the dataset, project, folder, and global level. Thismeans that with a few configurations at higher levels (for example, thefolder or project level), hundreds or thousands of tables can be backed upregularly, at scale.

Observability

With an automation framework, it's critical that you understand the statuses ofthe processes. For example, you should be able to find the information for thefollowing common queries:

  • The backup policy that is used by the system for each table.
  • The backup history and backup locations of each table.
  • The overall status of a single run (the number of processed tables andfailed tables).
  • The fatal errors that occurred in a single run, and the components orsteps of the process in which they occurred.

To provide this information, the deployment writes structured logs toCloud Logging at each execution step that uses a Cloud Runservice. The logs include the input, output, and errors, along with otherprogress checkpoints. A log sink routes these logs to a BigQuerytable. You can run a number of queries tomonitor runs and get reports for common observability use cases. For more information about logs and queriesin BigQuery, seeView logs routed to BigQuery.

Performance optimization

To handle thousands of tables at each run, the solution processes backuprequests in parallel. The dispatcher service lists all of the tables that areincluded within the BigQuery backup scope and it generates onebackup request per table at each run. This enables the application to processthousands of requests and tables in parallel, not sequentially.

Some of these requests might initially fail for temporary reasons such asreaching the limits of the underlying Google Cloud APIs or experiencing networkissues. Until the requests are completed, Pub/Sub automaticallyretries the requests with the exponential backoff retry policy. If there are fatalerrors such as invalid backup destinations or missing permissions, the errors arelogged and the execution of that particular table request is terminated withoutaffecting the overall run.

Limits

The following quotas and limits apply to this architecture.

For table snapshots, the following applies for each backup operation project thatyou specify:

  • One project can run up to 100 concurrent table snapshot jobs.
  • One project can run up to 50,000 table snapshot jobs per day.
  • One project can run up to 50 table snapshot jobs per table per day.

For details, seeTable snapshots.

For export jobs (exports to Cloud Storage), the following applies:

  • You can export up to 50 TiB of data per day from a project for free, byusing the shared slot pool.
  • One project can run up to 100,000 exports per day. To extend this limit,create a slot reservation.

For more information about extending these limits, seeExport jobs.

Regarding concurrency limits, this architecture uses Pub/Sub toautomatically retry requests that fail due to these limits, until they're servedby the API. However, for other limits on the number of operations per projectper day, these could be mitigated by either aquota-increase request,or by spreading the backup operations (snapshots or exports) across multipleprojects. To spread operations across projects, configure the backup policiesas described in the following deployment sections:

Deployment

To deploy this architecture, seeDeploy scalable BigQuery backup automation.

What's next

Contributors

Author:Karim Wadie | Strategic Cloud Engineer

Other contributors:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-09-17 UTC.