Understand reliability

This document provides an understanding of BigQuery reliabilityfeatures, such as availability, durability, data consistency, consistency ofperformance, data recovery, and a review of error handling considerations.

This introduction helps you address three primary considerations:

Determine whether BigQuery is the right tool for your job.
Understand the dimensions of BigQuery reliability.
Identify specific reliability requirements for specific use cases.

Select BigQuery

BigQuery is a fully managed enterprise Data Warehouse built to store and analyze massivedatasets. It provides a way to ingest,store, read, and query megabytes to petabytes of data with consistentperformance without having to manage any of the underlying infrastructure.Because of its power and performance, BigQuery is well suited tobe used in a range of solutions. Some of these are documented in detail asreference patterns.

Generally, BigQuery is very well suited for workloads where largeamounts of data are being ingested and analyzed. Specifically, it can beeffectively deployed for use cases such as real-time and predictive dataanalytics (withstreaming ingestion andBigQuery ML),anomaly detection, and other use cases where analyzing large volumes of datawith predictable performance is key. However, if you are looking for a databaseto support Online Transaction Processing (OLTP) style applications, you shouldconsider other Google Cloud services such asSpanner,Cloud SQL,orBigtable that may be better suited for these use cases.

Dimensions of reliability in BigQuery

Availability

Availability defines the user's ability to read data fromBigQuery or write data to it. BigQuery is built tomake both of these highly available with a 99.99%SLA.There are two components involved in both operations:

The BigQuery service
Compute resources required to execute the specific query

Reliability of the service is a function of the specific BigQuery API beingused to retrieve the data. The availability of compute resources depends on thecapacity available to the user at the time when the query is run. SeeUnderstand slotsfor more information about the fundamental unit of compute forBigQuery and the resultingslot resource economy.

Durability

Durability is discussed in theImplementing SLOs chapter of the SRE Workbook and is described as the "proportion of data that can besuccessfully read."

Data consistency

Consistency defines the expectations that users have for how the data is ableto be queried once it's written or modified. One aspect of data consistency isensuring "exactly-once" semantics for data ingestion. For more information,seeRetry failed job insertions.

Consistency of performance

In general, performance can be expressed in two dimensions.Latency is ameasure of the execution time of long data retrieval operations such as queries.Throughput is a measure of how much data BigQuery can processduring a specific period of time. Due to BigQuery's multi-tenant,horizontally scalable design, its throughput can scale up to arbitrary datasizes. The relative importance of latency and throughput is determined by thespecific use case.

Data recovery

Two ways to measure the ability to recover data after an outage are:

Recovery Time Objective (RTO). How long data can be unavailable afteran incident.
Recovery Point Objective (RPO). How much of the data collected prior tothe incident can acceptably be lost.

These considerations are specifically relevant in the unlikely case that a zoneor region experiences a multi-day or destructive outage.

Disaster planning

While theterm "disaster" may invoke visions of natural disasters, the scope of this sectionaddresses specific failures from the loss of an individual machine all the waythrough catastrophic loss of a region. The former are everyday occurrences thatBigQuery handles automatically, while the latter is somethingthat customers may need to design their architecture to handle if required.Understanding at what scope disaster planning crosses over to customerresponsibility is important.

BigQuery offers an industry leading99.99% uptime SLA.This is made possible by BigQuery's regional architecture thatwrites data in two different zones and provisions redundant compute capacity. Itis important to keep in mind that the BigQuery SLA is the samefor regions, for example us-central1, and multi-regions, for example US.

Automatic scenario handling

Because BigQuery is a regional service, it isthe responsibility of BigQuery to automatically handle the lossof a machine or even an entire zone. The fact that BigQuery isbuilt on top of zones is abstracted from users.

Loss of a machine

Machine failures are an everyday occurrence at the scale at which Googleoperates. BigQuery is designed to handle machine failuresautomatically without any impact to the containing zone.
Under the hood, execution of a query is broken up into small tasks that can bedispatched in parallel to many machines. The sudden loss or degradation ofmachine performance is handled automatically by redispatching work to adifferent machine. Such an approach is crucial to reducing tail latency.

BigQuery utilizesReed–Solomon encoding to efficiently and durably store a zonal copy of your data. In thehighly unlikely event that multiple machine failures cause the loss of a zonalcopy, data is also stored in the same fashion in at least one other zone. Insuch a case, BigQuery will detect the problem and make a newzonal copy of the data to restore redundancy.

Loss of a Zone

The underlying availability of compute resources in any given zone is notsufficient to meet BigQuery's 99.99% uptime SLA. HenceBigQuery provides automatic zonal redundancy for both data andcompute slots. While short lived zonal disruptions are not common, they dohappen. BigQuery automation, however, will route new queriesto another zone within a minute of any severe disruption. Already inflightqueries may not immediately recover, but newly issued queries will. This wouldmanifest as inflight queries taking a long time to finish while newly issuedqueries complete quickly.

Even if a zone were to be unavailable for a longer period of time, no data losswould occur due to the fact that BigQuery synchronously writesdata to two zones. So even in the face of zonal loss, customers will notexperience a service disruption.

Types of failures

There are two types of failures, soft failures and hard failures.

Soft failure is an operational deficiency where hardware is not destroyed.Examples include power failure, network partition, or a machine crash. Ingeneral, BigQuery should never lose data in the event of a softfailure.

Hard failure is an operational deficiency where hardware is destroyed. Hardfailures are more severe than soft failures. Hard failure examples includedamage from floods, terrorist attacks, earthquakes, and hurricanes.

Availability and durability

When you create a BigQuery dataset, you select a location in whichto store your data. This location is one of the following:

A region: a specific geographical location, such as Iowa (us-central1) or Montréal(northamerica-northeast1).
A multi-region: a large geographic area that contains two or more geographicplaces, such as the United States (US) or Europe (EU).

In either case, BigQuery automatically stores copies of your datain two different Google Cloudzoneswithin a single region in the selected location.

In addition to storage redundancy, BigQuery also maintainsredundant compute capacity across multiple zones. By combining redundant storageand compute across multiple availability zones, BigQuery providesboth high availability and durability.

Note: Selecting a multi-region location does not provide cross-region replication nor regional redundancy. Data will be stored in a single region within the geographic location.

In the event of a machine-level failure, BigQuery continues torun with no more than a few milliseconds of delay. All currently runningqueries continue processing. In the event of either a soft or hard zonalfailure, no data loss is expected. However, currently running queriesmight fail and need to be resubmitted. A soft zonal failure, such asresulting from a power outage, destroyed transformer, or network partition,is a well-tested path and is automatically mitigated within a few minutes.

A soft regional failure, such as a region-wide loss of network connectivity,results in loss of availability until the region is brought back online,but it doesn't result in lost data. A hard regional failure, for example,if a disaster destroys the entire region,could result in loss of data stored in that region. BigQuery doesnot automatically provide a backup or replica of your data in anothergeographic region. You can use cross-region dataset replication ormanaged disaster recovery to enhanceyour resiliency to hard regional failures.

To learn more about BigQuery dataset locations, seeLocation considerations.

Scenario: Loss of region

BigQuery does not offer durability or availability in theextraordinarily unlikely and unprecedented event of physical region loss.This is true for both regions and multi-regions.Hence maintaining durability and availability under such ascenario requires customer planning. In the case of temporary loss, such as anetwork outage, redundant availability should be considered ifBigQuery's 99.99% SLA is not considered sufficient.

To avoid data loss in the face of destructive regional loss, you need to back updata to another geographic location. For example, you could usecross-region dataset replication to continuously replicate your data to a geographically distinct region.

In the case of BigQuery multi-regions, you should avoid backingup to regions within the scope of the multi-region. SeeBigQuery locations for informationabout the scope of the multi-regions. For example, if you are backing up datafrom the US multi-region then you should avoid choosing one of the overlappingregions such as us-central1, given the chance of correlated failure during adisaster.

To avoid an extended unavailability, you need to have both data replicated andslots provisioned in two geographically separate BigQuerylocations. You can usemanaged disaster recovery toautomatically provision slots in a secondary region, and control failover ofyour workloads from one region to another.

Scenario: Accidental deletion or data corruption

By virtue of BigQuery'smultiversion concurrency control architecture, BigQuery supportstime travel.With this feature you can query data from any point in time over the last sevendays. This allows for self service restoration of any data that has beenmistakenly deleted, modified, or corrupted within a 7 day window. Time traveleven works on tables that have been deleted.

BigQuery also supports the ability tosnapshottables. With thisfeature you can explicitly backup data within the same region for longer thanthe 7 day time travel window. A snapshot is purely a metadata operation andresults in no additional storage bytes. While this can add protection againstaccidental deletion, it does not increase the durability of the data.

Use case: Real-time analytics

In this use case, streaming data is being ingested from endpoint logs intoBigQuery continuously. Protecting against extendedBigQuery unavailability for the entire region requirescontinuously replicating data and provisioning slots in a different region.Given that the architecture is resilient to BigQueryunavailability due to theuse of Pub/Sub and Dataflow in the ingestion path,this high level of redundancy is likely not worth the cost.

Assuming the user has configured BigQuery data inus-east4 to be exported nightly by using extract jobs to Cloud Storageunder the Archive Storage class in us-central1. This provides a durable backupin case of catastrophic data loss in us-east4. In this case, the Recovery PointObjective (RPO) is 24 hours, as the last exported backup can be up to 24 hoursold in the worst case. The Recovery Time Objective (RTO) is potentially days, asdata needs to be restored from the Cloud Storage backup toBigQuery in us-central1. If BigQuery is to beprovisioned in a different region from where backups are placed, data needs tobe transferred to this region first. Also note that unless you have purchasedredundant slots in the recovery region in advance, there may be an additionaldelay in getting the required BigQuery capacity provisioneddepending on the quantity requested.

Use case: Batch data processing

For this use case it is business critical that a daily report is completed by afixed deadline to be sent to a regulator. Implementing redundancy by running twoseparate instances of the entire processing pipeline is likely worth the cost.Using two separate regions, e.g. us-west1 and us-east4, provides geographicseparation and two independent failure domains in case of extendedunavailability of a region or even the unlikely event of a permanent regionloss.

Assuming we need the report to be delivered exactly once, we need to reconcilethe expected case of both pipelines finishing successfully. A reasonablestrategy is simply picking the result from the pipeline finishing first e.g. bynotifying a Pub/Sub topic on successful completion. Alternatively,overwrite the result and re-version the Cloud Storage object. Ifthe pipeline finishing later writes corrupt data, you can recover by restoringthe version written by the pipeline finishing first from Cloud Storage.

Error handling

The following are best practices for addressing errors that affect reliability.

Retry failed API requests

Clients of BigQuery, including client libraries and partnertools, should usetruncated exponential backoff when issuing API requests. This means that if a client receives a system erroror a quota error, it should retry the request up to a certain number of times,but with a random and increasing backoff interval.

Employing this method of retries makes your application much more robust inthe face of errors. Even under normal operating conditions, you can expect onthe order of one in ten thousand requests to fail as described inBigQuery's99.99% availability SLA. Underabnormal conditions, this error rate may increase, but if errors are randomlydistributed the strategy of exponential backoff can mitigate all but the mostsevere cases.

If you encounter a scenario where a request fails persistently with a 5XX error,then you should escalate to Google Cloud Support. Be sure toclearly communicate the impact the failure is having on your business so that the issue can be triagedcorrectly. If, on the other hand, a request persistently fails with a 4XX error,the problem should be addressable by changes to your application. Read theerror message for details.

Exponential backoff logic example

Exponential backoff logic retries a query or request by increasing thewait time between retries up to a maximum backoff time. For example:

Make a request to BigQuery.
If the request fails, wait 1 + random_number_milliseconds seconds and retrythe request.
If the request fails, wait 2 + random_number_milliseconds seconds and retrythe request.
If the request fails, wait 4 + random_number_milliseconds seconds and retrythe request.
And so on, up to a (maximum_backoff) time.
Continue to wait and retry up to a maximum number of retries, butdon't increase the wait period between retries.

Note the following:

The wait time ismin(((2^n)+random_number_milliseconds), maximum_backoff),withn incremented by 1 for each iteration (request).
random_number_milliseconds is a random number of milliseconds less than orequal to 1000. This randomization helps to avoid situations where many clientsare synchronized and all retry simultaneously, sending requests in synchronizedwaves. The value ofrandom_number_milliseconds is recalculated after eachretry request.
The maximum backoff interval (maximum_backoff) is typically 32 or 64seconds. The appropriate value formaximum_backoff depends on the use case.

The client can continue retrying after it reaches the maximum backoff time.Retries after this point don't need to continue increasing backoff time. Forexample, if the client uses a maximum backoff time of 64 seconds, then afterreaching this value the client can continue to retry every 64 seconds. At somepoint, clients should be prevented from retrying indefinitely.

The wait time between retries and the number of retries depend on your use caseand network conditions.

Retry failed job insertions

If exactly-once insertion semantics are important for your application, thereare additional considerations when it comes to inserting jobs. How to achieve atmost once semantics depends on whichWriteDispositionyou specify. The write disposition tells BigQuery what it shoulddo when encountering existing data in a table: fail, overwrite or append.

With aWRITE_EMPTY orWRITE_TRUNCATE disposition, this is achieved bysimply retrying any failed job insertion or execution. This is because all rowsingested by a job are atomically written to the table.

With aWRITE_APPEND disposition, the client needs to specify the job ID toguard against a retry appending the same data a second time. This works becauseBigQuery rejects job creation requests that attempt to use anID from a previous job. This achieves at-most-once semantics for any given jobID. You can then achieve exactly-once by retrying under a new predictable job IDonce you've confirmed with BigQuery that all previous attemptshave failed.

In some cases, the API client or HTTP client might not receive the confirmation that the job is inserted due to transient issues or network interruptions. When the insertion is retried, that request fails withstatus=ALREADY_EXISTS (code=409 andreason="duplicate"). The existing job status can be retrieved with a call tojobs.get. After the status of the existing job isretrieved, the caller can determine whether a new job with a new JOB ID should be created.

Use cases and reliability requirements

BigQuery might be a critical component of a variety ofarchitectures. Depending on the use case and architecture deployed, a variety ofavailability, performance, or other reliability requirements might need to be met.For the purposes of this guide, let's select two primary use cases andarchitectures to discuss in detail.

Real-time analytics

The first example is an event data processing pipeline. In this example, logevents from endpoints are ingested using Pub/Sub. From there, astreaming Dataflow pipeline performs some operations on the data priorto writing it into BigQuery using theStorage Write API.The data is then used both for ad hoc querying to, for example, recreatesequences of events that may have resulted in specific endpoint outcomes, andfor feeding near-real time dashboards to allow the detection of trends andpatterns in the data through visualization.

This example requires you to consider multiple aspects of reliability. Becausethe end-to-end data freshness requirements are quite high,latency of theingestion process is critical. Once data is written to BigQuery,reliability is perceived as the ability of users to issue ad hoc querieswithconsistent and predictable latency and ensuring that dashboardsutilizing the data reflect the absolute latest available information.

Batch data processing

The second example is a batch processing architecture based around regulatorycompliance in the financial services industry. A key requirement is to deliverdaily reports to regulators by a fixed nightly deadline. As long as the nightlybatch process that generates the reports completes by this deadline, it isconsidered sufficiently fast.

Data needs to be made available in BigQuery and joined with otherdata sources for dashboarding, analysis, and ultimately generation of a PDFreport. Having these reports be delivered on time and without error is acritical business requirement. As such, ensuring thereliability of bothdata ingestion and producing the report correctly and in aconsistenttimeframe to meet required deadlines are key.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換