Disaster recovery use cases: locality-restricted data analytics applications Stay organized with collections Save and categorize content based on your preferences.
This document is part of a series that discussesdisaster recovery (DR) in Google Cloud. This document describes how to apply the locality restrictions from the document,Architecting disaster recovery for locality-restricted workloads,to data analytics applications. Specifically, this documentdescribes how the components that you use in a data analytics platform fitinto a DR architecture that meets locality restrictions that your applicationsor data might be subject to.
The series consists of the following parts:
- Disaster recovery planning guide
- Disaster recovery building blocks
- Disaster recovery scenarios for data
- Disaster recovery scenarios for applications
- Architecting disaster recovery for locality-restricted workloads
- Architecting disaster recovery for cloud infrastructure outages
- Disaster recovery use cases: locality-restricted data analyticsapplications (this document)
This document is intended for systems architects and IT managers. It assumesthat you have the following knowledge and skills:
- Basic familiarity with Google Cloud data analytics services suchasDataflow,Dataproc,Cloud Composer,Pub/Sub,Cloud Storage,andBigQuery.
- Basic familiarity with Google Cloud networking services such asCloud Interconnect andCloud VPN.
- Familiarity with disaster recovery planning.
- Familiarity withApache Hive andApache Spark.
Locality requirements for a data analytics platform
Data analytics platforms are typically complex, multi-tiered applications that store data at rest. Theseapplications produce events that are processed and stored in your analyticssystem. Both the application itself and the data stored in the system might besubject to locality regulations. These regulations vary not just acrosscountries, but also across industry verticals. Therefore, you should have aclear understanding aboutyour data locality requirements before you start to design your DR architecture.
You can determine whether your data analytics platform has any localityrequirements by answering the following two questions:
- Does your application need to be deployed to a specific region?
- Is your data at rest restricted to a specific region?
If you answer "no" to both questions, your data analytics platform doesn'thave any locality-specific requirements. Because your platform doesn't havelocality requirements, follow the DR guidance for compliant services andproducts outlined in theDisaster recovery planning guide.
However, if you answer "yes" to either of the questions, your application islocality-restricted. Because your analytics platform is locality-restricted, youmust ask the following question:Can you use encryption techniques to mitigate data-at-rest requirements?
If you're able to use encryption techniques, you can use the multi-regionaland dual-regional services of Cloud External Key Manager and Cloud Key Management Service. You can then alsofollow the standard high availability and disaster recovery (HA/DR) techniques outlined inDisaster recovery scenarios for data.
If you are unable to use encryption techniques, you must use custom solutionsor partner offerings for DR planning. For more information about techniques foraddressing locality restrictions for DR scenarios, seeArchitecting disaster recovery for locality-restricted workloads.
Components in a data analytics platform
When you understand locality requirements, the next step is to understand thecomponents that your data analytics platform uses. Some common components ofdata analytics platform are databases, data lakes, processing pipelines, anddata warehouses, as described in the following list:
- Google Cloud has a set ofdatabase services that fit different use cases.
Data lakes, data warehouses, and processing pipelines can have slightlydiffering definitions. This document uses a set of definitions thatreference Google Cloud services:
- Adata lake is a scalable and secure platform for ingesting and storing data frommultiple systems. A typical data lake might use Cloud Storageas the central storage repository.
- A processing pipeline is an end-to-end process that takes dataor events from one or more systems, transforms that data or event, and loads itinto another system. The pipeline could follow either anextract, transform, and load (ETL) or extract, load, and transform (ELT) process.Typically, the system into which the processed data is loaded is a datawarehouse. Pub/Sub, Dataflow, andDataproc are commonly used components of a processing pipeline.
- Adata warehouse is an enterprise system used for analysis and reporting of data,which usually comes from an operational database. BigQuery isa commonly used data warehouse system running on Google Cloud.
Depending on the locality requirements and the data analytics components thatyou are using, the actual DR implementation varies. The following sectionsdemonstrate this variation with two use cases.
Batch use case: a periodic ETL job
The first use case describes a batch process in which a retail platform periodically collects sales events as files from variousstores and then writes the files to a Cloud Storage bucket. Thefollowing diagram illustrates the data analytics architecture for thisretailer's batch job.
The architecture includes the following phases and components:
- Theingestion phase consists of the stores sending theirpoint-of-sale (POS) data to Cloud Storage. This ingested dataawaits processing.
- Theorchestration phase uses Cloud Composer to orchestratethe end-to-end batch pipeline.
- Theprocessing phase involves an Apache Spark job running on aDataproc cluster. The Apache Spark job performs an ETLprocess on the ingested data. This ETL process provides business metrics.
- Thedata lake phase takes the processed data and stores informationin the following components:
- The processed data is commonly stored in Cloud Storagecolumnar formats such asParquet andORC because these formats allow efficient storage and faster access foranalytical workloads.
- The metadata about the process data (such as databases, tables,columns, and partitions) is stored in the Hive metastore servicesupplied byDataproc Metastore.
In locality-restricted scenarios, it might be difficult to provide redundantprocessing and orchestration capacity to maintain availability. Because the datais processed in batches, the processing and orchestration pipelines can berecreated, and batch jobs could be restarted after a disaster scenario isresolved. Therefore, for disaster recovery, the core focus is on recovering theactual data: the ingested POS data, the processed data stored in the data lake,and the metadata stored in the data lake.
Ingestion into Cloud Storage
Your first consideration should be the locality requirements for theCloud Storage bucket used to ingest the data from the POSsystem. Use this locality information when considering the following options:
- If the locality requirements allow data at rest to reside in one of themulti-region or dual-region locations, choose the corresponding locationtype when you create the Cloud Storage bucket. The location typethat you choose defines which Google Cloud regions are used toreplicate your data at rest. If an outage occurs in one of the regions, datathat resides in multi-region or dual-region buckets is still be accessible.
- Cloud Storage also supports bothcustomer-managed encryption keys (CMEK) andcustomer-supplied encryption keys (CSEK).Some locality rules allow data at rest to be stored in multiple locationswhen you use CMEK or CSEK for key management. If your locality rules allowthe use of CMEK or CSEK, you can design your DR architecture to useCloud Storage regions.
- Your locality requirements might not permit you to use either locationtypes or encryption-key management. When you can't use location types orencryption-key management, you can use the
gcloud storage rsynccommand to synchronize data to another location, such as on-premisesstorage or storage solutions from another cloud provider.
If a disaster occurs, the ingested POS data in the Cloud Storagebucket might have data that has not yet been processed and imported into thedata lake. Alternatively, the POS data might not be able to be ingested into theCloud Storage bucket. For these cases, you have the following disasterrecovery options:
Let the POS system retry. In the event that the system is unable towrite the POS data to Cloud Storage, the data ingestion process fails.To mitigate this situation, you can implement aretry strategy to allow the POS system to retry the data ingestion operation.Because Cloud Storage provideshigh durability,data ingestion and the subsequent processing pipeline willresume with little to no data loss after the Cloud Storage serviceresumes.
Make copies of ingested data. Cloud Storage supports bothmulti-region and dual-regionlocation types.Depending on your data locality requirements, you might be able to set up amulti-region or dual-region Cloud Storage bucket to increase dataavailability. You can also use tools like the
gcloud storageGoogle Cloud CLI command to synchronize data between Cloud Storageand another location.
Orchestration and processing of ingested POS data
In the architecture diagram for the batch use case,Cloud Composer carries out the following steps:
- Validates that new files have been uploaded to theCloud Storage ingestion bucket.
- Starts an ephemeral Dataproc cluster.
- Starts an Apache Spark job to process the data.
- Deletes the Dataproc cluster at the end of the job.
Cloud Composer usesdirected acyclic graph (DAG) files that define these series of steps. These DAG files are stored in aCloud Storage bucket that is not shown in the architecture diagram. Interms of dual-region and multi-region locality, the DR considerations for theDAG files bucket are the same as the ones discussed for the ingestion bucket.
Dataproc clusters are ephemeral. That is, the clusters onlyexist for as long as the processing stage runs. This ephemeral nature means thatyou don't have to explicitly do anything for DR in regard to theDataproc clusters.
Data lake storage with Cloud Storage
The Cloud Storage bucket that you use for the data lake has the samelocality considerations as the ones discussed for the ingestion bucket:dual-region and multi-region considerations, the use of encryption, and the useofgcloud storage rsync gcloud CLI command.
When designing the DR plan for your data lake, think about the followingaspects:
Data size. The volume of data in a data lake can be large. Restoring alarge volume of data takes time. In a DR scenario, you need to consider theimpact that the data lake's volume has on the amount of time that it takesto restore data to a point that meets the following criteria:
- Your application is available.
- You meet yourrecovery time objective (RTO) value.
- You get the data checkpoint that you need to met yourrecovery point objective (RPO) value.
For the current batch use case, Cloud Storage is used forthe data lake location and provides high durability. However,ransomware attacks have been on a rise. To ensure that you have the abilityto recover from such attacks, it would be prudent to follow the bestpractices that are outlined in,How Cloud Storage delivers 11 nines of durability.
Data dependency. Data in data lakes are usually consumed by othercomponents of a data analytics system such as a processing pipeline. In aDR scenario, the processing pipeline and the data on which it dependsshould share the same RTO. In this context, focus on how long you can havethe system be unavailable.
Data age. Historical data might allow for higher RPO. This type ofdata might have already been analyzed or processed by other data analyticscomponents and might have been persisted in another component that has itsown DR strategies. For example, sales records that are exported toCloud Storage are imported daily to BigQuery foranalysis. With proper DR strategies for BigQuery, historicalsales records that have been imported to BigQuery might havelower recovery objectives than those which haven't been imported.
Data lake metadata storage with Dataproc Metastore
Dataproc Metastore is anApache Hive metastore that is fully managed, highly available, autohealing, and serverless. Themetastore provides data abstraction and data discovery features. The metastorecan bebacked up andrestored in the case of a disaster. The Dataproc Metastore service alsolets youexport andimport metadata. You can add a task to export the metastore and maintain an externalbackup along with your ETL job.
If you encounter a situation where there is a metadata mismatch, you canrecreate the metastore from the data itself by using theMSCK command.
Streaming use case: change data capture using ELT
The second use case decribes a retail platform that uses change data capture (CDC) to update backend inventorysystems and to track real-time sales metrics. The following diagram shows anarchitecture in which events from a transactional database, such asCloud SQL, are processed and then stored in a data warehouse.
The architecture includes the following phases and components:
- The ingestion phase consists of the incoming change events beingpushed to Pub/Sub. As a message delivery service,Pub/Sub is used to reliably ingest and distribute streaminganalytics data. Pub/Sub has the additional benefits of being bothscalable and serverless.
- The processing phase involves using Dataflow toperform an ELT process on the ingested events.
- The data warehouse phase uses BigQuery to store theprocessed events. The merge operation inserts or updates a record inBigQuery. This operation allows the information stored inBigQuery to keep up to date with the transactional database.
While this CDC architecture doesn't rely on Cloud Composer, some CDCarchitectures require Cloud Composer to orchestrate the integration ofthe stream into batch workloads. In those cases, the Cloud Composerworkflow implements integrity checks, backfills, and projections that can't bedone in real time because of latency constraints. DR techniques forCloud Composer are discussed in thebatch processing use case discussed earlier in the document.
For a streaming data pipeline, you should consider the following when planningyour DR architecture:
- Pipeline dependencies. Processing pipelines take input from one ormore systems (the sources). Pipelines then process, transform, and storethese inputs into some other systems (the sinks). It's important to designyour DR architecture to achieve your end-to-end RTO. You need to ensurethat the RPO of the data sources and sinks allow you to meet the RTO. Inaddition to designing your cloud architecture to meet your localityrequirements, you'll also need to design your originating CDC sources toallow your end-to-end RTO to be met.
- Encryption and locality. If encryption lets you mitigatelocality restrictions, you can useCloud KMS,to attain the following goals:
- Manage your own encryption keys.
- Take advantage of the encryption capability of individual services.
- Deploy services in regions that would otherwise be not availableto use due to locality restrictions.
- Locality rules on data in motion. Some locality rules might applyonly to data at rest but not to data in motion. If your locality rulesdon't apply to data in motion, design your DR architecture to useresources in other regions to improve the recovery objectives. You cansupplement the regional approach by integrating encryption techniques.
Ingestion into Pub/Sub
If you don't have locality restrictions, you can publish messages to theglobal Pub/Sub endpoint.Global Pub/Sub endpoints are visible and accessible from anyGoogle Cloud location.
If your locality requirements allow the use of encryption, it's possible toconfigure Pub/Sub to achieve a similar level of high availabilityas global endpoints. Although Pub/Sub messages are encrypted withGoogle-owned and Google-managed encryption keys by default, you canconfigure Pub/Sub to use CMEK instead. By using Pub/Sub with CMEK, you are able to meet localityrules about encryption while still maintaining high availability.
Some locality rules require messages to stay in a specific location even afterencryption. If your locality rules have this restriction, you can specify themessage storage policy of a Pub/Sub topic and restrict it to a region. If you use thismessage storage approach, messages that are published to a topic are neverpersisted outside of the set of Google Cloud regions that you specify. Ifyour locality rules allow more than one Google Cloud region to be used,you can increase service availability by enabling those regions in thePub/Sub topic. You need to be aware that implementing a messagestorage policy to restrict Pub/Sub resource locations does comewithtrade-offs concerning availability.
A Pub/Sub subscription lets you store unacknowledged messages forup to 7 days without any restrictions on the number of messages. If your servicelevel agreement allows delayed data, you can buffer the data in yourPub/Sub subscription if the pipelines stop running. When thepipelines are running again, you can process the backed-up events. This designhas the benefit of having a low RPO. For more information about the resourcelimits for Pub/Sub, seeresource limits for Pub/Sub quotas.
Event processing with Dataflow
Dataflow is a managed service for executing a wide variety ofdata processing patterns.TheApache Beam SDK is an open source programming model that lets you develop both batch andstreaming pipelines. You create your pipelines with an Apache Beam program andthen run them on the Dataflow service.
When designing for locality restrictions, you need to consider where yoursources, sinks, and temporary files are located. If these file locations areoutside of your job's region, your data might be sent across regions. If yourlocality rules allow data to be processed in a different location, design yourDR architecture to deploy workers in other Google Cloud regions to providehigh availability.
If your locality restrictions limit processing to a single location, you cancreate a Dataflow jobthat is restricted to a specific region or zone. When you submit aDataflow job, you can specify theregional endpoint, worker region, and worker zone parameters. These parameters control where workers are deployed and where jobmetadata is stored.
Apache Beam provides a framework that allows pipelines to be executed acrossvarious runners. You can design your DR architecture to take advantage of thiscapability. For example, you might design a DR architecture to have a backuppipeline that runs on your local on-premises Spark cluster by usingApache Spark Runner.For information about whether a specific runner is capable of carrying out acertain pipeline operation, seeBeam Capability Matrix.
If encryption lets you mitigate locality restrictions, you can useCMEK in Dataflow to both encrypt pipeline state artifacts, and access sources and sinks that areprotected with Cloud KMS keys. Using encryption, you can design a DRarchitecture that uses regions that would otherwise be not available due tolocality restrictions.
Data warehouse built on BigQuery
Data warehouses support analytics and decision-making. Besides containing ananalytical database, data warehouses contain multiple analytical components andprocedures.
When designing the DR plan for your data warehouse, think about the followingcharacteristics:
Size. Data warehouses are much larger than online transactionprocessing (OLTP) systems. It's not uncommon for data warehouses to haveterabytes to petabytes of data. You need to consider how long it would taketo restore this data to achieve your RPO and RTO values. When planning yourDR strategy, you must also factor in the cost associated with recoveringterabytes of data. For more information about DR mitigation techniques forBigQuery, see the BigQuery information in thesection onbackup and recovery mechanisms for the managed database services on Google Cloud.
Availability. When you create a BigQuery dataset, youselect a location in which to store your data:regional ormulti-region.A regional location is a single, specific geographical location, such asIowa (
Note: In BigQuery, a multi-region location does not providecross-region replication nor regional redundancy. Data will be stored in asingle region within the geographic location.us-central1). A multi-regionlocation is a large geographic area, such as the United States (US) orEurope (EU), that contains two or more geographic places.When designing your DR plan to meet locality restrictions, the failure domain (that is,whether the failure occurs at the machine level, zonal, or regional) willhave a direct impact on you meeting your defined RTO. For more informationabout these failure domains and how they affect availability, seeAvailability and durability of BigQuery.
Nature of the data. Data warehouses contain historic information,and most of the older data is often static. Many data warehouses aredesigned to beappend-only.If your data warehouse is append-only, you maybe able to achieve your RTO by restoring just the data that is beingappended. In this approach, you backup just this appended data. If there is adisaster, you'll then be able to restore the appended data and have yourdata warehouse available to use, but with a subset of the data.
Data addition and update pattern. Data warehouses are typicallyupdated using ETL or ELT patterns. When you have controlled update paths,you can reproduce recent update events from alternative sources.
Your locality requirements might limit whether you can use a single region ormultiple regions for your data warehouse. Although BigQuerydatasets can be regional, multi-region storage is the simplest and mostcost-effective way to ensure the availability of your data if a disaster occurs.If multi-region storage is not available inyour region, but you can use a different region, use theBigQuery Data Transfer Service to copy your dataset to a different region. If you can use encryption tomitigate the locality requirements, you can manage your own encryption keys withCloud KMS and BigQuery.
If you can use only one region, consider backing up the BigQuerytables. The most cost-effective solution to backup tables is to useBigQuery exports.UseCloud Scheduler orCloud Composer to periodically schedule an export job to write to Cloud Storage. Youcan useformats such as Avro with SNAPPY compression or JSON with GZIP compression. While youare designing your export strategies, take note of thelimits on exports.
You might also want to store BigQuery backups in columnarformats such as Parquet and ORC. These provide high compression and also allowinteroperability with many open source engines, such as Hive and Presto, thatyou might have in your on-premises systems. The following diagram outlines theprocess of exporting BigQuery data to a columnar format forstorage in an on-premises location.
Specifically, this process of exporting BigQuery data to anon-premises storage location involves the following steps:
- The BigQuery data is sent to an Apache Spark job onDataproc. The use of the Apache Spark job permitsschema evolution.
- After the Dataproc job has processed theBigQuery files, the processed files are written toCloud Storage and then transferred to an on-premises DR location.
- Cloud Interconnect is used to connect your Virtual Private Cloud network to your on-premises network.
- The transfer to the on-premises DR location can occur through the Spark job.
If your warehouse design is append-only and is partitioned by date, you need tocreate a copy of the required partitions in a new table before yourun a BigQuery export job on the new table. You can then use a tool such asgcloud storagegcloud CLI command to transfer the updated files to your backuplocation on-premises or in another cloud.(Egress charges might apply when you transfer data out of Google Cloud.)
For example, you have a sales data warehouse that consists of an append-onlyorders table in which new orders are appended to the partition that representsthe current date. However, a return policy might allow items to be returnedwithin 7 days. Therefore, records in theorders table from within the last 7days might be updated. Your export strategies need to take the return policyinto account. In this example, any export job to backup theorders table needsto also export the partitions representing orders within the last 7 days toavoid missing updates due to returns.
What's next
- Read other articles in this DR series:
- Read the whitepaper:Learn aboutdata residency, operational transparency, and privacy for European customers on Google Cloud.
- For more reference architectures, diagrams, and best practices, explore theCloud Architecture Center.
Contributors
Authors:
- Grace Mollison | Solutions Lead
- Marco Ferrari | Cloud Solutions Architect
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2024-07-20 UTC.