Migrate across Google Cloud regions: Prepare data and batch workloads for migration across regions

This document describes how to design a data platform on Google Cloud tominimize the impact of a future expansion to otherregions or of aregion-to-region migration. This document is part of a series that helps you to understand the impact of expanding your data platform to another region. It helps you learn how to do the following:

  • Prepare to move data and data pipelines.
  • Set up checks during the migration phases.
  • Create a flexible migration strategy by separating data storage and datacomputation.

The guidance in this series is also useful if you didn't plan for a migrationacross regions or for an expansion to multiple regions in advance. In this case,you might need to spend additional effort to prepare your infrastructure,workloads, and data for the migration across regions and for the expansion tomultiple regions.

This document is part of a series:

This series assumes that you've read and are familiar with the followingdocuments:

The following diagram illustrates the path of your migration journey.

Migration path with four phases.

During each migration step, you follow the phases defined inMigration to Google Cloud: Get started:

  1. Assess and discover your workloads.
  2. Plan and build a foundation.
  3. Deploy your workloads.
  4. Optimize your environment.

The modern data platform on Google Cloud

This section describes the different parts of a modern data platform, and howthey're usually constructed in Google Cloud. Data platforms as a generalconcept can be divided into two sections:

  • Thedata storage layer is where data is saved. The data that you'resaving might be in the form of files where you manage actual bytes on afile system like Hadoop Distributed File System (HDFS) or Cloud Storage,or you might use a domain-specific language (DSL) to manage the data in adatabase management system.
  • Thedata computation layer is any data processing that you mightactivate on top of the storage system. As with the storage layer, there aremany possible implementations, and some data storage tools also handle datacomputation. The role of the data computation layer in the platform is toload data from the storage layer, process the data, and then save theresults to a target system. The target system can be the source storagelayer.

Some data platforms use multiple storage systems for their data storage layer,and multiple data computation systems for their data processing layer. In mostcases, the data storage layer and the data computation layer are separated. Forexample, you might have implemented your data storage layer using theseGoogle Cloud services:

You might have implemented the data computation layer using otherGoogle Cloud services like these:

To reduce the time and latency of communication, the cost of outbound datatransfer, and the number of I/O operations between the storage layer and thecomputation layer, we recommend that you store the data in the same zone thatyou process the data in.

We also recommend that you keep your data storage layer separate from your datacomputation layer. Keeping these layers separate improves your flexibility inchanging computation layers and migrating data. Keeping the layers separate alsoreduces your resource use because you don't have to keep the computation layerrunning all the time. Therefore, we recommend that you deploy your data storageand data computation on separate platforms in the same zone and region. Forexample, you can move your data storage from HDFS to Cloud Storage anduse a Dataproc cluster for computation.

Assess your environment

In the assessment phase, you determine the requirements and dependencies tomigrate the batch data pipelines that you've deployed:

  1. Build a comprehensive inventory of your data pipelines.
  2. Catalog your pipelines according to their properties and dependencies.
  3. Train and educate your teams on Google Cloud.
  4. Build an experiment and proof of concept on Google Cloud.
  5. Calculate the total cost of ownership (TCO) of the target environment.
  6. Choose the workloads that you want to migrate first.

For more information about the assessment phase and these tasks, seeMigration to Google Cloud: Assess and discover your workloads.The following sections are based on the information in that document.

Build your inventories

To scope your migration, you must understand the data platform environment whereyour data pipelines are deployed:

  1. Create an inventory of yourdata infrastructure—the different storagelayers and different computation layers that you're using for data storageand batch data processing.
  2. Create an inventory of the data pipelines that are scheduled to bemigrated.
  3. Create an inventory of the datasets that are being read by the datapipelines and that need to be migrated.

To build an inventory of your data platform, consider the following for eachpart of the data infrastructure:

  • Storage layers. Along with standard storage platforms likeCloud Storage, consider other storage layers such as databaseslike Firebase, BigQuery, Bigtable,and Postgres, or other clusters like Apache Kafka. Each storage platformhas its own strategy and method to complete migration. For example,Cloud Storage has data migration services, and a database mighthave a built-in migration tool. Make sure that each product that you'reusing for data storage is available to you in your target environment, orthat you have a compatible replacement. Practice and verify the technicaldata transfer process for each of the involved storage platforms.
  • Computation layers. For each computation platform, verify thedeployment plan and verify any configuration changes that you might havemade to the different platforms.
  • Network latency. Test and verify the network latency between thesource environment and the target environment. It's important for you tounderstand how long it will take for the data to be copied. You also needto test the network latency from clients and external environments (such asan on-premises environment) to the target environment in comparison to thesource environment.
  • Configurations and deployment. Each data infrastructure product hasits own setup methods. Take inventory of the custom configurations thatyou've made for each component, and which components you're using thedefault versions of for each platform (for example, whichDataproc version or Apache Kafka version you're using). Makesure that those configurations are deployable as part of your automateddeployment process.

    You need to know how each component is configured because computationalengines might behave differently when they're configureddifferently—particularly if the processing layer framework changesduring the migration. For example, if the target environment is running adifferent version of Apache Spark, some configurations of the Sparkframework might have changed between versions. This kind of configurationchange can cause changes in outputs, serializations, and computation.

    During the migration, we recommend that you use automated deployments toensure that versions and configurations stay the same. If you can't keepversions and configurations the same, then make sure to have tests thatvalidate the data outputs that the framework calculates.

  • Cluster sizes. For self-managed clusters, such as a long-livingDataproc cluster or an Apache Kafka cluster running onCompute Engine, note the number of nodes and CPUs, and the memoryfor each node in the clusters. Migrating to another region might result in achange to the processor that your deployment uses. Therefore, we recommendthat you profile and optimize your workloads after you deploy the migratedinfrastructure to production. If a component is fully managed or serverless(for example Dataflow), the sizing will be part of eachindividual job, and not part of the cluster itself.

The following items that you assess in your inventory focus on the datapipelines:

  • Data sources and sinks. Make sure to account for the sources andsinks that each data pipeline uses for reading and writing data.
  • Service Level Agreements (SLAs) and Service Level Objectives(SLOs). Batch data pipelines SLAs and SLOs are usually measured intime to completion, but they can also be measured in other ways, such ascompute power used. This business metadata is important in driving businesscontinuity and disaster recovery plan processes (BCDR), such as failingover a subset of your most critical pipelines to another region in theevent of a zonal or regional failure.
  • Data pipelines dependencies. Some data pipelines rely on data thatis generated by another data pipeline. When you split pipelines intomigration sprints, make sure to consider data dependencies.
  • Datasets generated and consumed. For each data pipeline, identifydatasets that the pipeline consumes, and which datasets it generates. Doingso can help you to identify dependencies between pipelines and between othersystems or components in your overall architecture.

The following items that you assess in your inventory focus on the datasets tobe migrated:

  • Datasets. Identify the datasets that need to be migrated to thetarget environment. You might consider some historical data as not neededfor migration, or to be migrated at a different time, if the data isarchived and isn't actively used. By defining the scope for the migrationprocess and the migration sprints, you can reduce risks in the migration.
  • Data sizes. If you plan to compress files before you transfer them,make sure to note the file size before and after compression. The size ofyour data will affect the time and cost that's required to copy the datafrom the source to the destination. Considering these factors will help youto choose between downtime strategies, as described later in this document.
  • Data structure. Classify each dataset to be migrated and make sure thatyou understand whether the data is structured, semi-structured, orunstructured. Understanding data structure can inform your strategy for howto verify that data is migrated correctly and completely.

Complete the assessment

After you build the inventories related to your Kubernetes clusters andworkloads, complete the rest of the activities of the assessment phase inMigration to Google Cloud: Assess and discover your workloads.

Plan and build your foundation

The plan and build phase of your migration to Google Cloud consists ofthe following tasks:

  1. Build a resource hierarchy.
  2. Configure Identity and Access Management (IAM).
  3. Set up billing.
  4. Set up network connectivity.
  5. Harden your security.
  6. Set up logging, monitoring, and alerting.

For more information about each of these tasks, seeMigrate to Google Cloud: Plan and build your foundation.

Migrate data and data pipelines

The following sections describes some of the aspects of the plan for migratingdata and batch data pipelines. It defines some concepts around thecharacteristics of data pipelines that are important to understand when youcreate the migration plan. It also discusses some data testing concepts that canhelp increase your confidence in the data migration.

Migration plan

In your migration plan, you need to include time to complete the data transfer.Your plan should account for network latency, time to test the data completenessand get any data that failed to migrate, and any network costs. Because datawill be copied from one region to another, your plan for network costs shouldinclude inter-region network costs.

We recommend that you divide the different pipelines and datasets into sprintsand migrate them separately. This approach helps to reduce the risks for eachmigration sprint, and it allows for improvements in each sprint. To improve yourmigration strategy and uncover issues early, we recommend that you prioritizesmaller, non-critical workloads, before you migrate larger, more criticalworkloads.

Another important part of a migration plan is to describe the strategy,dependencies, and nature of the different data pipelines from the computationlayer. If your data storage layer and data computation layer are built on thesame system, we recommend that you monitor the performance of the system whiledata is being copied. Typically, the act of copying large amounts of data cancause I/O overhead on the system and degrade performance in the computationlayer. For example, if you run a workload to extract data from a Kafka clusterin a batch fashion, the extra I/O operations to read large amounts of data cancause a degradation of performance on any active data pipelines that are stillrunning in the source environment. In that kind of scenario, you should monitorthe performance of the system by using any built-in or custom metrics. To avoidoverwhelming the system, we recommend that you have a plan to decommission someworkloads during the data copying process, or to throttle down the copy phase.

Because copying data makes the migration a long-running process, we recommendthat you have contingency plans to address anything that might go wrong duringthe migration. For example, if data movement is taking longer than expected orif integrity tests fail before you put the new system online, consider whetheryou want to roll back or try to fix and retry failed operations. Although arollback can be a cleaner solution, it can be time-consuming and expensive tocopy large datasets multiple times. We recommend that you have a clearunderstanding and predefined tests to determine which action to take in whichconditions, how much time to allow to try to create patches, and when to performa complete rollback.

It's important to differentiate between the tooling and scripts that you'reusing for the migration, and the data that you're copying. Rolling back datamovement means that you have to recopy data and either override or delete datathat you already copied. Rolling back changes to the tooling and scripts ispotentially easier and less costly, but changes to tooling might force you torecopy data. For example, you might have to recopy data if you create a newtarget path in a script that generates a Cloud Storage locationdynamically. To help avoid recopying data, build your scripts to allow forresumability and idempotency.

Data pipeline characteristics

In order to create an optimal migration plan, you need to understand thecharacteristics of different data pipelines. It's important to remember thatbatch pipelines that write data are different from batch pipelines that readdata:

  • Data pipelines that write data: Because it changes the state of thesource system, it can be difficult to write data to the source environmentat the same time that data is being copied to the target environment.Consider the runtimes of pipelines that write data, and try to prioritizetheir migration earlier in the overall process. Doing so will let you havedata ready on the target environment before you migrate the pipelines thatread the data.
  • Data pipelines that read data: Pipelines that read data might havedifferent requirements for data freshness. If the pipelines that generatedata are stopped on the source system, then the pipelines that read datamight be able to run while data is being copied to the target environment.

Data is state, and copying data between regions isn't an atomic operation.Therefore, you need to be aware of state changes while data is being copied.

It's also important in the migration plan to differentiate between systems. Yoursystems might have different functional and non-functional requirements (forexample, one system for batch and another for streaming). Therefore, your planshould include different strategies to migrate each system. Make sure that youspecify the dependencies between the systems and specify how you will reducedowntime for each system during each phase of the migration.

A typical plan for a migration sprint should include the following:

  • General strategy. Describe the strategy for handling the migrationin this sprint. For common strategies, seeDeploy your workloads.
  • List of tools and methods for data copy and resource deployment.Specify any tool that you plan to use to copy data or deploy resources tothe target environment. This list should include custom scripts that areused to copy Cloud Storage assets, standard tooling such as theGoogle Cloud CLI, and Google Cloud tools such as Migration Services.
  • List of resources to deploy to the target environment. List allresources that need to be deployed in the target environment. This listshould include all data infrastructure components such asCloud Storage buckets, BigQuery datasets, andDataproc clusters. In some cases, early migration sprintswill include deployment of a sized cluster (such as aDataproc cluster) in a smaller capacity, while later sprintswill include resizing to fit new workloads. Make sure that your planincludes potential resizing.
  • List of datasets to be copied. For each dataset, make sure tospecify the following information:
    • Order in copying (if applicable): For most strategies, theorder of operation might be important. An exception is thescheduledmaintenance strategy that's described later in this document.
    • Size
    • Key statistics: Chart key statistics, such as row number,that can help you to verify that the dataset was copied successfully.
    • Estimated time to copy: The time to complete your datatransfer, based on the migration plan.
    • Method to copy: Refer to the tools and methods listdescribed earlier in this document.
    • Verification tests: Explicitly list the tests that you planto complete to verify that the data was copied in full.
    • Contingency plan: Describe what to do if any verificationtests fail. Your contingency plan should specify when to retry andresume the copy or fill in the gap, and when to do a complete rollbackand recopy the entire dataset.

Testing

This section describes some typical types of tests that you can plan for. Thetests can help you to ensure data integrity and completeness. They can also helpyou to ensure that the computational layer is working as expected and is readyto run your data pipelines.

  • Summary or hashing comparison: In order to validate datacompleteness after copying data over, you need to compare the originaldataset against the new copy on the target environment. If the data isstructured inside BigQuery tables, you can't join the twotables in a query to see if all data exists, because the tables reside indifferent regions. Because of the cost and latency, BigQuerydoesn't allow queries to join data across regions. Instead, the method ofcomparison must summarize each dataset and compare the results. Dependingon the dataset structure, the method for summarizing might be different.For example, a BigQuery table might use an aggregationquery, but a set of files on Cloud Storage might use a Sparkpipeline to calculate a hash of each file, and then aggregate the hashes.
  • Canary flows: Canary flows activate jobs that are built to validatedata integrity and completeness. Before you continue to business use caseslike data analytics, it can be useful to run canary flow jobs to make surethat input data complies with a set of prerequisites. You can implementcanary flows as custom-made data pipelines, or as flows in aDAG based on Cloud Composer. Canary flows can help you to complete tasks likeverifying that there are no missing values for certain fields, or validatingthat the row count of specific datasets matches the expected count.

    You can also use canary flows to create digests or other aggregations of acolumn or a subset of the data. You can then use the canary flow to comparethe data to a similar digest or aggregation that's taken from the copy ofthe data.

    Canary flow methods are valuable when you need to evaluate the accuracyof data that's stored and copied in file formats, like Avro fileson top of Cloud Storage. Canary flows don't normally generate newdata, but instead they fail if a set of rules isn't met within the inputdata.

  • Testing environment: After you complete your migration plan, youshould test the plan in a testing environment. The testing environmentshould include copying sampled data or staging data to another region, toestimate the time that it takes to copy data over the network. This testinghelps you to identify any issues with the migration plan, and helps toverify that the data can be migrated successfully. The testing shouldinclude both functional and non-functional testing. Functional testingverifies that the data is migrated correctly. Non-functional testingverifies that the migration meets performance, security, and othernon-functional requirements. Each migration step in your plan shouldinclude a validation criteria that details when the step can be consideredcomplete.

To help with data validation, you can use theData Validation Tool (DVT).The tool performs multi-leveled data validation functions, from the table levelto the row level, and it helps you compare the results from your source andtarget systems.

Your tests should verify deployment of the computational layer, and test thedatasets that were copied. One approach to do so is to construct a testingpipeline that can compute some aggregations of the copied datasets, and makesure the source datasets and the target datasets match. A mismatch betweensource and target datasets is more common when the files that you copy betweenregions aren't exact byte-copy representations between the source and targetsystems (such as when you change file formats or file compressions).

For example, consider a dataset that's composed of newline delimited JSON files.The files are stored in a Cloud Storage bucket, and are mounted as anexternal table in BigQuery. To reduce the amount of data movedover the network, you can perform Avro compression as part of the migration,before you copy files to the target environment. This conversion has manyupsides, but it also has some risks, because the files that are being written tothe target environment aren't a byte-copy representation of the files in thesource environment.

To mitigate the risks from the conversion scenario, you can create aDataflow job, or use BigQuery to calculate someaggregations and checksum hashes of the dataset (such as by calculating sums,averages, or quantiles for each numeric column). For string columns, you cancompute aggregations on top of the string length, or on the hash code of thatstring. For each row, you can compute an aggregated hash from a combination ofall the other columns, which can verify with high accuracy that one row is thesame as its origin. These calculations are made on both the source and targetenvironments, and then they're compared. In some cases, such as if your datasetis stored in BigQuery, you can't join tables from the source andtarget environments because they're in different regions, so you need to use aclient that can connect to both environments.

You can implement the preceding testing methods either inBigQuery or as a batch job (such as in Dataflow).You can then run the aggregation jobs and compare the results calculated for thesource environment to the results calculated for the target environment. Thisapproach can help you to make sure that data is complete and accurate.

Another important aspect of testing the computational layer is to run pipelinesthat include all varieties of the processing engines and computational methods.Testing the pipeline is less important for managed computational engines likeBigQuery or Dataflow. However, it's important totest the pipeline for non-managed computational engines likeDataproc. For example, if you have a Dataproccluster that handles several different types of computation, such as ApacheSpark, Apache Hive, Apache Flink, or Apache MapReduce, you should test eachruntime to make sure that the different workload types are ready to betransferred.

Migration strategies

After you verify your migration plan with proper testing, you can migratedata. When you migrate data, you can use different strategies for differentworkloads. The following are examples of migration strategies that you can useas is or customize for your needs:

  • Scheduled maintenance: You planwhen your cutover window occurs. This strategy is good when data ischanged frequently, but SLOs and SLAs can withstand some downtime. Thisstrategy offers high confidence of data transferred because data iscompletely stale while it's being copied. For more information, seeScheduled maintenance in "Migration to Google Cloud: Transferring your large datasets."
  • Read-only cutover: A slight variation of the scheduled maintenancestrategy, where the source system data platform allows read-only datapipelines to continue reading data while data is being copied. Thisstrategy is useful because some data pipelines can continue to work andprovide insights to end systems. The disadvantage to this strategy is thatthe data that's produced is stale during the migration, because the sourcedata doesn't get updated. Therefore, you might need to employ a catch-upstrategy after the migration, to account for the stale data in the end systems.
  • Fully active: You copy the data at a specific timestamp, while thesource environment is still active for both read and write data pipelines.After you copy the data and switch over to the new deployment, you performa delta copy phase to get the data that was generated after the migrationtimestamp in the source environment. This approach requires morecoordination and consideration compared to other strategies. Therefore,your migration plan must include how you will handle the update and deleteoperations on the source data.
  • Double-writes: The data pipelines can run on both the source andtarget environments, while data is being copied. This strategy avoids thedelta copy phase that's required to backfill data if you use the fullyactive or read-only strategies. However, to help make sure that datapipelines are producing identical results, a double-writes strategy requiresmore testing before the migration. If you don't perform advance testing, youwill encounter problems trying to consolidate asplit-brainscenario. The double-writes strategy also introduces potential costs ofrunning the same workloads twice in different regions. This strategy hasthe potential to migrate your platform with zero downtime, but it requiresmuch more coordination to execute it correctly.

Post-migration testing

After the migration is complete, you should test data completeness and test thedata pipelines for validity. If you complete your migration in sprints, you needto perform these tests after each sprint. The tests that you perform in thisstage are similar to integration tests: you test the validity of a data pipelinethat's running business use cases with full production-grade data as input, andthen you inspect the output for validity. You can compare the output of theintegration test to the output from the source environment by running the samedata pipeline in both the source environment and in the target environment. Thistype of test works only if the data pipeline is deterministic, and if you canensure that the input to both environments is identical.

You can confirm that the data is complete when it meets a set of predefinedcriteria, where the data in the source environment is equal (or similar enough)to the data in the target environment. Depending on the strategy that you usedfrom the previous section, the data might not match one-to-one. Therefore, youneed to predefine criteria to describe data completeness for your use case. Forexample, for time-series data, you might consider the data to be complete whenthe most up-to-date record is no more than five minutes behind the currenttimestamp.

Cutover

After you verify and test the data and data pipelines on the targetenvironment, you can start the cutover phase. Starting this phase means thatclients might need to change their configuration to reference the new systems.In some cases, the configuration can't be the same as the configuration that'spointing to the source system. For example, if a service needs to read data froma Cloud Storage bucket, clients need to change the configuration forwhich bucket to use. Cloud Storage bucket names are globally unique, soyour target environment Cloud Storage bucket will be different from thesource environment.

During the cutover phase, you should decommission and unschedule the sourceenvironment workloads. We recommend that you keep the source environment datafor some time, in case you need to roll back.

The pre-migration testing phase isn't as complete as a production run of a datapipeline. Therefore, after the cutover is complete and the target system isoperational, you need to monitor the metrics, runtimes, and semantic outputs ofyour data pipelines. This monitoring will help you to catch errors that yourtesting phase might have missed, and it will help ensure the success of themigration.

Optimize your environment

Optimization is the last phase of your migration. In this phase, you make yourenvironment more efficient by executing multiple iterations of a repeatable loopuntil your environment meets your optimization requirements:

  1. Assess your current environment, teams, and optimization loop.
  2. Establish your optimization requirements and goals.
  3. Optimize your environment and your teams.
  4. Tune the optimization loop.

For more information about how to optimize your Google Cloud environment,seeMigration to Google Cloud: Optimize your environment.

Prepare your Google Cloud data and computing resources for a migration across regions

This section provides an overview of the data and computing resources onGoogle Cloud and of the design principles to prepare for a migrationacross regions.

BigQuery

Because BigQuery is a serverless SQL data warehouse, you don'thave to deploy the computation layer. If some of your BigQueryclients specify regions for processing, you will need to adjust those clients.Otherwise, BigQuery is the same in the source environment and thetarget environment. BigQuery data is stored in two kinds oftables:

  • BigQuery tables: Tables in the BigQueryformat. BigQuery manages the data files for you. For moreinformation about migrating data in the BigQuery format, seeManage datasets.
  • BigQuery external tables: Tables for which the data isstored outside of BigQuery. After the data is moved, youwill need to recreate the external tables in the new destination. For moreinformation about migrating external tables, seeIntroduction to external tables.

Cloud Storage

Cloud Storage offers aStorage Transfer Service that can help you migrate your data.

Dataflow (Batch)

Dataflow is a Google-managed data processing engine. To helpsimplify your Dataflow migration and ensure that your jobs can be deployed to any region, you should inject all inputs and outputs asparameters to your job. Instead of writing input and output data locations inyour source code, we recommend that you pass Cloud Storage paths anddatabase connection strings as arguments or parameters.

Dataproc

Dataproc is a managed Apache Hadoop environment that can run anyworkload that's compatible with the Apache Hadoop framework. It's compatiblewith frameworks such as Apache Spark, Apache Flink, and Apache Hive.

You can use Dataproc in the following ways, which affect how youshould migrate your Dataproc environment across regions:

  • Ephemeral clusters with data on Cloud Storage: Clusters are builtto run specific jobs, and they're destroyed after the jobs are done. Thismeans that the HDFS layer or any other state of the cluster is alsodestroyed. If your configuration meets the following criteria, then thistype of usage is easier to migrate compared to other types of usage:
    • Inputs and outputs to your jobs aren't hardcoded in the sourcecode. Instead, your jobs receive inputs and output as arguments.
    • The Dataproc environment provisioning isautomated, including the configurations for the individual frameworksthat your environment is using.
  • Long-living clusters with external data: You have one or moreclusters, but they're long-living clusters—even if there are no runningjobs on the cluster, the cluster is still up and running. The data andcompute are separate because the data is saved outside of the cluster onGoogle Cloud solutions like Cloud Storage orBigQuery. This model is usually effective when there arealways jobs that are running on the cluster, so it doesn't make sense totear down and set up clusters like in the ephemeral model. Because data andcompute are separate, the migration is similar to migration of theephemeral model.
  • Long-living clusters with data in the cluster: The cluster is longliving, but the cluster is also keeping state, data, or both, inside thecluster, most commonly as data on HDFS. This type of use complicates themigration efforts because data and compute aren't separate; if you migrateone without the other, there is a high risk of creating inconsistencies. Inthis scenario, consider moving data and state outside of the cluster beforethe migration, to separate the two. If doing so is impossible, then werecommend that you use thescheduled maintenance strategy in order to reduce the risk of creating inconsistencies in yourdata.

Because there are many potential frameworks, and many versions andconfigurations of those frameworks, you need to test thoroughly before youexecute your migration plan.

Cloud Composer

Cloud Composer is Google Cloud's managed version of ApacheAirflow, for orchestration and scheduling of flows. DAGs, configurations, andlogs are managed in a Cloud Storage bucket that should be migrated withyour Cloud Composer deployment. In order to migrate the state of yourCloud Composer deployment, you cansave and load environment snapshots.

If you've deployed any custom plugins to your Cloud Composerinstance, we recommend that you apply an infrastructure-as-code methodology torecreate the environment in a fully automated manner.

Cloud Composer doesn't manage data but it activates other dataprocessing frameworks and platforms. Therefore, migration ofCloud Composer can be completely isolated from the data.Cloud Composer also doesn't process data, so your deployment doesn'tneed to be in the same region as the data. Therefore, you can create aCloud Composer deployment in the target environment, and still runpipelines on the source environment. In some cases, doing so can be useful foroperating different pipelines in different regions while the entire platform isbeing migrated.

Cloud Data Fusion

Cloud Data Fusion is a visual integration tool that helps you builddata pipelines using a visual editor. Cloud Data Fusion is based on theopen source projectCDAP.Like Cloud Composer, Cloud Data Fusion doesn't manage dataitself, but it activates other data processing frameworks and platforms. YourCloud Data Fusion pipelines should be exported from the sourceenvironment and imported to the target environment in one of these ways:

Depending on the amount of flows that you need to migrate, you might prefer onemethod over the other. Using the CDAP API to build a migration script might bedifficult, and it requires more software engineering skills. However, if youhave a lot of flows, or if the flows change relatively frequently, an automatedprocess might be the best approach.

What's Next

Contributors

Author:Eyal Ben Ivri | Cloud Solutions Architect

Other contributor:Marco Ferrari | Cloud Solutions Architect

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-12-02 UTC.