Migrate data pipelines

This document describes how you can migrate yourupstream data pipelines, which load data into your data warehouse.You can use this document to better understand what a data pipeline is,what procedures and patterns a pipeline can employ, and which migrationoptions and technologies are available for a data warehousemigration.

What is a data pipeline?

In computing, adata pipelineis a type of application that processes data through a sequence of connectedprocessing steps. As a general concept, data pipelines can be applied, forexample, to data transfer between information systems,extract, transform, and load (ETL), data enrichment, and real-time data analysis. Typically, data pipelinesare operated as abatch process that executes and processes data when run, oras astreaming process that executes continuously and processes data as itbecomes available to the pipeline.

In the context of data warehousing, data pipelines are commonly used to readdata from transactional systems, apply transformations, and then write data tothe data warehouse. Each of the transformations is described by a function, andthe input for any given function is the output of the previous function orfunctions. These connected functions are described as a graph, and this graph isoften referred to as aDirected Acyclic Graph (DAG)—that is, the graph follows a direction (from source to destination), andis acyclic—the input for any function cannot be dependent on the output ofanother function downstream in the DAG. In other words, loops are not permitted.Each node of the graph is a function, and each edge represents the data flowingfrom one function to the next. The initial functions aresources, orconnections to source data systems. The final functions aresinks, orconnections to destination data systems.

In the context of data pipelines, sources are usually transactional systems—forexample, an RDBMS—and the sink connects to a data warehouse. This type of graphis referred to as adata flow DAG. You can also use DAGs to orchestrate datamovement between data pipelines and other systems. This usage is referred to asanorchestration orcontrol flow DAG.

When to migrate the data pipelines

When you migrate ause case to BigQuery, you can choose tooffload orfully migrate.

On the one hand, when you offload a use case, you don't need to migrateits upstream data pipelines up front. You first migrate the use case schema anddata from your existing data warehouse into BigQuery. You thenestablish an incremental copy from the old to the new data warehouse to keep thedata synchronized. Finally, you migrate and validate downstream processes suchas scripts, queries, dashboards, and business applications.

At this point, your upstream data pipelines are unchanged and are still writingdata to your existing data warehouse. You can include the offloaded use cases inthe migration backlog again to be fully migrated in a subsequentiteration.

On the other hand, when you fully migrate a use case, the upstreamdata pipelines required for the use case are migrated to Google Cloud. Fullmigration requires you to offload the use case first. After the full migration,you can deprecate the corresponding legacy tables from the on-premises datawarehouse because data is ingested directly into BigQuery.

During an iteration, you can choose one of the following options:

Offload only your use case.
Fully migrate a use case that was previously offloaded.
Fully migrate a use case from scratch by offloading it first in the sameiteration.

When all of your use cases are fully migrated, you can elect to switch off theold warehouse, which is an important step for reducing overhead and costs.

How to migrate the data pipelines

The rest of this document addresses how to migrate your data pipelines,including which approach and procedures to use and which technologies to employ.Options range from repurposing existing data pipelines (redirecting them to loadto BigQuery) to rewriting the data pipelines in order to takeadvantage of Google Cloud-managed services.

Procedures and patterns for data pipelines

You can use data pipelines to execute a number of procedures andpatterns. These pipelines are the most commonly used in data warehousing. Youmight havebatch data pipelines orstreaming data pipelines. Batch datapipelines run on data collected over a period of time (for example, once a day).Streaming data pipelines handle real-time events being generated by youroperational systems—for example, inCDC row changes being generated by your Online Transaction Processing (OLTP) databases.

Extract, transform, and load (ETL)

In the context of data warehousing, data pipelines often execute an extract,transform, and load (ETL) procedure. ETL technologies run outside of the datawarehouse, which means the resources of the data warehouse can beused primarily for concurrent querying, instead of for preparing and transformingdata. One downside of the transformation being executed outside of the datawarehouse is that it requires you to learn additional tooling and languages(other than SQL) to express the transformations.

The following diagram shows a typical ETL procedure.

Flow showing source (extract) going to one or more transformations (transform), then to a sink, and finally to a data warehouse (load)

Figure 1. A typical ETL procedure.

Note: The arrows denote the direction of data flow; the sources within the DAGtypically pull their data from the source systems.

A typical ETL data pipeline pulls data from one or more source systems(preferably, as few as possible to avoid failures caused by issues likeunavailable systems). The pipeline then performs a series of transformations,including cleaning data, applying business rules to it, checking for dataintegrity, and create aggregates or disaggregates. For more information, see Real-life ETL cycle.

It's common to have multiple data pipelines. The first pipeline focuses oncopying data from the source system to the data warehouse. Subsequent pipelinesapply business logic and transform the data for use in variousdata marts,which are subsets of the data warehouse focused on a specific business unit orbusiness focus.

When you have multiple data pipelines, you need toorchestrate them. The following diagram shows what this orchestration process might looklike.

Orchestrator (DAG) managing two ETL processes (Sub DAGs)

Figure 2. Orchestration process for multiple data pipelines.

In the diagram, each data pipeline is considered a sub-DAG of the orchestrationDAG. Each orchestration DAG encompasses several data pipelines to align with thelarger objective, for example, preparing data for a business unit so that business analystscan run their dashboards or reports.

Extract, load, and transform (ELT)

ELT is an alternative to ETL. With ELT, the data pipeline is split into twoparts. First, an ELT technology extracts the data from the source system andloads it into the data warehouse. Second, SQL scripts on top of the datawarehouse perform the transformations. The upside of this approach is that youcan use SQL to express the transformations; the downside is that this mightconsume data warehouse resources that are needed for concurrent querying. Forthis reason, ELT batches often run during the night (or off-peak) when the datawarehouse's system resources are in lesser demand.

The following diagram shows a typical ELT procedure.

Flow showing source (extract) going to one or more transformations (transform), then to a sink, and finally to a data warehouse (load).

Figure 3. A typical ELT procedure.

When you adopt an ELT approach, it's common to separate the extract and loadinto one DAG and the transformations into their own DAGs. Data is loaded into thedata warehouse once and then transformed multiple times to create the differenttables that are used downstream in reporting and so on. These DAGs in turnbecome sub-DAGs in a larger orchestration DAG (as shown in theETL section).

When you migrate data pipelines from a congested on-premises data warehouse tothe cloud, it's important to remember that cloud data warehouse systems such asBigQuery are massively parallel data processing technologies. Infact, in the case of BigQuery, you can buy more resources tosupport both increasing demands for ELT and concurrent querying. For moreinformation, see theIntroduction to optimizing query performance.

Extract and load (EL)

You can use the extact and load (EL) procedure on its own or followed bytransformations, in which case it becomes ELT. EL is mentioned separatelybecause several automated services are available that perform thistask, mitigating the need for you to create your own ingestion datapipeline. For more details, seeWhat is BigQuery Data Transfer Service?.

Change data capture (CDC)

Change data capture (CDC)is one of several software design patterns used to track data changes. It'soften used in data warehousing because the data warehouse is used to collate andtrack data and its changes from various source systems over time.

The following diagram shows an example of how CDC works with ELT.

ETL flow showing individual records with version information assigned at extraction and timestamps added at load.

Figure 4. How CDC works with ELT.

CDC works well with ELT because you want to store the original record beforemaking any changes downstream.

To make the EL part happen, you can process database logs by using CDC softwaresuch asDatastream or open source tools likeDebezium and writing the records to BigQuery usingDataflow.Then you can use a SQL query to determine the latest version beforeyou apply further transformations. Here's an example:

WITHrankedAS(SELECT*,ROW_NUMBER()OVER(PARTITIONBYRECORDKEYORDERBYEVENTTIMESTAMPDESC)ASrankFROMTABLENAME)SELECT*FROMrankedWHERErank=1

When you are refactoring or creating new data pipelines, consider using the CDCpattern applied as an ELT procedure. This approach ensures that you have acomplete history of data changes upstream and provides a good segregation ofresponsibilities—for example:

Source system teams ensure availability of their logs and publicationof their data events.
The data platform team ensures that ingestion collation of the originalrecords includes timestamps in the data warehouse.
Data engineering and analyst teams schedule a series of transformationsto populate their data marts.

Feedback loops with operational data pipelines

Operational data pipelines are data processing pipelines that take data fromthe data warehouse, transform it if needed, and write the result intooperational systems.

Operational systems refer to systems that process the organization's day-to-day transactions, suchas OLTP databases, Customer Relationship Management (CRM) systems, ProductCatalog Management (PCM) systems, and so on. Because these systems often act asa source of data, the operational data pipelines implement afeedback looppattern.

The operational data pipeline pattern is shown in the following diagram.

ETL pipeline feeding into data warehouse and then into an operational pipeline that feeds back into the source system that feeds the ETL pipeline.

Figure 5. Pattern for an operational data pipeline.

The following example describes an operational data pipeline that writes productprices into a PCM system. A PCM system is theauthoritative systemfor sales-related product information such as colors, sales channels, price, andseasonality. Here's the end-to-end flow of data:

Price-related data is available from multiple sources. This data caninclude the current price by region from the PCM, competitor pricing from athird-party service, demand forecasting and supplier reliability frominternal systems, and so on.
An ETL pipeline pulls the data from the sources, transforms it, andwrites the result into the data warehouse. The transformation in this caseis a complex calculation involving all the sources with the goal ofproducing an optimal base price for each product in the PCM.
Finally, the operational pipeline takes the base prices from the datawarehouse, performs light transformations to adjust the prices for seasonalevents and writes the final prices back into the PCM.

PCM system feeding into ETL system.

Figure 6. An operational data pipeline that writes product prices into aPCM system.

An operational data pipeline is a type of downstream process, whereas datapipelines implementingETL,ELT,orCDC are upstream processes. Nevertheless, the tools used to implement both canoverlap. For instance, you can useDataflow to define and run all the data processing DAGs,GoogleSQL to define transformations that execute within BigQuery, andCloud Composer to orchestrate the end-to-end flow of data.

Choosing a migration approach

This section describes different approaches you can adopt to migrate your datapipelines.

Redirect data pipelines to write to BigQuery

In the following conditions, you might consider whether a technologythat you use offers a built-in BigQuery sink(write connector):

The legacy data warehouse is fed by data pipelines executing anETL procedure.
The transformation logic is executed before the data isstored in the data warehouse.

Independent software vendors(ISV) offer data processing technologies with BigQueryconnectors, including the following:

Informatica:BigQuery connector guide
Talend:Writing data in BigQuery

Note: It's important to check that the data processing software takes advantageof the BigQuery large-scale ingestion mechanisms,such as streaming inserts or batch loads from Cloud Storage. An approach thatemploys theMagnitude Simba JDBC orODBC BigQuery drivers isn't suitable for large-scale ingestionoperations, because these drivers implement the query interface. While the driverscan perform inserts, this interface is intended for querying and datamanipulation language (DML) statements on BigQuery, not forlarge-scale inserts or updates.

If the data pipeline technology doesn't support data ingestion toBigQuery, consider using avariation on this approach that writes the data temporarily to files that are subsequently ingested byBigQuery.

Data pipeline that's blocked from feeding into the legacy system and instead feeds into BigQuery.

Figure 7. Rewriting, or reconfiguring, the last function of a data pipelineto write data to BigQuery.

At a high level, the work involved concerns rewriting, or reconfiguring, thelast function of the data pipeline to write data to BigQuery.However, you face a number of options that might require additional changes ornew work, for example:

Functional

Data mappings: Given that the target database table schema mightchange, you might need to reconfigure these mappings.
Metric validation: You must validate both historic and new reports,because both the schema and the queries might change.

Nonfunctional

Firewalls might need to be configured to allow outbound data transfer fromon-premises to BigQuery.
Network changes might be required to create additional bandwidth,to accommodate outbound data transfer.

Redirect data pipelines by using files as an intermediate vehicle

When the existing on-premises data pipeline technology doesn't support GoogleAPIs, or if you are restricted from using Google APIs, you can use files as anintermediate vehicle for your data to reach BigQuery.

This approach is similar to the redirect approach, but instead of using anative sink that can write to BigQuery, you use a sink that canwrite to an on-premises file system. When your data is in your file system, youcopy the files to Cloud Storage. For more details, see theoverview of the ingest options for Cloud Storage and the criteria that are involved in choosing an ingest option.

The final step is to load the data from Cloud Storage intoBigQuery following the guidelines inBatch loading data.

The following diagram shows the approach outlined in this section.

ETL pipeline that feeds into a file system instead of into the legacy data warehouse; the file system in turn feeds into Cloud Storage and from there to BigQuery.

Figure 8. Redirecting data pipelines by using files as an intermediatevehicle.

With respect to the orchestration of the ETL pipeline, you need to perform twoseparate steps:

Reuse your existing on-premises pipeline orchestration to write thetransformed data into the file system. Extend this orchestration to copythe files from your on-premises file system into Cloud Storage, orcreate an additional script that runs regularly to perform the copy step.
When the data is in Cloud Storage, use aCloud Storage transfer to schedule recurring loads from Cloud Storage toBigQuery. Alternatives to Cloud Storage transfersareCloud Storage triggers andCloud Composer.

In Figure 8, note how it's also possible for the orchestration onGoogle Cloud to use a pull model by retrieving the files using aprotocol such asSFTP.

Migrate existing ELT pipelines to BigQuery

ELT pipelines consist of two parts: the part that loads the data into your datawarehouse, and the part that transforms the data by using SQL so it can beconsumed downstream. When you migrate ELT pipelines, each of these parts has itsown approach for migration.

For the part that loads data into your data warehouse (the EL part), you canfollow the guidelines in theredirect data pipelines section, minus the advice on transformations, which are not part of an ELpipeline.

If your data sources are supported by theBigQuery Data Transfer Service (DTS) eitherdirectly or throughthird-party integrations,you can use DTS to replace your EL pipeline.

Migrating existing OSS data pipelines to Dataproc

When you migrate your data pipeline to Google Cloud, you might want tomigrate some legacy jobs that are written with an open source softwareframework likeApache Hadoop,Apache Spark,orApache Flink.

Dataproc lets you deploy fast, easy-to-use, fully managed Hadoop and Spark clusters in asimple, cost-efficient way. Dataproc integrates withtheBigQuery connector,a Java library that enables Hadoop and Spark to directly write data toBigQuery by using abstracted versions of the Apache HadoopInputFormat andOutputFormat classes.

Dataproc makes it easy to create and delete clusters so thatinstead of using one monolithic cluster, you can use many ephemeral clusters.This approach has several advantages:

You can use different cluster configurations for individual jobs,eliminating the administrative burden of managing tools across jobs.
You can scale clusters to suit individual jobs or groups of jobs.
You pay only for resources when your jobs are using them.
You don't need to maintain clusters over time, because they are freshlyconfigured every time you use them.
You don't need to maintain separate infrastructure for development,testing, and production. You can use the same definitions to create as manydifferent versions of a cluster as you need when you need them.

When you migrate your jobs, we recommend that you take an incremental approach.By migrating incrementally, you can do the following:

Isolate individual jobs in your existing Hadoop infrastructure from thecomplexity that's inherent in a mature environment.
Examine each job in isolation to evaluate its needs and to determine thebest path for migration.
Handle unexpected problems as they arise without delaying dependent tasks.
Create a proof of concept for each complex process without affectingyour production environment.
Move your jobs to the recommended ephemeral model thoughtfully anddeliberately.

When you migrate your existing Hadoop and Spark jobs toDataproc, you can check that your jobs' dependencies are coveredby the supportedDataproc versions.If you need to install custom software, you might considercreating your own Dataproc image,using some of the availableinitialization actions (for example,forApache Flink),writing your own initialization action, orspecifying custom Python package requirements.

To get started, see the Dataprocquickstart guides and theBigQuery connector code samples.

Rehost third-party data pipelines to run on Google Cloud

A common scenario when building data pipelines on-premises is to usethird-party software to manage execution of the pipeline and allocation ofcomputing resources.

To move these pipelines to the cloud, you have several alternatives, dependingon the capabilities of the software that you are using, and also depending onyour licensing, support, and maintenance terms.

The following sections present some of these alternatives.

At a high level, you have the following alternatives for executing yourthird-party software in Google Cloud, from least to most complex:

Your software vendor has partnered with Google Cloud to offertheir software inGoogle Cloud Marketplace.
Your third-party software vendor can run onKubernetes.
Your third-party software runs on one or more virtual machines (VMs).

If your third-party software provides a Cloud Marketplace solution,the work involved is as follows:

Deploy your third-party software from theCloud Marketplace console.
Select and migrate your use cases following the iterative approachexplained inMigrating using an iterative approach.

This alternative is the simplest because you onboard your data pipelines to thecloud using the familiar platform provided by your vendor. You might also beable to use proprietary tools from your vendor to facilitate migration betweenyour original environment and your new environment on Google Cloud.

If your vendor doesn't provide a Cloud Marketplace solution, but theirproduct is able to run on top of Kubernetes, you can useGoogle Kubernetes Engine (GKE) to host your pipelines. The following work is involved:

Create a GKE cluster by following the recommendations from your vendor to make sure that thethird-party product can take advantage of the task parallelization thatKubernetes offers.
Install your third-party software on your GKE cluster byfollowing the vendor recommendations.
Select and migrate your use cases by following the iterative approachexplained inMigrating data warehouses to BigQuery: Overview.

This alternative provides a middle ground in terms of complexity. It takesadvantage of your vendor-native support for Kubernetes in order to scale andparallelize the execution of your pipelines. However, it requires you to createand manage a GKE cluster.

If your vendor doesn't support Kubernetes, you must install their software on apool of VMs to enable scaling out and parallelizing the work. If your vendorsoftware natively supports the distribution of work to several VMs, usethose provided facilities, possibly grouping the VM instances in amanaged instance group (MIG) for scaling in and out as required.

Handling the parallelization of the work is nontrivial. If your vendor doesn'tprovide capabilities for distributing tasks to different VMs, we recommend usinga task-farming pattern to distribute work to VMs in a MIG. The following diagramillustrates this approach.

multiple inputs go to Pub/Sub which creates topics. The topics are read by different MIGs.

Figure 9. A managed instance group (MIG) with three VMs.

In this diagram, each VM in the MIG executes the third-party pipeline software.You can trigger a pipeline execution in several ways:

Automatically, by usingCloud Scheduler,Cloud Composer,or aCloud Storage trigger when new data arrives into a Cloud Storage bucket.
Programmatically, by calling aCloud Endpoint orCloud Function,or by using thePub/Sub API.
Manually, by placing a new message in a Pub/Sub topic with theGoogle Cloud CLI.

In essence, all of these methods send a message to a predefinedPub/Sub topic.You create a simple agent to be installed in each VM. The agent listens to the one ormore Pub/Sub topics. Whenever a message arrives in the topic, theagent pulls the message from the topic, starts a pipeline in your third-partysoftware, and listens for its completion. When the pipeline is completed, theagent retrieves the next message from the topics it's listening to.

In all scenarios, we recommend that you work with your vendor to comply withthe appropriate licensing terms for your pipelines to work on Google Cloud.

Rewrite data pipelines to use Google Cloud-managed services

In some cases, you might elect to rewrite some of your existing data pipelinesto use new frameworks and services that are fully managed on Google Cloud.This option works well if your existing pipelines were originally implementedwith technologies that are now deprecated, or if you anticipate that porting andcontinuing to maintain those pipelines unmodified in the cloud would be tooimpractical or cost prohibitive.

The following sections present fully managed Google Cloud services thatlet you perform advanced data transformations at scale: Cloud Data Fusion andDataflow.

Cloud Data Fusion

Cloud Data Fusion,which is built on the open sourceCDAP project, is a fully managed data integration service for building and managing datapipelines through a graphical interface.

You develop the data pipelines in the Cloud Data Fusion UI by connectingsources to transformations, sinks, and other nodes to form a DAG. When youdeploy your data pipeline, the Cloud Data Fusion planner transforms this DAGinto a series of parallel computations which will be executed as an Apache Sparkjob onDataproc.

When using Cloud Data Fusion, you can connect to a source system'sdatabase by using the Java Database Connectivity (JDBC) drivers to read data,transform it, and load it into a destination of your choice (for example,BigQuery), without having to write any code. To do this, you needto upload a JDBC driver to your Cloud Data Fusion instance andconfigure it so that you can use it in your data pipelines. For more details,see the guide onusing JDBC drivers with Cloud Data Fusion.

Cloud Data Fusion exposes plugins for sources, transforms, aggregates,sinks, error collectors, alert publishers, actions, and post-run actions ascustomizable components. Prebuilt plugins offer access to a wide range of datasources. If a plugin doesn't exist, you can build your own plugin by using theCloud Data Fusion plugin APIs. For more information, see thePlugin overview.

With Cloud Data Fusion pipelines, you can create both batch andstreaming data pipelines. By providing access to logs and metrics, datapipelines also offer ways for administrators to operationalize their dataprocessing workflows without needing custom tooling.

To get started, see theCloud Data Fusion conceptual overview.For practical examples, see thequickstart guide and the tutorial on creating atargeting campaign pipeline.

Dataflow

Dataflow is a fully managed service for runningApache Beam jobs at scale. Apache Beam is an open source framework that provides a rich setof windowing and session-analysis primitives as well as an ecosystem of sourceand sink connectors, including aconnector for BigQuery.Apache Beam lets you transform and enrich data both in stream (real time) andbatch (historical) modes with equal reliability and expressiveness.

The serverless approach of Dataflow removes operational overheadwith performance, scaling, availability, security, and compliance handledautomatically. This lets you focus on programming instead of managing serverclusters.

You can submit Dataflow jobs in different ways, either throughthecommand-line interface,theJava SDK,or thePython SDK.Also, we are developing aportability framework to bring full interoperability between all SDKs andrunners.

If you want to migrate your data queries and pipelines from other frameworks toApache Beam and Dataflow, read about theApache Beam programming model and browse the officialDataflow documentation.

For practical examples, see the Dataflowquickstarts andtutorials.

Orchestration and scheduling

At a high level,orchestration is the automated coordination of severalsystems, whereasscheduling refers to the automated triggering oforchestration work.

Zooming in: A data pipeline is in itself an orchestration of datatransformations described by a DAG, which is adata processing DAG.
Zooming out: When a data pipeline depends on the output of other datapipelines, you need orchestration of multiple pipelines. Each pipelineconstitutes a sub-DAG in a larger DAG, which is anorchestration DAG.

This setup is typical in data warehousing. Figure 1 in theETL section shows an example setup. The following sections focus on the orchestration ofseveral data pipelines.

Dependencies

Dependencies can befan-in, where multiple data pipelines merge into a vertexof an orchestration DAG;fan-out, where a single data pipeline triggersmultiple others; or often both, as shown in the following diagram.

Multiple pipelines labeled A, B, and C fan in to pipeline D. Pipeline D fans out to pipelines E, F, and G. All of this is orchestrated by an orchestration DAG.

Figure 10. Fan-in and fan-out dependencies used in combination.

In suboptimal environments, some dependencies are a result of limitations inthe amount of available resources. For example, a data pipeline runs andproduces some common data as a byproduct. Other data pipelines depend on thiscommon data simply to avoid recalculating it, but are unrelated to the datapipeline that created the data. If this first pipeline encounters any functionalor nonfunctional issues, failures cascade down to its dependent datapipelines—at best, forcing them to wait, or at worst, preventing them fromrunning at all, as shown in the following diagram.

Pipeline A experiences a failure. Pipelines B and C are dependent on the output from pipeline A, so they fail as well.

Figure 11. Failures cascading down a data pipeline prevent dependentpipelines from running.

In Google Cloud, a wealth of compute resources and specialized tools areavailable to allow you to optimize the execution of your pipelines and theirorchestration. The remaining sections discuss these resources and tools.

Migration work involved

It's a best practice to simplify your orchestration needs. Your orchestrationincreases in complexity with the number of dependencies between your datapipelines. Migrating to Google Cloud presents an opportunity to examineyour orchestration DAGs, identify your dependencies, and determine how tooptimize those dependencies.

We recommend optimizing your dependencies incrementally, as follows:

In a first iteration, move your orchestration as is to Google Cloud.
In later iterations, analyze your dependencies and parallelize them iffeasible.
Finally, reorganize your orchestration by extracting common tasks intotheir own DAGs.

The next section explains this method with a practical example.

A practical example

Suppose that an organization has two related pipelines:

The first pipeline calculates the profits and losses (P&L) for thewhole organization. It's a complex pipeline involving many transformations.Part of the pipeline consists of calculating the monthly sales, which areused in subsequent transformation steps and eventually written to a table.
The second pipeline calculates the year-over-year and month-over-monthsales growth for different products so that the marketing department cantune its ad campaign efforts. This pipeline needs the monthly sales datapreviously calculated by the P&L data pipeline.

The organization considers the P&L data pipeline to have higher priority thanthe marketing pipeline. Unfortunately, because P&L is a complex data pipeline,it consumes a large amount of resources, preventing other pipelines from runningconcurrently. In addition, if the P&L pipeline fails, the marketing pipeline andother dependent pipelines don't have the required data to be able to run, andmust wait for a retry of P&L. The following diagram illustrates thissituation.

The P&L pipeline creates a 'monthly sales' artifact that's required for the marketing pipeline. The P&L pipeline can experience delays and other issues.

Figure 12. Complex data pipelines can prevent lower-priority pipelines fromrunning.

The organization is migrating to BigQuery. It has identified thetwo use cases—P&L and marketing sales growth—and included them in the migrationbacklog. When planning the next iteration, the organizationprioritizes the P&L use case andincludes it in the iteration backlog because it's severely limited by the current on-premises resources andregularly causes delays. Some of its dependent use cases are also included,among them the marketing use case.

The migration team runs the first iteration. They choose to move both the P&Land marketing use cases to Google Cloud by using aredirect approach.They make no changes to the pipeline steps or orchestration. An importantdifference is that now the P&L pipeline can dispose almost unlimited computepower, and therefore executes much faster than on-premises. The pipeline writesthe sales monthly data to a BigQuery table that the marketinggrowth pipeline uses. The following diagram illustrates these changes.

The P&L pipeline is the same as before but does not experience delays.

Figure 13. Speeding up a complex data pipeline by using a redirectapproach.

Although Google Cloud has helped with the nonfunctional P&L issues,functional issues still remain. Some unrelated tasks that precede the calculationof the monthly sales often cause errors that prevent that calculation fromhappening, and result in the dependent pipelines being unable to start.

In a second iteration, the team hopes to improve performance by including bothuse cases in the iteration backlog. The team identifies the pipeline steps tocalculate the monthly sales in the P&L pipeline. The steps constitute a sub-DAG,as shown in the next diagram. The migration team copies the sub-DAG into themarketing pipeline so that that pipeline can run independently of P&L. Havingsufficient computing power in Google Cloud enables both pipelines to runconcurrently.

The P&L pipeline and marketing pipeline now run as separate sub DAGs, so the marketing pipeline no longer is affected if there are issues in the P&L pipeline.

Figure 14. Pipelines running concurrently by using a sub-DAG.

The downside is that duplicating the sub-DAG logic creates code managementoverhead, because now the team needs to keep both copies of the sub-DAG logic insync.

In a third iteration, the team revisits the use cases and extracts the monthlysales sub-DAG into an independent pipeline. When the new monthly salespipeline is done, it triggers or fans out into the P&L, marketing growth, andother dependent pipelines. This configuration creates a new overallorchestration DAG, with each of the pipelines being one of its sub-DAGs.

The monthly sales pipeline is now first, feeding the P&L pipeline and the marketing pipeline.

Figure 15. Overall orchestration DAG with each pipeline in its ownsub-DAG.

In subsequent iterations, the migration team can solve any remaining functionalissues and migrate the pipelines to use the followingGoogle Cloud-managed services,among others:

Dataflow:Enables you to define each data pipeline as a self-contained DAG using theBeam model.
Cloud Composer:Enables you to define the broader orchestration as one or moreAirflow DAGs.

Even though Airflow supports sub-DAGs natively, this functionalitymight limit its performance and is thereforediscouraged.In their place, use independent DAGs with theTriggerDagRunOperator operator.

What's next

Learn more about the following steps in data warehouse migration:

You can also learn about moving from specificdata warehouse technologies to BigQuery:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-16 UTC.

Movatterモバイル変換