Introduction to Cloud Data Fusion: Studio

This page introduces Cloud Data Fusion: Studio, which is a visual,click-and-drag interface for building data pipelines from a library of prebuiltplugins and an interface where you configure, execute, and manage your pipelines.Building a pipeline in the Studio typically follows this process:

Connect to an on-premises or cloud data source.
Prepare and transform your data.
Connect to the destination.
Test your pipeline.
Execute your pipeline.
Schedule and trigger your pipelines.

After you design and execute the pipeline, you can manage pipelines on theCloud Data FusionPipeline Studio page:

Reuse pipelines by parameterizing them with preferences andruntime arguments.
Manage pipeline execution by customizing compute profiles, managingresources, and fine-tuning pipeline performance.
Manage pipeline lifecycle by editing pipelines.
Manage pipeline source control using Git integration.

Note: The Studio also provides administrative controls to centrally manage your configurations.

Before you begin

Enable the Cloud Data Fusion API.
Create a Cloud Data Fusion instance.
Understandaccess control in Cloud Data Fusion.
Understand keyconcepts and termsin Cloud Data Fusion.

Cloud Data Fusion: Studio overview

The Studio includes the following components.

Administration

Cloud Data Fusion lets you have multiplenamespaces in each instance. Within the Studio, administrators can manageall of the namespaces centrally, or each namespace individually.

The Studio provides the following administrator controls:

System Administration: TheSystem Admin module in the Studio lets you create new namespaces anddefine the centralcompute profile configurations at the system level,which are applicable to each namespace in that instance. For more information,seeManage Studio administration.
Namespace Administration: TheNamespace Admin module in the Studio lets you manage theconfigurations for the specific namespace. For each namespace, you can definecompute profiles, runtime preferences, drivers, service accounts and gitconfigurations. For more information, seeManage Studio administration.

Pipeline Design Studio

You design and execute pipelines in thePipeline Design Studio in theCloud Data Fusion web interface. Designing and executing data pipelinesincludes the following steps:

Connect to a source: Cloud Data Fusion allows connections toon-premises and cloud data sources. The Studio interface has defaultsystem plugins, which come pre-installed in the Studio. You can downloadadditional plugins from a plugin repository, known as theHub. For moreinformation, see thePlugins overview.
Data preparation: Cloud Data Fusion lets you prepare yourdata using its powerful data preparation plugin:Wrangler. Wrangler helpsyou view, explore, and transform a small sample of your data in one placebefore running the logic on the entire dataset in the Studio. This lets youquickly apply transformations to gain an understanding of how theyaffect the entire dataset. You can create multiple transformations and addthem to a recipe. For more information, see theWrangler overview.
Transform: Transform plugins change data after it's loaded from asource—for example, you can clone a record, change the file format toJSON, or use the Javascript plugin to create a custom transformation. Formore information, see thePlugins overview.
Connect to a destination: After you prepare the data and applytransformations, you can connect to the destination where you plan to loadthe data. Cloud Data Fusion supports connections to multipledestinations. For more information, seePlugins overview.
Preview: After you design the pipeline, to debug issues before youdeploy and run a pipeline, you run aPreview job. If you encounter anyerrors, you can fix them while inDraft mode. The Studio uses the first100 rows of your source dataset to generate the preview. The Studio displaysthe status and duration of the Preview job. You can stop the job anytime.You can also monitor the log events as the Preview job runs. For moreinformation, seePreview data.
Manage pipeline configurations: After you preview the data, you candeploy the pipeline and manage the following pipeline configurations:
- Compute configuration: You can change the compute profile that runsthe pipeline—for example, you want to run the pipeline against acustomized Dataproc cluster rather than the defaultDataproc cluster.
- Pipeline configuration: For each pipeline, you can enable or disableinstrumentation, such as timing metrics. By default, instrumentation isenabled.
- Engine configuration: Spark is the default execution engine. You canpass custom parameters for Spark.
- Resources: You can specify the memory and number of CPUs for theSpark driver and executor. The driver orchestrates the Spark job. Theexecutor handles the data processing in Spark.
- Pipeline alert: You can configure the pipeline to send alerts andstart post-processing tasks after the pipeline run finishes. Youcreate pipeline alerts when you design the pipeline. After you deploythe pipeline, you can view the alerts. To change alert settings, you canedit the pipeline.
- Transformation pushdown: You can enable Transformation pushdown ifyou want a pipeline to execute certain transformations inBigQuery.
For more information, seeManage pipeline configurations.
Reuse pipelines using macros, preferences, and runtime arguments:Cloud Data Fusion lets you reuse data pipelines. With reusabledata pipelines, you can have a single pipeline that can apply a dataintegration pattern to a variety of use cases and datasets. Reusablepipelines give you better manageability. They let you set most of theconfiguration of a pipeline at execution time, instead of hard-coding it atdesign time. In the Pipeline Design Studio, you can use macros to addvariables to plugin configurations so that you can specify the variablesubstitutions at runtime. For more information,seeManage macros, preferences, and runtime arguments.
Execute: Once you have reviewed the pipeline configurations, youcan initiate the pipeline execution. You can see the status change duringthe phases of the pipeline run—for example provisioning, starting,running, and success.
Schedule and orchestrate: Batch data pipelines can be set to run ona specified schedule and frequency. After you create and deploy a pipeline,you can create a schedule. In the Pipeline Design Studio, you canorchestrate pipelines by creating a trigger on a batch data pipeline tohave it run when one or more pipeline runs complete. These are calleddownstream and upstream pipelines. You create a trigger on the downstreampipeline so that it runs based on the completion of one or more upstreampipelines.
Recommended: You can also use Composer to orchestrate pipelinesin Cloud Data Fusion. For more information, seeSchedule pipelines andOrchestrate pipelines.
Edit pipelines: Cloud Data Fusion lets you edit a deployedpipeline. When you edit a deployed pipeline, it creates a new version ofthe pipeline with the same name and marks it as the latest version. Thislets you develop pipelines iteratively rather than duplicating pipelines,which creates a new pipeline with a different name. For more information,seeEdit pipelines.
Source Control Management: Cloud Data Fusion lets you bettermanage pipelines between development and production withSource Control Management of the pipelines using GitHub.
Logging and monitoring: To monitor pipeline metrics and logs, it'srecommended that you enable the Stackdriver logging service to useCloud Logging with your Cloud Data Fusion pipeline.

What's next

Learn more aboutmanaging Studio administration.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換