Run a pipeline against an existing Dataproc cluster

This page describes how to run a pipeline in Cloud Data Fusion againstan existing Dataproc cluster.

By default, Cloud Data Fusion creates ephemeral clusters for each pipeline:it creates a cluster at the beginning of the pipeline run, and then deletes itafter the pipeline run completes. While this behavior saves costs by ensuringthat resources are only created when required, this default behavior might notbe desirable in the following scenarios:

If the time it takes to create a new cluster for every pipeline isprohibitive for your use case.
If your organization requires cluster creation to be managed centrally; forexample, when you want to enforce certain policies for allDataproc clusters.

For these scenarios, you instead run pipelines against an existing cluster withthe following steps.

Before you begin

You need the following:

A Cloud Data Fusion instance.
Create a Cloud Data Fusion instance
An existing Dataproc cluster.
Create a Dataproc cluster
If you run your pipelines in Cloud Data Fusion version 6.2, use anolderDataproc imagethat runs with Hadoop 2.x (for example, 1.5-debian10), orupgrade to thelatest Cloud Data Fusion version.

Connect to the existing cluster

In Cloud Data Fusion versions 6.2.1 and later, you can connect to anexisting Dataproc cluster when you create a new Compute Engineprofile.

Go to your instance:
1. In the Google Cloud console, go to the Cloud Data Fusion page.
2. To open the instance in the Cloud Data Fusion Studio,clickInstances, and then clickView instance.
  Go to Instances
ClickSystem admin.
Click theConfiguration tab.
ClickSystem compute profiles.
ClickCreate new profile. A page of provisioners opens.
ClickExisting Dataproc.
Enter the profile, cluster, and monitoring information.
ClickCreate.

Configure your pipeline to use the custom profile

Go to your instance:
1. In the Google Cloud console, go to the Cloud Data Fusion page.
2. To open the instance in the Cloud Data Fusion Studio,clickInstances, and then clickView instance.
  Go to Instances
Go to your pipeline on theStudio page.
ClickConfigure.
ClickCompute config.
Click the profile that you created.
Figure 1: Click the custom profile
Run the pipeline. It runs against the existing Dataproccluster.

What's next

Learn more aboutconfiguring clusters.
Troubleshootdeleting clusters.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換

Run a pipeline against an existing Dataproc cluster Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Connect to the existing cluster

Configure your pipeline to use the custom profile

What's next

Run a pipeline against an existing Dataproc cluster