Dataproc cluster configuration

In Cloud Data Fusion, cluster configuration refers to defining how yourdata processing pipelines utilize computational resources when running Sparkjobs on Dataproc. This page describes the main approaches tocluster configuration.

Default ephemeral clusters (recommended)

Using the default clusters is the recommended approach forCloud Data Fusion pipelines.

  • Cloud Data Fusion automatically provisions and manages ephemeralDataproc clusters for each pipeline execution. It creates acluster at the beginning of the pipeline run, and then deletes it after thepipeline run completes.
  • Benefits of ephemeral clusters:
    • Simplicity: you don't need to manually configure or manage thecluster.
    • Cost-effectiveness: you only pay for the resources used duringpipeline execution.
Note: Cloud Data Fusion, by default, uses Dataproc Autoscaling compute profile which creates ephemeral clusters as per the default configurations.

To adjust clusters and tune performance, seeCluster sizing.

Static clusters (for specific scenarios)

In the following scenarios, you can use static clusters:

  • Long-running pipelines: for pipelines that run continuously or for extended periods, a static cluster can be more cost-effective than repeatedly creating and tearing down ephemeral clusters.
  • Centralized cluster management: if your organization requires centralized control over cluster creation and management policies, static clusters can be used alongside tools like Terraform.
  • Cluster creation time: when the time it takes to create a new cluster for every pipeline is prohibitive for your use case.

However, static clusters require more manual configuration and involve managingthe cluster lifecycle yourself.

To use a static cluster, you must set the followingproperties on the Dataproc cluster:

dataproc:dataproc.conscrypt.provider.enable=falsecapacity-scheduler:yarn.scheduler.capacity.resource-calculator="org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator"

Cluster configuration options for static clusters

If you choose to use static clusters, Cloud Data Fusion offersconfiguration options for the following aspects:

  • Worker machine type: specify the virtual machine type for the workernodes in your cluster. This determines the vCPUs and memory available foreach worker.
  • Number of workers: define the initial number of worker nodes in yourcluster. Dataproc might still autoscale this number, based onworkload.
  • Zone: select your cluster's Google Cloud zone. Location can affectdata locality and network performance.
  • Additional configurations: you can configure advanced options for yourstatic cluster, such as preemption settings, network settings, andinitialization actions.

Best practices

When creating a static cluster for your pipelines, use the followingconfigurations.

ParametersDescription
yarn.nodemanager.delete.debug-delay-secRetains YARN logs.
Recommended value:86400 (equivalent to one day)
yarn.nodemanager.pmem-check-enabledEnables YARN to check for physical memory limits and kill containers if they go beyond physical memory.
Recommended value:false
yarn.nodemanager.vmem-check-enabledEnables YARN to check for virtual memory limits and kill containers if they go beyond physical memory.
Recommended value:false
dataproc.scheduler.driver-size-mbThe average memory footprint of the driver makes Dataproc queue the job if the master node lacks sufficient memory to run the driver process. This can impactjob concurrency but can be mitigated by using a master node with more memory.
Recommended value:2048

For more information, seeRun a pipeline against an existing Dataproc cluster.

Reusing clusters

You can reuse Dataproc clusters between runs to improveprocessing time. Cluster reuse is implemented in a model similar to connectionpooling or thread pooling. Any cluster is kept up and running for a specifiedtime after the run is finished. When a new run is started, it will try to findan idle cluster available that matches the configuration of the compute profile.If one is present, it will be used, otherwise a new cluster will be started.

Considerations for reusing clusters

  • Clusters are not shared. Similar to the regular ephemeral clusterprovisioning model, a cluster runs a single pipeline run at a time. Acluster is reused only if it is idle.
  • If you enable cluster reuse for all your runs, the necessary number ofclusters to process all your runs will be created as needed. Similar to theephemeral Dataproc provisioner, there is no direct control onthe number of clusters created. You can still use Google Cloud quotesto manage resources. For example, if you run 100 runs with 7 maximumparallel runs, you will have up to 7 clusters at a given point of time.
  • Clusters are reused between different pipelines as soon as those pipelinesare using the same profile and share the same profile settings. If profilecustomization is used, clusters will still be reused, but only ifcustomizations are exactly the same, including all cluster settings likecluster labeling.

  • When cluster reuse is enabled, there are two main cost considerations:

    • Less resources are used for cluster startup and initialization.
    • More resources are used for clusters to sit idle between the pipelineruns and after the last pipeline run.

While it's hard to predict the cost effect of cluster reuse, you can employ astrategy to get maximum savings. The strategy is to identify a critical path forchained pipelines and enable cluster reuse for this critical path. This wouldensure the cluster is immediately reused, no idle time is wasted and maximumperformance benefits are achieved.

Enable Cluster Reuse

In the Compute Config section of deployed pipeline configuration or whencreating new compute profile:

  • EnableSkip Cluster Delete.
  • Max Idle Time is the time up to which a cluster waits for the next pipelineto reuse it. The default Max Idle Time is 30 minutes. For Max Idle Time,consider the cost versus cluster availability for reuse. The higher thevalue of Max Idle Time, the more clusters sit idle, ready for a run.

Troubleshoot: Version compatibility

Problem: The version of your Cloud Data Fusion environment mightnot be compatible with the version of your Dataproc cluster.

Recommended: Upgrade to the latest Cloud Data Fusion version anduse one of thesupported Dataproc versions.

Earlier versions of Cloud Data Fusion are only compatible withunsupported versions of Dataproc.Dataproc does not provide updates and support for clusterscreated with these versions. Although you can continue running a cluster thatwas created with an unsupported version, we recommend replacing it with onecreated with asupported version.

Cloud Data Fusion versionDataproc version
6.11.12.3, 2.2***, 2.1
6.10.1.12.2***, 2.1, 2.0*
6.102.1, 2.0*
6.92.1, 2.0, 1.5*
6.7-6.82.0, 1.5*
6.4-6.62.0*, 1.3**
6.1-6.31.3**

* Cloud Data Fusion versions 6.4 and later are compatible with supported versions of Dataproc. Unless specific OS features are needed, the recommended practice is to specify themajor.minor image version.
To specify the OS version used in your Dataproc cluster, the OS version must be compatible with one of the supported Dataproc versions for your Cloud Data Fusion in the preceding table.

** Cloud Data Fusion versions 6.1 to 6.6 are compatible with unsupported Dataproc version 1.3.

*** Certain issues are detected with this image version. This Dataproc image version is not recommended for production use.

Troubleshoot: Container exited with a non-zero exit code 3

Problem: An autoscaling policy isn't used, and the staticDataproc clusters are encountering memory pressure, causing anout of memory exception to appear in the logs:Container exited with a non-zeroexit code 3.

Recommended: Increase the executor memory.

Increase the memory by adding atask.executor.system.resources.memory runtimeargument to the pipeline. The following example runtime argument sets the memoryto 4096 MB:

"task.executor.system.resources.memory": 4096

For more information, seeCluster sizing.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.