Dataproc driver node groups

A DataprocNodeGroup resource is a group of Dataproccluster nodes that execute an assigned role. This page describes thedriver node group, which is a group of Compute Engine VMS that areassigned theDriver role for the purpose of running job drivers on theDataproc cluster.

Caution: Driver node groups are typically used on shared, long-running clusters.But, be cautious about using this approach. A shared, long-running clustertypically represents a single point of failure: a cluster that is unhealthy orin an error state can block an entire data pipeline.Instead, useephemeral clusters,which exist for the lifetime of a single job.

When to use driver node groups

  • Use driver node groups only when you need to run many concurrent jobson a shared cluster.
  • Increase master node resources before usingdriver node groups to avoiddriver node group limitations.

How driver nodes help you run concurrents job

Dataproc starts a job driver process on a Dataproccluster master node for each job. The driver process, in turn,runs an application driver, such asspark-submit, as its child process.However, the number of concurrent jobs running on the master is limited by theresources available on master node, and since Dataproc masternodes can't be scaled, a job can fail or get throttled when master node resourcesare insufficient to run a job.

Driver node groups are special node groups managed by YARN, so job concurrencyis not limited by master node resources. In clusters with a driver node group,application drivers run on driver nodes. Each driver node can run multipleapplication drivers if the node has sufficient resources.

Benefits

Using a Dataproc cluster with a driver nodegroup lets you:

  • Horizontally scale job driver resources to run more concurrent jobs
  • Scale driver resources separately from worker resources
  • Obtain faster scaledown on Dataproc 2.0+ and later imageclusters. On these clusters, the app master runs within a Spark driver in adriver node group (thespark.yarn.unmanagedAM.enabled is settotrue by default).
  • Customize driver node start-up. You can add{ROLE} == 'Driver' in aninitialization scriptto have the script perform actions for a driver node groupinnode selection.

Limitations

  • Node groups are not supported inDataproc workflow templates.
  • Node group clusters cannot be stopped, restarted, or autoscaled.
  • The MapReduce app master runs on worker nodes. A scale down of worker nodes canbe slow if you enablegraceful decommissioning.
  • Job concurrency is affected by thedataproc:agent.process.threads.job.maxcluster property.For example, with three masters and this property set to the defaultvalue of100, maximum cluster-level job concurrency is300.

Driver node group compared to Spark cluster mode

FeatureSpark cluster modeDriver node group
Worker node scale downLong-lived drivers run on the same worker nodes as short-lived containers, making scale down of workers using graceful decommission slow.Worker nodes scale down more quickly when drivers run on node groups.
Streamed driver outputRequires searching in YARN logs to find the node where the driver was scheduled.Driver output is streamed to Cloud Storage, and is viewable in the Google Cloud console and in thegcloud dataproc jobs wait command output after a job completes.

Driver node group IAM permissions

The following IAM permissions are associated with theDataproc node group related actions.

PermissionAction
dataproc.nodeGroups.createCreate Dataproc node groups. If a user hasdataproc.clusters.create in the project, this permission is granted.
dataproc.nodeGroups.getGet the details of a Dataproc node group.
dataproc.nodeGroups.updateResize a Dataproc node group.

Driver node group operations

You can use the gcloud CLI and Dataproc API to create,get, resize, delete, and submit a job to a Dataproc driver node group.

Create a driver node group cluster

A driver node group is associated with one Dataproc cluster.You create a node group as part ofcreating a Dataproc cluster.You can use the gcloud CLI or Dataproc REST API tocreate a Dataproc cluster with a driver node group.

Note: Dataproc driver node groups are supported in clusters created with2.0.52 and laterimage versions.

gcloud

gcloud dataproc clusters createCLUSTER_NAME \    --region=REGION \    --driver-pool-size=SIZE \    --driver-pool-id=NODE_GROUP_ID

Required flags:

  • CLUSTER_NAME: The cluster name, which must be unique within a project.The name must start with a lowercase letter, and can contain up to 51 lowercaseletters, numbers, and hyphens. It cannot end with a hyphen. The name of adeleted cluster can be reused.
  • REGION: Theregionwhere the cluster will be located.
  • SIZE: The number of driver nodes in the node group. The number of nodesneeded depend on job load and driver pool machine type. The number ofminimumdriver group nodes is equal to total memory or vCPUs required by job driversdivided by each driver pool's machine memory or vCPUs.
  • NODE_GROUP_ID: Optional and recommended. The ID must be uniquewithin the cluster. Use this ID to identify the driver group in futureoperations, such as resizing the node group. If not specified,Dataproc generates the node group ID.

Recommended flag:

  • --enable-component-gateway: Add this flag to enable theDataproc Component Gateway, which provides access to the YARN web interface.The YARN UI Application and Scheduler pages display cluster and job status,application queue memory, core capacity, and other metrics.

Additional flags: The following optionaldriver-pool flags can be addedto thegcloud dataproc clusters create command to customize the node group.

FlagDefault value
--driver-pool-idA string identifier, generated by the service if not set by the flag. This ID can be used to identify the node group when performing future node pool operations, such as resizing the node group.
--driver-pool-machine-typen1-standard-4
--driver-pool-acceleratorNo default. When specifying an accelerator, the GPU type is required; the number of GPUs is optional.
--num-driver-pool-local-ssdsNo default
--driver-pool-local-ssd-interfaceNo default
--driver-pool-boot-disk-typepd-standard
--driver-pool-boot-disk-size1000 GB
--driver-pool-min-cpu-platformAUTOMATIC

REST

Complete aAuxiliaryNodeGroupas part of a Dataproc APIcluster.createrequest.

Before using any of the request data, make the following replacements:

  • PROJECT_ID: Required. Google Cloud project ID.
  • REGION: Required. Dataproc clusterregion.
  • CLUSTER_NAME: Required. The cluster name, which must be unique within a project. The name must start with a lowercase letter, and can contain up to 51 lowercase letters, numbers, and hyphens. It cannot end with a hyphen. The name of a deleted cluster can be reused.
  • SIZE: Required. Number of nodes in the node group.
  • NODE_GROUP_ID: Optional and recommended. The ID must be unique within the cluster. Use this ID to identify the driver group in future operations, such as resizing the node group. If not specified, Dataproc generates the node group ID.

Additional options: SeeNodeGroup.

Set theEndpointConfig.enableHttpPortAccessproperty totrue to enable theDataproc Component Gateway,which provides access to the YARN web interface.The YARN UI Application and Scheduler pages display cluster and job status,application queue memory, core capacity, and other metrics.

HTTP method and URL:

POST https://dataproc.googleapis.com/v1/projects/PROJECT_ID/regions/REGION/clusters

Request JSON body:

{  "clusterName":"CLUSTER_NAME",  "config": {    "softwareConfig": {      "imageVersion":""    },    "endpointConfig": {      "enableHttpPortAccess": true    },    "auxiliaryNodeGroups": [{        "nodeGroup":{            "roles":["DRIVER"],            "nodeGroupConfig": {                "numInstances":SIZE             }         },        "nodeGroupId": "NODE_GROUP_ID"    }]  }}

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login , or by usingCloud Shell, which automatically logs you into thegcloud CLI . You can check the currently active account by runninggcloud auth list.

Save the request body in a file namedrequest.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://dataproc.googleapis.com/v1/projects/PROJECT_ID/regions/REGION/clusters"

PowerShell (Windows)

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login . You can check the currently active account by runninggcloud auth list.

Save the request body in a file namedrequest.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://dataproc.googleapis.com/v1/projects/PROJECT_ID/regions/REGION/clusters" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{  "projectId": "PROJECT_ID",  "clusterName": "CLUSTER_NAME",  "config": {    ...    "auxiliaryNodeGroups": [      {        "nodeGroup": {"name": "projects/PROJECT_ID/regions/REGION/clusters/CLUSTER_NAME/nodeGroups/NODE_GROUP_ID",          "roles": [            "DRIVER"          ],          "nodeGroupConfig": {            "numInstances":SIZE,            "instanceNames": [              "CLUSTER_NAME-np-q1gp",              "CLUSTER_NAME-np-xfc0"            ],            "imageUri": "https://www.googleapis.com/compute/v1/projects/cloud-dataproc-ci/global/images/dataproc-2-0-deb10-...-rc01",            "machineTypeUri": "https://www.googleapis.com/compute/v1/projects/PROJECT_ID/zones/REGION-a/machineTypes/n1-standard-4",            "diskConfig": {              "bootDiskSizeGb": 1000,              "bootDiskType": "pd-standard"            },            "managedGroupConfig": {              "instanceTemplateName": "dataproc-2a8224d2-...",              "instanceGroupManagerName": "dataproc-2a8224d2-..."            },            "minCpuPlatform": "AUTOMATIC",            "preemptibility": "NON_PREEMPTIBLE"          }        },        "nodeGroupId": "NODE_GROUP_ID"      }    ]  },}

Get driver node group cluster metadata

You can use thegcloud dataproc node-groups describecommand or theDataproc API toget driver node group metadata.

gcloud

gcloud dataproc node-groups describeNODE_GROUP_ID \    --cluster=CLUSTER_NAME \    --region=REGION

Required flags:

  • NODE_GROUP_ID: You can rungcloud dataproc clustersdescribeCLUSTER_NAME to list the node group ID.
  • CLUSTER_NAME: The cluster name.
  • REGION: The cluster region.

REST

Before using any of the request data, make the following replacements:

  • PROJECT_ID: Required. Google Cloud project ID.
  • REGION: Required. The cluster region.
  • CLUSTER_NAME: Required. The cluster name.
  • NODE_GROUP_ID: Required. You can rungcloud dataproc clusters describeCLUSTER_NAME to list the node group ID.

HTTP method and URL:

GET https://dataproc.googleapis.com/v1/projects/PROJECT_ID/regions/REGION/clusters/CLUSTER_NAMEnodeGroups/Node_GROUP_ID

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login , or by usingCloud Shell, which automatically logs you into thegcloud CLI . You can check the currently active account by runninggcloud auth list.

Execute the following command:

curl -X GET \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
"https://dataproc.googleapis.com/v1/projects/PROJECT_ID/regions/REGION/clusters/CLUSTER_NAMEnodeGroups/Node_GROUP_ID"

PowerShell (Windows)

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login . You can check the currently active account by runninggcloud auth list.

Execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method GET `
-Headers $headers `
-Uri "https://dataproc.googleapis.com/v1/projects/PROJECT_ID/regions/REGION/clusters/CLUSTER_NAMEnodeGroups/Node_GROUP_ID" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{  "name": "projects/PROJECT_ID/regions/REGION/clusters/CLUSTER_NAME/nodeGroups/NODE_GROUP_ID",  "roles": [    "DRIVER"  ],  "nodeGroupConfig": {    "numInstances": 5,    "imageUri": "https://www.googleapis.com/compute/v1/projects/cloud-dataproc-ci/global/images/dataproc-2-0-deb10-...-rc01",    "machineTypeUri": "https://www.googleapis.com/compute/v1/projects/PROJECT_ID/zones/REGION-a/machineTypes/n1-standard-4",    "diskConfig": {      "bootDiskSizeGb": 1000,      "bootDiskType": "pd-standard"    },    "managedGroupConfig": {      "instanceTemplateName": "dataproc-driver-pool-mcia3j656h2fy",      "instanceGroupManagerName": "dataproc-driver-pool-mcia3j656h2fy"    },    "minCpuPlatform": "AUTOMATIC",    "preemptibility": "NON_PREEMPTIBLE"  }}

Resize a driver node group

You can use thegcloud dataproc node-groups resizecommand or theDataproc APIto add or remove driver nodes from a cluster driver node group.

gcloud

gcloud dataproc node-groups resizeNODE_GROUP_ID \    --cluster=CLUSTER_NAME \    --region=REGION \    --size=SIZE

Required flags:

  • NODE_GROUP_ID: You can rungcloud dataproc clustersdescribeCLUSTER_NAME to list the node group ID.
  • CLUSTER_NAME: The cluster name.
  • REGION: The cluster region.
  • SIZE: Specify the new number of driver nodes in the node group.

Optional flag:

  • --graceful-decommission-timeout=TIMEOUT_DURATION:When scaling down a node group, you can add this flag to specify agraceful decommissioningTIMEOUT_DURATION to avoid the immediate termination of job drivers.Recommendation: Set a timeout duration that is at least equal tothe duration of longest job running on the node group (recoveryof failed drivers is not supported).

Example: gcloud CLINodeGroup scale up command:

gcloud dataproc node-groups resizeNODE_GROUP_ID \    --cluster=CLUSTER_NAME \    --region=REGION \    --size=4

Example: gcloud CLINodeGroup scale down command:

gcloud dataproc node-groups resizeNODE_GROUP_ID \    --cluster=CLUSTER_NAME \    --region=REGION \    --size=1 \    --graceful-decommission-timeout="100s"

REST

Before using any of the request data, make the following replacements:

  • PROJECT_ID: Required. Google Cloud project ID.
  • REGION: Required. The cluster region.
  • NODE_GROUP_ID: Required. You can rungcloud dataproc clusters describeCLUSTER_NAME to list the node group ID.
  • SIZE: Required. New number of nodes in the node group.
  • TIMEOUT_DURATION: Optional. When scaling down a node group, you can add agracefulDecommissionTimeout to the request body to avoid the immediate termination of job drivers.Recommendation: Set a timeout duration that is at least equal to the duration of longest job running on the node group (recovery of failed drivers is not supported).

    Example:

    { "size":SIZE,  "gracefulDecommissionTimeout": "TIMEOUT_DURATION"}

HTTP method and URL:

POST https://dataproc.googleapis.com/v1/projects/PROJECT_ID/regions/REGION/clusters/CLUSTER_NAME/nodeGroups/Node_GROUP_ID:resize

Request JSON body:

{  "size":SIZE,}

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login , or by usingCloud Shell, which automatically logs you into thegcloud CLI . You can check the currently active account by runninggcloud auth list.

Save the request body in a file namedrequest.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://dataproc.googleapis.com/v1/projects/PROJECT_ID/regions/REGION/clusters/CLUSTER_NAME/nodeGroups/Node_GROUP_ID:resize"

PowerShell (Windows)

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login . You can check the currently active account by runninggcloud auth list.

Save the request body in a file namedrequest.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://dataproc.googleapis.com/v1/projects/PROJECT_ID/regions/REGION/clusters/CLUSTER_NAME/nodeGroups/Node_GROUP_ID:resize" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{  "name": "projects/PROJECT_ID/regions/REGION/operations/OPERATION_ID",  "metadata": {    "@type": "type.googleapis.com/google.cloud.dataproc.v1.NodeGroupOperationMetadata",    "nodeGroupId": "NODE_GROUP_ID",    "clusterUuid": "CLUSTER_UUID",    "status": {      "state": "PENDING",      "innerState": "PENDING",      "stateStartTime": "2022-12-01T23:34:53.064308Z"    },    "operationType": "RESIZE",    "description": "Scale"up or "down" a GCE node pool toSIZE nodes."  }}

Delete a driver node group cluster

When youdelete a Dataproc cluster, node groups associated with the cluster are deleted.

Submit a job

You can use thegcloud dataproc jobs submitcommand or theDataproc API tosubmit a job to a clusterwith a driver node group.

gcloud

gcloud dataproc jobs submitJOB_COMMAND \    --cluster=CLUSTER_NAME \    --region=REGION \    --driver-required-memory-mb=DRIVER_MEMORY \    --driver-required-vcores=DRIVER_VCORES \    DATAPROC_FLAGS \    --JOB_ARGS

Required flags:

  • JOB_COMMAND: Specify thejob command.
  • CLUSTER_NAME: The cluster name.
  • DRIVER_MEMORY: Amount of job drivers memory in MB needed to run a job (seeYarn Memory Controls).
  • DRIVER_VCORES: The number of vCPUs needed to run a job.

Additional flags:

  • DATAPROC_FLAGS: Add any additionalgcloud dataproc jobs submitflags related to the job type.
  • JOB_ARGS: Add any arguments (after the-- to passto the job.

Examples: You can run the following examples from anSSH terminal session on a Dataproc driver node group cluster.

  • Spark job to estimate value ofpi:

    gcloud dataproc jobs submit spark \    --cluster=CLUSTER_NAME \    --region=REGION \    --driver-required-memory-mb=2048 \    --driver-required-vcores=2 \    --class=org.apache.spark.examples.SparkPi \    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \    -- 1000
  • Spark wordcount job:

    gcloud dataproc jobs submit spark \    --cluster=CLUSTER_NAME \    --region=REGION \    --driver-required-memory-mb=2048 \    --driver-required-vcores=2 \    --class=org.apache.spark.examples.JavaWordCount \    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \    -- 'gs://apache-beam-samples/shakespeare/macbeth.txt'
  • PySpark job to estimate value ofpi:

    gcloud dataproc jobs submit pyspark \    file:///usr/lib/spark/examples/src/main/python/pi.py \    --cluster=CLUSTER_NAME \    --region=REGION \    --driver-required-memory-mb=2048 \    --driver-required-vcores=2 \    -- 1000
  • HadoopTeraGen MapReduce job:

    gcloud dataproc jobs submit hadoop \    --cluster=CLUSTER_NAME \    --region=REGION \    --driver-required-memory-mb=2048 \    --driver-required-vcores=2 \    --jar file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \    -- teragen 1000 \    hdfs:///gen1/test

REST

Before using any of the request data, make the following replacements:

  • PROJECT_ID: Required. Google Cloud project ID.
  • REGION: Required. Dataproc clusterregion
  • CLUSTER_NAME: Required. The cluster name, which must be unique within a project. The name must start with a lowercase letter, and can contain up to 51 lowercase letters, numbers, and hyphens. It cannot end with a hyphen. The name of a deleted cluster can be reused.
  • DRIVER_MEMORY: Required. Amount of job drivers memory in MB needed to run a job (seeYarn Memory Controls).
  • DRIVER_VCORES: Required. The number of vCPUs needed to run a job.
Additional fields: Add additional fields related to thejob type and job arguments (the sample request includes fields needed to submit a Spark job that estimates the value ofpi).

HTTP method and URL:

POST https://dataproc.googleapis.com/v1/projects/PROJECT_ID/regions/REGION/jobs:submit

Request JSON body:

{  "job": {    "placement": {    "clusterName": "CLUSTER_NAME",    },    "driverSchedulingConfig": {      "memoryMb]":DRIVER_MEMORY,      "vcores":DRIVER_VCORES    },    "sparkJob": {      "jarFileUris": "file:///usr/lib/spark/examples/jars/spark-examples.jar",      "args": [        "10000"      ],      "mainClass": "org.apache.spark.examples.SparkPi"    }  }}

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login , or by usingCloud Shell, which automatically logs you into thegcloud CLI . You can check the currently active account by runninggcloud auth list.

Save the request body in a file namedrequest.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://dataproc.googleapis.com/v1/projects/PROJECT_ID/regions/REGION/jobs:submit"

PowerShell (Windows)

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login . You can check the currently active account by runninggcloud auth list.

Save the request body in a file namedrequest.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://dataproc.googleapis.com/v1/projects/PROJECT_ID/regions/REGION/jobs:submit" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{  "reference": {    "projectId": "PROJECT_ID",    "jobId": "job-id"  },  "placement": {    "clusterName": "CLUSTER_NAME",    "clusterUuid": "cluster-Uuid"  },  "sparkJob": {    "mainClass": "org.apache.spark.examples.SparkPi",    "args": [      "1000"    ],    "jarFileUris": [      "file:///usr/lib/spark/examples/jars/spark-examples.jar"    ]  },  "status": {    "state": "PENDING",    "stateStartTime": "start-time"  },  "jobUuid": "job-Uuid"}

Python

  1. Install the client library
  2. Set up application default credentials
  3. Run the codeSeeSetting Up a Python Development Environment.
    • Spark job to estimate value of pi:
      importrefromgoogle.cloudimportdataproc_v1asdataprocfromgoogle.cloudimportstoragedefsubmit_job(project_id:str,region:str,cluster_name:str)->None:"""Submits a Spark job to the specified Dataproc cluster with a driver node group and prints the output.    Args:        project_id: The Google Cloud project ID.        region: The Dataproc region where the cluster is located.        cluster_name: The name of the Dataproc cluster.    """# Create the job client.withdataproc.JobControllerClient(client_options={"api_endpoint":f"{region}-dataproc.googleapis.com:443"})asjob_client:driver_scheduling_config=dataproc.DriverSchedulingConfig(memory_mb=2048,# Example memory in MBvcores=2,# Example number of vcores)# Create the job config. 'main_jar_file_uri' can also be a# Google Cloud Storage URL.job={"placement":{"cluster_name":cluster_name},"spark_job":{"main_class":"org.apache.spark.examples.SparkPi","jar_file_uris":["file:///usr/lib/spark/examples/jars/spark-examples.jar"],"args":["1000"],},"driver_scheduling_config":driver_scheduling_config}operation=job_client.submit_job_as_operation(request={"project_id":project_id,"region":region,"job":job})response=operation.result()# Dataproc job output gets saved to the Cloud Storage bucket# allocated to the job. Use a regex to obtain the bucket and blob info.matches=re.match("gs://(.*?)/(.*)",response.driver_output_resource_uri)ifnotmatches:print(f"Error: Could not parse driver output URI:{response.driver_output_resource_uri}")raiseValueErroroutput=(storage.Client().get_bucket(matches.group(1)).blob(f"{matches.group(2)}.000000000").download_as_bytes().decode("utf-8"))print(f"Job finished successfully:{output}")
    • PySpark job to print 'hello world':
      importrefromgoogle.cloudimportdataproc_v1asdataprocfromgoogle.cloudimportstoragedefsubmit_job(project_id,region,cluster_name):"""Submits a PySpark job to a Dataproc cluster with a driver node group.    Args:        project_id (str): The ID of the Google Cloud project.        region (str): The region where the Dataproc cluster is located.        cluster_name (str): The name of the Dataproc cluster.    """# Create the job client.job_client=dataproc.JobControllerClient(client_options={"api_endpoint":f"{region}-dataproc.googleapis.com:443"})driver_scheduling_config=dataproc.DriverSchedulingConfig(memory_mb=2048,# Example memory in MBvcores=2,# Example number of vcores)# Create the job config. The main Python file URI points to the script in# a Google Cloud Storage bucket.job={"placement":{"cluster_name":cluster_name},"pyspark_job":{"main_python_file_uri":"gs://dataproc-examples/pyspark/hello-world/hello-world.py"},"driver_scheduling_config":driver_scheduling_config,}operation=job_client.submit_job_as_operation(request={"project_id":project_id,"region":region,"job":job})response=operation.result()# Dataproc job output gets saved to the Google Cloud Storage bucket# allocated to the job. Use a regex to obtain the bucket and blob info.matches=re.match("gs://(.*?)/(.*)",response.driver_output_resource_uri)ifnotmatches:raiseValueError(f"Unexpected driver output URI:{response.driver_output_resource_uri}")output=(storage.Client().get_bucket(matches.group(1)).blob(f"{matches.group(2)}.000000000").download_as_bytes().decode("utf-8"))print(f"Job finished successfully:{output}")

View job logs

To view job status and help debug job issues, you can view driver logs usingthe gcloud CLI or the Google Cloud console.

gcloud

Job driver logs are streamed to the gcloud CLI output orGoogle Cloud console during job execution. Driver logs persist ina the Dataproc clusterstaging bucketin Cloud Storage.

Run the following gcloud CLI command to list the location of driverlogs in Cloud Storage:

gcloud dataproc jobs describeJOB_ID \    --region=REGION

The Cloud Storage location of driver logs is listed as thedriverOutputResourceUri in the command output in the following format:

driverOutputResourceUri: gs://CLUSTER_STAGING_BUCKET/google-cloud-dataproc-metainfo/CLUSTER_UUID/jobs/JOB_ID

Console

To view node group cluster logs:

  1. Enable Logging.

  2. You can use the followingLogs Explorer queryformat to find logs:

    resource.type="cloud_dataproc_cluster"resource.labels.project_id="PROJECT_ID"resource.labels.cluster_name="CLUSTER_NAME"log_name="projects/PROJECT_ID/logs/LOG_TYPE>"
    Replace the following;

    • PROJECT_ID: Google Cloud project ID.
    • CLUSTER_NAME: The cluster name.
    • LOG_TYPE:
      • Yarn user logs:yarn-userlogs
      • Yarn resource manager logs:hadoop-yarn-resourcemanager
      • Yarn node manager logs:hadoop-yarn-nodemanager

Monitor metrics

Dataproc node group job drivers run in adataproc-driverpool-driver-queue child queue under adataproc-driverpoolpartition.

Driver node group metrics

The following table lists the associated node group driver metrics,which are collected by default for driver node groups.

Driver node group metricDescription
yarn:ResourceManager:DriverPoolsQueueMetrics:AvailableMBThe amount available memory in Mebibytes indataproc-driverpool-driver-queue under thedataproc-driverpool partition.
yarn:ResourceManager:DriverPoolsQueueMetrics:PendingContainersThe number of pending (queued) containers indataproc-driverpool-driver-queue under thedataproc-driverpool partition.

Child queue metrics

The following table lists the child queue metrics. The metrics are collectedby default for driver node groups, and can be enabled for collectionon any Dataproc clusters.

Child queue metricDescription
yarn:ResourceManager:ChildQueueMetrics:AvailableMBThe amount of the available memory in Mebibytes in this queue under the default partition.
yarn:ResourceManager:ChildQueueMetrics:PendingContainersNumber of pending (queued) containers in this queue under the default partition.
yarn:ResourceManager:ChildQueueMetrics:running_0The number of jobs with a runtime between0 and60 minutes in this queue under all partitions.
yarn:ResourceManager:ChildQueueMetrics:running_60The number of jobs with a runtime between60 and300 minutes in this queue under all partitions.
yarn:ResourceManager:ChildQueueMetrics:running_300The number of jobs with a runtime between300 and1440 minutes in this queue under all partitions.
yarn:ResourceManager:ChildQueueMetrics:running_1440The number of jobs with a runtime greater than1440 minutes in this queue under all partitions.
yarn:ResourceManager:ChildQueueMetrics:AppsSubmittedNumber of applications submitted to this queue under all partitions.

To viewYARN ChildQueueMetrics andDriverPoolsQueueMetrics in theGoogle Cloud console:

Debug node group job driver

This section provides driver node group conditions and errors withrecommendations to fix the condition or error.

Conditions

  • Condition:yarn:ResourceManager:DriverPoolsQueueMetrics:AvailableMBis nearing0. This indicates that cluster driver pools queue arerunning out of memory.

    Recommendation:: Scale up the size of the driver pool.

  • Condition:yarn:ResourceManager:DriverPoolsQueueMetrics:PendingContainersis larger than 0. This can indicate that cluster driver pools queue are runningout of memory and YARN is queuing jobs.

    Recommendation:: Scale up the size of the driver pool.

Errors

  • Error:Cluster <var>CLUSTER_NAME</var> requires driver scheduling config to runSPARK job because it contains a node pool with role DRIVER.Positive values are required for all driver scheduling config values.

    Recommendation: Setdriver-required-memory-mb anddriver-required-vcoreswith positive numbers.

  • Error:Container exited with a non-zero exit code 137.

    Recommendation: Increasedriver-required-memory-mb to job memory usage.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.