Cluster rotation Stay organized with collections Save and categorize content based on your preferences.
.
Organization security policies, regulatory compliance rules, and otherconsiderations can prompt you to "rotate" your Dataproc clustersat regular intervals by deleting, then recreating clusters on a schedule.As part of cluster rotation, new clusters can be provisioned with the latestDataproc image versions while retaining the configuration settingsof the replaced clusters.
This page shows you how to set up clusters that you plan to rotate ("rotatedclusters"), submit jobs to them, and then rotate the clusters as needed.
Custom image cluster rotation:You can apply previous or new customizations to a previous or newDataproc base image when recreating the custom image cluster.
Set up rotated clusters
To set up rotated clusters, create unique, timestamp-suffixed cluster namesto distinguish previous from new clusters, and then attach labels to clustersthat indicate if a cluster is part of a rotated cluster pool and activelyreceiving new job submissions. This example usescluster-pool andcluster-state=active labels for these purposes, but you can useyour own label names.
Set environment variables:
PROJECT=project ID \ REGION=region \ CLUSTER_POOL=cluster-pool-name \ CLUSTER_NAME=$CLUSTER_POOL-$(date '+%Y%m%d%H%M') \ BUCKET=Cloud Storage bucket-name
Notes:
- cluster-pool-name: The name of the cluster pool associated withone or more clusters. This name is used in the cluster name and with the
cluster-poollabel attached to the cluster to identify the cluster as part of the pool.
- cluster-pool-name: The name of the cluster pool associated withone or more clusters. This name is used in the cluster name and with the
Create the cluster. You can add arguments and use different labels.
gcloud dataproc clusters create ${CLUSTER_NAME} \ --project=${PROJECT_ID} \ --region=${REGION} \ --bucket=${BUCKET} \ --labels="cluster-pool=${CLUSTER_POOL},cluster-state=active"
Submit jobs to clusters
The following Google Cloud CLI andApache Airflow directed acyclic graph (DAG)examples submit an Apache Pig job to a cluster. Cluster labels areused to submit the job to an active cluster within a cluster pool.
gcloud
Submit an Apache Pig job located in Cloud Storage. Pick the cluster using labels.
gcloud dataproc jobs submit pig \ --region=${REGION} \ --file=gs://${BUCKET}/scripts/script.pig \ --cluster-labels="cluster-pool=${CLUSTER_POOL},cluster-state=active"Airflow
Submit an Apache Pig job located in Cloud Storage using Airflow.Pick the cluster using labels.
from airflow import DAGfrom airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperatorfrom datetime import datetime# Declare variablesproject_id=# e.g: my-projectregion="us-central1"dag_id='pig_wordcount'cluster_labels={"cluster-pool":${CLUSTER_POOL}, "cluster-state":"active"}wordcount_script="gs://bucket-name/scripts/wordcount.pig"# Define DAGdag = DAG( dag_id, schedule_interval=None, start_date=datetime(2023, 8, 16), catchup=False)PIG_JOB = { "reference": {"project_id": project_id}, "placement": {"cluster_labels": cluster_labels}, "pig_job": {"query_file_uri": wordcount_script},}wordcount_task = DataprocSubmitJobOperator( task_id='wordcount', region=region, project_id=project_id, job=PIG_JOB, dag=dag)
Rotate clusters
Update the cluster labels attached to the clusters you are rotating out. Thisexamples uses the
cluster-state=pendingfordeletionlabel to signify thatthe cluster is not receiving new job submissions and is being rotated out,but you can use your own label for this purpose.gcloud dataproc clusters update ${CLUSTER_NAME} \ --region=${REGION} \ --update-labels="cluster-state=pendingfordeletion"After the cluster label is updated, the cluster does not receive new jobssince jobs are submitted to clusters within a cluster poolwith
activelabels only (seeSubmit jobs to clusters).Delete clusters you are rotating out after they finish running jobs.
Note: You can automate this step with a monitoring script thatfetches clusters with thecluster-state=pendingfordeletionlabel (or otherlabel you added with the previous command), checks thatno jobs are running on the cluster, and then deletes the cluster.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.