Dataproc staging and temp buckets

When you create a cluster, HDFS is used as the default filesystem. You canoverride this behavior by setting the defaultFS as a Cloud Storagebucket. Bydefault, Dataproc also creates a Cloud Storage staging and aCloud Storage temp bucket in your project or reuses existingDataproc-created staging and temp buckets from previous clustercreation requests.

Staging bucket: Used to stage cluster job dependencies,job driver output,and cluster config files. Also receives output fromSnapshot diagnostic data collection.
Temp bucket: Used to store ephemeral cluster and jobs data,such as Spark and MapReduce history files. Also storescheckpoint diagnostic datacollected during the lifecycle of a cluster.

If you do not specify a staging or temp bucket when you create a cluster,Dataproc sets aCloud Storage location in US, ASIA,or EU for your cluster's staging and temp bucketsaccording to the Compute Engine zone where your cluster is deployed,and then creates and manages these project-level, per-location buckets.Dataproc-created staging and temp buckets areshared among clusters in the same region, and are created with aCloud Storagesoft delete retentionduration set to 0 seconds. If you specify your own staging and temp buckets, consider tuning the soft delete retention in order to reduce storage charges incurred by soft-deleted objects.

The temp bucket contains ephemeral data, and has a TTL of 90 days.The staging bucket, which can contain configuration dataand dependency files needed by multiple clusters,does not have a TTL. However, you canapply a lifecycle rule toyour dependency files(files with a ".jar" filename extension located in the staging bucket folder)to schedule the removal of your dependency files when they are no longerneeded by your clusters.

To locate the default Dataproc staging and temp buckets using the Google Cloud consoleCloud Storage Browser, filter results using the "dataproc-staging-" and "dataproc-temp-" prefixes.

Create your own staging and temp buckets

Instead of relying on the creation of a defaultstaging and temp bucket, you can specify existing Cloud Storage buckets thatDataproc will use as your cluster's staging and temp bucket.

Note: When you use an Assured Workloads environment forregulatory compliance, the cluster, VPC network, and Cloud Storagebuckets must be contained within the Assured Workloads environment.

gcloud command

Run thegcloud dataproc clusters create command with the--bucket and/or--temp-bucket flags locally in a terminal window or inCloud Shell to specify your cluster's staging and/or temp bucket.

gcloud dataproc clusters createcluster-name \    --region=region \    --bucket=bucket-name \    --temp-bucket=bucket-name \    other args ...

REST API

Use theClusterConfig.configBucket andClusterConfig.tempBucket fieldsin aclusters.create request to specify your cluster's staging and temp buckets.

Console

In the Google Cloud console, open the DataprocCreate a cluster page. Select the Customize cluster panel, then use the File storage field to specify or select the cluster's staging bucket.

Note: Currently, specifying a temp bucket using the Google Cloud console is not supported.

Dataproc uses a defined folder structure for Cloud Storage bucketsattached to clusters. Dataproc also supports attaching more than onecluster to a Cloud Storage bucket. The folder structure used for saving jobdriver output in Cloud Storage is:

cloud-storage-bucket-name  - google-cloud-dataproc-metainfo    -list of cluster IDs        -list of job IDs          -list of output logs for a job

You can use thegcloud command line tool, Dataproc API, orGoogle Cloud console to list the name of a cluster's staging and temp buckets.

Console

\View cluster details, which includeas the name of the cluster's staging bucket, on theDataprocClusterspage in the Google Cloud console.
On the Google Cloud consoleCloud Storage Browser page, filter results that contain "dataproc-temp-".

gcloud command

Run thegcloud dataproc clusters describecommand locally in a terminal window or inCloud Shell.The staging and temp buckets associated with your cluster are listed in theoutput.

gcloud dataproc clusters describecluster-name \    --region=region \...clusterName: cluster-nameclusterUuid: daa40b3f-5ff5-4e89-9bf1-bcbfec ...config:configBucket: dataproc-...    ...tempBucket: dataproc-temp...

REST API

Callclusters.getto list the cluster details, including the name of the cluster's staging and temp buckets.

{ "projectId": "vigilant-sunup-163401", "clusterName": "cluster-name", "config": {"configBucket": "dataproc-...",..."tempBucket": "dataproc-temp-...",}

defaultFS

You can setcore:fs.defaultFS to a bucket location in Cloud Storage (gs://defaultFS-bucket-name) to set Cloud Storage as the default filesystem. This also setscore:fs.gs.reported.permissions, the reported permission returned by the Cloud Storage connector for all files, to777.

Note: When you use an Assured Workloads environment forregulatory compliance, the cluster, VPC network, and Cloud Storagebuckets must be contained within the Assured Workloads environment.

If Cloud Storage is not set as the default filesystem, HDFS will be used, and thecore:fs.gs.reported.permissions property will return700, the default value.

gcloud dataproc clusters createcluster-name \    --properties=core:fs.defaultFS=gs://defaultFS-bucket-name \    --region=region \    --bucket=staging-bucket-name \    --temp-bucket=temp-bucket-name \    other args ...

Note: Currently, console display of the defaultFS bucket is not supported.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換

Dataproc staging and temp buckets Stay organized with collections Save and categorize content based on your preferences.

Create your own staging and temp buckets

gcloud command

REST API

Console

Console

gcloud command

REST API

defaultFS

Dataproc staging and temp buckets