Cluster caching Stay organized with collections Save and categorize content based on your preferences.
When you enable Dataproc cluster caching, the cluster cachesCloud Storage data frequently accessed by your Spark jobs.
Benefits
- Improved performance: Caching can improve job performance by reducing the amountof time spent retrieving data from storage.
- Reduced storage costs: Since hot data is cached on local disk,fewer API calls are made to storage to retrieve data.
- Spark job applicability: When cluster caching is enabled on a cluster,it applies to all Spark jobs run on the cluster, whether submitted to theDataproc service or run independently on the cluster.
Limitations and requirements
- Caching applies to Dataproc Spark jobs only.
- Only Cloud Storage data is cached.
- Caching only applies to clusters that meet the following requirements:
- The cluster has one master and
nworkers(High Availability (HA) andsingle node clusters are not supported). - This feature is available in Dataproc on Compute Engineimage versions
2.0.72+,2.1.20+, and2.2.0+. - Each cluster node must havelocal SSDsattached with theNVME (Non-Volatile Memory Express)interface (Persistent Disks (PDs) are not supported). Data is cached on NVMElocal SSDs only.
- The cluster uses thedefault VM service accountfor authentication.Custom VM service accountsare not supported.
- The cluster has one master and
Enable cluster caching
You can enable cluster caching when you create a Dataproc clusterusing the Google Cloud console, Google Cloud CLI, or the Dataproc API.
Google Cloud console
- Open the DataprocCreate a cluster on Compute Engine page in the Google Cloud console.
- TheSet up cluster panel is selected. In theSpark performance enhancements section, selectEnable Google Cloud Storage caching.
- After confirming and specifying cluster details in the cluster create panels, clickCreate.
gcloud CLI
Run thegcloud dataproc clusters create command locally in a terminal window or inCloud Shell using thedataproc:dataproc.cluster.caching.enabled=truecluster property.
Example:
gcloud dataproc clusters createCLUSTER_NAME \ --region=REGION \ --properties dataproc:dataproc.cluster.caching.enabled=true \ --num-master-local-ssds=2 \ --master-local-ssd-interface=NVME \ --num-worker-local-ssds=2 \ --worker-local-ssd-interface=NVME \ other args ...
REST API
SetSoftwareConfig.properties to include the"dataproc:dataproc.cluster.caching.enabled": "true"cluster property as part of aclusters.create request.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.