Dataproc Hadoop data storage Stay organized with collections Save and categorize content based on your preferences.
Dataproc integrates with Apache Hadoop and the Hadoop DistributedFile System (HDFS). The following features and considerations can be importantwhen selecting compute and data storage options for Dataprocclusters and jobs:
- HDFS with Cloud Storage:Dataproc uses theHadoop Distributed File System (HDFS) for storage. Additionally,Dataproc automatically installs the HDFS-compatibleCloud Storage connector,which enables the use of Cloud Storagein parallel with HDFS. Data can be moved in and out of a cluster throughupload and download to HDFS or Cloud Storage.
- VM disks:
- By default, when no local SSDs are provided, HDFS data and intermediateshuffle data is stored on VM boot disks, which arePersistent Disks.
- If you uselocal SSDs,HDFS data and intermediate shuffle data is stored on the SSDs.
- Persistent disk (PD) size and type affect performance and VM size, whether using HDFS or Cloud Storagefor data storage.
- VM Boot disks are deleted when the cluster is deleted.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.