Cloud Storage connector Stay organized with collections Save and categorize content based on your preferences.
The Cloud Storage connectoropen source Java librarylets you runApache Hadoop orApache Spark jobs directly on data inCloud Storage.
Benefits of the Cloud Storage connector
- Direct data access: Store your data in Cloud Storage andaccess it directly. You don't need to transfer it into HDFS first.
- HDFS compatibility: You can access your data inCloud Storage using the
gs://prefix instead ofhdfs://. - Interoperability: Storing data in Cloud Storage enablesseamless interoperability between Spark, Hadoop, and Google services.
- Data accessibility: When you shut down a Hadoop cluster, unlikeHDFS, you continue to have access to your data in Cloud Storage.
- High data availability: Data stored in Cloud Storage ishighly available and globally replicated without a loss of performance.
- No storage management overhead: Unlike HDFS, Cloud Storagerequires no routine maintenance, such as checking the file system, orupgrading or rolling back to a previous version of the file system.
- Quick startup: In HDFS, a MapReduce job can't start until the
NameNodeis out of safe mode, a process that can take from a fewseconds to many minutes depending on the size and state of your data. WithCloud Storage, you can start your job as soon as the task nodes start,which leads to significant cost savings over time.
Connector setup on Dataproc clusters
The Cloud Storage connector is installed by default on allDataproc cluster nodes in the/usr/local/share/google/dataproc/lib/ directory. The followingsubsections describe steps you can take to completeconnector setup on Dataproc clusters.
VM service account
When running the connector on Dataproc clusternodes and other Compute Engine VMs, thegoogle.cloud.auth.service.account.enable property is settofalse by default, which means you don't need to configure theVM service accountcredentials for the connector—VM service account credentials are provided by theVM metadata server.
The DataprocVM service accountmust have permission to access your Cloud Storage bucket.
User-selected connector versions
The default Cloud Storage connector versions used in the latest imagesinstalled on Dataproc clusters are listed in theimage version pages.If your application depends on a non-default connector version deployed on yourcluster, you can perform one of the following actions to use your selectedconnector version:
- Create a cluster with the
--metadata=GCS_CONNECTOR_VERSION=x.y.zflag, which updates the connector used by applications running on the clusterto the specified connector version. - Include andrelocatethe connector classes and connector dependencies for the version you areusing into your application's jar. Relocation is necessary to avoid aconflict between the your deployed connector version and the defaultconnector version installed on the Dataproc cluster. Also see theMaven dependencies relocation example.
Connector setup on non-Dataproc clusters
You can take the following steps to setup the Cloud Storage connector ona non-Dataproc cluster, such as an Apache Hadoop or Spark clusterthat you use to move on-premises HDFS data to Cloud Storage.
Download the connector.
- To download the Cloud Storage connector:
- To use a
latestversion located in Cloud Storage bucket(using alatestversion is not recommended for production applications): - To use aspecific versionfrom your Cloud Storage bucket by substituting the Hadoop andCloud Storage connector versions in the
gcs-connector-HADOOP_VERSION-CONNECTOR_VERSION.jarname pattern, for example,gs://hadoop-lib/gcs/gcs-connector-hadoop2-2.1.1.jar. - To use aspecific versionfrom theApache Maven repository,download a shaded jar that has
-shadedsuffix in the name.
- To use a
- To download the Cloud Storage connector:
Install the connector.
Follow theGitHub instructionsto install, configure, and test the Cloud Storage connector.
Connector usage
You can use the connector to access Cloud Storage data in the followingways:
- In a Spark, PySpark, or Hadoop application with the
gs://prefix - In a hadoop shell with
hadoop fs -ls gs://bucket/dir/file - In the Cloud StorageBrowserIn the Google Cloud console
- UsingGoogle Cloud SDK commands, such as:
Java usage
The Cloud Storage connector requires Java 8.
The following is a sample Maven POM dependency management section for theCloud Storage connector. For additional information, seeDependency Management.
<dependency> <groupId>com.google.cloud.bigdataoss</groupId> <artifactId>gcs-connector</artifactId> <version>hadoopX-X.X.XCONNECTOR VERSION</version> <scope>provided</scope></dependency>
For a shaded version:
<dependency> <groupId>com.google.cloud.bigdataoss</groupId> <artifactId>gcs-connector</artifactId> <version>hadoopX-X.X.XCONNECTOR VERSION</version> <scope>provided</scope> <classifier>shaded</classifier></dependency>
Connector support
The Cloud Storage connector is supported by Google Cloud for use withGoogle Cloud products and use cases. When used withDataproc, it is supported at the same level asDataproc. For more information,seeGet support.
Connect to Cloud Storage using gRPC
By default, the Cloud Storage connector on Dataproc usestheCloud Storage JSON API. This section showsyou how to enable the Cloud Storage connector tousegRPC.
Usage considerations
Using the Cloud Storage connector with gRPC includes thefollowing considerations:
- Regional bucket location: The gRPC can improve read latencies only when Compute Engine VMs and Cloud Storage bucketsare located in the sameCompute Engine region.
- Read-intensive jobs: gRPC can offer improved read latencies forlong-running reads, and can help read-intensive workloads.It is not recommended for applications that create a gRPC channel,run a short computation, and then close the channel.
- Unauthenticated requests: The gRPC does not support unauthenticatedrequests.
Requirements
The following requirements apply when using gRPC with theCloud Storage connector:
Your Dataproc clusterVPC network mustsupportdirect connectivity.This means that the network'sroutes andfirewall rules must allow egress trafficto reach
34.126.0.0/18and2001:4860:8040::/42.- If your Dataproc cluster uses IPv6 networking, you must setup an IPv6 subnet for VM instances. For more information,seeConfiguring IPv6 for instances and instance templates.
Whencreating a Dataproc cluster,you must use Cloud Storage connector version
2.2.23or later with image version2.1.56+or Cloud Storageconnector version v3.0.0 or later with image version 2.2.0+.The Cloud Storage connector version installed oneach Dataproc image version is listed in theDataproc image version pages.- If you create and use aDataproc on GKE virtual clusterfor your gRPC Cloud Storage requests, GKE version
1.28.5-gke.1199000withgke-metadata-server 0.4.285is recommended.This combination supports direct connectivity.
- If you create and use aDataproc on GKE virtual clusterfor your gRPC Cloud Storage requests, GKE version
You or your organization administrator must grant Identity and Access Management roles that includethe permissions necessary to set up and make gRPC requests to theCloud Storage connector. These roles can include the following:
- User role:Dataproc Editorrole granted to users to allow them to create clusters and submit jobs
- Service account role:Storage Object Userrole granted to the DataprocVM service accountto allow applications running on cluster VMs to view, read, create, and writeCloud Storage objects.
Enable gRPC on the Cloud Storage connector
You can enable gRPC on the Cloud Storage connector at thecluster or job level. Once enabled on the cluster, Cloud Storageconector read requests use gRPC. If enabled on a job instead of at thecluster level, Cloud Storage connector read requests use gRPC for thejob only.
Enable a cluster
To enable gRPC on the Cloud Storage connector at the cluster level,set thecore:fs.gs.client.type=STORAGE_CLIENT property when youcreate a Dataproc cluster.Once gRPC is enabled at the cluster level, Cloud Storage connectorread requests made by jobs running on the cluster use gRPC.
gcloud CLI example:
gcloud dataproc clusters createCLUSTER_NAME \ --project=PROJECT_ID \ --region=REGION \ --properties=core:fs.gs.client.type=STORAGE_CLIENT
Replace the following;
- CLUSTER_NAME: Specify a name for your cluster.
- PROJECT_NAME: The project ID of the project where the cluster is located.Project IDs are listed in theProject info section onthe Google Cloud consoleDashboard.
- REGION: Specify aCompute Engine regionwhere the cluster will be located.
Enable a job
To enable gRPC on the Cloud Storage connector for a specificjob, include--properties=spark.hadoop.fs.gs.client.type=STORAGE_CLIENTwhen yousubmit a job.
Example: Run a job on an existing cluster that uses gRPC to read fromCloud Storage.
Create a local
/tmp/line-count.pyPySpark script that uses gRPC to read aCloud Storage text file and output the number of lines in the file.cat <<EOF >"/tmp/line-count.py"#!/usr/bin/pythonimport sysfrom pyspark.sql import SparkSessionpath = sys.argv[1]spark = SparkSession.builder.getOrCreate()rdd = spark.read.text(path)lines_counter = rdd.count()print("There are {} lines in file: {}".format(lines_counter,path))EOFCreate a local
/tmp/line-count-sample.txttext file.cat <<EOF >"/tmp/line-count-sample.txt"Line 1Line 2line 3EOF
Upload local
/tmp/line-count.pyand/tmp/line-count-sample.txtto your bucketin Cloud Storage.gcloud storage cp /tmp/line-count* gs://BUCKET
Run the
line-count.pyjob on your cluster. Set--properties=spark.hadoop.fs.gs.client.type=STORAGE_CLIENTtoenable gRPC for Cloud Storage connector read requests.gcloud dataproc jobs submit pyspark gs://BUCKET/line-count.py \--cluster=CLUSTER_NAME \--project=PROJECT_ID \--region=REGION \--properties=spark.hadoop.fs.gs.client.type=STORAGE_CLIENT \-- gs://BUCKET/line-count-sample.txt
Replace the following;
- CLUSTER_NAME: The name of an existing cluster.
- PROJECT_NAME: Your project ID. Project IDs are listed in theProject info section on the Google Cloud consoleDashboard.
- REGION: The Compute Engine region where the cluster islocated.
- BUCKET: Your Cloud Storage bucket.
Generate gRPC client-side metrics
You can configure the Cloud Storage connector to generategRPC related metrics in Cloud Monitoring. The gRPC related metrics can help you to do the following:
- Monitor and optimize the performance of gRPC requests to Cloud Storage
- Troubleshoot and debug issues
- Gain insights into application usage and behavior
For information about how to configure the Cloud Storage connectorto generate gRPC related metrics, seeUse gRPC client-side metrics.
Resources
- See GitHub Cloud Storage connectorconfiguration properties.
- SeeConnect to Cloud Storage using gRPCto use the Cloud Storage connector with client libraries,VPC Service Controls, and other scenarios.
- Learn more aboutCloud Storage.
- SeeUse the Cloud Storage connector with Apache Spark.
- Understand theApache Hadoop file system.
- View theJavadoc reference.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.