Cloud Storage connector

The Cloud Storage connectoropen source Java librarylets you runApache Hadoop orApache Spark jobs directly on data inCloud Storage.

Benefits of the Cloud Storage connector

  • Direct data access: Store your data in Cloud Storage andaccess it directly. You don't need to transfer it into HDFS first.
  • HDFS compatibility: You can access your data inCloud Storage using thegs:// prefix instead ofhdfs://.
  • Interoperability: Storing data in Cloud Storage enablesseamless interoperability between Spark, Hadoop, and Google services.
  • Data accessibility: When you shut down a Hadoop cluster, unlikeHDFS, you continue to have access to your data in Cloud Storage.
  • High data availability: Data stored in Cloud Storage ishighly available and globally replicated without a loss of performance.
  • No storage management overhead: Unlike HDFS, Cloud Storagerequires no routine maintenance, such as checking the file system, orupgrading or rolling back to a previous version of the file system.
  • Quick startup: In HDFS, a MapReduce job can't start until theNameNode is out of safe mode, a process that can take from a fewseconds to many minutes depending on the size and state of your data. WithCloud Storage, you can start your job as soon as the task nodes start,which leads to significant cost savings over time.

Connector setup on Dataproc clusters

The Cloud Storage connector is installed by default on allDataproc cluster nodes in the/usr/local/share/google/dataproc/lib/ directory. The followingsubsections describe steps you can take to completeconnector setup on Dataproc clusters.

Note: To set up the connector on other clusters, seeNon-Dataproc clusters.

VM service account

When running the connector on Dataproc clusternodes and other Compute Engine VMs, thegoogle.cloud.auth.service.account.enable property is settofalse by default, which means you don't need to configure theVM service accountcredentials for the connector—VM service account credentials are provided by theVM metadata server.

The DataprocVM service accountmust have permission to access your Cloud Storage bucket.

User-selected connector versions

The default Cloud Storage connector versions used in the latest imagesinstalled on Dataproc clusters are listed in theimage version pages.If your application depends on a non-default connector version deployed on yourcluster, you can perform one of the following actions to use your selectedconnector version:

  • Create a cluster with the--metadata=GCS_CONNECTOR_VERSION=x.y.zflag, which updates the connector used by applications running on the clusterto the specified connector version.
  • Include andrelocatethe connector classes and connector dependencies for the version you areusing into your application's jar. Relocation is necessary to avoid aconflict between the your deployed connector version and the defaultconnector version installed on the Dataproc cluster. Also see theMaven dependencies relocation example.

Connector setup on non-Dataproc clusters

You can take the following steps to setup the Cloud Storage connector ona non-Dataproc cluster, such as an Apache Hadoop or Spark clusterthat you use to move on-premises HDFS data to Cloud Storage.

  1. Download the connector.

  2. Install the connector.

    Follow theGitHub instructionsto install, configure, and test the Cloud Storage connector.

Connector usage

You can use the connector to access Cloud Storage data in the followingways:

Java usage

The Cloud Storage connector requires Java 8.

The following is a sample Maven POM dependency management section for theCloud Storage connector. For additional information, seeDependency Management.

<dependency>    <groupId>com.google.cloud.bigdataoss</groupId>    <artifactId>gcs-connector</artifactId>    <version>hadoopX-X.X.XCONNECTOR VERSION</version>    <scope>provided</scope></dependency>

For a shaded version:

<dependency>    <groupId>com.google.cloud.bigdataoss</groupId>    <artifactId>gcs-connector</artifactId>    <version>hadoopX-X.X.XCONNECTOR VERSION</version>    <scope>provided</scope>    <classifier>shaded</classifier></dependency>

Connector support

The Cloud Storage connector is supported by Google Cloud for use withGoogle Cloud products and use cases. When used withDataproc, it is supported at the same level asDataproc. For more information,seeGet support.

Connect to Cloud Storage using gRPC

By default, the Cloud Storage connector on Dataproc usestheCloud Storage JSON API. This section showsyou how to enable the Cloud Storage connector tousegRPC.

Usage considerations

Using the Cloud Storage connector with gRPC includes thefollowing considerations:

  • Regional bucket location: The gRPC can improve read latencies only when Compute Engine VMs and Cloud Storage bucketsare located in the sameCompute Engine region.
  • Read-intensive jobs: gRPC can offer improved read latencies forlong-running reads, and can help read-intensive workloads.It is not recommended for applications that create a gRPC channel,run a short computation, and then close the channel.
  • Unauthenticated requests: The gRPC does not support unauthenticatedrequests.

Requirements

The following requirements apply when using gRPC with theCloud Storage connector:

  • Your Dataproc clusterVPC network mustsupportdirect connectivity.This means that the network'sroutes andfirewall rules must allow egress trafficto reach34.126.0.0/18 and2001:4860:8040::/42.

  • Whencreating a Dataproc cluster,you must use Cloud Storage connector version2.2.23 or later with image version2.1.56+ or Cloud Storageconnector version v3.0.0 or later with image version 2.2.0+.The Cloud Storage connector version installed oneach Dataproc image version is listed in theDataproc image version pages.

    • If you create and use aDataproc on GKE virtual clusterfor your gRPC Cloud Storage requests, GKE version1.28.5-gke.1199000 withgke-metadata-server 0.4.285 is recommended.This combination supports direct connectivity.
  • You or your organization administrator must grant Identity and Access Management roles that includethe permissions necessary to set up and make gRPC requests to theCloud Storage connector. These roles can include the following:

    • User role:Dataproc Editorrole granted to users to allow them to create clusters and submit jobs
    • Service account role:Storage Object Userrole granted to the DataprocVM service accountto allow applications running on cluster VMs to view, read, create, and writeCloud Storage objects.

Enable gRPC on the Cloud Storage connector

You can enable gRPC on the Cloud Storage connector at thecluster or job level. Once enabled on the cluster, Cloud Storageconector read requests use gRPC. If enabled on a job instead of at thecluster level, Cloud Storage connector read requests use gRPC for thejob only.

Enable a cluster

To enable gRPC on the Cloud Storage connector at the cluster level,set thecore:fs.gs.client.type=STORAGE_CLIENT property when youcreate a Dataproc cluster.Once gRPC is enabled at the cluster level, Cloud Storage connectorread requests made by jobs running on the cluster use gRPC.

gcloud CLI example:

gcloud dataproc clusters createCLUSTER_NAME \    --project=PROJECT_ID \    --region=REGION \    --properties=core:fs.gs.client.type=STORAGE_CLIENT

Replace the following;

  • CLUSTER_NAME: Specify a name for your cluster.
  • PROJECT_NAME: The project ID of the project where the cluster is located.Project IDs are listed in theProject info section onthe Google Cloud consoleDashboard.
  • REGION: Specify aCompute Engine regionwhere the cluster will be located.

Enable a job

To enable gRPC on the Cloud Storage connector for a specificjob, include--properties=spark.hadoop.fs.gs.client.type=STORAGE_CLIENTwhen yousubmit a job.

Example: Run a job on an existing cluster that uses gRPC to read fromCloud Storage.

  1. Create a local/tmp/line-count.py PySpark script that uses gRPC to read aCloud Storage text file and output the number of lines in the file.

    cat <<EOF >"/tmp/line-count.py"#!/usr/bin/pythonimport sysfrom pyspark.sql import SparkSessionpath = sys.argv[1]spark = SparkSession.builder.getOrCreate()rdd = spark.read.text(path)lines_counter = rdd.count()print("There are {} lines in file: {}".format(lines_counter,path))EOF
  2. Create a local/tmp/line-count-sample.txt text file.

    cat <<EOF >"/tmp/line-count-sample.txt"Line 1Line 2line 3EOF
  3. Upload local/tmp/line-count.py and/tmp/line-count-sample.txt to your bucketin Cloud Storage.

    gcloud storage cp /tmp/line-count* gs://BUCKET
  4. Run theline-count.py job on your cluster. Set--properties=spark.hadoop.fs.gs.client.type=STORAGE_CLIENT toenable gRPC for Cloud Storage connector read requests.

    gcloud dataproc jobs submit pyspark gs://BUCKET/line-count.py \--cluster=CLUSTER_NAME \--project=PROJECT_ID  \--region=REGION \--properties=spark.hadoop.fs.gs.client.type=STORAGE_CLIENT \-- gs://BUCKET/line-count-sample.txt

    Replace the following;

    • CLUSTER_NAME: The name of an existing cluster.
    • PROJECT_NAME: Your project ID. Project IDs are listed in theProject info section on the Google Cloud consoleDashboard.
    • REGION: The Compute Engine region where the cluster islocated.
    • BUCKET: Your Cloud Storage bucket.

Generate gRPC client-side metrics

You can configure the Cloud Storage connector to generategRPC related metrics in Cloud Monitoring. The gRPC related metrics can help you to do the following:

  • Monitor and optimize the performance of gRPC requests to Cloud Storage
  • Troubleshoot and debug issues
  • Gain insights into application usage and behavior

For information about how to configure the Cloud Storage connectorto generate gRPC related metrics, seeUse gRPC client-side metrics.

Resources

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.