Use Apache Spark with HBase on Dataproc

Deprecated: Starting with Dataproc version 2.1, you canno longer use the optional HBase component. Dataprocversion 1.5 andDataprocversion 2.0offer a Beta version of HBase with no support. However, due to the ephemeral nature of Dataproc clusters,using HBase is not recommended.

Objectives

This tutorial shows you how to:

Create a Dataproc cluster, installing Apache HBase andApache ZooKeeper on the cluster
Create an HBase table using the HBase shell running on the master node ofthe Dataproc cluster
Use Cloud Shell to submit a Java or PySpark Spark job to theDataproc service that writes data to, then readsdata from, the HBase table

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use thepricing calculator.

New Google Cloud users might be eligible for afree trial.

Before you begin

If you haven't already done so, create a Google Cloud Platform project.

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.

Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc and Compute Engine APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc and Compute Engine APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

Enable the APIs

Create a Dataproc cluster

Run the following command in aCloud Shell session terminal to:
- Install theHBase andZooKeeper components
- Provision three worker nodes (three to five workers are recommended torun the code in this tutorial)
- Enable theComponent Gateway
- Use image version 2.0
- Use the--properties flag to add the HBase config and HBase library tothe Spark driver and executor classpaths.

gcloud dataproc clusters createcluster-name \    --region=region \    --optional-components=HBASE,ZOOKEEPER \    --num-workers=3 \    --enable-component-gateway \    --image-version=2.0 \    --properties='spark:spark.driver.extraClassPath=/etc/hbase/conf:/usr/lib/hbase/*,spark:spark.executor.extraClassPath=/etc/hbase/conf:/usr/lib/hbase/*'

Verify connector installation

From the Google Cloud console or a Cloud Shell session terminal,SSH into the Dataproc cluster master node.

Verify the installation of theApache HBase Spark connector on the master node:

ls -l /usr/lib/spark/jars | grep hbase-spark

Sample output:

-rw-r--r-- 1 root rootsize date time hbase-spark-connector.version.jar

Keep the SSH session terminal open to:
1. Create an HBase table
2. (Java users):run commands on the masternode of the cluster to determine the versions of components installed onthe cluster
3. Scan your Hbase table after yourun the code

Create an HBase table

Run the commands listed in this section in the master node SSH sessionterminal that you opened in the previous step.

Open the HBase shell:
```
hbase shell
```
Create an HBase 'my-table' with a 'cf' column family:
```
create 'my_table','cf'
```
1. To confirm table creation, in the Google Cloud console, clickHBasein theGoogle Cloud console Component Gateway links to open the Apache HBase UI.my-table is listed in theTables section on theHome page.

View the Spark code

Java

packagehbase;importorg.apache.hadoop.hbase.spark.datasources.HBaseTableCatalog;importorg.apache.spark.sql.Dataset;importorg.apache.spark.sql.Row;importorg.apache.spark.sql.SparkSession;importjava.io.Serializable;importjava.util.Arrays;importjava.util.HashMap;importjava.util.Map;publicclassSparkHBaseMain{publicstaticclassSampleDataimplementsSerializable{privateStringkey;privateStringname;publicSampleData(Stringkey,Stringname){this.key=key;this.name=name;}publicSampleData(){}publicStringgetName(){returnname;}publicvoidsetName(Stringname){this.name=name;}publicStringgetKey(){returnkey;}publicvoidsetKey(Stringkey){this.key=key;}}publicstaticvoidmain(String[]args){// Init SparkSessionSparkSessionspark=SparkSession.builder().master("yarn").appName("spark-hbase-tutorial").getOrCreate();// Data SchemaStringcatalog="{"+"\"table\":{\"namespace\":\"default\", \"name\":\"my_table\"},"+"\"rowkey\":\"key\","+"\"columns\":{"+"\"key\":{\"cf\":\"rowkey\", \"col\":\"key\", \"type\":\"string\"},"+"\"name\":{\"cf\":\"cf\", \"col\":\"name\", \"type\":\"string\"}"+"}"+"}";Map<String,String>optionsMap=newHashMap<String,String>();optionsMap.put(HBaseTableCatalog.tableCatalog(),catalog);Dataset<Row>ds=spark.createDataFrame(Arrays.asList(newSampleData("key1","foo"),newSampleData("key2","bar")),SampleData.class);// Write to HBaseds.write().format("org.apache.hadoop.hbase.spark").options(optionsMap).option("hbase.spark.use.hbasecontext","false").mode("overwrite").save();// Read from HBaseDatasetdataset=spark.read().format("org.apache.hadoop.hbase.spark").options(optionsMap).option("hbase.spark.use.hbasecontext","false").load();dataset.show();}}

Python

frompyspark.sqlimportSparkSession# Initialize Spark Sessionspark=SparkSession \.builder \.master('yarn') \.appName('spark-hbase-tutorial') \.getOrCreate()data_source_format=''# Create some test datadf=spark.createDataFrame([("key1","foo"),("key2","bar"),],["key","name"])# Define the schema for catalogcatalog=''.join("""{    "table":{"namespace":"default", "name":"my_table"},    "rowkey":"key",    "columns":{        "key":{"cf":"rowkey", "col":"key", "type":"string"},        "name":{"cf":"cf", "col":"name", "type":"string"}    }}""".split())# Write to HBasedf.write.format('org.apache.hadoop.hbase.spark').options(catalog=catalog).option("hbase.spark.use.hbasecontext","false").mode("overwrite").save()# Read from HBaseresult=spark.read.format('org.apache.hadoop.hbase.spark').options(catalog=catalog).option("hbase.spark.use.hbasecontext","false").load()result.show()

Run the code

Open aCloud Shell session terminal.
Note: Run the commands listed in this section in aCloud Shell session terminal. Cloud Shell has the tools required bythis tutorial pre-installed, including gcloud CLI,git,Apache Maven,Java, andPython,plusother tools.
Clone the GitHubGoogleCloudDataproc/cloud-dataproc repository into your Cloud Shell session terminal:
```
git clone https://github.com/GoogleCloudDataproc/cloud-dataproc.git
```

Change to thecloud-dataproc/spark-hbase directory:

cd cloud-dataproc/spark-hbase

Sample output:

user-name@cloudshell:~/cloud-dataproc/spark-hbase (project-id)$

Submit the Dataproc job.

Java

Set component versions inpom.xml file.
1. The Dataproc2.0.x release versionspage lists the Scala, Spark, and HBase component versions installedwith the most recent and last four image 2.0 subminor versions.
  1. To find the subminor version of your 2.0 image version cluster,click the cluster name on theClusters page in theGoogle Cloud console to open theCluster details page, where theclusterImage version is listed.
2. Alternatively, you can run the following commands in anSSH session terminalfrom the master node of your cluster to determine component versions:
  1. Check scala version:
```
scala -version
```
  2. Check Spark version (control-D to exit):
```
spark-shell
```
  3. Check HBase version:
```
hbase version
```
  4. Identify the Spark, Scala, and HBase version dependenciesin the Mavenpom.xml:
```
<properties>  <scala.version>scala full version (for example, 2.12.14)</scala.version>  <scala.main.version>scala main version (for example, 2.12)</scala.main.version>  <spark.version>spark version (for example, 3.1.2)</spark.version>  <hbase.client.version>hbase version (for example, 2.2.7)</hbase.client.version>  <hbase-spark.version>1.0.0(the current Apache HBase Spark Connector version)></properties>
```
    Note: Thehbase-spark.version is the current Spark HBase connector version;leave this version number unchanged.
3. Edit thepom.xml file in the Cloud Shell editor to insert thethe correct Scala, Spark, and HBase version numbers.ClickOpen Terminal when you finish editing to return tothe Cloud Shell terminal command line.
```
cloudshell edit .
```
4. Switch to Java 8 in Cloud Shell. This JDK version is needed tobuild the code (you can ignore any plugin warning messages):
```
sudo update-java-alternatives -s java-1.8.0-openjdk-amd64 && export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
```
5. Verify Java 8 installation:
```
java -version
```
  Sample output:
```
openjdk version "1.8..."
```
Build thejar file:
```
mvn clean package
```
The.jar file is placed in the/target subdirectory (for example,target/spark-hbase-1.0-SNAPSHOT.jar.

Submit the job.

gcloud dataproc jobs submit spark \    --class=hbase.SparkHBaseMain  \    --jars=target/filename.jar \    --region=cluster-region \    --cluster=cluster-name

--jars: Insert the name of your.jar file after "target/" and before ".jar".
If you did not set the Spark driver and executor HBase classpaths when youcreated your cluster, you must set them with each job submission by including the following‑‑properties flag in you job submit command:
```
--properties='spark.driver.extraClassPath=/etc/hbase/conf:/usr/lib/hbase/*,spark.executor.extraClassPath=/etc/hbase/conf:/usr/lib/hbase/*'
```

View HBase table output in the Cloud Shell session terminal output:

Waiting for job output......+----+----+| key|name|+----+----+|key1| foo||key2| bar|+----+----+

Python

Submit the job.

gcloud dataproc jobs submit pyspark scripts/pyspark-hbase.py \    --region=cluster-region \    --cluster=cluster-name

If you did not set the Spark driver and executor HBase classpaths when youcreated your cluster, you must set them with each job submission by including the following‑‑properties flag in you job submit command:
```
--properties='spark.driver.extraClassPath=/etc/hbase/conf:/usr/lib/hbase/*,spark.executor.extraClassPath=/etc/hbase/conf:/usr/lib/hbase/*'
```

View HBase table output in the Cloud Shell session terminal output:

Waiting for job output......+----+----+| key|name|+----+----+|key1| foo||key2| bar|+----+----+

Scan the HBase table

You can scan the content of your HBase table byrunning the following commands in the master node SSH sessionterminal that you opened inVerify connector installation:

Open the HBase shell:
```
hbase shell
```

Scan 'my-table':

scan 'my_table'

Sample output:

ROW               COLUMN+CELL key1             column=cf:name, timestamp=1647364013561, value=foo key2             column=cf:name, timestamp=1647364012817, value=bar2 row(s)Took 0.5009 seconds

Clean up

After you finish the tutorial, you can clean up the resources that you created so that they stop using quota and incurring charges. The following sections describe how to delete or turn off these resources.

Delete the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

Caution: Deleting a project has the following effects:

Everything in the project is deleted. If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.
Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as anappspot.com URL, delete selected resources inside the project instead of deleting the whole project.

If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects can help you avoid exceeding project quota limits.

In the Google Cloud console, go to theManage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then clickDelete.
In the dialog, type the project ID, and then clickShut down to delete the project.

Delete the cluster

To delete your cluster:

gcloud dataproc clusters deletecluster-name \    --region=${REGION}

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換

Use Apache Spark with HBase on Dataproc Stay organized with collections Save and categorize content based on your preferences.

Objectives

Costs

Before you begin

Create a Dataproc cluster

Verify connector installation

Create an HBase table

View the Spark code

Java

Python

Run the code

Java

Python

Scan the HBase table

Clean up

Delete the project

Delete the cluster

Use Apache Spark with HBase on Dataproc