Create a Hadoop cluster

You can useDataproc to create one or moreCompute Engine instances that can connect to a Bigtableinstance and runHadoop jobs. This page explains how to useDataproc to automate the following tasks:

  • Installing Hadoop and the HBase client for Java
  • Configuring Hadoop and Bigtable
  • Setting the correct authorization scopes for Bigtable

After you create your Dataproc cluster, you can use the clusterto run Hadoop jobs that read and write data to and from Bigtable.

This page assumes that you are already familiar with Hadoop. For additionalinformation about Dataproc, see theDataprocdocumentation.

Before you begin

Before you begin, you'll need to complete the following tasks:

  • Create a Bigtable instance. Besure to note the project ID and Bigtable instance ID.
  • Enable the Cloud Bigtable API, Cloud Bigtable Admin API, Dataproc, and Cloud Storage JSON APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the APIs

  • Verify that your user account is in a role that includes the permissionstorage.objects.get.

    Open the IAM page in the Google Cloud console.

    Open theIAM page

  • Install the Google Cloud CLI. See thegcloud CLI setup instructions for details.
  • InstallApache Maven, which is usedto run a sample Hadoop job.

    On Debian GNU/Linux or Ubuntu, run the following command:

    sudoapt-getinstallmaven

    On RedHat Enterprise Linux or CentOS, run the following command:

    sudoyuminstallmaven

    On macOS, installHomebrew, then run thefollowing command:

    brewinstallmaven
  • Clone the GitHub repositoryGoogleCloudPlatform/cloud-bigtable-examples,which contains an example of a Hadoop job that uses Bigtable:
    gitclonehttps://github.com/GoogleCloudPlatform/cloud-bigtable-examples.git

Create a Cloud Storage bucket

Dataproc uses aCloud Storage bucket to storetemporary files. To prevent file-naming conflicts, create a new bucket forDataproc.

Cloud Storage bucket names must be globally unique across all buckets.Choose a bucket name that is likely to be available, such as a name thatincorporates your Google Cloud project's name.

After you choose a name, use the following command to create a new bucket,replacing values in brackets with the appropriate values:

gcloudstoragebucketscreategs://[BUCKET_NAME]--project=[PROJECT_ID]

Create the Dataproc cluster

Run the following command to create a Dataproc cluster with fourworker nodes, replacing values in brackets with the appropriate values:

gclouddataprocclusterscreate[DATAPROC_CLUSTER_NAME]--bucket [BUCKET_NAME] \--region [region] --num-workers 4 --master-machine-type n1-standard-4 \--worker-machine-type n1-standard-4

See thegcloud dataproc clusters create documentationfor additional settings that you can configure. If you get an error message thatincludes the textInsufficient 'CPUS' quota, try setting the--num-workersflag to a lower value.

Test the Dataproc cluster

After you set up your Dataproc cluster, you can test the clusterby running asample Hadoop job that counts the number of times aword appears in a text file. The sample job uses Bigtable to storethe results of the operation. You can use this sample job as a reference whenyou set up your own Hadoop jobs.

Run the sample Hadoop job

  1. In the directory where you cloned the GitHub repository, change to thedirectoryjava/dataproc-wordcount.
  2. Run the following command to build the project, replacing values in bracketswith the appropriate values:

    mvncleanpackage-Dbigtable.projectID=[PROJECT_ID]\-Dbigtable.instanceID=[BIGTABLE_INSTANCE_ID]
  3. Run the following command to start the Hadoop job, replacing values inbrackets with the appropriate values:

    ./cluster.shstart[DATAPROC_CLUSTER_NAME]

When the job is complete, it displays the name of the output table, which is thewordWordCount followed by a hyphen and a unique number:

Output table is: WordCount-1234567890

Verify the results of the Hadoop job

Optionally, after you run the Hadoop job, you can use thecbt CLI toverify that the job ran successfully:

  1. Open a terminal window in Cloud Shell.

    Open in Cloud Shell

  2. Install thecbt CLI:
    gcloudcomponentsupdategcloudcomponentsinstallcbt
  3. Scan the output table to view the results of the Hadoop job, replacing[TABLE_NAME] with the name of your output table:
    cbt-instance[BIGTABLE_INSTANCE_ID]read[TABLE_NAME]

Now that you've verified that the cluster is set up correctly, you can use it torun your own Hadoop jobs.

Delete the Dataproc cluster

When you are done using the Dataproc cluster, run the followingcommand to shut down and delete the cluster, replacing[DATAPROC_CLUSTER_NAME]with the name of your Dataproc cluster:

gclouddataprocclustersdelete[DATAPROC_CLUSTER_NAME]

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.