Migrate data from HBase to Bigtable offline

This page describes considerations and processes for migrating datafrom an Apache HBase cluster to a Bigtable instance onGoogle Cloud.

The process described on this page requires you to take your applicationoffline. If you want to migrate with no downtime, see the guidance for onlinemigration atReplicate from HBase toBigtable.

To migrate data to Bigtable from an HBase cluster that ishostedon a Google Cloud service, such as Dataproc or Compute Engine, seeMigrating HBase hosted on Google Cloud to Bigtable.

Before you begin this migration, you should consider performance implications,Bigtable schema design, your approach to authenticationand authorization, and the Bigtable feature set.

Pre-migration considerations

This section suggests a few things to review and think about before you beginyour migration.

Performance

Under a typical workload, Bigtable delivers highly predictableperformance. Make sure that you understand the factors that affectBigtable performance before you migrate your data.

Bigtable schema design

In most cases, you can use the same schema design in Bigtable asyou do in HBase. If you want to change your schema or if your use case ischanging, review the concepts laid out inDesigning your schema before you migrate your data.

Authentication and authorization

Before you design access control for Bigtable,review the existing HBase authentication and authorization processes.

Bigtable uses Google Cloud's standardmechanisms for authentication andIdentity and Access Management to provide access control, so you convert your existingauthorization on HBase to IAM. Youcan map the existing Hadoop groups that provide access control mechanisms forHBase to different service accounts.

Bigtable allows you to control access at the project, instance,and table levels. For more information, seeAccess Control.

Downtime requirement

The migration approach that is described on this page involves taking yourapplication offline for the duration of the migration. If your business can'ttolerate downtime while you migrate to Bigtable, see the guidancefor online migration atReplicate from HBase toBigtable.

Migrate HBase to Bigtable

To migrate your data from HBase to Bigtable, you export an HBasesnapshot for each table to Cloud Storage and then import the data intoBigtable. These steps are for a single HBase clusterand are described in detail in the next several sections.

  1. Stop sending writes to your HBase cluster.
  2. Take snapshots of the HBase cluster's tables.
  3. Export the snapshot files to Cloud Storage.
  4. Compute hashes and export them to Cloud Storage.
  5. Create destination tables in Bigtable.
  6. Import the HBase data from Cloud Storage intoBigtable.
  7. Validate the imported data.
  8. Route writes to Bigtable.

Before you begin

  1. Create a Cloud Storage bucket tostore your snapshots. Create the bucket in the samelocationthat you plan to run your Dataflow job in.

  2. Create a Bigtable instanceto store your new tables.

  3. Identify the Hadoop cluster that you are exporting. You can run the jobs foryour migration either directly on the HBase cluster or on a separate Hadoopcluster that has network connectivity to the HBase cluster's Namenode and Datanodes.

  4. Install and configure the Cloud Storage connector on every node inthe Hadoop cluster as well as the host where the job is initiated from. Fordetailed installation steps, seeInstalling the Cloud Storage connector.

  5. Open a command shell on a host that can connect to your HBase cluster and yourBigtable project. This is where you'll complete the next steps.

  6. Get the Schema Translation tool:

    wgetBIGTABLE_HBASE_TOOLS_URL

    ReplaceBIGTABLE_HBASE_TOOLS_URL with the URL of thelatestJAR with dependencies available in the tool'sMaven repository.The file name is similar tohttps://repo1.maven.org/maven2/com/google/cloud/bigtable/bigtable-hbase-1.x-tools/1.24.0/bigtable-hbase-1.x-tools-1.24.0-jar-with-dependencies.jar.

    To find the URL or to manually download the JAR, do the following:

    1. Go to the repository.
    2. Click the most recent version number.
    3. Identify theJAR with dependencies file (usually at the top).
    4. Either right-click and copy the URL, or click to download the file.
  7. Get the Import tool:

    wgetBIGTABLE_BEAM_IMPORT_URL

    ReplaceBIGTABLE_BEAM_IMPORT_URL with the URL of thelatestshaded JAR available in the tool'sMaven repository.The file name is similar tohttps://repo1.maven.org/maven2/com/google/cloud/bigtable/bigtable-beam-import/1.24.0/bigtable-beam-import-1.24.0-shaded.jar.

    To find the URL or to manually download the JAR, dothe following:

    1. Go to the repository.
    2. Click the most recent version number.
    3. ClickDownloads.
    4. Mouse overshaded.jar.
    5. Either right-click and copy the URL, or click to download the file.
  8. Set the following environment variables:

    #Google CloudexportPROJECT_ID=PROJECT_IDexportINSTANCE_ID=INSTANCE_IDexportREGION=REGIONexportCLUSTER_NUM_NODES=CLUSTER_NUM_NODES#JAR filesexportTRANSLATE_JAR=TRANSLATE_JARexportIMPORT_JAR=IMPORT_JAR#Cloud StorageexportBUCKET_NAME="gs://BUCKET_NAME"exportMIGRATION_DESTINATION_DIRECTORY="$BUCKET_NAME/hbase-migration-snap"#HBaseexportZOOKEEPER_QUORUM=ZOOKEPER_QUORUMexportZOOKEEPER_PORT=2181exportZOOKEEPER_QUORUM_AND_PORT="$ZOOKEEPER_QUORUM:$ZOOKEEPER_PORT"exportMIGRATION_SOURCE_DIRECTORY=MIGRATION_SOURCE_DIRECTORY

    Replace the following:

    • PROJECT_ID: the Google Cloud project that yourinstance is in
    • INSTANCE_ID: the identifier of theBigtable instance that you are importing your data to
    • REGION: a region that contains one of the clusters inyour Bigtable instance. Example:northamerica-northeast2
    • CLUSTER_NUM_NODES: the number of nodes in yourBigtable instance
    • TRANSLATE_JAR: the name and version number ofthebigtable hbase tools JAR file that you downloaded from Maven. The valueshould look something likebigtable-hbase-1.x-tools-1.24.0-jar-with-dependencies.jar.
    • IMPORT_JAR: the name and version number ofthebigtable-beam-import JAR file that you downloaded from Maven. The valueshould look something likebigtable-beam-import-1.24.0-shaded.jar.
    • BUCKET_NAME: the name of theCloud Storage bucket where you are storing your snapshots
    • ZOOKEEPER_QUORUM: the zookeeper host that the toolwill connect to, in the formathost1.myownpersonaldomain.com
    • MIGRATION_SOURCE_DIRECTORY: the directory on yourHBase host that holds the data that you want to migrate, in the formathdfs://host1.myownpersonaldomain.com:8020/hbase
  9. (Optional) To confirm that the variables were set correctly, run theprintenv command to view all environment variables.

Stop sending writes to HBase

Before you take snapshots of your HBase tables, stop sending writes to yourHBase cluster.

Take HBase table snapshots

When your HBase cluster is no longer ingesting data, take a snapshot ofeach table that you plan to migrate to Bigtable.

A snapshot has a minimal storage footprint on the HBase cluster at first, butover time it might grow to the same size as the original table. The snapshotdoes not consume any CPU resources.

Run the following command for each table, using a unique name for eachsnapshot:

echo "snapshot 'TABLE_NAME', 'SNAPSHOT_NAME'" | hbase shell -n

Replace the following:

  • TABLE_NAME: the name of the HBase table that youare exporting data from.
  • SNAPSHOT_NAME: the name for the new snapshot

Export the HBase snapshots to Cloud Storage

After you create the snapshots, you need to export them. When you're executingexport jobs on a production HBase cluster, monitor the cluster and other HBaseresources to ensure that the clusters remain in a good state.

For each snapshot that you want to export, run the following:

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \-Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM_AND_PORT -snapshotSNAPSHOT_NAME \    -copy-from $MIGRATION_SOURCE_DIRECTORY \    -copy-to $MIGRATION_DESTINATION_DIRECTORY/data

ReplaceSNAPSHOT_NAME with the name of the snapshot toexport.

Note: If you need to limit the bandwidth that the export uses, use the-mappers parameter with an integer value for specifying how many mappers torun and the-bandwidth parameter with an integer value for MB/sec. Forexample, to limit the concurrency of copy map tasks to 20, set-mappers 20 andto limit the bandwidth that each copy map task uses to 50 MB/sec, add-bandwidth 50 to the command. This makes the total bandwidth 1,000 MB/sec(50 MB/sec * 20 mappers = 1,000 MB/sec).

Compute and export hashes

Next, create hashes to use for validation after the migration is complete.HashTable is a validation tool provided by HBase that computes hashes forrow ranges and exports them to files. You can run async-table job on thedestination table to match the hashes and gain confidence in the integrity ofmigrated data.

Run the following command for each table that you exported:

hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=32000 --numhashfiles=20 \TABLE_NAME $MIGRATION_DESTINATION_DIRECTORY/hashtable/TABLE_NAME

Replace the following:

  • TABLE_NAME: the name of the HBase table that youcreated a snapshot for and exported
Note: For bigger migrations where you're migrating multiple tables, you can runExportSnapshot andHashTable jobs in parallel. This is possible because theexport job is reading from disk, and the HashTable job is busy computing hashesand scanning HBase. This option can reduce the downtime required for themigration.

Create destination tables

The next step is to create a destination table in your Bigtableinstance for each snapshot that you exported. Use an account that hasbigtable.tables.create permission for the instance.

This guide uses theBigtable Schema Translation tool,which automatically creates the table for you. However, if you don't want yourBigtable schema to exactly match the HBase schema, you cancreate a table using thecbt command-line tool or the Google Cloud console.

The Bigtable Schema Translation tool captures the schema of the HBasetable, including the table name, column families, garbage collection policies,and splits. Then it creates a similar table in Bigtable.

Note: If your HBase master is in a virtual private cloud or you can't connect tothe internet, you can follow thealternative instructions instead of the instructions in this section. When you use thealternative instructions, you export the HBase schema to a file, and then usethat file to create tables in Bigtable.

For each table that you want to import, run the following to copy the schemafrom HBase to Bigtable.

java \ -Dgoogle.bigtable.project.id=$PROJECT_ID \ -Dgoogle.bigtable.instance.id=$INSTANCE_ID \ -Dgoogle.bigtable.table.filter=TABLE_NAME \ -Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \ -Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT \ -jar $TRANSLATE_JAR

ReplaceTABLE_NAME with the name of the HBase tablethat you are importing. The Schema Translation tool uses this name foryour new Bigtable table.

You can also optionally replaceTABLE_NAME with aregular expression, such as ".*", that captures all the tables that you want tocreate, and then run the command only once.

Import the HBase data into Bigtable using Dataflow

After you have a table ready to migrate your data to, you are ready toimport and validate your data.

Uncompressed tables

If your HBase tables are not compressed, run the following command for eachtable that you want to migrate:

java-jar$IMPORT_JARimportsnapshot\--runner=DataflowRunner\--project=$PROJECT_ID\--bigtableInstanceId=$INSTANCE_ID\--bigtableTableId=TABLE_NAME\--hbaseSnapshotSourceDir=$MIGRATION_DESTINATION_DIRECTORY/data\--snapshotName=SNAPSHOT_NAME\--stagingLocation=$MIGRATION_DESTINATION_DIRECTORY/staging\--tempLocation=$MIGRATION_DESTINATION_DIRECTORY/temp\--maxNumWorkers=$(expr3\*$CLUSTER_NUM_NODES)\--region=$REGION

Replace the following:

  • TABLE_NAME: the name of the HBase tablethat you are importing. The Schema Translation tool gives uses this name foryour new Bigtable table. New table names are not supported.
  • SNAPSHOT_NAME: the name that you assigned to thesnapshot of the table that you are importing

After you run the command, the tool restores the HBase snapshot toyour Cloud Storage bucket, then starts the import job. It can take severalminutes for the process of restoring the snapshot to finish, depending on thesize of the snapshot.

Keep the following tips in mind when you import:

  • To improve the performance of data loading, be sure to setmaxNumWorkers.This value helps to ensure that the import job has enough compute power tocomplete in a reasonable amount of time, but not so much that it would overwhelmthe Bigtable instance.
    • If you are not also using the Bigtable instance for anotherworkload, multiply the number of nodes in your Bigtableinstance by 3, and use that number formaxNumWorkers.
    • If you are using the instance for another workload at the same time thatyou are importing your HBase data, reduce the value ofmaxNumWorkers appropriately.
  • Use the default worker type.
  • During the import, you should monitor the Bigtable instance'sCPU usage. If the CPU utilization across theBigtable instance is too high, you might need to add additionalnodes. It can take up to 20 minutes for the cluster to provide the performancebenefit of additional nodes.

For more information about monitoring the Bigtable instance, seeMonitoring.

Snappy compressed tables

If you are importing Snappy compressed tables, you need to use acustom containerimage inthe Dataflow pipeline. The custom container image that you useto import compressed data into Bigtable providesHadoop native compression library support. You must have the Apache Beam SDKversion 2.30.0 or later to use Dataflow Runner v2, and you musthave version 2.3.0 or later of the HBase client library for Java.

To import Snappy compressed tables, run thesame command thatyou run for uncompressed tables, but add the following option:

    --enableSnappy=true

Validate the imported data in Bigtable

To validate the imported data, you need to run thesync-table job. Thesync-table job computes hashes for row ranges in Bigtable,then matches them with the HashTable output that you computed earlier.

To run thesync-table job, run the following in the command shell:

java-jar$IMPORT_JARsync-table\--runner=dataflow\--project=$PROJECT_ID\--bigtableInstanceId=$INSTANCE_ID\--bigtableTableId=TABLE_NAME\--outputPrefix=$MIGRATION_DESTINATION_DIRECTORY/sync-table/output-TABLE_NAME-$(date+"%s")\--stagingLocation=$MIGRATION_DESTINATION_DIRECTORY/sync-table/staging\--hashTableOutputDir=$MIGRATION_DESTINATION_DIRECTORY/hashtable/TABLE_NAME\--tempLocation=$MIGRATION_DESTINATION_DIRECTORY/sync-table/dataflow-test/temp\--region=$REGION

ReplaceTABLE_NAME with the name of the HBase tablethat you are importing.

When thesync-table job is complete, open theDataflow Job details page andreview theCustom counters section for the job. If the import jobsuccessfully imports all of the data, the value forranges_matched has avalue and the value forranges_not_matched is 0.

Dataflow custom counters

Ifranges_not_matched shows a value, open theLogs page, chooseWorker Logs, and filter byMismatch on range. The machine-readableoutput of these logs is stored in Cloud Storage at the outputdestination that you create in the sync-tableoutputPrefix option.

Dataflow worker logs

You can try the import job again or write a script to read the output files todetermine where the mismatches occurred. Each line in the output file is aserialized JSON recordof a mismatched range.

Route writes to Bigtable

After you've validated the data for each table in the cluster, you canconfigure your applications to route alltheir traffic to Bigtable, then deprecate the HBase instance.

When your migration is complete, you can delete the snapshots on your HBaseinstance.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.