Migrate schema and data from Apache Hive

This document describes how to migrate your data, security settings, andpipelines from Apache Hive to BigQuery.

You can also usebatch SQL translationto migrate your SQL scripts in bulk, orinteractive SQL translationto translate ad hoc queries. Apache HiveQL is fully supported by bothSQL translation services.

Prepare for migration

The following sections describe how to collect information about tablestatistics, metadata, and security settings to help youmigrate your data warehouse from Hive to BigQuery.

Collect source table information

Gather information about source Hive tables such as their number of rows,number of columns, column data types, size, input format of thedata, and location. This information is useful in themigration process and also to validate the data migration. If you have a Hivetable namedemployees in a database namedcorp, use the following commandsto collect table information:

# Find the number of rows in the tablehive>SELECTCOUNT(*)FROMcorp.employees;# Output all the columns and their data typeshive>DESCRIBEcorp.employees;# Output the input format and location of the tablehive>SHOWCREATETABLEcorp.employees;Output:…STOREDASINPUTFORMAT'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'LOCATION'hdfs://demo_cluster/user/hive/warehouse/corp/employees'TBLPROPERTIES(…# Get the total size of the table data in bytesshell>hdfsdfs-du-sTABLE_LOCATION

Source table format conversion

Some of the formats that Hive supports cannot be ingested intoBigQuery directly.

Hive supports storing data in the following formats:

Text file
RC file
Sequence file
Avro file
ORC file
Parquet file

BigQuery supports loading data from Cloud Storage inany of the following file formats:

CSV
JSON (Newline delimited)
Avro
ORC
Parquet

BigQuery can load data files in Avro, ORC, and Parquet formatsdirectly without the need of schema files. For text files that are not formattedas CSV or JSON (Newline delimited), you can either copy the data to a Hive tablein Avro format, or you can convert the table schema to a BigQueryJSON schemato provide when ingesting.

Collect Hive access control settings

Hive and BigQuery have different access control mechanisms.Collect all the Hive access control settings such as roles, groups, members, andprivileges granted to them. Map out a security model in BigQueryon a per-dataset level and implement a fine-grained ACL. For example, a Hiveuser can be mapped to aGoogle accountand an HDFS group can be mapped to aGoogle group. Access can be set on thedataset level. Use the following commands to collect access control settings inHive:

# List all the users>hdfsdfs-ls/user/|cut-d/-f3# Show all the groups that a specific user belongs to>hdfsgroupsuser_name# List all the roleshive>SHOWROLES;# Show all the roles assigned to a specific grouphive>SHOWROLEGRANTGROUPgroup_name# Show all the grants for a specific rolehive>SHOWGRANTROLErole_name;# Show all the grants for a specific role on a specific objecthive>SHOWGRANTROLErole_nameonobject_typeobject_name;

In Hive, you may access the HDFS files behind the tables directly if you havethe required permissions. In standard BigQuery tables, after thedata is loaded into the table, the data gets stored in theBigQuery storage. You can read data by using theBigQuery Storage Read API but all IAM, row-, and column-levelsecurity is still enforced. If you are using BigQueryexternal tables to query the data in Cloud Storage, access toCloud Storage is also controlled by IAM.

You cancreate aBigLake table thatlets you useconnectorsto query the data with Apache Spark, Trino, or Apache Hive. TheBigQuery Storage API enforces row- and column-level governance policies for allBigLake tables in Cloud Storage orBigQuery.

Data migration

Migrating Hive data from your on-premises or other cloud-based source cluster toBigQuery has two steps:

Copying data from a source cluster to Cloud Storage
Loading data from Cloud Storage into BigQuery

The following sections cover migrating Hive data, validating migrated data, andhandling migration of continuously ingested data. The examples are written fornon-ACID tables.

Partition column data

In Hive, data in partitioned tables is stored in a directory structure.Each partition of the table is associated with a particular value ofpartition column. The data files themselves do not contain any data of thepartition columns. Use theSHOW PARTITIONS command to list the differentpartitions in a partitioned table.

The example below shows that the source Hive table is partitioned on the columnsjoining_date anddepartment. The data files under this table do notcontain any data related to these two columns.

hive>SHOWPARTITIONScorp.employees_partitionedjoining_date="2018-10-01"/department="HR"joining_date="2018-10-01"/department="Analyst"joining_date="2018-11-01"/department="HR"

One way to copy these columns is to convert the partitioned tableinto a non-partitioned table before loading into BigQuery:

Create a non-partitioned table with schema similar to the partitioned table.
Load data into the non-partitioned table from the source partitioned table.
Copy these data files under the staged non-partitioned table toCloud Storage.
Load the data into BigQuery with thebq load commandand provide the name of theTIMESTAMP orDATE type partitioncolumn, if any, as thetime_partitioning_field argument.

Copy data to Cloud Storage

The first step in data migration is to copy the data to Cloud Storage.UseHadoop DistCp to copy data from your on-premises or other-cloud cluster toCloud Storage. Store your data in a bucket in the same region ormulti-region as the dataset where you want to store the data inBigQuery.For example, if you want to use an existing BigQuery dataset asthe destination which is in the Tokyo region, you must choose aCloud Storage regional bucket in Tokyo to hold the data.

After selecting the Cloud Storage bucket location, you can use thefollowing command to list out all the data files present at theemployees Hivetable location:

>hdfsdfs-lshdfs://demo_cluster/user/hive/warehouse/corp/employeeshdfs://demo_cluster/user/hive/warehouse/corp/employees/000000_0hdfs://demo_cluster/user/hive/warehouse/corp/employees/000001_0hdfs://demo_cluster/user/hive/warehouse/corp/employees/000002_0

Copy all the files from above to Cloud Storage:

>hadoopdistcphdfs://demo_cluster/user/hive/warehouse/corp/employeesgs://hive_data/corp/employees

Note that you are charged for storing the data in Cloud Storageaccording to theData storage pricing.

There might be staging directories that hold intermediate files created forquery jobs. You must ensure that you delete any such directories before runningthebq load command.

Loading data

BigQuery supportsbatch loading data in many formats fromCloud Storage. Ensure that the BigQuerydataset you want to load your data into existsprior to creating a load job.

The following command shows the data copied from Hive for a non-ACID table:

>gcloudstoragelsgs://hive_data/corp/employees/gs://hive-migration/corp/employees/gs://hive-migration/corp/employees/000000_0gs://hive-migration/corp/employees/000001_0gs://hive-migration/corp/employees/000002_0

To load your Hive data into BigQuery, use thebq load command.You can use a wildcard character * in the URL to load data from multiplefiles that share a common object prefix. For example, use the following commandto load all the files sharing the prefixgs://hive_data/corp/employees/:

bqload--source_format=AVROcorp.employeesgs://hive_data/corp/employees/*

Because jobs can take a long time to complete, you can execute themasynchronously by setting the--sync flag toFalse. Running thebq loadcommand outputs the job ID of the created load job, so you can use this commandto poll the job status.This data includes details such as the job type, the job state, andthe user who ran the job.

Poll each load job status using its respective job ID and check for any jobthat has failed with errors. In case of failure, BigQuery uses an"All or None" approach while loading data into a table. You can try resolvingthe errors and safely re-create another load job. For more information, seetroubleshooting errors.

Ensure you have enough load jobquotaper table and project. If you exceed your quota, then the load job fails with aquotaExceeded error.

Note that you are not charged for a load operation to load data intoBigQuery from Cloud Storage. Once the data is loadedinto BigQuery, it is subject to BigQuery'sstorage pricing.When the load jobs are finished successfully, you candelete any remaining files in Cloud Storage to avoid incurring chargesfor storing redundant data.

Validation

After loading data successfully, you can validate your migrated data bycomparing thenumber of rows in the Hiveand BigQuery tables. View thetable informationto get details about BigQuery tables such as the number of rows,number of columns, partitioning fields, or clustering fields. For additionalvalidation, consider trying theData validation tool.

Continuous ingestion

If you continuously ingest data into a Hive table, perform an initialmigration and then migrate only the incremental data changes toBigQuery. It is common to create scripts that run repeatedlyto find and load new data. There are many ways to do this, and the followingsections describe one possible approach.

You can keep track of the migration progress in aCloud SQLdatabase table, which is referred to as a tracking table in the followingsections. During the first run of migration, store the progress in thetracking table.For the subsequent runs of migration, use the tracking table information todetect if any additional data has been ingested and can be migrated toBigQuery.

Select anINT64,TIMESTAMP, orDATE type identifier column to distinguishthe incremental data. This is referred to as an incremental column.

The following table is an example of a table with no partitioning that uses aTIMESTAMP type for its incremental column:

+-----------------------------+-----------+-----------+-----------+-----------+| timestamp_identifier        | column_2  | column_3  | column_4  | column_5  |+-----------------------------+-----------+-----------+-----------+-----------+| 2018-10-10 21\:56\:41       |           |           |           |           || 2018-10-11 03\:13\:25       |           |           |           |           || 2018-10-11 08\:25\:32       |           |           |           |           || 2018-10-12 05\:02\:16       |           |           |           |           || 2018-10-12 15\:21\:45       |           |           |           |           |+-----------------------------+-----------+-----------+-----------+-----------+

The following table is an example of a table partitioned on aDATE type columnpartition_column. It has an integer type incremental columnint_identifierin each partition.

+---------------------+---------------------+----------+----------+-----------+| partition_column    | int_identifier      | column_3 | column_4 | column_5  |+---------------------+---------------------+----------+----------+-----------+| 2018-10-01          | 1                   |          |          |           || 2018-10-01          | 2                   |          |          |           || ...                 | ...                 |          |          |           || 2018-10-01          | 1000                |          |          |           || 2018-11-01          | 1                   |          |          |           || 2018-11-01          | 2                   |          |          |           || ...                 | ...                 |          |          |           || 2018-11-01          | 2000                |          |          |           |+---------------------+---------------------+----------+----------+-----------+

The following sections describe migrating Hive data based on whether or not itis partitioned and whether or not it has incremental columns.

Non-partitioned table without incremental columns

Assuming there are no file compactions in Hive, Hive creates new data files wheningesting new data. During the first run, store the list of files in thetracking table and complete the initial migration of the Hive table by copyingthese files to Cloud Storage and loading them intoBigQuery.

>hdfsdfs-lshdfs://demo_cluster/user/hive/warehouse/corp/employeesFound3itemshdfs://demo_cluster/user/hive/warehouse/corp/employees/000000_0hdfs://demo_cluster/user/hive/warehouse/corp/employees/000001_0hdfs://demo_cluster/user/hive/warehouse/corp/employees/000002_0

After the initial migration, some data is ingested in Hive. You only need tomigrate this incremental data to BigQuery. In the subsequentmigration runs, list out the data files again and compare them with theinformation from the tracking table to detect new data files that haven't beenmigrated.

>hdfsdfs-lshdfs://demo_cluster/user/hive/warehouse/corp/employeesFound5itemshdfs://demo_cluster/user/hive/warehouse/corp/employees/000000_0hdfs://demo_cluster/user/hive/warehouse/corp/employees/000001_0hdfs://demo_cluster/user/hive/warehouse/corp/employees/000002_0hdfs://demo_cluster/user/hive/warehouse/corp/employees/000003_0hdfs://demo_cluster/user/hive/warehouse/corp/employees/000004_0

In this example, two new files are present at the tablelocation. Migrate the data by copying these new data files toCloud Storage and loading them into the existingBigQuery table.

Non-partitioned table with incremental columns

In this case, you can use the maximum value of incremental columns to determineif any new data was added. While performing the initial migration, query theHive table to fetch the maximum value of the incremental column and store it inthe tracking table:

hive>SELECTMAX(timestamp_identifier)FROMcorp.employees;2018-12-3122:15:04

In the subsequent runs of migration, repeat the same query again to fetch thepresent maximum value of the incremental column and compare it with the previousmaximum value from the tracking table to check if incremental data exists:

hive>SELECTMAX(timestamp_identifier)FROMcorp.employees;2019-01-0407:21:16

If the present maximum value is greater than the previous maximum value, itindicates that incremental data has been ingested into the Hive table as in theexample. To migrate the incremental data, create a staging table and load onlythe incremental data into it.

hive>CREATETABLEstage_employeesLIKEcorp.employees;hive>INSERTINTOTABLEstage_employeesSELECT*FROMcorp.employeesWHEREtimestamp_identifier>"2018-12-31 22:15:04"andtimestamp_identifier<="2019-01-04 07:21:16"

Migrate the staging table by listing out the HDFS data files, copying them toCloud Storage, and loading them into the existingBigQuery table.

Partitioned table without incremental columns

Ingestion of data into a partitioned table might create new partitions, appendincremental data to existing partitions, or do both. In this scenario, youcan identify those updated partitions but cannot easily identify what datahas been added to these existing partitions since there is no incremental columnto distinguish. Another option is to take and maintain HDFS snapshots, butsnapshotting creates performance concerns for Hive so it is generallydisabled.

While migrating the table for the first time, run theSHOW PARTITIONS commandand store the information about the different partitions in the trackingtable.

hive>SHOWPARTITIONScorp.employeespartition_column=2018-10-01partition_column=2018-11-01

The above output shows that the tableemployees has two partitions. Asimplified version of the tracking table is provided below to show how thisinformation can be stored.

partition_information	file_path	gcs_copy_status	gcs_file_path	bq_job_id	...
partition_column=2018-10-01
partition_column=2018-11-01

In the subsequent migration runs, run theSHOW PARTITIONS command again tolist all the partitions and compare these with the partition information fromthe tracking table to check if any new partitions are present which haven't beenmigrated.

hive>SHOWPARTITIONScorp.employeespartition_column=2018-10-01partition_column=2018-11-01partition_column=2018-12-01partition_column=2019-01-01

If any new partitions are identified as in the example, create a staging tableand load only the new partitions into it from the source table. Migrate thestaging table by copying the files to Cloud Storage and loading theminto the existing BigQuery table.

Partitioned table with incremental columns

In this scenario, the Hive table is partitioned and an incremental column ispresent in every partition. Continuously ingested data increments upon thiscolumn value. Here you have the ability to migrate the new partitions as describedin the previous section and you can also migrate incremental data that has been ingested into theexisting partitions.

When migrating the table for the first time, store the minimum and maximumvalues of the incremental column in each partition along with the informationabout the table partitions in the tracking table.

hive>SHOWPARTITIONScorp.employeespartition_column=2018-10-01partition_column=2018-11-01hive>SELECTMIN(int_identifier),MAX(int_identifier)FROMcorp.employeesWHEREpartition_column="2018-10-01";11000hive>SELECTMIN(int_identifier),MAX(int_identifier)FROMcorp.employeesWHEREpartition_column="2018-11-01";12000

The above output shows that the table employees has two partitions and theminimum and maximum values of the incremental column in each partition. Asimplified version of the tracking table is provided below to show how thisinformation can be stored.

partition_information	inc_col_min	inc_col_max	file_path	gcs_copy_status	...
partition_column=2018-10-01	1	1000
partition_column=2018-11-01	1	2000

In the subsequent runs, run the same queries to fetch the present maximum valuein each partition and compare it with the previous maximum value from thetracking table.

hive>SHOWPARTITIONScorp.employeespartition_column=2018-10-01partition_column=2018-11-01partition_column=2018-12-01partition_column=2019-01-01hive>SELECTMIN(int_identifier),MAX(int_identifier)FROMcorp.employeesWHEREpartition_column="2018-10-01";

In the example, two new partitions have been identified and some incrementaldata has been ingested in the existing partitionpartition_column=2018-10-01.If there is any incremental data, create a staging table, load only theincremental data into the staging table, copy the data toCloud Storage, and load the data into the existingBigQuery table.

Security settings

BigQuery uses IAM to manage access to resources.BigQuerypredefined rolesprovide granular access for a specific service and are meant to support commonuse cases and access control patterns. You can usecustom rolesto provide even more fine-grained access by customizing a set of permissions.

Access controls ontables anddatasets specify the operations that users, groups, andservice accounts are allowed to perform on tables, views, and datasets.Authorized views let you share query resultswith particular users and groups without giving them access to the underlyingsource data. Withrow-level securityandcolumn-level security,you can restrict who can access which rows or columns within a table.Data masking lets you selectivelyobscure column data for groups of users, while still allowing access to thecolumn.

When you apply access controls, you can grant access to thefollowing users and groups:

User by e-mail: gives an individual Google account access to the dataset
Group by e-mail: gives all members of a Google group access to the dataset
Domain: gives all users and groups in aGoogle domainaccess to the dataset
All Authenticated Users: gives all Google account holders access to thedataset (makes the dataset public)
Project Owners: gives all project owners access to the dataset
Project Viewers: gives all project viewers access to the dataset
Project Editors: gives all project editors access to the dataset
Authorized View: gives a view access to the dataset

Data pipeline changes

The following sections discuss how to change your data pipelines when youmigrate from Hive to BigQuery.

Sqoop

If your existing pipeline uses Sqoop to import data into HDFS or Hive forprocessing, modify the job to import data into Cloud Storage.

If you are importing data into HDFS, choose one of the following:

Copy the Sqoop output files to Cloud Storage usingHadoop DistCp.
Output the files to Cloud Storage directly using theCloud Storage connector.TheCloud Storageconnector is anopen source Java library that lets you runApache Hadoop orApache Spark jobs directly on data in Cloud Storage. For more information,seeInstalling the Cloud Storage connector.

If you want Sqoop to import data into Hive running onGoogle Cloud, point it to the Hive table directly and useCloud Storage as the Hive warehouse instead of HDFS. To do this, setthe propertyhive.metastore.warehouse.dir to a Cloud Storage bucket.

You can run your Sqoop job without managing a Hadoop cluster byusing Dataproc to submit Sqoop jobs to import data intoBigQuery.

Spark SQL and HiveQL

Thebatch SQL translator orinteractive SQL translator canautomatically translate your Spark SQL or HiveQL to GoogleSQL.

If you don't want to migrate your Spark SQL or HiveQL toBigQuery, you can use Dataproc or theBigQuery connector with Apache Spark.

Hive ETL

If there are any existing ETL jobs in Hive, you can modify them in thefollowing ways to migrate them from Hive:

Convert the Hive ETL job to a BigQuery job by using thebatch SQL translator.
Use Apache Spark to read from and write to BigQuery byusing theBigQuery connector.You can use Dataproc to runyour Spark jobs in a cost-efficient way with the help of ephemeral clusters.
Rewrite your pipelines using theApache Beam SDK and run them on Dataflow.
UseApache Beam SQL to rewrite your pipelines.

To manage your ETL pipeline, you can useCloud Composer(Apache Airflow) andDataproc Workflow Templates.Cloud Composer provides atool for converting Oozie workflows to Cloud Composer workflows.

Dataflow

If you want to move your Hive ETL pipeline to fully managed cloud services,consider writing your data pipelines using the Apache Beam SDK and runningthem on Dataflow.

Dataflowis a managed service for executing data processing pipelines. It executesprograms written using the open source frameworkApache Beam.Apache Beam is a unified programming model that enables you to develop bothbatch and streaming pipelines.

If your data pipelines are standard data movement, you can useDataflow templates to quickly create Dataflow pipelineswithout writing code. You can refer to thisGoogle-provided templatewhich allows you to read text files from Cloud Storage, applytransformations, and write the results to a BigQuery table.

To further simplify data processing, you can also tryBeam SQL which allows you to process data using SQL-like statements.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-05 UTC.

Movatterモバイル変換