Schema and data transfer overview

This document discusses the concepts and tasksfor transferring the schema and data from your existing data warehouse toBigQuery.

Migrating your data warehouse to the cloud is a complex process that requiresplanning, resources, and time. To tame this complexity, you should approach datawarehouse migration in a staged and iterative manner. Doing several iterationsof schema and data migration can improve the result.

Schema and data migration process

At the start of your migration journey, you have upstream systems that feedyour existing data warehouse, and downstream systems that use that data inreports, dashboards, and as feeds to other processes.

This general flow of data supports many analyticsuse cases,as shown in the following diagram:

Starting state before migration.

The end state of your journey is to have as many use cases as possible runningon top of BigQuery. This state enables you to minimize the use ofyour existing data warehouse and to eventually phase it out. You're in control ofwhich use cases are migrated and when, by prioritizing them during theprepare and discover phase of the migration.

Transfer schema and data to BigQuery

In theplanning phase of the migration, you identify the use cases that you want to migrate.Then you start the migration iterations in theexecute phase. To manage your iterations while running your analytics environment withminimal disruption, follow this high-level process:

Transfer tables and configure and test downstream processes.
- Transfer the group of tables for each use case toBigQuery without any changes, using BigQuery Data Transfer Serviceor another ETL tool. For information about tools, see theinitial data transfer section.
- Configure test versions of your downstream processes to read fromthe BigQuery tables.
This initial step divides the flow of data. The following diagram showsthe resulting flow. Some downstream systems now read fromBigQuery as shown in the flows labeled B. Others still readfrom the existing data warehouse, as shown in the flows labeled A.
Configure some test upstream processes to write data toBigQuery tables instead of (or in addition to) the existingdata warehouse.
After testing, configure your production upstream and downstreamprocesses to write and read to the BigQuery tables. Theseprocesses can connect to BigQuery using theBigQuery API and incorporate new cloud products such asLooker Studio andDataflow.
At this point, you have three flows of data:
1. Existing. The data and processes are unchanged and still centered on your existing data warehouse.
2. Offloaded. The upstream processes feed your existing data warehouse, the data is offloaded to BigQuery, and it then feeds downstream processes.
3. Fully migrated. The upstream and downstream processes don't write or read from the existing data warehouse anymore.
  The following diagram shows a system with all of these three flows:
Select additional use cases for migration, then go to step 1 to start anewexecution iteration.Continue iterating through these steps until all your use cases are fullymigrated into BigQuery. When selecting use cases, you canrevisit ones that remained in the offloaded state to move them to fullymigrated. For the use cases that are fully migrated, consider continuingthe evolution process by following the guidelines inEvolve your schema in BigQuery.

Evolve your schema in BigQuery

The data warehouse schema defines how your data is structured and defines therelationships between your data entities. The schema is at the core of your datadesign, and it influences many processes, both upstream and downstream.

A data warehouse migration presents a unique opportunity to evolve your schemaafter it's moved to BigQuery. This section introduces guidelinesfor evolving your schema using a series of steps. These guidelines help you keepyour data warehouse environment running during schema changes with minimaldisruption to upstream and downstream processes.

The steps in this section focus on the schema transformation for a single usecase.

Depending on how far you want to go with the evolution, you might stop at anintermediate step, or you might continue until your system is fully evolved.

Transfer a use case as is to BigQuery.
Before you continue with the next steps, make sure that the upstream anddownstream processes of your use case are already writing and reading fromBigQuery. However, it's also possible to start from anintermediate state where only the downstream process is reading fromBigQuery. In this scenario, apply only the guidelines forthe downstream part. The following diagram illustrates a use case whereupstream and downstream processes write to and read from tables inBigQuery.
Apply light optimizations.
1. Re-create your tables, applyingpartitioning andclustering.For this task, you can use the method of creating a table from a queryresult. For details, see thediscussion andexample for partitioned tables, and see thediscussion andexample for clustered tables.
2. Redirect your upstream and downstream processes to the new tables.
Create facade views.
If you want to further evolve your schema beyond light optimizations,createfacade views for your tables. Thefacade pattern is a design method that masks the underlying code or structures to hidecomplexity. In this case, the facade views mask the underlying tables tohide the complexity caused by table changes from the downstream processes.
The views can describe a new schema, free from technical debt, andmodelled with new ingestion and consumption scenarios in mind.
Under the hood, the tables and the view query definition itself canchange. But the views abstract away these changes as an internalimplementation detail of your data warehouse, and they always return thesame results. Thisabstraction layer madeof facade views isolates your upstream and downstream systems from changefor as long as needed, and only surfaces the changes when appropriate.
Transform downstream processes.
You can transform some of your downstream processes to read from thefacade views instead of from the actual tables. These processes willalready benefit from the evolved schema. It's transparent to theseprocesses that under the hood, the facade views still get their data fromthe source BigQuery schema, as shown in the followingdiagram:
We've described the downstream process transformation first. This letsyou show business value more quickly, in the form of migrated dashboards orreports, than if you transformed upstream processes that aren't visible tonon-technical stakeholders. However, it's possible to start thetransformation with your upstream processes instead. The priority of thesetasks is entirely dependent on your needs.
Transform upstream processes.
You can transform some of your upstream processes to write into the newschema. Because views are read only, you create tables based on the newschema, and you then modify the query definition of the facade views. Someviews will still query the source schema, while others will query the newlycreated tables, or perform a SQLUNION operation on both, as shown in thefollowing diagram:
At this point, you can take advantage ofnested and repeated fields when you create the new tables. This lets you further denormalize yourmodel and take direct advantage BigQuery underlying columnarrepresentation of data.
A benefit of facade views is that your downstream processes can continuetheir transformation independently from these underlying schema changes andindependently from changes in the upstream processes.
Fully evolve your use case.
Finally, you can transform the remaining upstream and downstreamprocesses. When all of these are evolved to write into the new tables andto read from the new facade views, you modify the query definitions of thefacade views to not read from the source schema anymore. You can thenretire the tables in the source model from the data flow. The followingdiagram shows the state where source tables are no longer used.
If the facade views don't aggregate fields or filter columns, you canconfigure your downstream processes to read from the evolved tables andthen retire the facade views to reduce complexity, as shown in thefollowing diagram:

Perform an initial transfer of your schema and data

This section discusses practical considerations for migrating your schema anddata from an existing data warehouse to BigQuery.

We recommend that you transfer the schema without any changes during initialiterations of the migration. This gives you the following advantages:

The data pipelines that feed your data warehouse don't need to beadjusted for a new schema.
You avoid adding a new schema to the list of training material for yourstaff.
You can leverage automated tools to perform the schema and data transfer.

In addition, proofs of concept (PoCs) and other data exploration activitiesthat leverage cloud capabilities can proceed unhindered, even while yourmigration occurs in parallel.

Choose a transfer method

You can make the initial transfer using one of several approaches.

For Amazon Redshift and Teradata data warehouses, you can useBigQuery Data Transfer Service to load schema and datadirectly from your existing system intoBigQuery. Cloud Storage is still used to stagedata as part of the migration process.
For any data warehouse, you can extract files that contain your schema anddata, upload those files to Cloud Storage, and then use one of thefollowing options to load the schema and data from those files intoBigQuery:

For further considerations when choosing a data transfer method, seeChoosing a data ingestion method.

Consider data transformation

Depending on your data extraction format and whether you want to trim orenrich your data before loading it into BigQuery,you might include a step to transform your data. You can transform thedata in the existing environment or on Google Cloud:

If you transform the data in the current environment, consider how theavailable compute capacity and tooling might limit throughput.In addition, if you are encriching the data during the transformationprocess, consider whether you need additional transfer time or networkbandwidth.
If you transform the data on Google Cloud, seeLoad data using an ETL tool formore information on your options.

Extract the existing schema and data into files

Your existing platform probably provides a tool to export data to avendor-agnostic format likeApache AVRO or CSV.To reduce the transfer complexity, we recommend using AVRO,ORC orParquet,where schema information is embedded with the data. If you choose CSV or asimilar simple, delimited data format, you must specify the schemaseparately. How you do this depends on the data transfer method you select.For example, for batch upload, you can either specify a schema at load timeor allow auto-detection of the schema based on the CSV file contents.

As you extract these files from your existing platform, copy them intostaging storage in your existing environment.

Upload the files to Cloud Storage

Unless you are using BigQuery Data Transfer Service to load data directly from an existingAmazon Redshift or Teradata data warehouse, you must upload theextracted files to a bucket in Cloud Storage. Depending on theamount of data you're transferring and the network bandwidth available, youcan choose from the following transfer options:

If your extracted data is in another cloud provider, useStorage Transfer Service.
If the data is in an on-premises environment or in a colocation facilitythat has good network bandwidth, use theGoogle Cloud CLI.The gcloud CLI supports multi-threaded parallel uploads, itresumes after errors, and it encrypts the traffic in transit for security.
- If you can't use the gcloud CLI, you can try athird-party tool from a Google partner.
- If your network bandwidth is limited, you can usecompression techniques to limit the size of the data, or you can modify your currentconnection to Google Cloud toincrease the bandwidth.
If you cannot achieve sufficient network bandwidth, you can perform anoffline transfer using aTransfer Appliance.

When you create the Cloud Storage bucket and are transferring datathrough the network, minimize network latency by choosing the location closestto your data center. If possible,choose the location of the bucket to be the same as the location of the BigQuery dataset.

For detailed information on best practices when moving data intoCloud Storage, including estimating costs, seeStrategies for transferring big data sets.

Load the schema and data into BigQuery

Load the schema and data into BigQuery, using one of the optionsdiscussed inChoose a transfer method.

For more information on one-time loads, seeIntroduction to loading data from Cloud Storage in the BigQuery documentation. For more information on loadsscheduled at regular intervals, seeOverview of Cloud Storage transfers in the BigQuery Data Transfer Service documentation.

Load data using an ETL tool

If your data needs further transformation as it is loaded intoBigQuery, use one of the following options:

Cloud Data Fusion.This tool graphically builds fully managed ETL/ELT data pipelines usingan open source library of preconfigured connectors and transformations.
Dataflow.This tool defines and runs complex data transformations and enrichmentgraphs using theApache Beam model. Dataflow is serverless and fully managed byGoogle, giving you access to virtually limitless processing capacity.
Dataproc.This runs Apache Spark and Apache Hadoop cluster on a fully managedcloud service.
Third-party tools. Contact one of ourpartners.They can provide effective choices when you want to externalize thebuilding of a data pipeline.

The following diagram shows the pattern described in this section. The datatransfer tool is thegcloud CLI,and there's a transformationstep that leverages Dataflow and writes directly toBigQuery, perhaps using the Apache Beam built-inGoogle BigQuery I/O connector.

The existing data warehouse copies data to temporary storage on-premises. The gcloud CLI copies the data to a Cloud Storage bucket. Dataflow reads from the staging bucket and moves the data to BigQuery.

After you've loaded an initial set of your data into BigQuery,you can start taking advantage ofBigQuery's powerful features.

However, as when you transferred your schema, uploading your data is part of aniterative process. Subsequent iterations can start by expanding the footprint ofthe data being transferred to BigQuery. Then you can rerouteyour upstream data feeds to BigQuery to eliminate the need forkeeping your existing data warehouse running. These topics are exploredin the next section.

Validate the data

Now that your data is in BigQuery, you can verify the successof your data transfer with theData Validation Tool (DVT). DVT is an open source, Python CLI tool that allows you to comparedata from various sources to your target in BigQuery. You canspecify what aggregations you would like to run (for example, count, sum,average) and the columns that these should apply to. For more information, seeAutomate Data Validation with DVT.

Iterate on the initial transfer

This section discusses how to proceed after your initial data transfer in orderto take best advantage of BigQuery.

A subset of your data is now in BigQuery. You're preparing toincrease the footprint of the data being used in BigQuery, andtherefore to reduce the dependency on your existing data warehouse.

The method you choose for subsequent iterations depends on how important it isfor your use case to have data updated to the current second. If your dataanalysts can work with data that is incorporated into the data warehouse atrecurrent intervals, a scheduled option is the way to go. You can createscheduled transfers in a manner similar to the initial transfer. You use theBigQuery Data Transfer Service,other ETL tools such as Google'sStorage Transfer Service,orthird-party implementations.

If you use BigQuery Data Transfer Service, first you decide which tables to move. Thenyou create a table name pattern to include those tables. Finally you set arecurrent schedule for when to run the transfer.

On the other hand, if your use case requires near-instant access to new data,you require a streaming approach. You have two options:

Set up a load pipeline with Google Cloud products. Google providesa set ofstreaming Dataflow templates to facilitate this task.
Use a solution from one of ourpartners.

In the short term, increasing the footprint of your BigQuerydata and of using it for downstream process should be focused on satisfying yourtop-priority use cases, with the medium-term goal of moving off your existing datawarehouse. Use the initial iterations wisely and don't spend a lot ofresources perfecting the ingestion pipelines from your existing data warehouseinto BigQuery. Ultimately, you'll need to adapt those pipelinesnot to use the existing data warehouse.

Optimize the schema

Simply migrating your tables as-is to BigQuery allows you totake advantage of its unique features. For instance, there is no need forrebuilding indexes, reshuffling data blocks (vacuuming) or any downtime orperformance degradation because of version upgrades.

However, keeping the data warehouse model intact beyond the initial iterationsof the migration also has disadvantages:

Existing issues and technical debt associated with the schema remain.
Query optimizations are limited, and they might need to be redone if theschema is updated at a later stage.
You don't take full advantage of other BigQuery features,such as nested and repeated fields, partitioning, and clustering.

As you move towards doing a final transfer, we recommend that you improvesystem performance by applying partitioning and clustering to the tablesyou create in your schema.

Partitioning

BigQuery lets you divide your data into segments, calledpartitions,that make it easier and more efficient to manage and query your data. You canpartition your tables based on aTIMESTAMP orDATE column, or BigQuery can add pseudocolumns to automaticallypartition your data during ingestion. Queries that involve smaller partitionscan be more performant because they scan only a subset of the data, and canreduce costs by minimizing the number of bytes being read.

Partitioning does not impact the existing structure of your tables. Therefore,you should consider creating partitioned tables even if your schema is notmodified.

Clustering

In BigQuery, no indexes are used to query your data.BigQuery's performance is optimized by its underlyingmodel, storage and query techniques, and massively parallel architecture.When you run a query, the more data is being processed,the more machines are added to scan data and aggregate results concurrently.This technique scales well to huge datasets, whereas rebuilding indexes doesnot.

Nevertheless, there is room for further query optimization with techniques likeclustering.With clustering, BigQuery automatically sorts your data based onthe values of a few columns that you specify and colocates them in optimallysized blocks. Clustering improves query performance compared to not usingclustering. With clustering, BigQuery can better estimate thecost of running the query than without clustering. With clustered columns,queries also eliminate scans of unnecessary data, and can calculate aggregatesquicker because the blocks colocate records with similar values.

Examine your queries for columns frequently used for filtering and create yourtables with clustering on those columns. For more information about clustering,seeIntroduction to clustered tables.

Denormalization

Data migration is an iterative process.Therefore, once you've moved your initial schema to the cloud, performed lightoptimizations, and tested a few key use cases, it might be time to exploreadditional capabilities that benefit from the underlying design ofBigQuery. These include nested and repeated fields.

Data warehouse schemas have historically used the following models:

Star schema.This is a denormalized model, where a fact table collects metrics such asorder amount, discount, and quantity, along with a group of keys. Thesekeys belong to dimension tables such as customer, supplier, region, and soon. Graphically, the model resembles a star, with the fact table in thecenter surrounded by dimension tables.
Snowflake schema.This is similar to the star schema, but with its dimension tablesnormalized, which gives the schema the appearance of a unique snowflake.

BigQuery supports both star and snowflake schemas, but itsnative schema representation is neither of those two. It usesnested and repeated fields instead for a more natural representation of the data. For more information, seetheexample schema in the BigQuery documentation.

Changing your schema to use nested and repeated fields is an excellentevolutionary choice. It reduces the number of joins required for your queries,and it aligns your schema with the BigQuery internal datarepresentation. Internally, BigQuery organizes data using theDremel modeland stores it in a columnar storage format calledCapacitor.

To decide the best denormalization approach for your case, considertheusing nested and repeated fields in BigQuery as well as the techniques forhandling schema changes.

What's next

Learn more about the following steps in data warehouse migration:

You can also learn about moving from specificdata warehouse technologies to BigQuery:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換