Amazon Redshift to BigQuery migration: Overview

This document provides guidance on migrating from Amazon Redshift toBigQuery, focusing on the following topics:

  • Strategies for migration
  • Best practices for query optimization and data modeling
  • Troubleshooting tips
  • User adoption guidance

The objectives of this document are as follows:

  • Provide high-level guidance for organizations migrating fromAmazon Redshift to BigQuery, including helping you rethinkyour existing datapipelines to get the most out of BigQuery.
  • Help you compare the architectures of BigQuery andAmazon Redshift so thatyou can determine how to implement existing features and capabilitiesduring migration. The goal is to show you new capabilities available toyour organization through BigQuery, not to map featuresone to one withAmazon Redshift.

This document is intended for enterprise architects, databaseadministrators, application developers, and IT security specialists. Itassumes you are familiar with Amazon Redshift.

You can also usebatch SQL translationto migrate your SQL scripts in bulk, orinteractive SQL translationto translate ad hoc queries. Amazon Redshift SQL is fully supported by bothSQL translation services.

Pre-migration tasks

To help ensure a successful data warehouse migration, start planning yourmigration strategy early in your project timeline. This approach lets youevaluate the Google Cloud features that suit your needs.

Capacity planning

BigQuery uses slots to measure analytics throughput.A BigQuery slot is Google's proprietary unit ofcomputational capacity required to execute SQL queries. BigQuerycontinuously calculates how many slots are required by queries as they execute,but it allocates slots to queries based on afair scheduler.

You can choose between the following pricing models when capacity planning forBigQuery slots:

  • On-demand pricing:With on-demand pricing, BigQuery charges for the number ofbytes processed (data size), so you pay only for the queries that you run.For more information about how BigQuery determines datasize, seeData size calculation.Because slots determine the underlying computational capacity, you can payfor BigQuery usage depending on the number of slotsyou'll need (instead of bytes processed). By default allGoogle Cloud projects arelimited to a maximum of 2000 slots. BigQuery might burstbeyond this limitto accelerate your queries, but bursting is not guaranteed.
  • Capacity-based pricing:With capacity-based pricing, you purchase BigQuery slotreservations(a minimum of 100) instead of paying forthe bytes processed by queries that you run. We recommend capacity-based pricingfor enterprise data warehouse workloads, which commonly see many concurrentreporting and extract-load-transform (ELT) queries that have predictableconsumption.

To help with slot estimation, we recommend setting upBigQuery monitoring using Cloud Monitoringandanalyzing your audit logs using BigQuery.You can useLooker Studio(here's anopen source exampleof a Looker Studio dashboard) orLookerto visualize BigQuery's audit log data, specifically for slotusage acrossqueries and projects. You can also use BigQuery's systemtables data formonitoring slot utilization across jobs and reservations (here's anopen source exampleof a Looker Studio dashboard). Regularly monitoring andanalyzing your slotutilization helps you estimate how many total slots your organization needs asyou grow on Google Cloud.

For example, suppose you initially reserve 4,000 BigQueryslots to run 100 medium-complexity queriessimultaneously. If you notice high wait times in your queries' execution plans,and your dashboards show high slot utilization, this could indicate that youneed additionalBigQuery slots to helpsupport your workloads. If you want to purchase slots yourself through yearly orthree-year commitments, you canget started with BigQueryreservations using theGoogle Cloud console or the bq command-line tool. For more details about workload management,query execution, and BigQuery architecture, seeMigration toGoogle Cloud: An in-depth view.

Security in Google Cloud

The following sections describe common Amazon Redshift security controls andhow you can help ensure that your data warehouse stays protected in aGoogle Cloud environment.

Identity and access management

Setting up access controls in Amazon Redshift involves writing Amazon RedshiftAPI permissions policies and attaching them toIdentity and Access Management (IAM) identities.The Amazon Redshift API permissions provide cluster-level access but don'tprovide accesslevels more granular than the cluster. If you want more granular access toresources like tables or views, you can use user accounts in theAmazon Redshift database.

BigQuery uses IAM to manage access to resourcesat a more granular level. Thetypes of resources available in BigQuery are organizations,projects, datasets,tables, columns, and views. In the IAM policy hierarchy,datasets are childresources of projects. A table inherits permissions from the dataset thatcontains it.

To grant access to a resource, assign one or more IAM roles toa user, group,or service account. Organization and project roles affect the ability to runjobs or manage the project, whereas dataset roles affect the ability to accessor modify the data inside a project.

IAM provides these types of roles:

  • Predefined roles,which are meant to support common use cases and access control patterns.
  • Custom roles,which provide granular access according to a user-specifiedlist of permissions.

Within IAM, BigQuery providestable-level access control.Table-level permissions determine the users, groups, and service accounts thatcan access a table or view. You can give a user access to specific tables orviews without giving the user access to the complete dataset. For more granularaccess, you can also look into implementing one or more of the followingsecurity mechanisms:

  • Column-level access control,which provides fine-grained access to sensitive columns using policy tags,or type-based classification of data.
  • Column-level dynamic data masking,which lets you selectively obscure column data for groups of users,while still allowing access to the column.
  • Row-level security,which lets you filter data and enables access to specific rows in a tablebased on qualifying user conditions.

Full-disk encryption

In addition to identity and access management, data encryption adds an extralayer of defense for protecting data. In the case of data exposure, encrypteddata is not readable.

On Amazon Redshift, encryption for both data at rest and data in transit is notenabled by default. Encryption for data at rest must beexplicitly enabledwhen a cluster is launched or by modifying an existing cluster to use AWS KeyManagement Service encryption. Encryption for data in transit must also beexplicitly enabled.

BigQuery encrypts all dataat restandin transit by default, regardless of thesource or any other condition, and this cannot be turned off. BigQueryalso supportscustomer-managed encryption keys (CMEK)if you want to control and manage key encryption keys inCloud Key Management Service.

For more information about encryption in Google Cloud,see whitepapers aboutdata-at-rest encryptionanddata-in-transit encryption.

Fordata in transit on Google Cloud,data is encrypted and authenticated when it moves outside of thephysical boundaries controlled by Google or on behalf of Google.Inside these boundaries, data in transit is generally authenticated but notnecessarily encrypted.

Data loss prevention

Compliance requirements might limit what data can be stored onGoogle Cloud.You can useSensitive Data Protection toscan your BigQuery tablesto detect and classify sensitive data. If sensitive data is detected,Sensitive Data Protection de-identification transformations canmask, delete, or otherwise obscurethat data.

Migration to Google Cloud: The basics

Use this section to learn more about using tools and pipelines tohelp with migration.

Migration tools

The BigQuery Data Transfer Service provides an automated tool to migrateboth schema and data from Amazon Redshift to BigQuery directly. The followingtable lists additional tools to assist in migrating from Amazon Redshift toBigQuery:

ToolPurpose
BigQuery Data Transfer ServicePerform an automated batch transfer of your Amazon Redshift data toBigQuery by using this fully managed service.
Storage Transfer ServiceQuickly import Amazon S3 data into Cloud Storage and set up arepeating schedule for transferring data by using this fully managedservice.
gcloudCopy Amazon S3 files into Cloud Storage by using this command-linetool.
bq command-line toolInteract with BigQuery by using this command-line tool. Commoninteractions include creating BigQuery table schemas, loadingCloud Storage data into tables, and executing queries.
Cloud Storage clientlibrariesCopy Amazon S3 files into Cloud Storage by using your custom tool,built on top of the Cloud Storage client library.
BigQuery client librariesInteract with BigQuery by using your custom tool, built on top of theBigQuery client library.
BigQuery query schedulerSchedule recurring SQL queries by using this built-in BigQueryfeature.
Cloud ComposerOrchestrate transformations and BigQuery load jobs by using thisfully managed Apache Airflow environment.
Apache SqoopSubmit Hadoop jobs by using Sqoop and Amazon Redshift's JDBC driverto extract data from Amazon Redshift into either HDFS orCloud Storage. Sqoop runs in a Dataproc environment.

For more information on using BigQuery Data Transfer Service, seeMigrate schema and data from Amazon Redshift.

Migration using pipelines

Your data migration from Amazon Redshift to BigQuery can takedifferent paths based on the available migration tools.While the list in this section is not exhaustive, it does provide a senseof the differentdata pipeline patterns available when moving your data.

For more high-level information about migrating data to BigQueryby using pipelines, seeMigrate data pipelines.

Extract and load (EL)

You can fully automate an EL pipeline by using BigQuery Data Transfer Service, whichcan automaticallycopy your tables' schemas and data from your Amazon Redshift cluster toBigQuery. If you want more control over your data pipelinesteps, you can createa pipeline using the options described in the following sections.

Use Amazon Redshift file extracts
  1. Export Amazon Redshift data to Amazon S3.
  2. Copy data from Amazon S3 to Cloud Storage by using any of the followingoptions:

  3. Load Cloud Storage data into BigQuery by usingany of the following options:

Use an Amazon Redshift JDBC connection

Use any of the following Google Cloud products to export Amazon Redshift databy using theAmazon Redshift JDBC driver:

Extract, transform, and load (ETL)

If you want to transform some data before loading it intoBigQuery, follow thesame pipeline recommendations that are described in theExtract andLoad (EL) section, adding an extra step totransform your data before loadinginto BigQuery.

Use Amazon Redshift file extracts
  1. Export Amazon Redshift data to Amazon S3.

  2. Copy data from Amazon S3 to Cloud Storage by using any of the followingoptions:

  3. Transform and then load your data into BigQuery by using any of thefollowing options:

Use an Amazon Redshift JDBC connection

Use any of the products described in theExtract andLoad (EL) section, adding an extra step totransform your data before loadinginto BigQuery. Modify your pipeline to introduceone or more steps to transformyour data before writing to BigQuery.

Extract, load, and transform (ELT)

You can transform your data using BigQuery itself,using any of theExtract and Load (EL)options to load your data into a staging table. You then transform thedata in this staging table using SQLqueries that write their output into your final production table.

Change data capture (CDC)

Change data captureis one of several software design patterns used to track data changes. It'soften used in data warehousing because the data warehouse is used to collate andtrack data and its changes from various source systems over time.

Partner tools for data migration

There are several vendors in the extract, transform, and load (ETL) space.Refer to theBigQuery partner websitefor a list of key partners and their provided solutions.

Migration to Google Cloud: An in-depth view

Use this section to learn more about how your data warehouse architecture,schema, and SQL dialect affects your migration.

Architecture comparison

Both BigQuery and Amazon Redshift are based on a massivelyparallel processing(MPP) architecture. Queries are distributed across multiple servers toaccelerate their execution. With regard to system architecture, Amazon Redshiftand BigQuery primarily differ in how data is stored andhow queries areexecuted. In BigQuery, the underlying hardware andconfigurations are abstractedaway; its storage and compute allow your data warehouse to grow without anyintervention from you.

Compute, memory, and storage

In Amazon Redshift, CPU, memory, and disk storage are tied together throughcompute nodes,as illustrated inthis diagram from the Amazon Redshift documentation. Cluster performance and storage capacity are determined by the type and the quantity of compute nodes, both of which mustbe configured. To change compute or storage, you need to resize your cluster through a process (over a couple of hours, or up to two days or longer) that creates a brand new cluster and copies the data over. Amazon Redshift also offers RA3 nodes with managed storage that help separate compute and storage. The largest node in the RA3 category caps at 64 TB of managed storage for each node.

From the start, BigQuery does not tie together compute, memory,and storage but instead treats each separately.

BigQuery compute is defined byslots,a unit of computational capacity required to execute queries. Google manages theentire infrastructure that a slot encapsulates, eliminating all but the task ofchoosing the proper slot quantity for your BigQuery workloads. Refer to thecapacity planning for help deciding how manyslots you'll purchase for your data warehouse. BigQuery memory is provided by aremote distributed service,connected to compute slots by Google's petabit network, all managed by Google.

BigQuery and Amazon Redshift both use columnar storage, but BigQuery usesvariations and advancementson columnar storage. While columns are being encoded, various statistics aboutthe data persisted and later are used during query execution to compileoptimal plans and to choose the most efficient runtime algorithm. BigQuerystores your data inGoogle's distributed file system,where it's automatically compressed, encrypted, replicated, and distributed.This is all accomplished without affecting the computing power available foryour queries. Separating storage from compute lets you scale up to dozens ofpetabytes in storage seamlessly, without requiring additional expensive computeresources. A number of otherbenefits of separating compute and storageexist as well.

Scaling up or down

When storage or compute become constrained, Amazon Redshift clusters must beresized by modifying the quantity or types of nodes in the cluster.

When youresize an Amazon Redshift cluster,there are two approaches:

  • Classic resize: Amazon Redshift creates a cluster to which data iscopied, a process that can take a couple hours or as much as two days orlonger for large amounts of data.
  • Elastic resize: If you change only the number of nodes, then queriesare temporarily paused and connections are held open if possible. Duringthe resize operation, the cluster is read-only. Elastic resize typicallytakes 10 to 15 minutes but might not be available for all configurations.

Because BigQuery is a platform as a service (PaaS), you only have to worryabout the number of BigQuery slots that you want to reserve for yourorganization. You reserve BigQuery slots in reservations and then assignprojects to these reservations. To learn how to set up these reservations,seeCapacity planning.

Query execution

BigQuery's execution engine is similar to Amazon Redshift's in that they bothorchestrate your query by breaking it into steps (a query plan), executingthe steps (concurrently where possible), and then reassembling the results.Amazon Redshift generates a static query plan, but BigQuery does not because itdynamically optimizes query plans as your query executes. BigQuery shuffles datausing its remote memory service, whereas Amazon Redshift shuffles data usinglocal compute node memory. For more information about BigQuery's storage ofintermediate data from various stages of your query plan, seeIn-memory query execution in Google BigQuery.

Workload management in BigQuery

BigQuery offers the following controls for workload management (WLM):

  • Interactive queries,which are executed as soon as possible (this is the default setting).
  • Batch queries,which are queued on your behalf, then start as soon as idle resources areavailable in the BigQuery shared resource pool.
  • Slot reservationsthroughcapacity-based pricing.Instead of paying for queries on demand, you candynamically create and managebuckets of slots calledreservationsandassignprojects, folders, or organizations to these reservations. You can purchase BigQueryslot commitments(starting at a minimum of 100) in either flex, monthly, or yearlycommitments to help minimize costs. By default,queries running in a reservation automatically use idleslotsfrom other reservations.

    As the following diagram illustrates, suppose you purchaseda total commitment capacity of1,000 slots to share across three workload types: data science,ELT, and business intelligence (BI). To supportthese workloads, you could create the following reservations:

    • You can create the reservationds with 500 slots, and assign allGoogle Cloud data science projects to that reservation.
    • You can create the reservationelt with 300 slots, and assign theprojects you use for ELT workloads to that reservation.
    • You can create the reservationbi with 200 slots, and assignprojects connected to your BI tools to that reservation.

    This setup is shown in the following graphic:

    How slot commitments, reservations, and assignments work together.

    Instead of distributing reservations to your organization's workloads, forexample to production and testing, you might choose to assignreservations to individual teams or departments, depending on your use case.

    For more information, seeWorkload management using Reservations.

Workload management in Amazon Redshift

Amazon Redshift offers two types of workload management (WLM):

  • Automatic:With automatic WLM, Amazon Redshift manages query concurrency and memoryallocation. Up to eight queues are created with the service classidentifiers 100–107. Automatic WLM determines the amount of resourcesthat queries need andadjusts the concurrency based on the workload. For more information, seeQuery priority.
  • Manual:In contrast, manual WLM requires you to specify values for queryconcurrency and memory allocation. The default for manual WLM isconcurrency of five queries, and memory is divided equally between all five.

Whenconcurrency scalingis enabled, Amazon Redshift automatically adds additional cluster capacity whenyou need it to process an increase in concurrent read queries. Concurrencyscaling has certain regional and query considerations. For more information, seeConcurrency scaling candidates.

Dataset and table configurations

BigQuery offers a number of ways to configure your data and tables such aspartitioning, clustering, and data locality. These configurations can help maintain largetables and reduce the overall data load and response time for your queries, thereby increasing theoperational efficiency of your data workloads.

Partitioning

A partitioned table is a table that is divided into segments, calledpartitions, that make it easier to manage and query your data. Users typicallysplit large tables into many smaller partitions, where each partition contains aday's worth of data. Partition management is a key determinant of BigQuery'sperformance and cost when querying over a specific date range because it helpsBigQuery scan less data per query.

There are three types of table partitioning in BigQuery:

A column-based, time-partitioned table obviates the need to maintain partitionawareness independent from the existing data filtering on the bound column. Datawritten to a column-based, time-partitioned table is automatically delivered to,the appropriate partition based on the value of the data. Similarly, queriesthat express filters on the partitioning column can reduce the overall datascanned, which can yield improved performance and reduced query cost foron-demand queries.

BigQuery column-based partitioning is similar to Amazon Redshift's column-basedpartitioning, with a slightly different motivation. Amazon Redshift usescolumn-based key distribution to try to keep related data stored together withinthe same compute node, ultimately minimizing data shuffling that occurs duringjoins and aggregations. BigQuery separates storage from compute, so it leveragescolumn-based partitioning to minimize the amount of data thatslotsread from disk.

Once slot workers read their data from disk, BigQuery can automaticallydetermine more optimal data sharding and quickly repartition data usingBigQuery's in-memory shuffle service.

For more information, seeIntroduction to partitioned tables.

Clustering and sort keys

Amazon Redshift supports specifying table columns as eithercompoundorinterleavedsort keys. In BigQuery, you can specify compound sort keys byclusteringyour table. BigQuery clustered tables improve query performance because thetable data is automatically sorted based on the contents of up to four columnsspecified in the table's schema. These columns are used to colocate relateddata. The order of the clustering columns you specify is important because itdetermines the sort order of the data.

Clustering can improve the performance of certain types of queries, such asqueries that use filter clauses and queries that aggregate data. When data iswritten to a clustered table by a query job or a load job, BigQueryautomatically sorts the data using the values in the clustering columns. Thesevalues are used to organize the data into multiple blocks in BigQuery storage.When you submit a query containing a clause that filters data based on theclustering columns, BigQuery uses the sorted blocks to eliminate scans ofunnecessary data.

Similarly, when you submit a query that aggregates data based on the values inthe clustering columns, performance is improved because the sorted blockscolocate rows with similar values.

Use clustering in the following circumstances:

  • Compound sort keys are configured in your Amazon Redshift tables.
  • Filtering or aggregation is configured against particular columns inyour queries.

When you use clustering and partitioning together, your data can be partitionedby a date, timestamp, or integer column and then clustered on a different set ofcolumns (up to four total clustered columns). In this case, data in eachpartition is clustered based on the values of the clustering columns.

When you specify sort keys in tables in Amazon Redshift, depending on the loadon the system, Amazon Redshift automatically initiates the sort using your owncluster's compute capacity. You may even need to manually run theVACUUMcommand if you want to fully sort your table data as soon as possible, forexample, after a large data load. BigQueryautomatically handlesthis sorting for you and does not use your allocated BigQueryslots, therefore not affecting the performance of any of your queries.

For more information about working with clustered tables, see theIntroduction to clustered tables.

Distribution keys

Amazon Redshift uses distribution keys to optimize the location of data blocksto execute its queries. BigQuery does not use distributionkeys because it automatically determines andadds stages in a query plan (while the query is running) to improve datadistribution throughout query workers.

External sources

If you useAmazon Redshift Spectrumto query data on Amazon S3, you can similarly use BigQuery'sexternal data source feature toquery data directly from files on Cloud Storage.

In addition to querying data in Cloud Storage, BigQuery offersfederated query functionsforquerying directlyfrom the following products:

Data locality

You can create your BigQuery datasets in both regional and multi-regionallocations, whereas Amazon Redshift only offers regional locations. BigQuerydetermines the location to run your load, query, or extract jobs based on thedatasets referenced in the request. Refer to the BigQuery locationconsiderations for tips on working with regional and multi-regional datasets.

Data type mapping in BigQuery

Amazon Redshift data types differ from BigQuery data types. For more details onBigQuery data types, refer to the officialdocumentation.

BigQuery also supports the following data types, which don't have a directAmazon Redshift analog:

SQL comparison

GoogleSQLsupports compliance with the SQL 2011 standard and has extensions that supportquerying nested and repeated data. Amazon Redshift SQL is based on PostgreSQLbut has several differences which are detailed in theAmazon Redshift documentation.For a detailed comparison between Amazon Redshift and GoogleSQL syntax andfunctions, see theAmazon Redshift SQL translation guide.

You can use thebatch SQL translatorto convert scripts and other SQL codefrom your current platform to BigQuery.

Post-migration

Since you've migrated scripts that weren't designed with BigQuery in mind, youcan opt to implement techniques for optimizing query performance in BigQuery.For more information, seeIntroduction to optimizing query performance.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.