Live data migration from MySQL Stay organized with collections Save and categorize content based on your preferences.
This document explains how to perform a live data migration from a source MySQLinstance to Spanner with minimal downtime using Terraform todeploy Dataflow and Datastream.
Once you perform the live data migration and are confident that all your datahas been transferred, you've migrated your code and dependencies, and completedtesting, you can switch your application to usingSpanner instead of your source MySQL database.
You can perform a live datamigration after creating your target Spanner database.You need to create a compatible schema on your target database before migratingdata.
How it works
The live data migration consists of the following two phases:
Backfill migration:
- During backfill migration, Dataflow reads existing data from yoursource MySQL database and migrates the data to the targetSpanner database. You need to use a bulk migrationDataflow template to move the data from your source MySQL instanceto Spanner.
- When the backfill migration fails to write a row to Spanner,it writes that row to a dead-letter queue directory in aCloud Storage bucket. You can have the backfill migration retry thewriting these rows to Spanner.
Change data capture (CDC) migration:
- This phase runs concurrently with the backfill migration, capturing changesoccurring in the source MySQL instance in real time. These changes are thenapplied to Spanner after the backfill migration is complete.
- You need to use Datastream to capture changes occurring in yoursource MySQL instance in real time and write them to a Cloud Storagebucket.
- After the backfill migration is complete, you need to use Dataflowto move the CDC from the Cloud Storage bucket toSpanner.If Dataflow fails to write a row to Spannerfor any reason, it writes that row to a dead-letter queue directory in adifferent Cloud Storage bucket. The CDC migration automaticallyretries writing the rows from the dead-letter queue directory toSpanner.
Plan for the live data migration
You need to configure the network infrastructure required for data to flowbetween your source MySQL instance, Datastream, Dataflow, Cloud Storage buckets, and the target Spanner database.We recommend configuring private network connectivity for a secure migration.Depending on your organization's compliance requirements, you mighthave to configure public or private network connectivity. For moreinformation about Datastream connectivity, seeNetwork connectivity options.
To plan for the live data migration, you might need your organization's networkadministrator to perform the following tasks:
- Use the default VPC or create a new VPC inyour project with the following requirements:
- The source MySQL instance must be available on this VPC.You might need to create an egress firewall rule on thisVPC, and an ingress firewall rule on the VPCwhere the source MySQL instance is located.
- Datastream, Dataflow, Cloud Storagebuckets, and the target Spanner database must be availableon this VPC.
- You must create an allowlist on your source MySQL instance to allowconnections from the VPC.
- Determine and allocate an IP address range in the VPC thatDatastream can use.
- Create a subnetwork in the VPC for Dataflowto use to complete the backfill migration.
- Create a subnetwork in the VPC for Dataflowto use to complete the CDC migration later.
You or can perform the live data migration by using the following steps:
- Set up CDC migration.
- Perform the backfill migration.
- Finish the CDC migration after the backfill migration is finished.
Performing the live data migration requires deploying and managing asignificant number of resources. Spanner provides two sampleTerraform templates for each phase of the live data migration.
The live migration template performs the CDC migration in two phases:
- Set up CDC migration to a Cloud Storage bucket using Datastream.You can use a Terraform variable to prevent the template from creating theDataflow jobs.
- Migrate the CDC to Spanner from the Cloud Storage bucketusing Dataflow. You must perform this phase only after thebackfill migration Terraform template is finished with the backfill migration.
The backfill migration terraform template performs the backfill migration fromyour source MySQL instance to Spanner.
Before you begin
- Ensure Terraform is installed on your local shell.
- Create a service account to run the live datamigration. For more information about creating a service account, seeCreate service accounts.
To ensure that the service account has the necessary permissions to perform live migration, ask your administrator to grant the service account the following IAM roles on your project:
Important: You must grant these roles to the service account,not to your user account. Failure to grant the roles to the correct principal might result in permission errors.- Dataflow Admin (
roles/dataflow.admin) - Datastream Admin (
roles/datastream.admin) - Security Admin (
roles/iam.securityAdmin) - Service Account Admin (
roles/serviceAccountAdmin) - Pub/Sub Admin (
roles/pubsub.admin) - Storage Admin (
roles/storage.admin) - Compute Network Admin (
roles/compute.networkAdmin) - Viewer (
roles/viewer)
For more information about granting roles, seeManage access to projects, folders, and organizations.
These predefined roles contain the permissions required to perform live migration. To see the exact permissions that are required, expand theRequired permissions section:
Required permissions
The following permissions are required to perform live migration:
compute.globalAddresses.createcompute.globalAddresses.createInternalcompute.globalAddresses.createInternalcompute.globalAddresses.deletecompute.globalAddresses.deleteInternalcompute.globalAddresses.getcompute.globalOperations.getcompute.networks.addPeeringcompute.networks.getcompute.networks.listPeeringRoutescompute.networks.removePeeringcompute.networks.usecompute.routes.getcompute.routes.listcompute.subnetworks.getcompute.subnetworks.listdataflow.jobs.canceldataflow.jobs.createdataflow.jobs.updateContentsdatastream.connectionProfiles.createdatastream.connectionProfiles.deletedatastream.privateConnections.createdatastream.privateConnections.deletedatastream.streams.createdatastream.streams.deletedatastream.streams.updateiam.roles.getiam.serviceAccounts.actAspubsub.subscriptions.createpubsub.subscriptions.deletepubsub.topics.attachSubscriptionpubsub.topics.createpubsub.topics.deletepubsub.topics.getIamPolicypubsub.topics.setIamPolicyresourcemanager.projects.setIamPolicystorage.buckets.createstorage.buckets.deletestorage.buckets.updatestorage.objects.delete
Your administrator might also be able to give the service account these permissions withcustom roles or otherpredefined roles.
- Dataflow Admin (
Set up CDC migration
Spanner offers a Terraform template setting up the CDCand later, completing the CDC migration. You can use a Terraform variable todisable the template from creating the Dataflow jobs. The Terraformtemplate deploys and manages the following resources to set up theCDC migration:
Datastream private connection: a private Datastreamprivate connection is deployed on your configured VPC.
Source Datastream connection profile: a connection profilethat lets Datastream connect to your source MySQL instance.
Cloud Storage bucket: a Cloud Storage bucket thatDatastream writes the data to.
Target Datastream connection profile: this connection profilelets Datastream connect and write to the Cloud Storage bucket.
Datastream stream: a Datastream stream that readsfrom your source MySQL instance and writes to the Cloud Storage bucketas defined in the connection profiles.
Pub/Sub topic and subscription: the Cloud Storage bucketsends object notifications to the Pub/Sub topic and Dataflowconsumes the Pub/Sub subscription to write data toSpanner.
Cloud Storage bucket notifications: a Cloud Storage bucketnotification that publishes to the Pub/Sub topic.
Preparing the CDC Terraform configuration
You can prepare theTerraform template to include Dataflow variableconfigurations, but disable the creation of Dataflow jobs:
common_params = { project = "PROJECT_ID" region = "GCP_REGION" } datastream_params = { mysql_host = "MYSQL_HOST_IP_ADDRESS" mysql_username = "MYSQL_USERNAME" mysql_password = "MYSQL_PASSWORD" mysql_port = 3306 mysql_database = { database = "DATABASE_NAME" } private_connectivity = { vpc_name = "VPC_NAME" range = "RESERVED_RANGE" } } dataflow_params = { skip_dataflow = false enable_backfill = false template_params = { spanner_database_id = "SPANNER_DATABASE_ID" spanner_instance_id = "SPANNER_INSTANCE_ID" } runner_params = { max_workers = 10 num_workers = 4 on_delete = "cancel" network = "VPC_NETWORK" subnetwork = "SUBNETWORK_NAME" } }The Terraform variables are described in the following list:
project: the Google Cloud project ID.region: the Google Cloud region.mysql_host: your source MySQL instance IP address.mysql_username: your source mySQL instance username.mysql_password: your source mySQL instance password.mysql_port: the source MySQL instance port number.database: your source MySQL database name in the instance.vpc_name: the name of an existing VPC that'sused by Datastream.range: The IP range on the VPC that you've reserved for Datastreamto use.skip_dataflow: set this value totrueto disable Dataflow fromcreating Dataflow jobs.enable_backfill: set this value tofalseto disable the Terraform templatefrom creating Dataflow jobs.spanner_database_id: the target Spanner database ID.spanner_instance_id: the target Spanner instance ID.max_workers: determines the maximum number of workers Dataflowcreates.min_workers: determines the maximum number of workers Dataflowcreates.network: the name of an existing VPC that is going to be used byDataflow.subnetwork: the name of the designated subnetwork in the VPC thatDataflow can create workers.
Run the CDC Terraform template
To perform the CDC migration, you need to run the Terraform template:
Initialize Terraform by using the following command:
terraform init
Validate the Terraform files by using the following command:
terraform plan --var-file=terraform_simple.tfvars
Run the Terraform configuration using the following command:
terraform apply --var-file=terraform_simple.tfvars
The Terraform configuration produces output similar to the following:
Outputs:resource_ids = { "datastream_source_connection_profile" = "source-mysql-thorough-wombat" "datastream_stream" = "mysql-stream-thorough-wombat" "datastream_target_connection_profile" = "target-gcs-thorough-wombat" "gcs_bucket" = "live-migration-thorough-wombat" "pubsub_subscription" = "live-migration-thorough-wombat-sub" "pubsub_topic" = "live-migration-thorough-wombat"}resource_urls = { "datastream_source_connection_profile" = "https://console.cloud.google.com/datastream/connection-profiles/locations/us-central1/instances/source-mysql-thorough-wombat?project=your-project-here" "datastream_stream" = "https://console.cloud.google.com/datastream/streams/locations/us-central1/instances/mysql-stream-thorough-wombat?project=your-project-here" "datastream_target_connection_profile" = "https://console.cloud.google.com/datastream/connection-profiles/locations/us-central1/instances/target-gcs-thorough-wombat?project=your-project-here" "gcs_bucket" = "https://console.cloud.google.com/storage/browser/live-migration-thorough-wombat?project=your-project-here" "pubsub_subscription" = "https://console.cloud.google.com/cloudpubsub/subscription/detail/live-migration-thorough-wombat-sub?project=your-project-here" "pubsub_topic" = "https://console.cloud.google.com/cloudpubsub/topic/detail/live-migration-thorough-wombat?project=your-project-here"}
Datastream is now streaming the CDC to a Cloud Storage bucket. Youmust perform the backfill migration and finish the CDC migration later.
Perform the backfill migration
Spanner offers a Terraform template to perform the backfillmigration. The Terraform template deploys and manages the following resource:
- Dataflow job: The Dataflow job that reads from thesource MySQL instance and writes to the target Spanner database.
Preparing the backfill migration Terraform configuration
job_name = "JOB_NAME" project = "PROJECT_ID" region = "GCP_REGION" working_directory_bucket = "WORKING_DIRECTORY_BUCKET" working_directory_prefix = "WORKING_DIRECTORY_PREFIX" source_config_url = "SOURCE_CONFIG_URL" username = "USERNAME" password = "PASSWORD" instance_id = "SPANNER_INSTANCE_ID" database_id = "SPANNER_DATABASE_ID" spanner_project_id = "SPANNER_PROJECT_ID"
The Terraform variables are described in the following list:
job_name: the Dataflow job name.project: the Google Cloud project ID where the Dataflow job needs to run.region: the Google Cloud region.working_directory_bucket: the Cloud Storage bucket for uploadingthe session file and creating the output directory.working_directory_prefix: the Cloud Storage bucket prefix for TheDataflow working directory.source_config_url: your source MySQL instance IP address.username: your source mySQL instance username.password: your source mySQL instance password.instance_id: the target Spanner instance ID.database_id: the target Spanner database ID.spanner_project_id: the project ID where your Spannerinstance is. This project ID can be different than the project you're runningDataflow on.
Run the backfill migration Terraform template
To perform the backfill migration, do the following:
Initialize Terraform by using the following command:
terraform init
Validate the Terraform files by using the following command:
terraform plan --var-file=terraform_simple.tfvars
Run the Terraform configuration using the following command:
terraform apply --var-file=terraform_simple.tfvars
The Terraform configuration produces an output similar to the following:
Apply complete! Resources: 1 added, 0 changed, 0 destroyed.Outputs:dataflow_job_id = [ "2024-06-05_00_41_11-4759981257849547781",]dataflow_job_url = [ "https://console.cloud.google.com/dataflow/jobs/gcp-region/2024-06-05_00_41_11-4759981257849547781",]
When the backfill migration cannot write a row to Spanner, it writes that row to a dead-letter queue directory in aCloud Storage bucket.
You can retry writing these rows from the dead-letter queue directory toSpanner before finishing the CDC migration.
To retry writing these rows from the dead-letter queue directory toSpanner before finishing the CDC migration,run the following command:
gcloud dataflow flex-template runJOB_NAME \--region=GCP_REGION \--template-file-gcs-location=gs://dataflow-templates/latest/flex/Cloud_Datastream_to_Spanner \--additional-experiments=use_runner_v2 \--parameters inputFilePattern=inputFilePattern,streamName="ignore", \--datastreamSourceType=SOURCE_TYPE\instanceId=INSTANCE_ID,databaseId=DATABASE_ID,sessionFilePath=SESSION_FILE_PATH, \deadLetterQueueDirectory=DLQ_DIRECTORY,runMode="retryDLQ"
The gcloud CLI command variables are described in the following list:
job_name: the Dataflow job name.region: the Google Cloud region.inputFilePattern: the Cloud Storage bucket location of the inputfile pattern.datastreamSourceType: the source type, for example, MySQL.instanceId: the target Spanner instance ID.databaseId: the target Spanner database ID.sessionFilePath: the Cloud Storage bucket path to the session file.deadLetterQueueDirectory: the Cloud Storage bucket path to the DLQdirectory.
Finish the CDC migration
After the backfill migration is complete, you can use Dataflow tomigrate the CDC to Spanner. The Dataflow job takes thechange events from the Cloud Storage bucket and writes them toSpanner.
After almost all the data from the Cloud Storage bucket is written toSpanner, stop writes on the source MySQL instance to allow theremaining changes to be written to Spanner.
This causes a short downtime while Spanner catches up to thesource MySQL instance. After all the changes are written to Spanner,your application can start using Spanner as their database.
To finish the CDC migration, change the value of theskip_dataflowTerraform parameter tofalse and rerun the live migrationTerraform template.
Run the Terraform configuration using the following command:
terraform apply --var-file=terraform_simple.tfvars
What's next
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-17 UTC.