Stream data from MySQL databases

This section contains information about:

  • The behavior of how Datastream handles data that's being pulled from a source MySQL database
  • The versions of MySQL database that Datastream supports
  • Known limitations for using MySQL database as a source
  • An overview of how to setup a source MySQL database so that data can be streamed from it to a destination

Behavior

This section describes the behavior of MySQL sources when you replicate datausing Datastream. When you ingest data from MySQL databases, you canuse binlog-based replication or global transaction identifier (GTID)-basedreplication. You select your CDC method when youcreate a stream.

Binlog-based replication

Datastream can usebinary log files tokeep a record of data changes in MySQL databases. The information contained inthese log files is then replicated to the destination to reproduce the changesmade on the source.

The key characteristics of binlog-based replication in Datastream are:

  • All databases or specific databases from a given MySQL source, as well as alltables from the databases or specific tables, can be selected.
  • All historical data is replicated.
  • All data manipulation language (DML) changes, such as inserts, updates, anddeletes from the specified databases and tables, are replicated.
  • Only committed changes are replicated.

Global transaction identifier (GTID)-based replication

Use GTID-based replication for seamless failover support on high availability setups.

Datastream also supports global identifier (GTID)-based replication.

Global transaction identifier (GTID) is a unique identifier created andassociated with each transaction committed on a MySQL source. This identifier isunique not only to the source on which it originated, but also across all serversin a given replication topology, as opposed to the binary log-basedreplication where each node in the database cluster maintains its own binlogfiles, with its own numbering. Maintaining separate binlog files and numberingmight become an issue in the event of a failure or planned downtime, because thebinlog continuity is broken and the binlog-based replication fails.

GTID-based replication supports failovers, self-managed database clusters, andcontinues to work irrespective of changes in the database cluster.

The key characteristics of GTID-based replication in Datastream are:

Switch from binlog-based to GTID-based replication

If you want to update your stream and switch from binlog-based to GTID-basedreplication without the need to do a backfill, perform the following steps:

Note: These steps require database downtime. Similar steps might also be useful when you want to upgrade the major version of your MySQL source.
  1. Ensure that all requirements for GTID-based replication are satisfied. Formore information, seeConfigure a source MySQL database.
  2. Optionally, create and run atest GTID-based stream. For more information,seeCreate a stream.
  3. Create a GTID-based stream. Don't start it yet.
  4. Stop application traffic to the source database.
  5. Pause the existing binlog-based stream. For more information, seePause the stream.
  6. Wait for a few minutes to ensure that Datastream has caught up withthe database. You can check this using the metrics in theMonitoring tab, ontheStream details page for your stream. The values forData freshness andThroughput need to be0.
  7. Start the GTID-based stream. For more information, seeStart the stream.
  8. Resume traffic to the source database.

If performing a backfill isn't an issue, you can truncate your tables inBigQuery, delete the old stream, and start a new one with backfill. Formore information about managing backfill, seeManage backfill for the objects of a stream.

Versions

Datastream supports the following versions of MySQL database:

Datastream supports the following types of MySQL database:

Best practices

This section describes recommended best practices for configuring your MySQLsource for use with Datastream.

Use GTID for high availability setups

If your production MySQL source uses replicas or any other high availabilityconfiguration, use GTID-based replication.

Binlog file and position-based replication can break during a database failoverbecause when the primary fails, the new primary has a different binlog history.In such a case, Datastream loses its position and it can't resume.

GTID assigns a unique ID to every transaction across your entire replicationtopology (primary and replicas). After a failover, Datastream canresume from the last GTID logged on the new primary, without needing to knowthe binlog file or the position.

Recommendation: for any production MySQL source with a replica or highavailability configuration, using the GTID CDC method is mandatory forresilient and reliable data replication.

Properly size your read replica

If you configure Datastream to replicate from a read replica, you canencounter adouble lag, which is a combination of MySQL replication lag (fromprimary to replica) and Datastream replication lag (from replica todestination). Read replicas are often provisioned with fewer resources (CPU, RAM,IOPS) than primaries to save costs, which can cause them to lag behind theprimary during high-write periods.

Recommendation: when using a read replica as a source for Datastream,provision it with resources comparable to the primary, so that the replica cankeep up with the primary's write throughput.

Increase throughput for the binlog CDC method

If you are using binlog-based replication and experience high latency due tolarge source write volumes generating binlog files faster than a single task canprocess, increase throughput by tuning themaxConcurrentCdcTasks parameter.This parameter controls the number of CDC tasks a stream runs inparallel. Increasing the value for this parameter allows Datastream toprocess more binlog files concurrently.

Recommendation: To determine the appropriate value for data freshness, monitoryour MySQL server's binlog generation rate during peak hours. You can do this byobserving the rate at which new binlog files are created and rotated in theMySQL data directory, or by using MySQL monitoring tools to track the growth ofbinary logs. If, for example, your source generates 10 binlog files per minuteduring peak times, settingmaxConcurrentCdcTasks to a value like10-15allows Datastream to process these files in parallel, preventing a backlog.

You can increasemaxConcurrentCdcTasks up to the maximum supported value of50, provided that the load on the source database remains under control.For more information, seeStream concurrency controls.

Correctly size themax_allowed_packet parameter

The defaultmax_allowed_packet setting in MySQL (for example, 16MB-64MB)might be too small. If a single row with largeBLOB,JSON, orTEXT typefields, or a single large transaction exceeds this size, MySQL terminates theDatastream connection, causing the stream to fail with errors such asPacket for query is too large orGot a packet bigger than'max_allowed_packet' bytes.

Recommendation: set themax_allowed_packet parameter on your MySQL serverto its maximum allowed value of 1G. This ensures that the server can handle anylarge row or transaction that Datastream needs to read from the binlog.

Known limitations

Known limitations for using MySQL database as a source include:

  • Streams are limited to 10,000 tables.
  • Tables that have a primary key defined asINVISIBLE can't be backfilled.
  • A table that has more than 500 million rows can't be backfilled unless the following conditions are met:
    1. The table has a unique index.
    2. None of the columns of the index are nullable.
    3. The index isn'tdescending.
    4. All columns of the index are included in the stream.
  • Datastream periodically fetches the latest schema from the source as events are processed. If a schema changes, Datastream detects the schema change and triggers a schema fetch. However, some events might get processed incorrectly or get dropped between the schema fetches, which can cause data discrepancies.
  • Not all changes to the source schema can be detected automatically, in which case data corruption may occur. The following schema changes may cause data corruption or failure to process the events downstream:
    • Dropping columns
    • Adding columns to the middle of a table
    • Changing the data type of a column
    • Reordering columns
    • Dropping tables (relevant if the same table is then recreated with new data added)
    • Truncating tables
  • Datastream doesn't support replicating views.
  • Datastream doesn't support columns ofspatial data types. The values in these columns are replaced withNULL values.
  • Datastream doesn't support the zero value (0000-00-00 00:00:00) in columns of theDATETIME,DATE, orTIMESTAMP data types. The zero value is replaced with theNULL value.
  • Datastream doesn't support replicating rows which include the following values inJSON columns:DECIMAL,NEWDECIMAL,TIME,TIME2DATETIME,DATETIME2,DATE,TIMESTAMP orTIMESTAMP2. Events containing such values are discarded.
  • Datastream doesn't supportbinary log transaction compression.
  • Datastream doesn't support SSL certificate chains in the source MySQL connection profiles. Only single, x509 PEM-encoded certificates are supported.
  • Datastream doesn't support cascading deletes. Such events aren't written to the binary log, and as a result, aren't propagated to the destination.
  • Datastream doesn't supportDROP PARTITION operations. Such operations are metadata only operations and aren't replicated. Other events aren't affected and the stream runs successfully.
  • You might experience connectivity issues when replicatingFEDERATED tables. If that happens, remove allFEDERATED tables from the source database configuration and increase the values forconnect_timeout,net_read_timeout andmax_allowed_packet parameters to mitigate timeout issues during backfill.
  • Cloud SQL Enterprise Plus instances must use GTID-based replication because they are subject tonear-zero downtime maintenance. Binary log-based replication breaks on failovers, hence we recommend using GTID-based replication for high availability use cases.
  • For MySQL versions 8.0 and later, thebinlog_row_value_options variable must be set to an empty value. This is the default for most versions, but for some, for example MySQL sources on Oracle Cloud Infrastructure (OCI), you must set it explicitly. For more information, seeConfigure a self-managed MySQL database.

Additional limitations for the GTID-based replication

  • Recovering streams that use GTID-based replication is only available when usingthe Datastream API.
  • Creating tables from other tables using theCREATE TABLE ... SELECT statementsisn't supported.
  • Datastream doesn't support tagged GTIDs.
  • For MySQL restrictions that apply to GTID-based replication, seeMySQL documentation.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.