Stream data from MySQL databases Stay organized with collections Save and categorize content based on your preferences.
This section contains information about:
- The behavior of how Datastream handles data that's being pulled from a source MySQL database
- The versions of MySQL database that Datastream supports
- Known limitations for using MySQL database as a source
- An overview of how to setup a source MySQL database so that data can be streamed from it to a destination
Behavior
This section describes the behavior of MySQL sources when you replicate datausing Datastream. When you ingest data from MySQL databases, you canuse binlog-based replication or global transaction identifier (GTID)-basedreplication. You select your CDC method when youcreate a stream.
Binlog-based replication
Datastream can usebinary log files tokeep a record of data changes in MySQL databases. The information contained inthese log files is then replicated to the destination to reproduce the changesmade on the source.
The key characteristics of binlog-based replication in Datastream are:
- All databases or specific databases from a given MySQL source, as well as alltables from the databases or specific tables, can be selected.
- All historical data is replicated.
- All data manipulation language (DML) changes, such as inserts, updates, anddeletes from the specified databases and tables, are replicated.
- Only committed changes are replicated.
Global transaction identifier (GTID)-based replication
Use GTID-based replication for seamless failover support on high availability setups.
Datastream also supports global identifier (GTID)-based replication.
Global transaction identifier (GTID) is a unique identifier created andassociated with each transaction committed on a MySQL source. This identifier isunique not only to the source on which it originated, but also across all serversin a given replication topology, as opposed to the binary log-basedreplication where each node in the database cluster maintains its own binlogfiles, with its own numbering. Maintaining separate binlog files and numberingmight become an issue in the event of a failure or planned downtime, because thebinlog continuity is broken and the binlog-based replication fails.
GTID-based replication supports failovers, self-managed database clusters, andcontinues to work irrespective of changes in the database cluster.
The key characteristics of GTID-based replication in Datastream are:
- All databases or specific databases from a given MySQL source, as well as alltables from the databases or specific tables, can be selected.
- All historical data is replicated.
- All data manipulation language (DML) changes, such as inserts, updates, anddeletes from the specified databases and tables, are replicated.
- Only committed changes are replicated.
- Seamless support for failovers.
Switch from binlog-based to GTID-based replication
If you want to update your stream and switch from binlog-based to GTID-basedreplication without the need to do a backfill, perform the following steps:
Note: These steps require database downtime. Similar steps might also be useful when you want to upgrade the major version of your MySQL source.- Ensure that all requirements for GTID-based replication are satisfied. Formore information, seeConfigure a source MySQL database.
- Optionally, create and run atest GTID-based stream. For more information,seeCreate a stream.
- Create a GTID-based stream. Don't start it yet.
- Stop application traffic to the source database.
- Pause the existing binlog-based stream. For more information, seePause the stream.
- Wait for a few minutes to ensure that Datastream has caught up withthe database. You can check this using the metrics in theMonitoring tab, ontheStream details page for your stream. The values forData freshness andThroughput need to be
0. - Start the GTID-based stream. For more information, seeStart the stream.
- Resume traffic to the source database.
If performing a backfill isn't an issue, you can truncate your tables inBigQuery, delete the old stream, and start a new one with backfill. Formore information about managing backfill, seeManage backfill for the objects of a stream.
Versions
Datastream supports the following versions of MySQL database:
- MySQL 5.6
- MySQL 5.7
- MySQL 8.0
MySQL 8.4 (supported only for GTID-based replication)
Global transaction identifier (GTID)-based replication isonly supported for versions 5.7 and later.
Datastream supports the following types of MySQL database:
- Self-hosted MySQL
- Cloud SQL for MySQLCloud SQL for MySQL Enterprise Plus sources aresupported when using the GTID-based replication.
- Amazon RDS for MySQL
- Amazon Aurora MySQL
- MariaDB
- Alibaba Cloud PolarDB
- Percona Server for MySQL
Best practices
This section describes recommended best practices for configuring your MySQLsource for use with Datastream.
Use GTID for high availability setups
If your production MySQL source uses replicas or any other high availabilityconfiguration, use GTID-based replication.
Binlog file and position-based replication can break during a database failoverbecause when the primary fails, the new primary has a different binlog history.In such a case, Datastream loses its position and it can't resume.
GTID assigns a unique ID to every transaction across your entire replicationtopology (primary and replicas). After a failover, Datastream canresume from the last GTID logged on the new primary, without needing to knowthe binlog file or the position.
Recommendation: for any production MySQL source with a replica or highavailability configuration, using the GTID CDC method is mandatory forresilient and reliable data replication.
Properly size your read replica
If you configure Datastream to replicate from a read replica, you canencounter adouble lag, which is a combination of MySQL replication lag (fromprimary to replica) and Datastream replication lag (from replica todestination). Read replicas are often provisioned with fewer resources (CPU, RAM,IOPS) than primaries to save costs, which can cause them to lag behind theprimary during high-write periods.
Recommendation: when using a read replica as a source for Datastream,provision it with resources comparable to the primary, so that the replica cankeep up with the primary's write throughput.
Increase throughput for the binlog CDC method
If you are using binlog-based replication and experience high latency due tolarge source write volumes generating binlog files faster than a single task canprocess, increase throughput by tuning themaxConcurrentCdcTasks parameter.This parameter controls the number of CDC tasks a stream runs inparallel. Increasing the value for this parameter allows Datastream toprocess more binlog files concurrently.
Recommendation: To determine the appropriate value for data freshness, monitoryour MySQL server's binlog generation rate during peak hours. You can do this byobserving the rate at which new binlog files are created and rotated in theMySQL data directory, or by using MySQL monitoring tools to track the growth ofbinary logs. If, for example, your source generates 10 binlog files per minuteduring peak times, settingmaxConcurrentCdcTasks to a value like10-15allows Datastream to process these files in parallel, preventing a backlog.
You can increasemaxConcurrentCdcTasks up to the maximum supported value of50, provided that the load on the source database remains under control.For more information, seeStream concurrency controls.
Correctly size themax_allowed_packet parameter
The defaultmax_allowed_packet setting in MySQL (for example, 16MB-64MB)might be too small. If a single row with largeBLOB,JSON, orTEXT typefields, or a single large transaction exceeds this size, MySQL terminates theDatastream connection, causing the stream to fail with errors such asPacket for query is too large orGot a packet bigger than'max_allowed_packet' bytes.
Recommendation: set themax_allowed_packet parameter on your MySQL serverto its maximum allowed value of 1G. This ensures that the server can handle anylarge row or transaction that Datastream needs to read from the binlog.
Known limitations
Known limitations for using MySQL database as a source include:
- Streams are limited to 10,000 tables.
- Tables that have a primary key defined as
INVISIBLEcan't be backfilled. - A table that has more than 500 million rows can't be backfilled unless the following conditions are met:
- The table has a unique index.
- None of the columns of the index are nullable.
- The index isn'tdescending.
- All columns of the index are included in the stream.
- Datastream periodically fetches the latest schema from the source as events are processed. If a schema changes, Datastream detects the schema change and triggers a schema fetch. However, some events might get processed incorrectly or get dropped between the schema fetches, which can cause data discrepancies.
- Not all changes to the source schema can be detected automatically, in which case data corruption may occur. The following schema changes may cause data corruption or failure to process the events downstream:
- Dropping columns
- Adding columns to the middle of a table
- Changing the data type of a column
- Reordering columns
- Dropping tables (relevant if the same table is then recreated with new data added)
- Truncating tables
- Datastream doesn't support replicating views.
- Datastream doesn't support columns ofspatial data types. The values in these columns are replaced with
NULLvalues. - Datastream doesn't support the zero value (
0000-00-00 00:00:00) in columns of theDATETIME,DATE, orTIMESTAMPdata types. The zero value is replaced with theNULLvalue. - Datastream doesn't support replicating rows which include the following values in
JSONcolumns:DECIMAL,NEWDECIMAL,TIME,TIME2DATETIME,DATETIME2,DATE,TIMESTAMPorTIMESTAMP2. Events containing such values are discarded. - Datastream doesn't supportbinary log transaction compression.
- Datastream doesn't support SSL certificate chains in the source MySQL connection profiles. Only single, x509 PEM-encoded certificates are supported.
- Datastream doesn't support cascading deletes. Such events aren't written to the binary log, and as a result, aren't propagated to the destination.
- Datastream doesn't support
DROP PARTITIONoperations. Such operations are metadata only operations and aren't replicated. Other events aren't affected and the stream runs successfully. - You might experience connectivity issues when replicating
FEDERATEDtables. If that happens, remove allFEDERATEDtables from the source database configuration and increase the values forconnect_timeout,net_read_timeoutandmax_allowed_packetparameters to mitigate timeout issues during backfill. - Cloud SQL Enterprise Plus instances must use GTID-based replication because they are subject tonear-zero downtime maintenance. Binary log-based replication breaks on failovers, hence we recommend using GTID-based replication for high availability use cases.
- For MySQL versions 8.0 and later, the
binlog_row_value_optionsvariable must be set to an empty value. This is the default for most versions, but for some, for example MySQL sources on Oracle Cloud Infrastructure (OCI), you must set it explicitly. For more information, seeConfigure a self-managed MySQL database.
Additional limitations for the GTID-based replication
- Recovering streams that use GTID-based replication is only available when usingthe Datastream API.
- Creating tables from other tables using the
CREATE TABLE ... SELECTstatementsisn't supported. - Datastream doesn't support tagged GTIDs.
- For MySQL restrictions that apply to GTID-based replication, seeMySQL documentation.
What's next
- Learn how toconfigure a MySQL sourcefor use with Datastream.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.