Key concepts and features

Behavior and use cases

Datastream gives users the ability to bring source data from a Relational Database Management System (RDBMS) and other sources into destinations such as BigQuery, BigLake Iceberg tables and Cloud Storage in near real-time fashion. This provides for downstream use-cases such as loading the data into BigQuery for data warehousing and analytics, or running Spark jobs over the data for artificial intelligence and machine learning use cases.

Concepts

This section describes the main concepts that you need to understand to use Datastream effectively.

Change data capture

Change data capture (CDC) is a set ofsoftware design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data. CDC is also an approach to data integration that's based on the identification, capture, and delivery of the changes made to enterprise data sources.

Event sourcing

Introduced in 2005,eventsourcing is a designpattern where every change to a state of an application is captured in an eventobject. Utilizing event sourcing, an application can rebuild its state, performpoint-in-time recovery(by processing the event until that point), recompute the state in case of achange in logic, or enableCommand Query Responsibility Segregation(CQRS) design. With the evolution oftools for real-time event processing, many applications are moving to the eventsourcing model. Historically, transactional databases were alwaysevent-oriented, because ofatomicity, consistency, isolation, and durability(ACID) requirements.

Transactional databases

In a transactional database, the set of operations that the database is going to perform is usually written to awrite-ahead log (WAL) before any operations are executed on the storage engine. After an operation is executed on the storage engine and is committed to the WAL, the operation is considered to be successful. Using WAL enables atomicity and durability, and also allows high-fidelity replication of the database. Some databases will write to the log the exact operation that will happen on the storage level (for example,write 0x41 on location 20), so those actions can only be replicated (or redone) on the same storage engine. Other databases will log a complete logical statement (or row) that can be re-executed on a different storage engine.

Events and streams

Datastream ingests a lot of data in near real-time fashion from a variety of sources, and makes the data available for consumption in the destination. The unit of data stored by Datastream is an event. A stream represents continuous ingestion of events from a source and writing them to a destination.

Unified types

Data sources have their own types, some specific to the database itself, and some that are generic and are shared across databases. Because there are many different sources that generate streams to a unified destination, there must be a standard, unified way to represent the original source type across all sources. Theunified type is a common and lossless way to represent data types across all sources so that they can be consumed in a cohesive manner. The unified types supported by Datastream will represent the superset of all normalized types across all supported source systems so that all types can be supported losslessly.

Entity context

Datastream has five entities:

Private connectivity configurations enable Datastream tocommunicate with data sources over a secure, private network connection.This communication happens through Virtual Private Cloud (VPC) peering.
Connection profiles represent connectivity information to a specificsource or destination database.
Streams represent a source and destination connection profile pair,along with stream-specific settings.
Objects represent a sub-portion of a stream. For example, a databasestream has a data object for every table being streamed.
Events represent every data manipulation language (DML) change for agiven object.

Aftercreating a private connectivityconfiguration,you can connect to sources hosted in Google Cloud or elsewhere over aprivate communication channel. Private connectivity is optional,Datastream also supports other modes of connectivity over publicnetworks.

Aftercreating a connectionprofile for a source and adestination, you can create streams that use the information stored in theconnection profiles to transfer data from the source to the destination.

Aftercreating a stream, Datastreamconnects to the source directly, consumes content, and then processes and writesthe events to the destination based on theeventstructure.

Private connectivity configurations and connection profiles can be managed separately from streams for reuse.

Features

Features for Datastream include:

Serverless: You can configure a stream and the data starts moving. Thereare no installation, resource allocation, or maintenance overheads. As datavolumes grow and shrink, Datastream's autoscaling capabilitiesallocate resources to keep data moving in near real-time, automatically.
Unified Avro-based type schema: Datastream enablessource-independent processing by converting all source-specific data typesinto a unified Datastream type schema, based on Avro types.
Stream historical and CDC data: Datastream streams bothhistorical and CDC source data in near real-time, simultaneously.
Oracle CDC without additional licenses: Datastream providesLogMiner-based CDC streaming from any Oracle source version 11.2g and later,without the need to pay for additional licenses or software installations.
BigQuery destination: Changes in the source are replicatedcontinuously to BigQuery tables in near real-time. Data inBigQuery is almost immediately available for analytics.
Cloud Storage destination: CDC data is written to self-describingAvro or JSON files in Cloud Storage continually. This information isconsumable for additional processing, either directly in place or by loadingdownstream to another destination such as Spanner.

Use cases

There are three main scenarios for using Datastream:

Data integration: Data streams from databases and Software-as-a-Service (SaaS) cloud services can feed a near real-time data integration pipeline by loading data into BigQuery.
Streaming analytics: Changes in databases are ingested into streaming pipelines such as with Dataflow for fraud detection, security event processing, and anomaly detection.
Near real-time availability of data changes: Availability of data changes in near real-time powers artificial intelligence and machine learning applications to prevent churn or increase engagement using marketing efforts or by feeding back into production systems.

Behavior overview

Datastream enables customers stream ongoing changes from multiple data sources directly into Google Cloud.

Sources

There is setup work required for a source to be used with Datastream, including authentication and additional configuration options.
Each source generates events that reflect all data manipulation language (DML) changes.
Each stream can backfill historical data, as well as stream ongoing changes into the destination.

Destinations

Datastream supports BigQuery, BigLake Iceberg tables and Cloud Storage asdestinations. When the stream is created, its destination configuration is defined.

Event delivery

Theevent order isn'tguaranteed.Event metadataincludes information that can be used to order the events.
Theevent delivery occursat least once. Event metadata includes data that can be used to remove anyduplicate data in the destination.
The event size is limited to 20 MB per event for BigQuerydestinations and 100 MB per event for Cloud Storage destinations.

To learn more about events, seeEvents and streams.

High availability and disaster recovery

This section contains information about how Datastream handles scenarios associated with high availability and disaster recovery.

High availability: Datastream is a regional service, running onmultiplezones in eachregion. A single-zonefailure in any one region doesn't impact the availability or quality of theservice in other zones.
Disaster recovery: If there's a failure in a region, then any streamsrunning on that region will be down for the duration of the outage. Afterthe outage is resolved, Datastream will continue exactly where itleft off, and any data that hasn't been written to the destination will beretrieved again from the source. In this case, duplicates of data may residein the destination. SeeEvent delivery for moreinformation on removing the duplicate data.

Initial data and CDC data

Because data sources have data that existed before the time that the source was connected to a stream (historical data), Datastream generates events both from the historical data as well as data changes happening in real-time.

To ensure fast data access, the historical data and the real-time data changes are replicated simultaneously to the destination. Theevent metadata indicates whether that event is from the backfill or from the CDC.

What's next

Find out more aboutDatastream.
Learn aboutunified types mappings.
Learn more aboutsources that Datastream supports.
Learn more aboutdestinations that Datastream supports.
Find out how to createprivate connectivity configurations,connection profiles andstreams.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換

Key concepts and features Stay organized with collections Save and categorize content based on your preferences.