Build data products in a data mesh

Last reviewed 2024-09-03 UTC

To ensure that the use cases of data consumers are met, it's essential thatdata products in a data mesh are designed and built with care. The design of adata product starts with the definition of how data consumers would use thatproduct, and how that product then gets exposed to consumers. Data products ina data mesh are built on top of a datastore (for example, a domain datawarehouse or data lake). When you create data products in a data mesh, thereare some key factors that we recommend you consider throughout this process.These considerations are described in this document.

This document is part of a series which describes how to implement a data meshon Google Cloud. It assumes that you have read and are familiar with theconcepts described inArchitecture and functions in a data mesh andBuild a modern, distributed Data Mesh with Google Cloud.

The series has the following parts:

When creating data products from a domain data warehouse, we recommend thatdata producers carefully design analytical (consumption) interfaces for thoseproducts. These consumption interfaces are a set of guarantees on the dataquality and operational parameters, along with a product support model andproduct documentation. The cost of changing consumption interfaces is usuallyhigh because of the need for both the data producer and potentially multipledata consumers to change their consuming processes and applications. Given thatthe data consumers are most likely to be in organizational units which areseparate to that of the data producers, coordinating the changes can bedifficult.

The following sections provide background information on what you must considerwhen creating a domain warehouse, defining consumption interfaces, and exposingthose interfaces to data consumers.

Create a domain data warehouse

There's no fundamental difference between building a standalone data warehouseand building a domain data warehouse from which the data producer team createsdata products. The only real difference between the two is that the latterexposes a subset of its data through the consumption interfaces.

In many data warehouses, the raw data ingested from operational data sourcesgoes through the process of enrichment and data quality verification (curation).InDataplex Universal Catalog-manageddata lakes, curated data typically is stored in designated curated zones. Whencuration is complete, a subset of the data should be ready forexternal-to-the-domain consumption through several types of interfaces. Todefine those consumption interfaces, an organization should provide a set oftools to domain teams who are new to adopting a data mesh approach. These toolslet data producers create new data products on a self-service basis. Forrecommended practices, seeDesign a self-service data platform.

Additionally, data products must meet centrally defined data governancerequirements. These requirementsaffect data quality, data availability, and lifecycle management. Because theserequirements build the trust of data consumers in the data products and encouragedata product usage, the benefits of implementing these requirements are worth theeffort in supporting them.

Define consumption interfaces

We recommend that data producers use multiple types of interfaces, instead ofdefining just one or two. Each interface type in data analytics has advantagesand disadvantages, and there's no single type of interface that excels ateverything. When data producers assess the suitability of each interface type,they must consider the following:

  • Ability to perform the data processing needed.
  • Scalability to support current and future data consumer use cases.
  • Performance required by data consumers.
  • Cost of development and maintenance.
  • Cost of running the interface.
  • Support by the languages and tools that your organization uses.
  • Support for separation of storage and compute.

For example, if the business requirement is to be able to run analyticalqueries over a petabyte-size dataset, then the only practical interface is aBigQuery view. But if the requirements are to provide near real-timestreaming data, then a Pub/Sub-based interface is more appropriate.

Many of these interfaces don't require you to copy or replicate existing data.Most of them also let you separate storage and compute, a critical feature ofGoogle Cloud analytical tools. Consumers of data exposed through theseinterfaces process the data using the compute resources available to them.There's no need for data producers to do any additional infrastructureprovisioning.

There's a wide variety of consumption interfaces. The following interfaces arethe most common ones used in a data mesh and are discussed in the followingsections:

The list of interfaces in this document is not exhaustive. There are also otheroptions that you might consider for your consumption interfaces (for example,BigQuery sharing (formerly Analytics Hub)).However, these other interfaces are outside of the scope of this document.

Authorized views and functions

As much as possible, data products should be exposed throughauthorized views andauthorized functions, including table-valued functions.Authorized datasets provide a convenient way to authorize several views automatically. Usingauthorized views prevents direct access to the base tables, and lets youoptimize the underlying tables and queries against them, without affectingconsumer use of these views. Consumers of this interface use SQL to query thedata. The following diagram illustrates the use of authorized datasets as theconsumption interface.

Consumption interfaces.

Authorized datasets and views help to enable easy versioning of interfaces. Asshown in the following diagram, there are two primary versioning approaches thatdata producers can take:

Dataset and view versioning.

The approaches can be summarized as follows:

  • Dataset versioning: In this approach, you version the dataset name.You don't version the views and functions inside the dataset. You keep thesame names for the views and functions regardless of version. For example,the first version of a sales dataset is defined in a dataset namedsales_v1 with two views,catalog andorders. For its second version,the sales dataset has been renamedsales_v2, and any previous views inthe dataset keep their previous names but have new schemas. The secondversion of the dataset might also have new views added to it, or may removeany of the previous views.
  • View versioning: In this approach, the views inside the dataset areversioned instead of the dataset itself. For example, the sales datasetkeeps the name ofsales regardless of version. However, the names of theviews inside the dataset change to reflect each new version of the view(such ascatalog_v1,catalog_v2,orders_v1,orders_v2, andorders_v3).

The best versioning approach for your organization depends on yourorganization's policies and the number of views that are rendered obsolete withthe update to the underlying data. Dataset versioning is best when a majorproduct update is needed and most views must change. View versioning leads tofewer identically named views in different datasets, but can lead toambiguities, for example, how to tell if a join between datasets workscorrectly. A hybrid approach can be a good compromise. In a hybrid approach,compatible schema changes are allowed within a single dataset, and incompatiblechanges require a new dataset.

BigLake table considerations

Authorized views can be created not only on BigQuery tables, butalso onBigLake tables. BigLake tables let consumers query the data stored inCloud Storage by using the BigQuery SQL interface.BigLake tables support fine-grained access control without theneed for data consumers to have read permissions for the underlyingCloud Storage bucket.

Data producers must consider the following for BigLake tables:

  • The design of the file formats and the data layout influences theperformance of the queries. Column-based formats, for example,Parquet orORC,generally perform much better for analytic queries than JSON or CSV formats.
  • AHive partitioned layout lets you prune partitions and speeds up queries which use partitioning columns.
  • The number of files and the preferred query performance for the filesize must also be taken into account in the design stage.

If queries using BigLake tables don't meet service-levelagreement (SLA) requirements for the interface and can't be tuned, then werecommend the following actions:

  • For data that must be exposed to the data consumer, convert that datato BigQuery storage.
  • Redefine the authorized views to use the BigQuery tables.

Generally, this approach does not cause any disruption to the data consumers,or require any changes to their queries. The queries in BigQuerystorage can be optimized using techniques that aren't possible withBigLake tables. For example, with BigQuerystorage, consumers can query materialized views that have different partitioningand clustering than the base tables, and they can use the BigQuery BI Engine.

Direct read APIs

Although we don't generally recommend that data producers give data consumersdirect read access to the base tables, it might occasionally be practical toallow such access for reasons such as performance and cost. In such cases, extracare should be taken to ensure that the table schema is stable.

There are two ways to directly access data in a typical warehouse. Dataproducers can either use theBigQuery Storage Read API,or theCloud Storage JSON or XML APIs.The following diagram illustrates two examples of consumers using these APIs.One is a machine learning (ML) use case, and the other is a data processingjob.

ML and data processing use cases, explained in following text.

Versioning a direct-read interface is complex. Typically, data producers mustcreate another table with a different schema. They must also maintain twoversions of the table, until all the data consumers of the deprecated versionmigrate to the new one. If the consumers can tolerate the disruption ofrebuilding the table and switching to the new schema, then it's possible toavoid the data duplication. In cases where schema changes can be backwardcompatible, the migration of the base table can be avoided. For example, youdon't have to migrate the base table if only new columns are added and the datain these columns is backfilled for all the rows.

The following is a summary of the differences between the Storage Read APIand Cloud Storage API. In general, whenever possible, we recommend that dataproducers use BigQuery API for analytical applications.

  • Storage Read API:Storage Read API can be used to read data in BigQuerytables and to read BigLake tables. This API supports filteringand fine-grained access control, and can be a good option for stable dataanalytics or ML consumers.

  • Cloud Storage API:Data producers might need to share a particular Cloud Storage bucketdirectly with data consumers. For example, data producers can share the bucketif data consumers can't use the SQL interface for some reason, or the bucket hasdata formats that aren'tsupported by Storage Read API.

In general, we don't recommend that data producers allow direct access throughthe storage APIs because direct access doesn't allow for filtering andfine-grained access control. However, the direct access approach can be a viablechoice for stable, small-sized (gigabytes) datasets.

Allowing Pub/Sub access to the bucket gives data consumers an easyway to copy the data into their projects and process it there. In general, wedon't recommend data copying if it can be avoided. Multiple copies of dataincrease storage cost, and add to the maintenance and lineage trackingoverhead.

Data as streams

A domain can expose streaming data by publishing that data to a Pub/Subtopic. Subscribers who want to consume the data create subscriptions to consumethe messages published to that topic. Each subscriber receives and consumes dataindependently. The following diagram shows an example of such data streams.

Data streams to receive and consume data.

In the diagram, the ingest pipeline reads raw events, enriches (curates) them,and saves this curated data to the analytical datastore(BigQuery base table). At the same time, the pipeline publishesthe enriched events to a dedicated topic. This topic is consumed by multiplesubscribers, each of whom may be potentially filtering these events to get onlythe ones relevant to them. The pipeline also aggregates and publishes eventstatistics to its own topic to be processed by another data consumer.

The following are example use cases for Pub/Sub subscriptions:

  • Enriched events, such as providing full customer profile informationalong with data on a particular customer order.
  • Close-to-real-time aggregation notifications, such as total orderstatistics for the last 15 minutes.
  • Business-level alerts, such as generating an alert if order volumedropped by 20% compared to a similar period on the previous day.
  • Data change notifications (similar in concept tochange data capture notifications), such as a particular order changes status.

The data format that data producers use for Pub/Sub messagesaffects costs and how these messages are processed. For high-volume streams in adata mesh architecture, Avro or Protobuf formats are good options. If dataproducers use these formats, they canassign schemas to Pub/Sub topics. The schemas help to ensure that the consumersreceive well-formed messages.

Because a streaming data structure can be constantly changing, versioning ofthis interface requires coordination between the data producers and the dataconsumers. There are several common approaches data producers can take, whichare as follows:

  • A new topic is created every time the message structure changes. Thistopic often has an explicit Pub/Sub schema. Data consumerswho need the new interface can start to consume the new data. The messageversion is implied by the name of the topic, for example,click_events_v1. Message formats are strongly typed. There's no variationon the message format between messages in the same topic. The disadvantageof this approach is that there might be data consumers who can't switch tothe new subscription. In this case, the data producer must continuepublishing events to all active topics for some time, and data consumerswho subscribe to the topic must either deal with a gap in message flow, orde-duplicate the messages.
  • Data is always published to the same topic. However, the structure ofthe message can change. A Pub/Submessage attribute (separate from the payload) defines the version of the message. Forexample,v=1.0. This approach removes the need to deal with gaps orduplicates; however, all data consumers must be ready to receive messagesof a new type. Data producers also can't use Pub/Sub topicschemas for this approach.
  • A hybrid approach. The message schema can have an arbitrary data sectionthat can be used for new fields. This approach can provide a reasonablebalance between having strongly typed data, and frequent and complexversion changes.

Data access API

Data producers can build a custom API to directly access the base tables in adata warehouse. Typically, these producers expose this custom API as a REST or agRPC API, and deploy on Cloud Run or a Kubernetes cluster. An APIgateway like Apigee can provide other additional features, such as trafficthrottling or a caching layer. These functionalities are useful when exposingthe data access API to consumers outside of a Google Cloud organization.Potential candidates for a data access API are latency sensitive and highconcurrency queries which both return a relatively small result in a single APIand can be effectively cached.

Examples of such a custom API for data access can be as follows:

  • A combined view on the SLA metrics of the table or product.
  • The top 10 (potentially cached) records from a particular table.
  • A dataset of table statistics (total number of rows, or datadistribution within key columns).

Any guidelines and governance that the organization has around buildingapplication APIs are also applicable to the custom APIs created by dataproducers. The organization's guidelines and governance should cover issues suchas hosting, monitoring, access control, and versioning.

The disadvantage of a custom API is the fact that the data producers areresponsible for any additional infrastructure that's required to host thisinterface, as well as custom API coding and maintenance. We recommend that dataproducers investigate other options before deciding to create custom data accessAPIs. For example, data producers can use BigQuery BI Engine to decrease responselatency and increase concurrency.

Looker Blocks

For products such as Looker, which are heavily used in businessintelligence (BI) tools, it might be helpful to maintain a set of BI tool-specificwidgets. Because the data producer team knows the underlying datamodel that is used in the domain, that team is best placed to create andmaintain a prebuilt set of visualizations.

In the case of Looker, this visualization could be a set ofLooker Blocks (prebuilt LookML data models). The LookerBlocks can be easily incorporated into dashboards hostedby consumers.

ML models

Because teams that work in data domains have a deep understanding and knowledgeof their data, they are often the best teams to build and maintain ML modelswhich are trained on the domain data. These ML Models can be exposed throughseveral different interfaces, including the following:

  • BigQuery ML models can be deployed in a dedicated dataset andshared with data consumers for BigQuery batch predictions.
  • BigQuery ML models can be exported into Vertex AIto be used for online predictions.

Data location considerations for consumption interfaces

An important consideration when data producers define consumption interfacesfor data products is data location. In general, to minimize costs, data shouldbe processed in the sameregion that it's stored in. This approach helps toprevent cross-region data egress charges. This approach also has the lowest dataconsumption latency. For these reasons, data stored in multi-regionalBigQuery locations is usually the best candidate for exposing asa data product.

However, for performance reasons, data stored in Cloud Storage andexposed through BigLake tables or direct read APIs should bestored in regional buckets.

If data exposed in one product resides in one region and needs to be joined withdata in another domain in another region, data consumers must consider thefollowing limitations:

  • Cross-region queries that use BigQuery SQL are notsupported. If the primary consumption method for the data isBigQuery SQL, all the tables in the query must be in thesame location.
  • BigQuery flat-rate commitments are regional. If a projectuses only a flat-rate commitment in one region but queries a data productin another region, on-demand pricing applies.
  • Data consumers can use direct read APIs to read data from anotherregion. However, cross-regional network egress charges apply, and dataconsumers will most likely experience latency for large data transfers.

Data that's frequently accessed across regions can be replicated to thoseregions to reduce the cost and latency of queries incurred by the productconsumers. For example, BigQuerydatasets can be copied to other regions. However, data should only be copied when it's required. Werecommend that data producers only make a subset of the available product dataavailable to multiple regions when you copy data. This approach helps tominimize replication latency and cost. This approach can result in the need toprovide multiple versions of the consumption interface with the data locationregion explicitly called out. For example, BigQuery Authorizedviews can be exposed through naming such assales_eu_v1 andsales_us_v1.

Data stream interfaces using Pub/Sub topics don't need anyadditional replication logic to consume messages in regions that are not thesame region as that where the message is stored. However, additionalcross-regionegress charges apply in this case.

Expose consumption interfaces to data consumers

This section discusses how to make consumption interfaces discoverable bypotential consumers.Data Catalog is a fully managed service which organizations can use to provide the datadiscovery and metadata management services. Data producers must make theconsumption interfaces of their data products searchable and annotate them withthe appropriate metadata to enable product consumers to access them in aself-service manner. We

The following sections discuss how each interface type is defined as aData Catalog entry.

Caution: Data Catalog isdeprecated in favor ofDataplex Universal Catalog. Dataplex Universal Catalog is also integrated with the same source systems like BigQuery, offering similar capabilities. It also allowscreating custom entries.

BigQuery-based SQL interfaces

Technical metadata, such as a fully qualified table name or table schema, isautomatically registered for authorized views, BigLake views, andBigQuery tables that are available through the Storage Read API.We recommend that data producers also provide additional information in thedata product documentation to help data consumers.For example, to help users find the product documentation for an entry, dataproducers can add a URL to one of the tags that has been applied to the entry.Producers can also provide the following:

  • Sets ofclustered columns, which should be used in query filters.
  • Enumeration values for fields that have logical enumeration type, if thetype is not provided as part of the field description.
  • Supported joins with other tables.

Data streams

Pub/Sub topics are automatically registered with theData Catalog. However, data producers must describe the schema inthe data product documentation.

Cloud Storage API

Data Catalog supports the definition of Cloud Storagefile entries and their schema. If a data lake fileset is managed byDataplex Universal Catalog, the fileset is automatically registered in theData Catalog. Filesets that aren't associated withDataplex Universal Catalog are added usinga different approach.

Other interfaces

You can add other interfaces which don't have built-in support fromData Catalog by creatingcustom entries.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-09-03 UTC.