Deploy an enterprise data management and analytics platform

An enterprise data management and analytics platform provides an enclave whereyou can store, analyze, and manipulate sensitive information while maintainingsecurity controls. You can use theenterprise data mesh architecture to deploya platform on Google Cloud for data management and analytics. The architectureis designed to work in a hybrid environment, where Google Cloud componentsinteract with your existing on-premises components and operating processes.

The enterprise data mesh architecture includes the following:

  • AGitHubrepositorythat contains a set of Terraform configurations, scripts, and code to buildthe following:
    • A governance project that lets you use Google'simplementation of theCloud Data Management Capabilities (CDMS) Key ControlsFramework.
    • A data platform example that supports interactive and productionworkflows.
    • A producer environment within the data platform that supports multiple datadomains. Data domains are logical groupings of data elements.
    • A consumer environment within the data platform that supports multipleconsumer projects.
    • A data transfer service that uses Workload Identity Federation and theTink encryption library to help youtransfer data into Google Cloud in a secure manner.
    • A data domain example that contains ingestion, non-confidential, andconfidential projects.
    • An example of a data access system that lets data consumers request access todata sets and data owners grant access to those data sets. The examplealso includes a workflow manager that changes the IAMpermissions of those data sets accordingly.
  • A guide to the architecture, design, security controls, and operationalprocesses that you use this architecture to implement (this document).

The enterprise data mesh architecture is designed to be compatible with theenterprise foundationsblueprint. Theenterprise foundations blueprint provides a number of base-level services thatthis architecture relies on, such asVPCnetworks and logging. You can deploythis architecture without deploying the enterprise foundations blueprint if yourGoogle Cloud environment provides thenecessaryfunctionality.

This document is intended for cloud architects, data scientists, data engineers,and security architects who can use the architecture to build and deploycomprehensive data services on Google Cloud. This document assumes thatyou are familiar with the concepts ofdatameshes, Google Clouddata services, andthe Google Cloud implementation of the CDMC framework.

Architecture

The enterprise data mesh architecture takes a layered approach to provide thecapabilities that enable data ingestion, data processing, and governance. Thearchitecture is intended to be deployed and controlled through a CI/CD workflow.The following diagram shows how the data layer that is deployed by thisarchitecture relates to other layers in your environment.

Data mesh architecture.

This diagram includes the following:

  • Google Cloud infrastructureprovides security capabilities such asencryption atrestandencryption intransit, aswell as basic building blocks such as compute and storage.
  • The enterprise foundation provides a baseline of resources such asidentity, networking, logging, monitoring, and deployment systems that enableyou to adopt Google Cloud for your data workloads.
  • The data layer provides variouscapabilities such as data ingestion, data storage, data access control, datagovernance, data monitoring, and data sharing.
  • The application layer represents various different applications that use thedata layer assets.
  • CI/CD provides the tools to automate the provision, configuration,management, and deployment of infrastructure, workflows, and softwarecomponents. These components help you ensure consistent, reliable, andauditable deployments; minimize manual errors; and accelerate the overalldevelopment cycle.

To show how the data environment is used, the architecture includes a sampledata workflow. The sample data workflow takes you through the followingprocesses: data governance, data ingestion, data processing, data sharing, anddata consumption.

Key architectural decisions

The following table summarizes the high-level decisions of thearchitecture.

Decision areaDecision
Google Cloud architecture

Resource hierarchy

The architecture uses theresourcehierarchy from the enterprise foundations blueprint.

Networking

The architecture includes an example data transfer service that usesWorkload Identity Federationand a Tink library.

Roles and IAM permissions

The architecture includes segmented data producer roles, data consumerroles, data governance roles, and data platform roles.

Common data services

Metadata

The architecture uses Data Catalog to manage datametadata.

Central policy management

To manage policies, the architecture uses Google Cloud'simplementation of the CDMC framework.

Data access management

To control access to data, the architecture includes an independentprocess that requires data consumers to request access to data assets from thedata owner.

Data quality

The architecture uses theCloud DataQuality Engine to define and run data quality rules on specified tablecolumns, measuring data quality based on metrics like correctness andcompleteness.

Data security

The architecture uses tagging, encryption, masking, tokenization, andIAM controls to provide data security.

Data domain

Data environments

The architecture includes three environments. Two environments(non-production and production) are operational environments that are driven bypipelines. One environment (development) is an interactive environment.

Data owners

Data owners ingest, process, expose, and grant access to dataassets.

Data consumers

Data consumers request access to data assets.

Onboarding and operations

Pipelines

The architecture uses the following pipelines to deploy resources:

  • Foundation pipeline
  • Infrastructure pipeline
  • Artifact pipelines
  • Service Catalog pipeline

Repositories

Each pipeline uses a separate repository to enable segregation ofresponsibility.

Process flow

The process requires that changes to the production environment include asubmitter and an approver.

Cloud operations

Data product scorecards

TheReportEngine generates data product scorecards.

Cloud Logging

The architecture uses thelogginginfrastructure from the enterprise foundations blueprint.

Cloud Monitoring

The architecture uses the monitoring infrastructure from the enterprisefoundations blueprint.

Identity: Mapping roles to groups

The data mesh leverages the enterprise foundations blueprint's existing identitylifecycle management, authorization, and authentication architecture. Users arenot assigned roles directly; instead groups are the primary method of assigningroles and permission in IAM. IAM roles andpermissions are assigned during project creation through the foundationpipeline.

The data mesh associates groups with one of four key areas:infrastructure,datagovernance,domain-based dataproducers,anddomain-basedconsumers.

The permission scopes for these groups are the following:

  • The infrastructure group's permission scope is the data mesh as a whole.
  • The data governance groups' permission scope is the data governance project.
  • Domain-based producers and consumers permissions are scoped to their datadomain.

The following tables show the various roles used in this data meshimplementation and their associated permissions.

Infrastructure

GroupDescriptionRoles

data-mesh-ops@example.com

Overall administrators of the data mesh

roles/owner (data platform)

Data governance

GroupDescriptionRoles

gcp-dm-governance-admins@example.com

Administrators of the data governance project

roles/owner on the data governance project

gcp-dm-governance-developers@example.com

Developers who build and maintain the data governance components

Multiple roles on the data governance project, includingroles/viewer, BigQuery roles, andData Catalog roles

gcp-dm-governance-data-readers@example.com

Readers of data governance information

roles/viewer

gcp-dm-governance-security-administrator@example.com

Security administrators of the governance project

roles/orgpolicy.policyAdmin androles/iam.securityReviewer

gcp-dm-governance-tag-template-users@example.com

Group with permission to use tag templates

roles/datacatalog.tagTemplateUser

gcp-dm-governance-tag-users@example.com

Group with permission to use tag templates and add tags

roles/datacatalog.tagTemplateUser androles/datacatalog.tagEditor

gcp-dm-governance-scc-notifications@example.com

Service account group for Security Command Center notifications

None. This is a group for membership, and a service account is createdwith this name, which has the necessary permissions.

Domain-based data producers

GroupDescriptionRoles

gcp-dm-{data_domain_name}-admins@example.com

Administrators of a specific data domain

roles/owner on the data domain project

gcp-dm-{data_domain_name}-developers@example.com

Developers who build and maintain data products within a datadomain

Multiple roles on the data domain project, includingroles/viewer, BigQuery roles, and Cloud Storageroles

gcp-dm-{data_domain_name}-data-readers@example.com

Readers of the data domain information

roles/viewer

gcp-dm-{data_domain_name}-metadata-editors@{var.domain}

Editors of Data Catalog entries

Roles to edit Data Catalog entries

gcp-dm-{data_domain_name}-data-stewards@example.com

Data stewards for the data domain

Roles to manage metadata and data governance aspects

Domain-based data consumers

GroupDescriptionRoles

gcp-dm-consumer-{project_name}-admins@example.com

Administrators of a specific consumer project

roles/owner on consumer project

gcp-dm-consumer-{project_name}-developers@example.com

Developers working within a consumer project

Multiple roles on the consumer project, includingroles/viewer and BigQuery roles

gcp-dm-consumer-{project_name}-data-readers@example.com

Readers of the consumer project information

roles/viewer

Organization structure

To differentiate between production operations and production data, thearchitecture uses different environments to develop and release workflows.Production operations include the governance, traceability, and repeatability ofa workflow and the auditability of the results of the workflow. Production datarefers to possibly sensitive data that you need to run your organization. Allenvironments are designed to have security controls that let you ingest andoperate your data.

To help data scientists and engineers, the architecture includes aninteractiveenvironment, where developers can work withthe environment directly and add services through a curated catalog ofsolutions. Operational environments are driven throughpipelines which have codified architecture andconfiguration.

This architecture uses the organizational structure of the enterprisefoundations blueprint as a basis for deploying data workloads. The followingdiagram shows the top-level folders and projects used in the enterprise datamesh architecture.

Data mesh organization structure.

The following table describes the top-level folders and projects that are partof the architecture.

FolderComponentDescription

common

prj-c-artifact-pipeline

Contains the deployment pipeline that's used to build out the codeartifacts of the architecture.

prj-c-service-catalog

Contains the infrastructure used by the Service Catalogto deploy resources in the interactive environment.

prj-c-datagovernance

Contains all the resources used by Google Cloud's implementation ofthe CDMC framework.

development

fldr-d-dataplatform

Contains the projects and resources of the data platform for developinguse cases in interactive mode.

non-production

fldr-n-dataplatform

Contains the projects and resources of the data platform for testing usecases that you want to deploy in an operational environment.

production

fldr-p-dataplatform

Contains the projects and resources of the data platform for deploymentinto production.

Data platform folder

The data platform folder contains all the data plane components and some of theCDMC resources. In addition, the data platform folder and the data governanceproject contain the CDMC resources. The following diagram shows the folders andprojects that are deployed in the data platform folder.

The data platform folder

Each data platform folder includes an environment folder (production,non-production, and development). The following table describes the folderswithin each data platform folder.

FoldersDescription

Producers

Contains the data domains.

Consumers

Contains the consumer projects.

Data domain

Contains the projects associated with a particular domain.

Producers folder

Each producers folder includes one or more data domains. A data domain refers to alogical grouping of data elements that share a common meaning, purpose, orbusiness context. Data domains let you categorize and organize data assetswithin an organization. The following diagram shows the structure of a datadomain. The architecture deploys projects in the data platformfolder for each environment.

The producers folder.

The following table describes the projects that are deployed in the dataplatform folder for each environment.

ProjectDescription

Ingestion

The ingestion project ingests data into the data domain. The architectureprovides examples of how you can stream data into BigQuery,Cloud Storage, and Pub/Sub. The ingestion project also containsexamples of Dataflow and Cloud Composer that you can use toorchestrate the transformation and movement of ingested data.

Non-confidential

The non-confidential project contains data that has been de-identified.You can mask, containerize, encrypt, tokenize, or obfuscate data. Use policytags to control how the data is presented.

Confidential

The confidential project contains plaintext data. You can control accessthrough IAM permissions.

Consumer folder

The consumer folder contains consumer projects. Consumer projects provide amechanism to segment data users based on their required trust boundary. Eachproject is assigned to a separate user group and the group is assigned access tothe required data assets on a project-by-project basis. You can use the consumerproject to collect, analyze, and augment the data for the group.

Common folder

The common folder contains the services that are used by different environmentsand projects. This section describes the capabilities that are added to thecommon folder to enable the enterprise data mesh.

CDMC architecture

The architecture uses the CDMC architecture for data governance. The datagovernance functions reside in the data governance project in the common folder.The following diagram shows the components of the CDMC architecture. The numbersin the diagram represent the key controls that are addressed with Google Cloudservices.

The CDMC architecture.

The following table describes the components of the CDMC architecture that theenterprise data mesh architecture uses.

CDMC componentGoogle Cloud serviceDescription
Access and lifecycle components

Key management

Cloud KMS

A service that securely manages encryption keys that protect sensitivedata.

Record Manager

Cloud Run

An application that maintains comprehensive logs and records of dataprocessing activities, ensuring organizations can track and audit datausage.

Archiving policy

BigQuery

A BigQuery table that contains the storage policy fordata.

Entitlements

BigQuery

A BigQuery table that stores information about who canaccess sensitive data. This table ensures that only authorized users can accessspecific data based on their roles and privileges.

Scanning components

Data loss

Sensitive Data Protection

Service used to inspect assets for sensitive data.

DLP findings

BigQuery

A BigQuery table that catalogs data classifications withinthe data platform.

Policies

BigQuery

A BigQuery table that contains consistent data governancepractices (for example, data access types).

Billing export

BigQuery

A table that stores cost information that isexportedfrom Cloud Billing to enable the analysis of cost metrics that areassociated with data assets.

Cloud Data Quality Engine

Cloud Run

An application that runs data quality checks for tables andcolumns.

Data quality findings

BigQuery

A BigQuery table that records identified discrepanciesbetween the defined data quality rules and the actual quality of the dataassets.

Reporting components

Scheduler

Cloud Scheduler

A service that controls when the Cloud Data Quality Engine runs and whenthe Sensitive Data Protection inspection occurs.

Report Engine

Cloud Run

An application that generates reports that help track and measureadherence to the CDMC framework's controls.

Findings and assets

BigQuery and Pub/Sub

A BigQuery report of discrepancies or inconsistencies indata management controls, such as missing tags, incorrect classifications, ornon-compliant storage locations.

Tag exports

BigQuery

A BigQuery table that contains extracted tag informationfrom Data Catalog.

Other components

Policy management

Organization Policy Service

A service that defines and enforces restrictions on where data can bestored geographically.

Attribute-based access policies

Access Context Manager

A service that defines and enforces granular, attribute-based accesspolicies so that only authorized users from permitted locations and devices canaccess sensitive information.

Metadata

Data Catalog

A service that stores metadata information about the tables that are inuse in the data mesh.

Tag Engine

Cloud Run

An application that adds tags to data in BigQuerytables.

CDMC reports

Looker Studio

Dashboards that let your analysts view reports that were generated by theCDMC architecture engines.

CDMC implementation

The following table describes how the architecture implements the key controlsin the CDMC framework.

CDMC control requirementImplementation

Datacontrol compliance

The Report Engine detects non-compliant data assets through and publishesfindings to a Pub/Sub topic. These findings are also loaded intoBigQuery for reporting usingLooker Studio.

Dataownership is established for both migrated and cloud-generated data

Data Catalog automatically captures technical metadatafrom BigQuery. Tag Engine applies business metadata tags likeowner name and sensitivity level from a reference table, which helps ensure thatall sensitive data is tagged with owner information for compliance. Thisautomated tagging process helps provide data governance and compliance byidentifying and labeling sensitive data with the appropriate ownerinformation.

Datasourcing and consumption are governed and supported by automation

Data Catalog classifies data assets by tagging them withanis_authoritative flag when they are an authoritative source.Data Catalog automatically stores the information, along with thetechnical metadata, in a data register. The Report Engine and the Tag Engine canvalidate and report the data register of authoritative sources usingPub/Sub.

Datasovereignty and cross-border data movement are managed

Organization Policy Servicedefines permitted storage regions for data assets andAccess Context Managerrestricts access based on user location. Data Catalog stores theapproved storage locations as metadata tags. Report Engine compares these tagsagainst the actual location of the data assets in BigQuery andpublishes any discrepancies as findings using Pub/Sub.Security Command Center provides an additional layer of monitoring by generatingvulnerability findings if data is stored or accessed outside the definedpolicies.

Datacatalogs are implemented, used, and interoperable

Data Catalog stores and updates the technical metadata forall BigQuery data assets, effectively creating a continuouslysynchronized Data Catalog. Data Catalog ensures that any new ormodified tables and views are immediately added to the catalog, maintaining anup-to-date inventory of data assets.

Dataclassifications are defined and used

Sensitive Data Protectioninspects BigQuery data and identifies sensitive informationtypes. These findings are then ranked based on a classification reference table,and the highest sensitivity level is assigned as a tag inData Catalog at the column and table levels. Tag Engine managesthis process by updating the Data Catalog with sensitivity tags whenever newdata assets are added or existing ones are modified. This process ensures aconstantly updated classification of data based on sensitivity, which you canmonitor and report on using Pub/Sub and integrated reportingtools.

Dataentitlements are managed, enforced, and tracked

BigQuerypolicytags control access to sensitive data at the column level, ensuring onlyauthorized users can access specific data based on their assigned policy tag.IAM manages overall access to the data warehouse, whileData Catalog stores sensitivity classifications. Regular checksare performed to ensure all sensitive data has corresponding policy tags, withany discrepancies reported using Pub/Sub for remediation.

Ethicalaccess, use, and outcomes of data are managed

Data sharing agreements for both providers and consumers are stored in adedicated BigQuery data warehouse to control consumptionpurposes. Data Catalog labels data assets with the provideragreement information, while consumer agreements are linked toIAM bindings for access control. Query labels enforce consumptionpurposes, requiring consumers to specify a valid purpose when querying sensitivedata, which is validated against their entitlements in BigQuery.An audit trail in BigQuery tracks all data access and ensurescompliance with the data sharing agreements.

Datais secured, and controls are evidenced

Google's default encryption at rest helps protect data that is stored ondisk. Cloud KMS supports customer-managed encryption keys (CMEK) forenhanced key management. BigQuery implements column-level dynamicdata masking for de-identification and supports application-levelde-identification during data ingestion. Data Catalog storesmetadata tags for encryption and de-identification techniques that are appliedto data assets. Automated checks ensure that the encryption andde-identification methods align with predefined security policies, with anydiscrepancies that are reported as findings usingPub/Sub.

Adata privacy framework is defined and operational

Data Catalog tags sensitive data assets with relevantinformation for impact assessment, such as subject location and assessmentreport links. Tag Engine applies these tags based on data sensitivity and apolicy table in BigQuery, which defines the assessmentrequirements based on data and subject residency. This automated tagging processallows for continuous monitoring and reporting of compliance with impactassessment requirements, ensuring that data protection impact assessments(DPIAs) or protection impact assessment (PIAs) are conducted whennecessary.

Thedata lifecycle is planned and managed

Data Catalog labels data assets with retention policies,specifying retention periods and expiration actions (such as archive or purge).Record Manager automates the enforcement of these policies by purging orarchiving BigQuery tables based on the defined tags. Thisenforcement ensures adherence to the data lifecycle policies and maintainscompliance with data retention requirements, with any discrepancies detected andreported using Pub/Sub.

Dataquality is managed

The Cloud Data Quality Engine defines and runs data quality rules onspecified table columns, measuring data quality based on metrics likecorrectness and completeness. Results from these checks, including successpercentages and thresholds, are stored as tags in Data Catalog.Storing these results allows for continuous monitoring and reporting of dataquality, with any issues or deviations from acceptable thresholds published asfindings using Pub/Sub.

Costmanagement principles are established and applied

Data Catalog stores cost-related metrics for data assets,such as query costs, storage costs, and data egress costs, which are calculatedusing billing information exported from Cloud Billing toBigQuery. Storing cost-related metrics allows for comprehensivecost tracking and analysis, ensuring adherence to cost policies and efficientresource utilization, with any anomalies reported usingPub/Sub.

Dataprovenance and lineage are understood

Data Catalog's built-in data lineage features track theprovenance and lineage of data assets, visually representing the flow of data.Additionally, data ingestion scripts identify and tag the original source of thedata in Data Catalog, enhancing the traceability of data back toits origin.

Data access management

The architecture's access to data is controlled through an independent processwhich separates operational control (for example, running Dataflow jobs) fromdata access control. A user's access to a Google Cloud service is definedby an environmental or operational concern and is provisioned and approved by acloud engineering group. A user's access to Google Cloud data assets (forexample, a BigQuery table) is a privacy, regulatory, orgovernance concern and is subject to an access agreement between the producingand consuming parties and controlled through the following processes. Thefollowing diagram shows how data access is provisioned through the interactionof different software components.

Data access management

As shown in the previous diagram, onboarding of data accesses is handled by thefollowing processes:

  • Cloud data assets are collected and inventoried byData Catalog.
  • The workflow manager retrieves the data assets fromData Catalog.
  • Data owners are onboarded to workflow manager.

The operation of the data access management is as follows:

  1. A data consumer makes a request for a specific asset.
  2. The data owner of the asset is alerted to the request.
  3. The data owner approves or rejects the request.
  4. If the request is approved, the workflow manager passes the group, asset, andassociated tag to the IAM mapper.
  5. The IAM mapper translates the workflow manager tags intoIAM permissions, and gives the specified groupIAM permissions for the data asset.
  6. When a user wants to access the data asset, IAM evaluatesaccess to the Google Cloud asset based on the permissions of the group.
  7. If permitted, the user accesses the data asset.

Networking

The data security process initiates at the source application, which mightreside on-premises or in another environment external to the targetGoogle Cloud project. Before any network transfer occurs, this application usesWorkload Identity Federation to securely authenticate itself to GoogleCloud APIs. Using these credentials, it interacts withCloud KMS to obtain or wrap the necessary keys and then employs theTink library to perform initial encryption and de-identification on thesensitive data payload according to predefined templates.

After the data payload is protected, the payload must be securely transferredinto the Google Cloud ingestion project. For on-premise applications, youcan use Cloud Interconnect or potentially Cloud VPN. Within theGoogle Cloud network, use Private Service Connect to route thedata towards the ingestion endpoint within the target project's VPC network.Private Service Connect lets the source application connect to GoogleAPIs using private IP addresses, ensuring traffic isn't exposed to the internet.

The entire network path and the target ingestion services(Cloud Storage, BigQuery, and Pub/Sub)within the ingestion project are secured by a VPC Service Controls perimeter. Thisperimeter enforces a security boundary, ensuring that the protected dataoriginating from the source can only be ingested into the authorizedGoogle Cloud services within that specific project.

Logging

This architecture uses theCloud Loggingcapabilities that are provided by the enterprise foundations blueprint.

Pipelines

The enterprise data mesh architecture uses a series of pipelines to provisionthe infrastructure, orchestration, data sets, data pipelines, and applicationcomponents. The architecture's resource deployment pipelines use Terraform asthe infrastructure as code (IaC) tool and Cloud Build as the CI/CD serviceto deploy the Terraform configurations into the architecture environment. Thefollowing diagram shows the relationship between the pipelines.

Pipeline relationships

The foundation pipeline and the infrastructure pipeline are part of theenterprise foundations blueprint. The following table describes the purpose ofthe pipelines and the resources that they provision.

PipelineProvisioned byResources

Foundation pipeline

Bootstrap

  • Data platform folder and subfolders
  • Common projects
  • Infrastructure pipeline service account
  • Cloud Build trigger for the Infrastructure pipeline
  • Shared VPC
  • VPC Service Control perimeter

Infrastructure pipeline

Foundation pipeline

  • Consumer projects
  • Service Catalog service account
  • The Cloud Build trigger for the Service Catalogpipeline
  • Artifact pipeline service account
  • The Cloud Build trigger for the artifact pipeline

Service Catalog pipeline

Infrastructure pipeline

  • Resources deployed in the Service Catalog bucket

Artifact pipelines

Infrastructure pipeline

Artifact pipelines produce the various containers and other components ofthe codebase used by the data mesh.

Each pipeline has its own set of repositories that it pulls code andconfiguration files from. Each repository has a separation of duties wheresubmitters and approvals of operational code deployments are theresponsibilities of different groups.

Interactive deployment through Service Catalog

Interactive environments are the development environment within the architectureand exist under the development folder. The main interface for the interactiveenvironment is Service Catalog, which lets developers usepreconfigured templates to instantiate Google services. These preconfiguredtemplates are known asservice templates. Service templates help you toenforce your security posture, such as making CMEK encryption mandatory, andalso prevents your users from having direct access to Google APIs.

The following diagram shows the components of the interactive environment andhow data scientists deploy resources.

Interactive environment with Service Catalog.

To deploy resources using the Service Catalog, the followingsteps occur:

  1. The MLOps engineer puts a Terraform resource template for Google Cloudinto a Git repository.
  2. The Git Commit command triggers a Cloud Build pipeline.
  3. Cloud Build copies the template and any associated configurationfiles to Cloud Storage.
  4. The MLOps engineer sets up the Service Catalog solutions andService Catalog manually. The engineer then shares theService Catalog with a service project in the interactiveenvironment.
  5. The data scientist selects a resource from theService Catalog.
  6. Service Catalog deploys the template into the interactiveenvironment.
  7. The resource pulls any necessary configuration scripts.
  8. The data scientist interacts with the resources.

Artifact pipelines

The data ingestion process uses Cloud Composer andDataflow to orchestrate the movement and transformation of datawithin the data domain. The artifact pipeline builds all necessary resources fordata ingestion and moves the resources to the appropriate location for theservices to access them. The artifact pipeline creates the container artifactsthat the orchestrator uses.

Security controls

The enterprise data mesh architecture uses a layered defense-in-depth securitymodel that includes default Google Cloud capabilities, Google Cloudservices, and security capabilities that are configured through the enterprisefoundations blueprint. The following diagram shows the layering of the varioussecurity controls for the architecture.

Security controls in the data mesh architecture.

The following table describes the security controls that are associated with theresources in each layer.

LayerResourceSecurity control

CDMC framework

Google Cloud CDMC implementation

Provides a governance framework that helps secure, manage and control your data assets. SeeCDMCKey Controls Framework for more information.

Deployment

Infrastructure pipeline

Provides a series of pipelines that deploy infrastructure, build containers, and create data pipelines. The use of pipelines allows for auditability, traceability, and repeatability.

Artifact pipeline

Deploys various components not deployed by the infrastructurepipeline.

Terraform templates

Builds out the system infrastructure.

Open Policy Agent

Helps ensure that the platform conforms to selected policies.

Network

Private Service Connect

Provides data exfiltration protections around the architecture resources at the API layer and the IP layer. Lets you communicate with Google Cloud APIs using privateIP addresses so that you can avoid exposing traffic to the internet.

VPC network with private IP addresses

Helps remove exposure to internet-facing threats.

VPC Service Controls

Helps protect sensitive resources against data exfiltration.

Firewall

Helps protect the VPC network against unauthorized access.

Access management

Access Context Manager

Controls who can access what resources and helps prevent unauthorized use of your resources.

Workload Identity Federation

Removes the need for external credentials to transfer data onto theplatform from on-premises environments.

Data Catalog

Provides an index of assets available to users.

IAM

Provides fine-grained access.

Encryption

Cloud KMS

Lets you manage your encryption keys and secrets, and help protect your data through encryption at rest and encryption in transit.

Secrets Manager

Provides a secret store for pipelines that are controlled byIAM.

Encryption at rest

By default, Google Cloud encrypts data at rest.

Encryption in transit

By default, Google Cloud encryptsdata intransit.

Detective

Security Command Center

Helps you to detect misconfigurations and malicious activity in your Google Cloudorganization.

Continuous architecture

Continually checks your Google Cloud organization against a seriesof OPA policies that you have defined.

IAM Recommender

Analyzes user permissions and provides suggestions about reducingpermissions to help enforce the principle of least privilege.

Firewall Insights

Analyzes firewall rules, identifies overly-permissive firewall rules, andsuggests more restrictive firewalls to help strengthen your overall securityposture.

Cloud Logging

Provides visibility into system activity and helps enable the detectionof anomalies and malicious activity.

Cloud Monitoring

Tracks key signals and events that can help identify suspiciousactivity.

Preventative

Organization Policy

Lets you control and restrict actions within your Google Cloudorganization.

Workflows

The following sections outline the data producer workflow and data consumerworkflow, ensuring appropriate access controls based on data sensitivity anduser roles.

Data producer workflow

The following diagram shows how data is protected as it is transferred toBigQuery.

Data producer workflow

The workflow for data transfer is the following:

  1. An application that is integrated with Workload Identity Federation usesCloud KMS to decrypt a wrapped encryption key.
  2. The application uses the Tink library to de-identify or encrypt the datausing a template.
  3. The application transfers data to the ingestion project in Google Cloud.
  4. The data arrives in Cloud Storage, BigQuery, orPub/Sub.
  5. In the ingestion project, the data is decrypted or re-identified using atemplate.
  6. The decrypted data is encrypted or masked based on another de-identificationtemplate, then placed in the non-confidential project. Tags are applied by thetagging engine as appropriate.
  7. Data from the non-confidential project is transferred over to theconfidential project and re-identified.

The following data access is permitted:

  • Users who have access to the confidential project can access all the rawplaintext data.
  • Users who have access to the non-confidential project can access masked,tokenized, or encrypted data based on the tags associated with the data andtheir permissions.

Data consumer workflow

The following steps describe how a consumer can access data that is stored inBigQuery.

  1. The data consumer searches for data assets using Data Catalog.
  2. After the consumer finds the assets that they are looking for, the dataconsumer requests access to the data assets.
  3. The data owner decides whether to provide access to the assets.
  4. If the consumer obtains access, the consumer can use a notebook and theSolution Catalog to create an environment in which they can analyze andtransform the data assets.

Bringing it all together

TheGitHubrepositoryprovides you with detailed instructions on deploying the data mesh onGoogle Cloud after you deployed the enterprise foundation. The process todeploy the architecture involves modifying your existing infrastructurerepositories and deploying new data mesh specific components.

Complete the following:

  1. Complete all prerequisites, including the following:
    1. InstallGoogle Cloud CLI,Terraform,Tink,Java, andGo.
    2. Deploy theenterprise foundations blueprint(v4.1).
    3. Maintain the following local repositories:
      • gcp-data-mesh-foundations
      • gcp-bootstrap
      • gcp-environments
      • gcp-networks
      • gcp-org
      • gcp-projects
  2. Modify the existing foundation blueprint and then deploy the data meshapplications. For each item, complete the following:
    1. In your target repository, check out thePlan branch.
    2. To add data mesh components, copy the relevant files and directories fromgcp-data-mesh-foundations into the appropriate foundation directory.Overwrite files when required.
    3. Update the data mesh variables, roles, and settings in the Terraformfiles (forexample,*.tfvars and*.tf). Set the GitHub tokens as environmentvariables.
    4. Perform the Terraform initialize, plan, and apply operations on eachrepository.
    5. Commit your changes, push the code to your remote repository, create pullrequests and merge to your development, nonproduction, and productionenvironments.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-04-04 UTC.