Deploy an enterprise data management and analytics platform

An enterprise data management and analytics platform provides an enclave whereyou can store, analyze, and manipulate sensitive information while maintainingsecurity controls. You can use theenterprise data mesh architecture to deploya platform on Google Cloud for data management and analytics. The architectureis designed to work in a hybrid environment, where Google Cloud componentsinteract with your existing on-premises components and operating processes.

The enterprise data mesh architecture includes the following:

AGitHubrepositorythat contains a set of Terraform configurations, scripts, and code to buildthe following:
- A governance project that lets you use Google'simplementation of theCloud Data Management Capabilities (CDMS) Key ControlsFramework.
- A data platform example that supports interactive and productionworkflows.
- A producer environment within the data platform that supports multiple datadomains. Data domains are logical groupings of data elements.
- A consumer environment within the data platform that supports multipleconsumer projects.
- A data transfer service that uses Workload Identity Federation and theTink encryption library to help youtransfer data into Google Cloud in a secure manner.
- A data domain example that contains ingestion, non-confidential, andconfidential projects.
- An example of a data access system that lets data consumers request access todata sets and data owners grant access to those data sets. The examplealso includes a workflow manager that changes the IAMpermissions of those data sets accordingly.
A guide to the architecture, design, security controls, and operationalprocesses that you use this architecture to implement (this document).

The enterprise data mesh architecture is designed to be compatible with theenterprise foundationsblueprint. Theenterprise foundations blueprint provides a number of base-level services thatthis architecture relies on, such asVPCnetworks and logging. You can deploythis architecture without deploying the enterprise foundations blueprint if yourGoogle Cloud environment provides thenecessaryfunctionality.

This document is intended for cloud architects, data scientists, data engineers,and security architects who can use the architecture to build and deploycomprehensive data services on Google Cloud. This document assumes thatyou are familiar with the concepts ofdatameshes, Google Clouddata services, andthe Google Cloud implementation of the CDMC framework.

Architecture

The enterprise data mesh architecture takes a layered approach to provide thecapabilities that enable data ingestion, data processing, and governance. Thearchitecture is intended to be deployed and controlled through a CI/CD workflow.The following diagram shows how the data layer that is deployed by thisarchitecture relates to other layers in your environment.

Data mesh architecture.

This diagram includes the following:

Google Cloud infrastructureprovides security capabilities such asencryption atrestandencryption intransit, aswell as basic building blocks such as compute and storage.
The enterprise foundation provides a baseline of resources such asidentity, networking, logging, monitoring, and deployment systems that enableyou to adopt Google Cloud for your data workloads.
The data layer provides variouscapabilities such as data ingestion, data storage, data access control, datagovernance, data monitoring, and data sharing.
The application layer represents various different applications that use thedata layer assets.
CI/CD provides the tools to automate the provision, configuration,management, and deployment of infrastructure, workflows, and softwarecomponents. These components help you ensure consistent, reliable, andauditable deployments; minimize manual errors; and accelerate the overalldevelopment cycle.

To show how the data environment is used, the architecture includes a sampledata workflow. The sample data workflow takes you through the followingprocesses: data governance, data ingestion, data processing, data sharing, anddata consumption.

Key architectural decisions

The following table summarizes the high-level decisions of thearchitecture.

Decision area	Decision
Google Cloud architecture
Resource hierarchy	The architecture uses theresourcehierarchy from the enterprise foundations blueprint.
Networking	The architecture includes an example data transfer service that usesWorkload Identity Federationand a Tink library.
Roles and IAM permissions	The architecture includes segmented data producer roles, data consumerroles, data governance roles, and data platform roles.
Common data services
Metadata	The architecture uses Data Catalog to manage datametadata.
Central policy management	To manage policies, the architecture uses Google Cloud'simplementation of the CDMC framework.
Data access management	To control access to data, the architecture includes an independentprocess that requires data consumers to request access to data assets from thedata owner.
Data quality	The architecture uses theCloud DataQuality Engine to define and run data quality rules on specified tablecolumns, measuring data quality based on metrics like correctness andcompleteness.
Data security	The architecture uses tagging, encryption, masking, tokenization, andIAM controls to provide data security.
Data domain
Data environments	The architecture includes three environments. Two environments(non-production and production) are operational environments that are driven bypipelines. One environment (development) is an interactive environment.
Data owners	Data owners ingest, process, expose, and grant access to dataassets.
Data consumers	Data consumers request access to data assets.
Onboarding and operations
Pipelines	The architecture uses the following pipelines to deploy resources: Foundation pipeline Infrastructure pipeline Artifact pipelines Service Catalog pipeline
Repositories	Each pipeline uses a separate repository to enable segregation ofresponsibility.
Process flow	The process requires that changes to the production environment include asubmitter and an approver.
Cloud operations
Data product scorecards	TheReportEngine generates data product scorecards.
Cloud Logging	The architecture uses thelogginginfrastructure from the enterprise foundations blueprint.
Cloud Monitoring	The architecture uses the monitoring infrastructure from the enterprisefoundations blueprint.

Identity: Mapping roles to groups

The data mesh leverages the enterprise foundations blueprint's existing identitylifecycle management, authorization, and authentication architecture. Users arenot assigned roles directly; instead groups are the primary method of assigningroles and permission in IAM. IAM roles andpermissions are assigned during project creation through the foundationpipeline.

The data mesh associates groups with one of four key areas:infrastructure,datagovernance,domain-based dataproducers,anddomain-basedconsumers.

The permission scopes for these groups are the following:

The infrastructure group's permission scope is the data mesh as a whole.
The data governance groups' permission scope is the data governance project.
Domain-based producers and consumers permissions are scoped to their datadomain.

The following tables show the various roles used in this data meshimplementation and their associated permissions.

Infrastructure

Group	Description	Roles
`data-mesh-ops@example.com`	Overall administrators of the data mesh	`roles/owner` (data platform)

Data governance

Group	Description	Roles
`gcp-dm-governance-admins@example.com`	Administrators of the data governance project	`roles/owner` on the data governance project
`gcp-dm-governance-developers@example.com`	Developers who build and maintain the data governance components	Multiple roles on the data governance project, including`roles/viewer`, BigQuery roles, andData Catalog roles
`gcp-dm-governance-data-readers@example.com`	Readers of data governance information	`roles/viewer`
`gcp-dm-governance-security-administrator@example.com`	Security administrators of the governance project	`roles/orgpolicy.policyAdmin` and`roles/iam.securityReviewer`
`gcp-dm-governance-tag-template-users@example.com`	Group with permission to use tag templates	`roles/datacatalog.tagTemplateUser`
`gcp-dm-governance-tag-users@example.com`	Group with permission to use tag templates and add tags	`roles/datacatalog.tagTemplateUser` and`roles/datacatalog.tagEditor`
`gcp-dm-governance-scc-notifications@example.com`	Service account group for Security Command Center notifications	None. This is a group for membership, and a service account is createdwith this name, which has the necessary permissions.

Domain-based data producers

Group	Description	Roles
`gcp-dm-{data_domain_name}-admins@example.com`	Administrators of a specific data domain	`roles/owner` on the data domain project
`gcp-dm-{data_domain_name}-developers@example.com`	Developers who build and maintain data products within a datadomain	Multiple roles on the data domain project, including`roles/viewer`, BigQuery roles, and Cloud Storageroles
`gcp-dm-{data_domain_name}-data-readers@example.com`	Readers of the data domain information	`roles/viewer`
`gcp-dm-{data_domain_name}-metadata-editors@{var.domain}`	Editors of Data Catalog entries	Roles to edit Data Catalog entries
`gcp-dm-{data_domain_name}-data-stewards@example.com`	Data stewards for the data domain	Roles to manage metadata and data governance aspects

Domain-based data consumers

Group	Description	Roles
`gcp-dm-consumer-{project_name}-admins@example.com`	Administrators of a specific consumer project	`roles/owner` on consumer project
`gcp-dm-consumer-{project_name}-developers@example.com`	Developers working within a consumer project	Multiple roles on the consumer project, including`roles/viewer` and BigQuery roles
`gcp-dm-consumer-{project_name}-data-readers@example.com`	Readers of the consumer project information	`roles/viewer`

Organization structure

To differentiate between production operations and production data, thearchitecture uses different environments to develop and release workflows.Production operations include the governance, traceability, and repeatability ofa workflow and the auditability of the results of the workflow. Production datarefers to possibly sensitive data that you need to run your organization. Allenvironments are designed to have security controls that let you ingest andoperate your data.

To help data scientists and engineers, the architecture includes aninteractiveenvironment, where developers can work withthe environment directly and add services through a curated catalog ofsolutions. Operational environments are driven throughpipelines which have codified architecture andconfiguration.

This architecture uses the organizational structure of the enterprisefoundations blueprint as a basis for deploying data workloads. The followingdiagram shows the top-level folders and projects used in the enterprise datamesh architecture.

Data mesh organization structure.

The following table describes the top-level folders and projects that are partof the architecture.

Folder	Component	Description
`common`	`prj-c-artifact-pipeline`	Contains the deployment pipeline that's used to build out the codeartifacts of the architecture.
	`prj-c-service-catalog`	Contains the infrastructure used by the Service Catalogto deploy resources in the interactive environment.
	`prj-c-datagovernance`	Contains all the resources used by Google Cloud's implementation ofthe CDMC framework.
`development`	`fldr-d-dataplatform`	Contains the projects and resources of the data platform for developinguse cases in interactive mode.
`non-production`	`fldr-n-dataplatform`	Contains the projects and resources of the data platform for testing usecases that you want to deploy in an operational environment.
`production`	`fldr-p-dataplatform`	Contains the projects and resources of the data platform for deploymentinto production.

Data platform folder

The data platform folder contains all the data plane components and some of theCDMC resources. In addition, the data platform folder and the data governanceproject contain the CDMC resources. The following diagram shows the folders andprojects that are deployed in the data platform folder.

The data platform folder

Each data platform folder includes an environment folder (production,non-production, and development). The following table describes the folderswithin each data platform folder.

Folders	Description
Producers	Contains the data domains.
Consumers	Contains the consumer projects.
Data domain	Contains the projects associated with a particular domain.

Producers folder

Each producers folder includes one or more data domains. A data domain refers to alogical grouping of data elements that share a common meaning, purpose, orbusiness context. Data domains let you categorize and organize data assetswithin an organization. The following diagram shows the structure of a datadomain. The architecture deploys projects in the data platformfolder for each environment.

The producers folder.

The following table describes the projects that are deployed in the dataplatform folder for each environment.

Project	Description
Ingestion	The ingestion project ingests data into the data domain. The architectureprovides examples of how you can stream data into BigQuery,Cloud Storage, and Pub/Sub. The ingestion project also containsexamples of Dataflow and Cloud Composer that you can use toorchestrate the transformation and movement of ingested data.
Non-confidential	The non-confidential project contains data that has been de-identified.You can mask, containerize, encrypt, tokenize, or obfuscate data. Use policytags to control how the data is presented.
Confidential	The confidential project contains plaintext data. You can control accessthrough IAM permissions.

Consumer folder

The consumer folder contains consumer projects. Consumer projects provide amechanism to segment data users based on their required trust boundary. Eachproject is assigned to a separate user group and the group is assigned access tothe required data assets on a project-by-project basis. You can use the consumerproject to collect, analyze, and augment the data for the group.

Common folder

The common folder contains the services that are used by different environmentsand projects. This section describes the capabilities that are added to thecommon folder to enable the enterprise data mesh.

CDMC architecture

The architecture uses the CDMC architecture for data governance. The datagovernance functions reside in the data governance project in the common folder.The following diagram shows the components of the CDMC architecture. The numbersin the diagram represent the key controls that are addressed with Google Cloudservices.

The CDMC architecture.

The following table describes the components of the CDMC architecture that theenterprise data mesh architecture uses.

CDMC component	Google Cloud service	Description
Access and lifecycle components
Key management	Cloud KMS	A service that securely manages encryption keys that protect sensitivedata.
Record Manager	Cloud Run	An application that maintains comprehensive logs and records of dataprocessing activities, ensuring organizations can track and audit datausage.
Archiving policy	BigQuery	A BigQuery table that contains the storage policy fordata.
Entitlements	BigQuery	A BigQuery table that stores information about who canaccess sensitive data. This table ensures that only authorized users can accessspecific data based on their roles and privileges.
Scanning components
Data loss	Sensitive Data Protection	Service used to inspect assets for sensitive data.
DLP findings	BigQuery	A BigQuery table that catalogs data classifications withinthe data platform.
Policies	BigQuery	A BigQuery table that contains consistent data governancepractices (for example, data access types).
Billing export	BigQuery	A table that stores cost information that isexportedfrom Cloud Billing to enable the analysis of cost metrics that areassociated with data assets.
Cloud Data Quality Engine	Cloud Run	An application that runs data quality checks for tables andcolumns.
Data quality findings	BigQuery	A BigQuery table that records identified discrepanciesbetween the defined data quality rules and the actual quality of the dataassets.
Reporting components
Scheduler	Cloud Scheduler	A service that controls when the Cloud Data Quality Engine runs and whenthe Sensitive Data Protection inspection occurs.
Report Engine	Cloud Run	An application that generates reports that help track and measureadherence to the CDMC framework's controls.
Findings and assets	BigQuery and Pub/Sub	A BigQuery report of discrepancies or inconsistencies indata management controls, such as missing tags, incorrect classifications, ornon-compliant storage locations.
Tag exports	BigQuery	A BigQuery table that contains extracted tag informationfrom Data Catalog.
Other components
Policy management	Organization Policy Service	A service that defines and enforces restrictions on where data can bestored geographically.
Attribute-based access policies	Access Context Manager	A service that defines and enforces granular, attribute-based accesspolicies so that only authorized users from permitted locations and devices canaccess sensitive information.
Metadata	Data Catalog	A service that stores metadata information about the tables that are inuse in the data mesh.
Tag Engine	Cloud Run	An application that adds tags to data in BigQuerytables.
CDMC reports	Looker Studio	Dashboards that let your analysts view reports that were generated by theCDMC architecture engines.

CDMC implementation

The following table describes how the architecture implements the key controlsin the CDMC framework.

CDMC control requirement	Implementation
Datacontrol compliance	The Report Engine detects non-compliant data assets through and publishesfindings to a Pub/Sub topic. These findings are also loaded intoBigQuery for reporting usingLooker Studio.
Dataownership is established for both migrated and cloud-generated data	Data Catalog automatically captures technical metadatafrom BigQuery. Tag Engine applies business metadata tags likeowner name and sensitivity level from a reference table, which helps ensure thatall sensitive data is tagged with owner information for compliance. Thisautomated tagging process helps provide data governance and compliance byidentifying and labeling sensitive data with the appropriate ownerinformation.
Datasourcing and consumption are governed and supported by automation	Data Catalog classifies data assets by tagging them withan`is_authoritative` flag when they are an authoritative source.Data Catalog automatically stores the information, along with thetechnical metadata, in a data register. The Report Engine and the Tag Engine canvalidate and report the data register of authoritative sources usingPub/Sub.
Datasovereignty and cross-border data movement are managed	Organization Policy Servicedefines permitted storage regions for data assets andAccess Context Managerrestricts access based on user location. Data Catalog stores theapproved storage locations as metadata tags. Report Engine compares these tagsagainst the actual location of the data assets in BigQuery andpublishes any discrepancies as findings using Pub/Sub.Security Command Center provides an additional layer of monitoring by generatingvulnerability findings if data is stored or accessed outside the definedpolicies.
Datacatalogs are implemented, used, and interoperable	Data Catalog stores and updates the technical metadata forall BigQuery data assets, effectively creating a continuouslysynchronized Data Catalog. Data Catalog ensures that any new ormodified tables and views are immediately added to the catalog, maintaining anup-to-date inventory of data assets.
Dataclassifications are defined and used	Sensitive Data Protectioninspects BigQuery data and identifies sensitive informationtypes. These findings are then ranked based on a classification reference table,and the highest sensitivity level is assigned as a tag inData Catalog at the column and table levels. Tag Engine managesthis process by updating the Data Catalog with sensitivity tags whenever newdata assets are added or existing ones are modified. This process ensures aconstantly updated classification of data based on sensitivity, which you canmonitor and report on using Pub/Sub and integrated reportingtools.
Dataentitlements are managed, enforced, and tracked	BigQuerypolicytags control access to sensitive data at the column level, ensuring onlyauthorized users can access specific data based on their assigned policy tag.IAM manages overall access to the data warehouse, whileData Catalog stores sensitivity classifications. Regular checksare performed to ensure all sensitive data has corresponding policy tags, withany discrepancies reported using Pub/Sub for remediation.
Ethicalaccess, use, and outcomes of data are managed	Data sharing agreements for both providers and consumers are stored in adedicated BigQuery data warehouse to control consumptionpurposes. Data Catalog labels data assets with the provideragreement information, while consumer agreements are linked toIAM bindings for access control. Query labels enforce consumptionpurposes, requiring consumers to specify a valid purpose when querying sensitivedata, which is validated against their entitlements in BigQuery.An audit trail in BigQuery tracks all data access and ensurescompliance with the data sharing agreements.
Datais secured, and controls are evidenced	Google's default encryption at rest helps protect data that is stored ondisk. Cloud KMS supports customer-managed encryption keys (CMEK) forenhanced key management. BigQuery implements column-level dynamicdata masking for de-identification and supports application-levelde-identification during data ingestion. Data Catalog storesmetadata tags for encryption and de-identification techniques that are appliedto data assets. Automated checks ensure that the encryption andde-identification methods align with predefined security policies, with anydiscrepancies that are reported as findings usingPub/Sub.
Adata privacy framework is defined and operational	Data Catalog tags sensitive data assets with relevantinformation for impact assessment, such as subject location and assessmentreport links. Tag Engine applies these tags based on data sensitivity and apolicy table in BigQuery, which defines the assessmentrequirements based on data and subject residency. This automated tagging processallows for continuous monitoring and reporting of compliance with impactassessment requirements, ensuring that data protection impact assessments(DPIAs) or protection impact assessment (PIAs) are conducted whennecessary.
Thedata lifecycle is planned and managed	Data Catalog labels data assets with retention policies,specifying retention periods and expiration actions (such as archive or purge).Record Manager automates the enforcement of these policies by purging orarchiving BigQuery tables based on the defined tags. Thisenforcement ensures adherence to the data lifecycle policies and maintainscompliance with data retention requirements, with any discrepancies detected andreported using Pub/Sub.
Dataquality is managed	The Cloud Data Quality Engine defines and runs data quality rules onspecified table columns, measuring data quality based on metrics likecorrectness and completeness. Results from these checks, including successpercentages and thresholds, are stored as tags in Data Catalog.Storing these results allows for continuous monitoring and reporting of dataquality, with any issues or deviations from acceptable thresholds published asfindings using Pub/Sub.
Costmanagement principles are established and applied	Data Catalog stores cost-related metrics for data assets,such as query costs, storage costs, and data egress costs, which are calculatedusing billing information exported from Cloud Billing toBigQuery. Storing cost-related metrics allows for comprehensivecost tracking and analysis, ensuring adherence to cost policies and efficientresource utilization, with any anomalies reported usingPub/Sub.
Dataprovenance and lineage are understood	Data Catalog's built-in data lineage features track theprovenance and lineage of data assets, visually representing the flow of data.Additionally, data ingestion scripts identify and tag the original source of thedata in Data Catalog, enhancing the traceability of data back toits origin.

Data access management

The architecture's access to data is controlled through an independent processwhich separates operational control (for example, running Dataflow jobs) fromdata access control. A user's access to a Google Cloud service is definedby an environmental or operational concern and is provisioned and approved by acloud engineering group. A user's access to Google Cloud data assets (forexample, a BigQuery table) is a privacy, regulatory, orgovernance concern and is subject to an access agreement between the producingand consuming parties and controlled through the following processes. Thefollowing diagram shows how data access is provisioned through the interactionof different software components.

Data access management

As shown in the previous diagram, onboarding of data accesses is handled by thefollowing processes:

Cloud data assets are collected and inventoried byData Catalog.
The workflow manager retrieves the data assets fromData Catalog.
Data owners are onboarded to workflow manager.

The operation of the data access management is as follows:

A data consumer makes a request for a specific asset.
The data owner of the asset is alerted to the request.
The data owner approves or rejects the request.
If the request is approved, the workflow manager passes the group, asset, andassociated tag to the IAM mapper.
The IAM mapper translates the workflow manager tags intoIAM permissions, and gives the specified groupIAM permissions for the data asset.
When a user wants to access the data asset, IAM evaluatesaccess to the Google Cloud asset based on the permissions of the group.
If permitted, the user accesses the data asset.

Networking

The data security process initiates at the source application, which mightreside on-premises or in another environment external to the targetGoogle Cloud project. Before any network transfer occurs, this application usesWorkload Identity Federation to securely authenticate itself to GoogleCloud APIs. Using these credentials, it interacts withCloud KMS to obtain or wrap the necessary keys and then employs theTink library to perform initial encryption and de-identification on thesensitive data payload according to predefined templates.

After the data payload is protected, the payload must be securely transferredinto the Google Cloud ingestion project. For on-premise applications, youcan use Cloud Interconnect or potentially Cloud VPN. Within theGoogle Cloud network, use Private Service Connect to route thedata towards the ingestion endpoint within the target project's VPC network.Private Service Connect lets the source application connect to GoogleAPIs using private IP addresses, ensuring traffic isn't exposed to the internet.

The entire network path and the target ingestion services(Cloud Storage, BigQuery, and Pub/Sub)within the ingestion project are secured by a VPC Service Controls perimeter. Thisperimeter enforces a security boundary, ensuring that the protected dataoriginating from the source can only be ingested into the authorizedGoogle Cloud services within that specific project.

Logging

This architecture uses theCloud Loggingcapabilities that are provided by the enterprise foundations blueprint.

Pipelines

The enterprise data mesh architecture uses a series of pipelines to provisionthe infrastructure, orchestration, data sets, data pipelines, and applicationcomponents. The architecture's resource deployment pipelines use Terraform asthe infrastructure as code (IaC) tool and Cloud Build as the CI/CD serviceto deploy the Terraform configurations into the architecture environment. Thefollowing diagram shows the relationship between the pipelines.

Pipeline relationships

The foundation pipeline and the infrastructure pipeline are part of theenterprise foundations blueprint. The following table describes the purpose ofthe pipelines and the resources that they provision.

Pipeline	Provisioned by	Resources
Foundation pipeline	Bootstrap	Data platform folder and subfolders Common projects Infrastructure pipeline service account Cloud Build trigger for the Infrastructure pipeline Shared VPC VPC Service Control perimeter
Infrastructure pipeline	Foundation pipeline	Consumer projects Service Catalog service account The Cloud Build trigger for the Service Catalogpipeline Artifact pipeline service account The Cloud Build trigger for the artifact pipeline
Service Catalog pipeline	Infrastructure pipeline	Resources deployed in the Service Catalog bucket
Artifact pipelines	Infrastructure pipeline	Artifact pipelines produce the various containers and other components ofthe codebase used by the data mesh.

Each pipeline has its own set of repositories that it pulls code andconfiguration files from. Each repository has a separation of duties wheresubmitters and approvals of operational code deployments are theresponsibilities of different groups.

Interactive deployment through Service Catalog

Interactive environments are the development environment within the architectureand exist under the development folder. The main interface for the interactiveenvironment is Service Catalog, which lets developers usepreconfigured templates to instantiate Google services. These preconfiguredtemplates are known asservice templates. Service templates help you toenforce your security posture, such as making CMEK encryption mandatory, andalso prevents your users from having direct access to Google APIs.

The following diagram shows the components of the interactive environment andhow data scientists deploy resources.

Interactive environment with Service Catalog.

To deploy resources using the Service Catalog, the followingsteps occur:

The MLOps engineer puts a Terraform resource template for Google Cloudinto a Git repository.
The Git Commit command triggers a Cloud Build pipeline.
Cloud Build copies the template and any associated configurationfiles to Cloud Storage.
The MLOps engineer sets up the Service Catalog solutions andService Catalog manually. The engineer then shares theService Catalog with a service project in the interactiveenvironment.
The data scientist selects a resource from theService Catalog.
Service Catalog deploys the template into the interactiveenvironment.
The resource pulls any necessary configuration scripts.
The data scientist interacts with the resources.

Artifact pipelines

The data ingestion process uses Cloud Composer andDataflow to orchestrate the movement and transformation of datawithin the data domain. The artifact pipeline builds all necessary resources fordata ingestion and moves the resources to the appropriate location for theservices to access them. The artifact pipeline creates the container artifactsthat the orchestrator uses.

Security controls

The enterprise data mesh architecture uses a layered defense-in-depth securitymodel that includes default Google Cloud capabilities, Google Cloudservices, and security capabilities that are configured through the enterprisefoundations blueprint. The following diagram shows the layering of the varioussecurity controls for the architecture.

Security controls in the data mesh architecture.

The following table describes the security controls that are associated with theresources in each layer.

Layer	Resource	Security control
CDMC framework	Google Cloud CDMC implementation	Provides a governance framework that helps secure, manage and control your data assets. SeeCDMCKey Controls Framework for more information.
Deployment	Infrastructure pipeline	Provides a series of pipelines that deploy infrastructure, build containers, and create data pipelines. The use of pipelines allows for auditability, traceability, and repeatability.
	Artifact pipeline	Deploys various components not deployed by the infrastructurepipeline.
	Terraform templates	Builds out the system infrastructure.
	Open Policy Agent	Helps ensure that the platform conforms to selected policies.
Network	Private Service Connect	Provides data exfiltration protections around the architecture resources at the API layer and the IP layer. Lets you communicate with Google Cloud APIs using privateIP addresses so that you can avoid exposing traffic to the internet.
	VPC network with private IP addresses	Helps remove exposure to internet-facing threats.
	VPC Service Controls	Helps protect sensitive resources against data exfiltration.
	Firewall	Helps protect the VPC network against unauthorized access.
Access management	Access Context Manager	Controls who can access what resources and helps prevent unauthorized use of your resources.
	Workload Identity Federation	Removes the need for external credentials to transfer data onto theplatform from on-premises environments.
	Data Catalog	Provides an index of assets available to users.
	IAM	Provides fine-grained access.
Encryption	Cloud KMS	Lets you manage your encryption keys and secrets, and help protect your data through encryption at rest and encryption in transit.
	Secrets Manager	Provides a secret store for pipelines that are controlled byIAM.
	Encryption at rest	By default, Google Cloud encrypts data at rest.
	Encryption in transit	By default, Google Cloud encryptsdata intransit.
Detective	Security Command Center	Helps you to detect misconfigurations and malicious activity in your Google Cloudorganization.
	Continuous architecture	Continually checks your Google Cloud organization against a seriesof OPA policies that you have defined.
	IAM Recommender	Analyzes user permissions and provides suggestions about reducingpermissions to help enforce the principle of least privilege.
	Firewall Insights	Analyzes firewall rules, identifies overly-permissive firewall rules, andsuggests more restrictive firewalls to help strengthen your overall securityposture.
	Cloud Logging	Provides visibility into system activity and helps enable the detectionof anomalies and malicious activity.
	Cloud Monitoring	Tracks key signals and events that can help identify suspiciousactivity.
Preventative	Organization Policy	Lets you control and restrict actions within your Google Cloudorganization.

Workflows

The following sections outline the data producer workflow and data consumerworkflow, ensuring appropriate access controls based on data sensitivity anduser roles.

Data producer workflow

The following diagram shows how data is protected as it is transferred toBigQuery.

Data producer workflow

The workflow for data transfer is the following:

An application that is integrated with Workload Identity Federation usesCloud KMS to decrypt a wrapped encryption key.
The application uses the Tink library to de-identify or encrypt the datausing a template.
The application transfers data to the ingestion project in Google Cloud.
The data arrives in Cloud Storage, BigQuery, orPub/Sub.
In the ingestion project, the data is decrypted or re-identified using atemplate.
The decrypted data is encrypted or masked based on another de-identificationtemplate, then placed in the non-confidential project. Tags are applied by thetagging engine as appropriate.
Data from the non-confidential project is transferred over to theconfidential project and re-identified.

The following data access is permitted:

Users who have access to the confidential project can access all the rawplaintext data.
Users who have access to the non-confidential project can access masked,tokenized, or encrypted data based on the tags associated with the data andtheir permissions.

Data consumer workflow

The following steps describe how a consumer can access data that is stored inBigQuery.

The data consumer searches for data assets using Data Catalog.
After the consumer finds the assets that they are looking for, the dataconsumer requests access to the data assets.
The data owner decides whether to provide access to the assets.
If the consumer obtains access, the consumer can use a notebook and theSolution Catalog to create an environment in which they can analyze andtransform the data assets.

Bringing it all together

TheGitHubrepositoryprovides you with detailed instructions on deploying the data mesh onGoogle Cloud after you deployed the enterprise foundation. The process todeploy the architecture involves modifying your existing infrastructurerepositories and deploying new data mesh specific components.

Complete the following:

Complete all prerequisites, including the following:
1. InstallGoogle Cloud CLI,Terraform,Tink,Java, andGo.
2. Deploy theenterprise foundations blueprint(v4.1).
3. Maintain the following local repositories:
  - gcp-data-mesh-foundations
  - gcp-bootstrap
  - gcp-environments
  - gcp-networks
  - gcp-org
  - gcp-projects
Modify the existing foundation blueprint and then deploy the data meshapplications. For each item, complete the following:
1. In your target repository, check out thePlan branch.
2. To add data mesh components, copy the relevant files and directories fromgcp-data-mesh-foundations into the appropriate foundation directory.Overwrite files when required.
3. Update the data mesh variables, roles, and settings in the Terraformfiles (forexample,*.tfvars and*.tf). Set the GitHub tokens as environmentvariables.
4. Perform the Terraform initialize, plan, and apply operations on eachrepository.
5. Commit your changes, push the code to your remote repository, create pullrequests and merge to your development, nonproduction, and productionenvironments.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-04-04 UTC.

Movatterモバイル変換

Deploy an enterprise data management and analytics platform Stay organized with collections Save and categorize content based on your preferences.

Architecture

Key architectural decisions

Identity: Mapping roles to groups

Infrastructure

Data governance

Domain-based data producers

Domain-based data consumers

Organization structure

Data platform folder

Producers folder

Consumer folder

Common folder

CDMC architecture

CDMC implementation

Data access management

Networking

Logging

Pipelines

Interactive deployment through Service Catalog

Artifact pipelines

Security controls

Workflows

Data producer workflow

Data consumer workflow

Bringing it all together

What's next

Deploy an enterprise data management and analytics platform