Deploy an enterprise data management and analytics platform Stay organized with collections Save and categorize content based on your preferences.
An enterprise data management and analytics platform provides an enclave whereyou can store, analyze, and manipulate sensitive information while maintainingsecurity controls. You can use theenterprise data mesh architecture to deploya platform on Google Cloud for data management and analytics. The architectureis designed to work in a hybrid environment, where Google Cloud componentsinteract with your existing on-premises components and operating processes.
The enterprise data mesh architecture includes the following:
- AGitHubrepositorythat contains a set of Terraform configurations, scripts, and code to buildthe following:
- A governance project that lets you use Google'simplementation of theCloud Data Management Capabilities (CDMS) Key ControlsFramework.
- A data platform example that supports interactive and productionworkflows.
- A producer environment within the data platform that supports multiple datadomains. Data domains are logical groupings of data elements.
- A consumer environment within the data platform that supports multipleconsumer projects.
- A data transfer service that uses Workload Identity Federation and theTink encryption library to help youtransfer data into Google Cloud in a secure manner.
- A data domain example that contains ingestion, non-confidential, andconfidential projects.
- An example of a data access system that lets data consumers request access todata sets and data owners grant access to those data sets. The examplealso includes a workflow manager that changes the IAMpermissions of those data sets accordingly.
- A guide to the architecture, design, security controls, and operationalprocesses that you use this architecture to implement (this document).
The enterprise data mesh architecture is designed to be compatible with theenterprise foundationsblueprint. Theenterprise foundations blueprint provides a number of base-level services thatthis architecture relies on, such asVPCnetworks and logging. You can deploythis architecture without deploying the enterprise foundations blueprint if yourGoogle Cloud environment provides thenecessaryfunctionality.
This document is intended for cloud architects, data scientists, data engineers,and security architects who can use the architecture to build and deploycomprehensive data services on Google Cloud. This document assumes thatyou are familiar with the concepts ofdatameshes, Google Clouddata services, andthe Google Cloud implementation of the CDMC framework.
Architecture
The enterprise data mesh architecture takes a layered approach to provide thecapabilities that enable data ingestion, data processing, and governance. Thearchitecture is intended to be deployed and controlled through a CI/CD workflow.The following diagram shows how the data layer that is deployed by thisarchitecture relates to other layers in your environment.
This diagram includes the following:
- Google Cloud infrastructureprovides security capabilities such asencryption atrestandencryption intransit, aswell as basic building blocks such as compute and storage.
- The enterprise foundation provides a baseline of resources such asidentity, networking, logging, monitoring, and deployment systems that enableyou to adopt Google Cloud for your data workloads.
- The data layer provides variouscapabilities such as data ingestion, data storage, data access control, datagovernance, data monitoring, and data sharing.
- The application layer represents various different applications that use thedata layer assets.
- CI/CD provides the tools to automate the provision, configuration,management, and deployment of infrastructure, workflows, and softwarecomponents. These components help you ensure consistent, reliable, andauditable deployments; minimize manual errors; and accelerate the overalldevelopment cycle.
To show how the data environment is used, the architecture includes a sampledata workflow. The sample data workflow takes you through the followingprocesses: data governance, data ingestion, data processing, data sharing, anddata consumption.
Key architectural decisions
The following table summarizes the high-level decisions of thearchitecture.
| Decision area | Decision |
|---|---|
| Google Cloud architecture | |
Resource hierarchy | The architecture uses theresourcehierarchy from the enterprise foundations blueprint. |
Networking | The architecture includes an example data transfer service that usesWorkload Identity Federationand a Tink library. |
Roles and IAM permissions | The architecture includes segmented data producer roles, data consumerroles, data governance roles, and data platform roles. |
| Common data services | |
Metadata | The architecture uses Data Catalog to manage datametadata. |
Central policy management | To manage policies, the architecture uses Google Cloud'simplementation of the CDMC framework. |
Data access management | To control access to data, the architecture includes an independentprocess that requires data consumers to request access to data assets from thedata owner. |
Data quality | The architecture uses theCloud DataQuality Engine to define and run data quality rules on specified tablecolumns, measuring data quality based on metrics like correctness andcompleteness. |
Data security | The architecture uses tagging, encryption, masking, tokenization, andIAM controls to provide data security. |
| Data domain | |
Data environments | The architecture includes three environments. Two environments(non-production and production) are operational environments that are driven bypipelines. One environment (development) is an interactive environment. |
Data owners | Data owners ingest, process, expose, and grant access to dataassets. |
Data consumers | Data consumers request access to data assets. |
| Onboarding and operations | |
Pipelines | The architecture uses the following pipelines to deploy resources:
|
Repositories | Each pipeline uses a separate repository to enable segregation ofresponsibility. |
Process flow | The process requires that changes to the production environment include asubmitter and an approver. |
| Cloud operations | |
Data product scorecards | TheReportEngine generates data product scorecards. |
Cloud Logging | The architecture uses thelogginginfrastructure from the enterprise foundations blueprint. |
Cloud Monitoring | The architecture uses the monitoring infrastructure from the enterprisefoundations blueprint. |
Identity: Mapping roles to groups
The data mesh leverages the enterprise foundations blueprint's existing identitylifecycle management, authorization, and authentication architecture. Users arenot assigned roles directly; instead groups are the primary method of assigningroles and permission in IAM. IAM roles andpermissions are assigned during project creation through the foundationpipeline.
The data mesh associates groups with one of four key areas:infrastructure,datagovernance,domain-based dataproducers,anddomain-basedconsumers.
The permission scopes for these groups are the following:
- The infrastructure group's permission scope is the data mesh as a whole.
- The data governance groups' permission scope is the data governance project.
- Domain-based producers and consumers permissions are scoped to their datadomain.
The following tables show the various roles used in this data meshimplementation and their associated permissions.
Infrastructure
| Group | Description | Roles |
|---|---|---|
| Overall administrators of the data mesh |
|
Data governance
| Group | Description | Roles |
|---|---|---|
| Administrators of the data governance project |
|
| Developers who build and maintain the data governance components | Multiple roles on the data governance project, including |
| Readers of data governance information |
|
| Security administrators of the governance project |
|
| Group with permission to use tag templates |
|
| Group with permission to use tag templates and add tags |
|
| Service account group for Security Command Center notifications | None. This is a group for membership, and a service account is createdwith this name, which has the necessary permissions. |
Domain-based data producers
| Group | Description | Roles |
|---|---|---|
| Administrators of a specific data domain |
|
| Developers who build and maintain data products within a datadomain | Multiple roles on the data domain project, including |
| Readers of the data domain information |
|
| Editors of Data Catalog entries | Roles to edit Data Catalog entries |
| Data stewards for the data domain | Roles to manage metadata and data governance aspects |
Domain-based data consumers
| Group | Description | Roles |
|---|---|---|
| Administrators of a specific consumer project |
|
| Developers working within a consumer project | Multiple roles on the consumer project, including |
| Readers of the consumer project information |
|
Organization structure
To differentiate between production operations and production data, thearchitecture uses different environments to develop and release workflows.Production operations include the governance, traceability, and repeatability ofa workflow and the auditability of the results of the workflow. Production datarefers to possibly sensitive data that you need to run your organization. Allenvironments are designed to have security controls that let you ingest andoperate your data.
To help data scientists and engineers, the architecture includes aninteractiveenvironment, where developers can work withthe environment directly and add services through a curated catalog ofsolutions. Operational environments are driven throughpipelines which have codified architecture andconfiguration.
This architecture uses the organizational structure of the enterprisefoundations blueprint as a basis for deploying data workloads. The followingdiagram shows the top-level folders and projects used in the enterprise datamesh architecture.
The following table describes the top-level folders and projects that are partof the architecture.
| Folder | Component | Description |
|---|---|---|
|
| Contains the deployment pipeline that's used to build out the codeartifacts of the architecture. |
| Contains the infrastructure used by the Service Catalogto deploy resources in the interactive environment. | |
| Contains all the resources used by Google Cloud's implementation ofthe CDMC framework. | |
|
| Contains the projects and resources of the data platform for developinguse cases in interactive mode. |
|
| Contains the projects and resources of the data platform for testing usecases that you want to deploy in an operational environment. |
|
| Contains the projects and resources of the data platform for deploymentinto production. |
Data platform folder
The data platform folder contains all the data plane components and some of theCDMC resources. In addition, the data platform folder and the data governanceproject contain the CDMC resources. The following diagram shows the folders andprojects that are deployed in the data platform folder.
Each data platform folder includes an environment folder (production,non-production, and development). The following table describes the folderswithin each data platform folder.
| Folders | Description |
|---|---|
Producers | Contains the data domains. |
Consumers | Contains the consumer projects. |
Data domain | Contains the projects associated with a particular domain. |
Producers folder
Each producers folder includes one or more data domains. A data domain refers to alogical grouping of data elements that share a common meaning, purpose, orbusiness context. Data domains let you categorize and organize data assetswithin an organization. The following diagram shows the structure of a datadomain. The architecture deploys projects in the data platformfolder for each environment.
The following table describes the projects that are deployed in the dataplatform folder for each environment.
| Project | Description |
|---|---|
Ingestion | The ingestion project ingests data into the data domain. The architectureprovides examples of how you can stream data into BigQuery,Cloud Storage, and Pub/Sub. The ingestion project also containsexamples of Dataflow and Cloud Composer that you can use toorchestrate the transformation and movement of ingested data. |
Non-confidential | The non-confidential project contains data that has been de-identified.You can mask, containerize, encrypt, tokenize, or obfuscate data. Use policytags to control how the data is presented. |
Confidential | The confidential project contains plaintext data. You can control accessthrough IAM permissions. |
Consumer folder
The consumer folder contains consumer projects. Consumer projects provide amechanism to segment data users based on their required trust boundary. Eachproject is assigned to a separate user group and the group is assigned access tothe required data assets on a project-by-project basis. You can use the consumerproject to collect, analyze, and augment the data for the group.
Common folder
The common folder contains the services that are used by different environmentsand projects. This section describes the capabilities that are added to thecommon folder to enable the enterprise data mesh.
CDMC architecture
The architecture uses the CDMC architecture for data governance. The datagovernance functions reside in the data governance project in the common folder.The following diagram shows the components of the CDMC architecture. The numbersin the diagram represent the key controls that are addressed with Google Cloudservices.
The following table describes the components of the CDMC architecture that theenterprise data mesh architecture uses.
| CDMC component | Google Cloud service | Description |
|---|---|---|
| Access and lifecycle components | ||
Key management | Cloud KMS | A service that securely manages encryption keys that protect sensitivedata. |
Record Manager | Cloud Run | An application that maintains comprehensive logs and records of dataprocessing activities, ensuring organizations can track and audit datausage. |
Archiving policy | BigQuery | A BigQuery table that contains the storage policy fordata. |
Entitlements | BigQuery | A BigQuery table that stores information about who canaccess sensitive data. This table ensures that only authorized users can accessspecific data based on their roles and privileges. |
| Scanning components | ||
Data loss | Sensitive Data Protection | Service used to inspect assets for sensitive data. |
DLP findings | BigQuery | A BigQuery table that catalogs data classifications withinthe data platform. |
Policies | BigQuery | A BigQuery table that contains consistent data governancepractices (for example, data access types). |
Billing export | BigQuery | A table that stores cost information that isexportedfrom Cloud Billing to enable the analysis of cost metrics that areassociated with data assets. |
Cloud Data Quality Engine | Cloud Run | An application that runs data quality checks for tables andcolumns. |
Data quality findings | BigQuery | A BigQuery table that records identified discrepanciesbetween the defined data quality rules and the actual quality of the dataassets. |
| Reporting components | ||
Scheduler | Cloud Scheduler | A service that controls when the Cloud Data Quality Engine runs and whenthe Sensitive Data Protection inspection occurs. |
Report Engine | Cloud Run | An application that generates reports that help track and measureadherence to the CDMC framework's controls. |
Findings and assets | BigQuery and Pub/Sub | A BigQuery report of discrepancies or inconsistencies indata management controls, such as missing tags, incorrect classifications, ornon-compliant storage locations. |
Tag exports | BigQuery | A BigQuery table that contains extracted tag informationfrom Data Catalog. |
| Other components | ||
Policy management | Organization Policy Service | A service that defines and enforces restrictions on where data can bestored geographically. |
Attribute-based access policies | Access Context Manager | A service that defines and enforces granular, attribute-based accesspolicies so that only authorized users from permitted locations and devices canaccess sensitive information. |
Metadata | Data Catalog | A service that stores metadata information about the tables that are inuse in the data mesh. |
Tag Engine | Cloud Run | An application that adds tags to data in BigQuerytables. |
CDMC reports | Looker Studio | Dashboards that let your analysts view reports that were generated by theCDMC architecture engines. |
CDMC implementation
The following table describes how the architecture implements the key controlsin the CDMC framework.
| CDMC control requirement | Implementation |
|---|---|
The Report Engine detects non-compliant data assets through and publishesfindings to a Pub/Sub topic. These findings are also loaded intoBigQuery for reporting usingLooker Studio. | |
Dataownership is established for both migrated and cloud-generated data | Data Catalog automatically captures technical metadatafrom BigQuery. Tag Engine applies business metadata tags likeowner name and sensitivity level from a reference table, which helps ensure thatall sensitive data is tagged with owner information for compliance. Thisautomated tagging process helps provide data governance and compliance byidentifying and labeling sensitive data with the appropriate ownerinformation. |
Datasourcing and consumption are governed and supported by automation | Data Catalog classifies data assets by tagging them withan |
Organization Policy Servicedefines permitted storage regions for data assets andAccess Context Managerrestricts access based on user location. Data Catalog stores theapproved storage locations as metadata tags. Report Engine compares these tagsagainst the actual location of the data assets in BigQuery andpublishes any discrepancies as findings using Pub/Sub.Security Command Center provides an additional layer of monitoring by generatingvulnerability findings if data is stored or accessed outside the definedpolicies. | |
Data Catalog stores and updates the technical metadata forall BigQuery data assets, effectively creating a continuouslysynchronized Data Catalog. Data Catalog ensures that any new ormodified tables and views are immediately added to the catalog, maintaining anup-to-date inventory of data assets. | |
Sensitive Data Protectioninspects BigQuery data and identifies sensitive informationtypes. These findings are then ranked based on a classification reference table,and the highest sensitivity level is assigned as a tag inData Catalog at the column and table levels. Tag Engine managesthis process by updating the Data Catalog with sensitivity tags whenever newdata assets are added or existing ones are modified. This process ensures aconstantly updated classification of data based on sensitivity, which you canmonitor and report on using Pub/Sub and integrated reportingtools. | |
BigQuerypolicytags control access to sensitive data at the column level, ensuring onlyauthorized users can access specific data based on their assigned policy tag.IAM manages overall access to the data warehouse, whileData Catalog stores sensitivity classifications. Regular checksare performed to ensure all sensitive data has corresponding policy tags, withany discrepancies reported using Pub/Sub for remediation. | |
Data sharing agreements for both providers and consumers are stored in adedicated BigQuery data warehouse to control consumptionpurposes. Data Catalog labels data assets with the provideragreement information, while consumer agreements are linked toIAM bindings for access control. Query labels enforce consumptionpurposes, requiring consumers to specify a valid purpose when querying sensitivedata, which is validated against their entitlements in BigQuery.An audit trail in BigQuery tracks all data access and ensurescompliance with the data sharing agreements. | |
Google's default encryption at rest helps protect data that is stored ondisk. Cloud KMS supports customer-managed encryption keys (CMEK) forenhanced key management. BigQuery implements column-level dynamicdata masking for de-identification and supports application-levelde-identification during data ingestion. Data Catalog storesmetadata tags for encryption and de-identification techniques that are appliedto data assets. Automated checks ensure that the encryption andde-identification methods align with predefined security policies, with anydiscrepancies that are reported as findings usingPub/Sub. | |
Data Catalog tags sensitive data assets with relevantinformation for impact assessment, such as subject location and assessmentreport links. Tag Engine applies these tags based on data sensitivity and apolicy table in BigQuery, which defines the assessmentrequirements based on data and subject residency. This automated tagging processallows for continuous monitoring and reporting of compliance with impactassessment requirements, ensuring that data protection impact assessments(DPIAs) or protection impact assessment (PIAs) are conducted whennecessary. | |
Data Catalog labels data assets with retention policies,specifying retention periods and expiration actions (such as archive or purge).Record Manager automates the enforcement of these policies by purging orarchiving BigQuery tables based on the defined tags. Thisenforcement ensures adherence to the data lifecycle policies and maintainscompliance with data retention requirements, with any discrepancies detected andreported using Pub/Sub. | |
The Cloud Data Quality Engine defines and runs data quality rules onspecified table columns, measuring data quality based on metrics likecorrectness and completeness. Results from these checks, including successpercentages and thresholds, are stored as tags in Data Catalog.Storing these results allows for continuous monitoring and reporting of dataquality, with any issues or deviations from acceptable thresholds published asfindings using Pub/Sub. | |
Data Catalog stores cost-related metrics for data assets,such as query costs, storage costs, and data egress costs, which are calculatedusing billing information exported from Cloud Billing toBigQuery. Storing cost-related metrics allows for comprehensivecost tracking and analysis, ensuring adherence to cost policies and efficientresource utilization, with any anomalies reported usingPub/Sub. | |
Data Catalog's built-in data lineage features track theprovenance and lineage of data assets, visually representing the flow of data.Additionally, data ingestion scripts identify and tag the original source of thedata in Data Catalog, enhancing the traceability of data back toits origin. |
Data access management
The architecture's access to data is controlled through an independent processwhich separates operational control (for example, running Dataflow jobs) fromdata access control. A user's access to a Google Cloud service is definedby an environmental or operational concern and is provisioned and approved by acloud engineering group. A user's access to Google Cloud data assets (forexample, a BigQuery table) is a privacy, regulatory, orgovernance concern and is subject to an access agreement between the producingand consuming parties and controlled through the following processes. Thefollowing diagram shows how data access is provisioned through the interactionof different software components.
As shown in the previous diagram, onboarding of data accesses is handled by thefollowing processes:
- Cloud data assets are collected and inventoried byData Catalog.
- The workflow manager retrieves the data assets fromData Catalog.
- Data owners are onboarded to workflow manager.
The operation of the data access management is as follows:
- A data consumer makes a request for a specific asset.
- The data owner of the asset is alerted to the request.
- The data owner approves or rejects the request.
- If the request is approved, the workflow manager passes the group, asset, andassociated tag to the IAM mapper.
- The IAM mapper translates the workflow manager tags intoIAM permissions, and gives the specified groupIAM permissions for the data asset.
- When a user wants to access the data asset, IAM evaluatesaccess to the Google Cloud asset based on the permissions of the group.
- If permitted, the user accesses the data asset.
Networking
The data security process initiates at the source application, which mightreside on-premises or in another environment external to the targetGoogle Cloud project. Before any network transfer occurs, this application usesWorkload Identity Federation to securely authenticate itself to GoogleCloud APIs. Using these credentials, it interacts withCloud KMS to obtain or wrap the necessary keys and then employs theTink library to perform initial encryption and de-identification on thesensitive data payload according to predefined templates.
After the data payload is protected, the payload must be securely transferredinto the Google Cloud ingestion project. For on-premise applications, youcan use Cloud Interconnect or potentially Cloud VPN. Within theGoogle Cloud network, use Private Service Connect to route thedata towards the ingestion endpoint within the target project's VPC network.Private Service Connect lets the source application connect to GoogleAPIs using private IP addresses, ensuring traffic isn't exposed to the internet.
The entire network path and the target ingestion services(Cloud Storage, BigQuery, and Pub/Sub)within the ingestion project are secured by a VPC Service Controls perimeter. Thisperimeter enforces a security boundary, ensuring that the protected dataoriginating from the source can only be ingested into the authorizedGoogle Cloud services within that specific project.
Logging
This architecture uses theCloud Loggingcapabilities that are provided by the enterprise foundations blueprint.
Pipelines
The enterprise data mesh architecture uses a series of pipelines to provisionthe infrastructure, orchestration, data sets, data pipelines, and applicationcomponents. The architecture's resource deployment pipelines use Terraform asthe infrastructure as code (IaC) tool and Cloud Build as the CI/CD serviceto deploy the Terraform configurations into the architecture environment. Thefollowing diagram shows the relationship between the pipelines.
The foundation pipeline and the infrastructure pipeline are part of theenterprise foundations blueprint. The following table describes the purpose ofthe pipelines and the resources that they provision.
| Pipeline | Provisioned by | Resources |
|---|---|---|
Foundation pipeline | Bootstrap |
|
Infrastructure pipeline | Foundation pipeline |
|
Service Catalog pipeline | Infrastructure pipeline |
|
Artifact pipelines | Infrastructure pipeline | Artifact pipelines produce the various containers and other components ofthe codebase used by the data mesh. |
Each pipeline has its own set of repositories that it pulls code andconfiguration files from. Each repository has a separation of duties wheresubmitters and approvals of operational code deployments are theresponsibilities of different groups.
Interactive deployment through Service Catalog
Interactive environments are the development environment within the architectureand exist under the development folder. The main interface for the interactiveenvironment is Service Catalog, which lets developers usepreconfigured templates to instantiate Google services. These preconfiguredtemplates are known asservice templates. Service templates help you toenforce your security posture, such as making CMEK encryption mandatory, andalso prevents your users from having direct access to Google APIs.
The following diagram shows the components of the interactive environment andhow data scientists deploy resources.
To deploy resources using the Service Catalog, the followingsteps occur:
- The MLOps engineer puts a Terraform resource template for Google Cloudinto a Git repository.
- The Git Commit command triggers a Cloud Build pipeline.
- Cloud Build copies the template and any associated configurationfiles to Cloud Storage.
- The MLOps engineer sets up the Service Catalog solutions andService Catalog manually. The engineer then shares theService Catalog with a service project in the interactiveenvironment.
- The data scientist selects a resource from theService Catalog.
- Service Catalog deploys the template into the interactiveenvironment.
- The resource pulls any necessary configuration scripts.
- The data scientist interacts with the resources.
Artifact pipelines
The data ingestion process uses Cloud Composer andDataflow to orchestrate the movement and transformation of datawithin the data domain. The artifact pipeline builds all necessary resources fordata ingestion and moves the resources to the appropriate location for theservices to access them. The artifact pipeline creates the container artifactsthat the orchestrator uses.
Security controls
The enterprise data mesh architecture uses a layered defense-in-depth securitymodel that includes default Google Cloud capabilities, Google Cloudservices, and security capabilities that are configured through the enterprisefoundations blueprint. The following diagram shows the layering of the varioussecurity controls for the architecture.
The following table describes the security controls that are associated with theresources in each layer.
| Layer | Resource | Security control |
|---|---|---|
CDMC framework | Google Cloud CDMC implementation | Provides a governance framework that helps secure, manage and control your data assets. SeeCDMCKey Controls Framework for more information. |
Deployment | Infrastructure pipeline | Provides a series of pipelines that deploy infrastructure, build containers, and create data pipelines. The use of pipelines allows for auditability, traceability, and repeatability. |
Artifact pipeline | Deploys various components not deployed by the infrastructurepipeline. | |
Terraform templates | Builds out the system infrastructure. | |
Open Policy Agent | Helps ensure that the platform conforms to selected policies. | |
Network | Private Service Connect | Provides data exfiltration protections around the architecture resources at the API layer and the IP layer. Lets you communicate with Google Cloud APIs using privateIP addresses so that you can avoid exposing traffic to the internet. |
VPC network with private IP addresses | Helps remove exposure to internet-facing threats. | |
VPC Service Controls | Helps protect sensitive resources against data exfiltration. | |
Firewall | Helps protect the VPC network against unauthorized access. | |
Access management | Access Context Manager | Controls who can access what resources and helps prevent unauthorized use of your resources. |
Workload Identity Federation | Removes the need for external credentials to transfer data onto theplatform from on-premises environments. | |
Data Catalog | Provides an index of assets available to users. | |
IAM | Provides fine-grained access. | |
Encryption | Cloud KMS | Lets you manage your encryption keys and secrets, and help protect your data through encryption at rest and encryption in transit. |
Secrets Manager | Provides a secret store for pipelines that are controlled byIAM. | |
Encryption at rest | By default, Google Cloud encrypts data at rest. | |
Encryption in transit | By default, Google Cloud encryptsdata intransit. | |
Detective | Security Command Center | Helps you to detect misconfigurations and malicious activity in your Google Cloudorganization. |
Continuous architecture | Continually checks your Google Cloud organization against a seriesof OPA policies that you have defined. | |
IAM Recommender | Analyzes user permissions and provides suggestions about reducingpermissions to help enforce the principle of least privilege. | |
Firewall Insights | Analyzes firewall rules, identifies overly-permissive firewall rules, andsuggests more restrictive firewalls to help strengthen your overall securityposture. | |
Cloud Logging | Provides visibility into system activity and helps enable the detectionof anomalies and malicious activity. | |
Cloud Monitoring | Tracks key signals and events that can help identify suspiciousactivity. | |
Preventative | Organization Policy | Lets you control and restrict actions within your Google Cloudorganization. |
Workflows
The following sections outline the data producer workflow and data consumerworkflow, ensuring appropriate access controls based on data sensitivity anduser roles.
Data producer workflow
The following diagram shows how data is protected as it is transferred toBigQuery.
The workflow for data transfer is the following:
- An application that is integrated with Workload Identity Federation usesCloud KMS to decrypt a wrapped encryption key.
- The application uses the Tink library to de-identify or encrypt the datausing a template.
- The application transfers data to the ingestion project in Google Cloud.
- The data arrives in Cloud Storage, BigQuery, orPub/Sub.
- In the ingestion project, the data is decrypted or re-identified using atemplate.
- The decrypted data is encrypted or masked based on another de-identificationtemplate, then placed in the non-confidential project. Tags are applied by thetagging engine as appropriate.
- Data from the non-confidential project is transferred over to theconfidential project and re-identified.
The following data access is permitted:
- Users who have access to the confidential project can access all the rawplaintext data.
- Users who have access to the non-confidential project can access masked,tokenized, or encrypted data based on the tags associated with the data andtheir permissions.
Data consumer workflow
The following steps describe how a consumer can access data that is stored inBigQuery.
- The data consumer searches for data assets using Data Catalog.
- After the consumer finds the assets that they are looking for, the dataconsumer requests access to the data assets.
- The data owner decides whether to provide access to the assets.
- If the consumer obtains access, the consumer can use a notebook and theSolution Catalog to create an environment in which they can analyze andtransform the data assets.
Bringing it all together
TheGitHubrepositoryprovides you with detailed instructions on deploying the data mesh onGoogle Cloud after you deployed the enterprise foundation. The process todeploy the architecture involves modifying your existing infrastructurerepositories and deploying new data mesh specific components.
Complete the following:
- Complete all prerequisites, including the following:
- InstallGoogle Cloud CLI,Terraform,Tink,Java, andGo.
- Deploy theenterprise foundations blueprint(v4.1).
- Maintain the following local repositories:
gcp-data-mesh-foundationsgcp-bootstrapgcp-environmentsgcp-networksgcp-orggcp-projects
- Modify the existing foundation blueprint and then deploy the data meshapplications. For each item, complete the following:
- In your target repository, check out the
Planbranch. - To add data mesh components, copy the relevant files and directories from
gcp-data-mesh-foundationsinto the appropriate foundation directory.Overwrite files when required. - Update the data mesh variables, roles, and settings in the Terraformfiles (forexample,
*.tfvarsand*.tf). Set the GitHub tokens as environmentvariables. - Perform the Terraform initialize, plan, and apply operations on eachrepository.
- Commit your changes, push the code to your remote repository, create pullrequests and merge to your development, nonproduction, and productionenvironments.
- In your target repository, check out the
What's next
- Read about thearchitecture and functions in a datamesh.
- Import data from Google Cloud into a secured BigQuerydatawarehouse.
- Implement the CDMC key controls framework in a BigQuery datawarehouse.
- Read about theenterprise foundationsblueprint.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-04-04 UTC.