Import data into a secured BigQuery data warehouse

Last reviewed 2025-06-15 UTC

Many organizations deploy data warehouses that store sensitive data so that theycan analyze the data for a variety of business purposes. This document isintended for data engineers and security administrators who deploy and securedata warehouses using BigQuery. It's part of a blueprint thatincludes the following:

Two GitHub repositories(terraform-google-secured-data-warehouseandterraform-google-secured-data-warehouse-onprem-ingest)that contain Terraform configurations and scripts. The Terraform configurationsets up a Google Cloud environment in Google Cloud for a data warehouse thatstores confidential data.
A guide to the architecture, design, and security controls of this blueprint(this document).
Awalkthroughthat deploys a sample environment.

This document discusses the following:

The architecture and Google Cloud services that you can use to helpsecure a data warehouse in a production environment.
Best practices for importing data into BigQuery from anexternal network such as an on-premises environment.
Best practices for datagovernancewhen creating, deploying, and operating a data warehouse inGoogle Cloud, including the following:
- datade-identification
- differential handling of confidential data
- column-level encryption
- column-level access controls

This document assumes that you have already configured a foundational set ofsecurity controls as described in theenterprise foundationsblueprint. It helpsyou to layer additional controls onto your existing security controls to helpprotect confidential data in a data warehouse.

Data warehouse use cases

The blueprint supports the following use cases:

Use theterraform-google-secured-data-warehouserepository to import data from Google Cloud into aBigQuery data warehouse
Use theterraform-google-secured-data-warehouse-onprem-ingestrepository to import data from an on-premises environment or another cloudinto a BigQuery data warehouse

Overview

Data warehouses such asBigQuery let businessesanalyze their business data for insights. Analysts access the business data thatis stored in data warehouses to create insights. If your data warehouse includesconfidential data, you must take measures to preserve the security,confidentiality, integrity, and availability of the business data while it isstored, while it is in transit, or while it is being analyzed. In thisblueprint, you do the following:

When importing data from external data sources, encrypt your data that'slocated outside of Google Cloud (for example, in an on-premisesenvironment) and import it into Google Cloud.
Configure controls that help secure access to confidential data.
Configure controls that help secure the data pipeline.
Configure an appropriate separation of duties for different personas.
When importing data from other sources located in Google Cloud (alsoknown asinternal data sources), set up templates to find and de-identifyconfidential data.
Set up appropriate security controls and logging to help protect confidentialdata.
Use data classification, policy tags, dynamic data masking, and column-levelencryption to restrict access to specific columns in the data warehouse.

Architecture

To create a confidential data warehouse, you need to import data securely andthen store the data in a VPC Service Controls perimeter.

Architecture when importing data from Google Cloud

The following image shows how ingested data is categorized, de-identified, andstored when you import source data from Google Cloud using theterraform-google-secured-data-warehouse repository. It also shows how you canre-identify confidential data on demand for analysis.

The sensitive data warehouse architecture for internal sources.

Architecture when importing data from external sources

The following image shows how data is ingested and stored when you import datafrom an on-premises environment or another cloud into a BigQuerywarehouse using theterraform-google-secured-data-warehouse-onprem-ingestrepository.

The sensitive data warehouse architecture for external networks.

Google Cloud services and features

The architectures use a combination of the following Google Cloud servicesand features:

Service or feature	Description
BigQuery	Applicable to both internal and external data sources. However, differentstorage options exist, as follows: When importing data from Google Cloud, BigQuery storesthe confidential data in the confidential data perimeter. When importing data from an external source, BigQuery storesthe encrypted data and the wrapped encryption key in separate tables. BigQuery uses various security controls to help protectcontent, includingaccesscontrols,column-levelsecurity for confidential data, anddataencryption.
Cloud Key Management Service (Cloud KMS) withCloud HSM	Applicable to both internal and external sources. However, an additionaluse case for external data sources exists. Cloud HSM is acloud-based hardware security module (HSM) service that hosts the key encryptionkey (KEK). When importing data from an external source, you useCloud HSM to generate the encryption key that you use to encrypt thedata in your network before sending it to Google Cloud.
Cloud Logging	Applicable to both internal and external sources. Cloud Loggingcollects all the logs from Google Cloud services for storage and retrievalby your analysis and investigation tools.
Cloud Monitoring	Applicable to both internal and externalsources. Cloud Monitoring collects and stores performance informationand metrics about Google Cloud services.
Cloud Run functions	Applicable for external data sourcesonly. Cloud Run functions is triggered by Cloud Storageand writes the data that Cloud Storage uploads to the ingestion bucket intoBigQuery.
Cloud Storage andPub/Sub	Applicable to both internal and externalsources. Cloud Storage and Pub/Sub receive data asfollows: Cloud Storage: receives and stores batch data. Bydefault, Cloud Storage uses TLS to encrypt data in transit and AES-256to encrypt data in storage. The encryption key is acustomer-managed encryption key(CMEK). For more information about encryption, seeData encryptionoptions. You can help to secure access to Cloud Storage buckets usingsecurity controls such as Identity and Access Management, access control lists (ACLs), and policydocuments. For more information about supported access controls, seeOverview of accesscontrol. Pub/Sub: receives and stores streaming databefore de-identification. Pub/Sub usesauthentication,accesscontrols, andmessage-levelencryption with a CMEK to protect your data.
DataProfiler for BigQuery	Applicable to both internal and external sources. Data Profiler forBigQuery automatically scans for sensitive data in allBigQuery tables and columns across the entire organization,including all folders and projects.
Dataflowpipelines	Applicable to both internal and external sources; however, differentpipelines exist. Dataflow pipelines import data, asfollows: When importing data from Google Cloud, two Dataflowpipelines de-identify and re-identify confidential data. The first pipelinede-identifies confidential data usingpseudonymization.The second pipeline re-identifies confidential data when authorized usersrequire access. When importing data from an external source, one Dataflowpipeline writes streaming data into BigQuery.
Dataplex Universal Catalog	Applicable to both internal and externalsources. Dataplex Universal Catalog automaticallycategorizes confidential data with metadata, also known aspolicytags, during ingestion. Dataplex Universal Catalog also usesmetadata to manage access to confidential data. To control access to data withinthe data warehouse, you apply policy tags to columns that include confidentialdata.
Dedicated Interconnect	Applicable for external data sourcesonly. Dedicated Interconnect lets you move data betweenyour network and Google Cloud. You can use another connectivity option, asdescribed inChoosinga Network Connectivity product.
IAM andResource Manager	Applicable to both internal and externalsources. Identity and Access Management (IAM) and Resource Manager restrict accessand segment resources. The access controls and resource hierarchy follow theprinciple of least privilege.
Security Command Center	Applicable to both internal and external sources. Security Command Centermonitors and reviews security findings from across your Google Cloudenvironment in a central location.
Sensitive Data Protection	Applicable to both internal and external sources; however, differentscans occur. Sensitive Data Protection scans data, asfollows: When importing data from Google Cloud,Sensitive Data Protection de-identifies confidential data duringingestion. Sensitive Data Protection de-identifies structured andunstructured data based on the infoTypes or records that are detected. When importing data from an external source,Sensitive Data Protection scans data that is stored inBigQuery to find any sensitive data that isn't protected. Formore information, seeUsingSensitive Data Protection to scan BigQuerydata.
VPC Service Controls	Applicable to both internal and external sources; however, differentperimeters exist. VPC Service Controls creates security perimetersthat isolate services and resources by setting up authorization, accesscontrols, andsecuredata exchange. The perimeters are as follows: A data ingestion perimeter accepts incoming data (in batch or stream) andde-identifies it. A separate landing zone helps to protect the rest of yourworkloads from incoming data. When importing data from Google Cloud, a confidential data perimetercan re-identify the confidential data and store it in a restricted area. When importing external data, a data perimeter isolates the encryption datafrom other workloads. A governance perimeter stores the encryption keys and defines what isconsidered confidential data. These perimeters are designed to protect incoming content, isolateconfidential data by setting up additional access controls and monitoring, andseparate your governance from the actual data in the warehouse. Your governanceincludes key management, data catalog management, and logging.

Organization structure

You group your organization's resources so that you can manage them and separateyour testing environments from your production environment.Resource Manager lets youlogically group resources by project, folder, and organization.

The following diagrams show you a resource hierarchy with folders that representdifferent environments such as bootstrap, common, production, non-production (orstaging), and development. You deploy most of the projects in the architectureinto the production folder and the data governance project in the common folder,which is used for governance.

Organization structure when importing data from Google Cloud

The following diagram shows the organization structure when importing data fromGoogle Cloud using theterraform-google-secured-data-warehouse repository.

The resource hierarchy for a sensitive data warehouse for internal sources.

Organization structure when importing data from external sources

The following diagram shows the organization structure when importing data froman external source using theterraform-google-secured-data-warehouse-onprem-ingest repository.

The resource hierarchy for a sensitive data warehouse for external sources.

Folders

You use folders to isolate your production environment and governance servicesfrom your non-production and testing environments. The following table describesthe folders from the enterprise foundations blueprint that are used by thisarchitecture.

Folder	Description
Bootstrap	Contains resources required to deploy the enterprise foundationsblueprint.
Common	Contains centralized services for the organization, such as the Datagovernance project.
Production	Contains projects that have cloud resources that have been tested and areready to use. In this architecture, the Production folder contains the Dataingestion project and data-related projects.
Non-production	Contains projects that have cloud resources that are beingtested and staged for release. In this architecture, the Non-production foldercontains the Data ingestion project and data-related projects.
Development	Contains projects that have cloud resources that are beingdeveloped. In this architecture, the Development folder contains the Dataingestion project and data-related projects.

You can change the names of these folders to align with your organization'sfolder structure, but we recommend that you maintain a similar structure. Formore information, see theenterprise foundationsblueprint.

Projects

You isolate parts of your environment using projects. The following tabledescribes the projects that are needed within the organization. You create theseprojects when you run the Terraform code. You can change the names of theseprojects, but we recommend that you maintain a similar project structure.

Project	Description
Data ingestion	Common project for both internal and external sources. Containsservices that are required in order to receive data and de-identify confidentialdata.
Data governance	Common project for both internal and external sources. Containsservices that provide key management, logging, and data catalogingcapabilities.
Non-confidential data	Project for internal sources only. Contains services that arerequired in order to store data that has been de-identified.
Confidential data	Project for internal sources only. Contains services that arerequired in order to store and re-identify confidential data.
Data	Project for external sources only. Contains services that arerequired to store data.

In addition to these projects, your environment must also include a project thathosts a DataflowFlexTemplatejob. The Flex Template job is required for the streaming data pipeline.

Mapping roles and groups to projects

You must give different user groups in your organization access to the projectsthat make up the confidential data warehouse. The following sections describethe architecture recommendations for user groups and role assignments in theprojects that you create. You can customize the groups to match yourorganization's existing structure, but we recommend that you maintain a similarsegregation of duties and role assignment.

Data analyst group

Data analysts analyze the data in the warehouse. In theterraform-google-secured-data-warehouse-onprem-ingest repository, this group canview data after it has been loaded into the data warehouse and perform the sameoperations as theEncrypted data viewergroup.

The following table describes the group's roles in different projects for theterraform-google-secured-data-warehouse repository (internal data sources only).

Project mapping Roles

Project mapping	Roles
Data ingestion	Dataflow Developer (`roles/dataflow.developer`) Dataflow Viewer (`roles/dataflow.viewer`) LogsViewer (`roles/logging.viewer`) Additional role for data analysts that require access to confidentialdata: Fine-GrainedReader (`roles/datacatalog.categoryFineGrainedReader`)
Confidential data	BigQuery DataViewer (`roles/bigquery.dataViewer`) BigQuery JobUser (`roles/bigquery.jobUser`) BigQuery User (`roles/bigquery.user`) Dataflow Developer (`roles/dataflow.developer`) Dataflow Viewer (`roles/dataflow.viewer`) LogsViewer (`roles/logging.viewer`)
Non-confidential data	BigQuery DataViewer (`roles/bigquery.dataViewer`) BigQuery JobUser (`roles/bigquery.jobUser`) BigQuery User (`roles/bigquery.user`) LogsViewer (`roles/logging.viewer`)

Data ingestion

Additional role for data analysts that require access to confidentialdata:

Fine-GrainedReader (roles/datacatalog.categoryFineGrainedReader)

Confidential data

Non-confidential data

The following table describes the group's roles in different projects for theterraform-google-secured-data-warehouse-onprem-ingest repository (external datasources only).

Scope of assignment	Roles
Data ingestion project	Dataflow Developer (`roles/dataflow.developer`) Dataflow Viewer (`roles/dataflow.viewer`) LogsViewer (`roles/logging.viewer`)
Data project	BigQuery DataViewer (`roles/bigquery.dataViewer`) BigQuery JobUser (`roles/bigquery.jobUser`) BigQuery User (`roles/bigquery.user`) Dataflow Developer (`roles/dataflow.developer`) Dataflow Viewer (`roles/dataflow.viewer`) DLPAdministrator (`roles/dlp.admin`) LogsViewer (`roles/logging.viewer`)
Data policy level	MaskedReader (`roles/bigquerydatapolicy.maskedReader`)

Encrypted data viewer group (external sources only)

The Encrypted data viewer group in theterraform-google-secured-data-warehouse-onprem-ingest repository can viewencrypted data from BigQuery reporting tables throughLooker Studio and other reporting tools, such as SAP Business Objects.The encrypted data viewer group can't view cleartext data from encryptedcolumns.

This group requires theBigQuery User(roles/bigquery.jobUser) rolein the Data project. This group also requires theMasked Reader(roles/bigquerydatapolicy.maskedReader)role at the data policy level.

Plaintext reader group (external sources only)

The Plaintext reader group in theterraform-google-secured-data-warehouse-onprem-ingest repository has therequired permission to call the decryption user-defined function (UDF) to viewplaintext data and the additional permission to read unmasked data.

This group requires the following roles in the Data project:

In addition, this group requires theFine-Grained Reader(roles/datacatalog.categoryFineGrainedReader) role at theDataplex Universal Catalog level.

Data engineer group

Data engineers set up and maintain the data pipeline and warehouse.

The following table describes the group's roles in different projects for theterraform-google-secured-data-warehouse repository.

Scope of assignment	Roles
Data ingestion project	Cloud Build Editor (`roles/cloudbuild.builds.editor`) Cloud KMS Viewer (`roles/cloudkms.viewer`) ComposerUser (`roles/composer.user`) ComputeNetwork User (`roles/compute.networkUser`) Dataflow Admin (`roles/dataflow.admin`) LogsViewer (`roles/logging.viewer`)
Confidential data project	BigQuery DataEditor (`roles/bigquery.dataEditor`) BigQuery JobUser (`roles/bigquery.jobUser`) Cloud Build Editor (`roles/cloudbuild.builds.editor`) Cloud KMS Viewer (`roles/cloudkms.viewer`) ComputeNetwork User (`roles/compute.networkUser`) Dataflow Admin (`roles/dataflow.admin`) LogsViewer (`roles/logging.viewer`)
Non-confidential data project	BigQuery DataEditor (`roles/bigquery.dataEditor`) BigQuery JobUser (`roles/bigquery.jobUser`) Cloud KMS Viewer (`roles/cloudkms.viewer`) LogsViewer (`roles/logging.viewer`)

The following table describes the group's roles in different projects for theterraform-google-secured-data-warehouse-onprem-ingest repository.

Scope of assignment	Roles
Data ingestion project	Cloud Build Editor (`roles/cloudbuild.builds.editor`) Cloud KMS Viewer (`roles/cloudkms.viewer`) ComposerUser (`roles/composer.user`) ComputeNetwork User (`roles/compute.networkUser`) Dataflow Admin (`roles/dataflow.admin`) LogsViewer (`roles/logging.viewer`)
Data project	BigQuery DataEditor (`roles/bigquery.dataEditor`) BigQuery JobUser (`roles/bigquery.jobUser`) Cloud Build Editor (`roles/cloudbuild.builds.editor`) Cloud KMS Viewer (`roles/cloudkms.viewer`) ComputeNetwork User (`roles/compute.networkUser`) Dataflow Admin (`roles/dataflow.admin`) DLPAdministrator (`roles/dlp.admin`) LogsViewer (`roles/logging.viewer`)

Network administrator group

Network administrators configure the network. Typically, they are members of thenetworking team.

Network administrators require the following roles at the organization level:

Security administrator group

Security administrators administer security controls such as access, keys,firewall rules, VPC Service Controls, and theSecurity Command Center.

Security administrators require the following roles at the organization level:

Security analyst group

Security analysts monitor and respond to security incidents andSensitive Data Protection findings.

Security analysts require the following roles at the organization level:

Example group access flows for external sources

The following sections describe access flows for two groups when importing datafrom external sources using theterraform-google-secured-data-warehouse-onprem-ingest repository.

Access flow for Encrypted data viewer group

The following diagram shows what occurs when a user from theEncrypted dataviewergrouptries to access encrypted data in BigQuery.

The flow for the encrypted data viewer group.

The steps to access data in BigQuery are as follows:

The Encrypted data viewer executes the following query onBigQuery to access confidential data:
```
SELECTssn,panFROMcc_card_table
```
BigQuery verifies access as follows:
- The user is authenticated using valid, unexpired Google Cloudcredentials.
- The user identity and the IP address that the request originated from arepart of the allowlist in the access level or ingress rule on theVPC Service Controls perimeter.
- IAM verifies that the user has the appropriate roles and isauthorized to access selected encrypted columns on theBigQuery table.

BigQuery returns the confidential data in encrypted format.

Access flow for Plaintext reader group

The following diagram shows what occurs when a user from thePlaintext readergrouptries to access encrypted data in BigQuery.

The flow for the plaintext reader group.

The steps to access data in BigQuery are as follows:

The Plaintext reader executes the following query on BigQueryto access confidential data in decrypted format:
```
SELECTdecrypt_ssn(ssn)FROMcc_card_table
```
BigQuery calls the decryptuser-defined function(UDF) withinthe query to access protected columns.
Access is verified as follows:
- IAM verifies that the user has appropriate roles and isauthorized to access the decrypt UDF on BigQuery.
- The UDF retrieves the wrapped data encryption key (DEK) that was used toprotect sensitive data columns.
The decrypt UDF calls the key encryption key (KEK) in Cloud HSM tounwrap the DEK. The decrypt UDF uses theBigQuery AEAD decryptfunctionto decrypt the sensitive data columns.
The user is granted access to the plaintext data in the sensitive datacolumns.

Common security controls

The following sections describe the controls that apply to both internal andexternal sources.

Data ingestion controls

To create your data warehouse, you must transfer data from anotherGoogle Cloud source (for example, a data lake), your on-premisesenvironment, or another cloud. You can use one of the following options totransfer your data into the data warehouse on BigQuery:

A batch job that uses Cloud Storage.
A streaming job that uses Pub/Sub.

To help protect data during ingestion, you can use client-side encryption,firewall rules, and access level policies. The ingestion process is sometimesreferred to as anextract, transform, load (ETL)process.

Network and firewall rules

Virtual Private Cloud (VPC) firewallrules control the flow of datainto the perimeters. You create firewall rules that deny all egress, except forspecific TCP port 443 connections from therestricted.googleapis.com specialdomain names. Therestricted.googleapis.com domain has the following benefits:

It helps reduce your network attack surface by using Private Google Accesswhen workloads communicate to Google APIs and services.
It ensures that you only use services that support VPC Service Controls.

For more information, seeConfiguringPrivate Google Access.

When using theterraform-google-secured-data-warehouse repository, you mustconfigure separate subnets for each Dataflow job. Separatesubnets ensure that data that is being de-identified is properly separated fromdata that is being re-identified.

The data pipeline requires you to open TCP ports in the firewall, as defined inthedataflow_firewall.tf file in the respective repositories. For moreinformation, seeConfiguring internet access and firewallrules.

To deny resources the ability to use external IP addresses, theDefine allowedexternal IPs for VMinstances (compute.vmExternalIpAccess)organization policy is set to deny all.

Perimeter controls

As shown in thearchitecturediagram,you place the resources for the data warehouse into separate perimeters. Toenable services in different perimeters to share data, you createperimeterbridges.

Perimeter bridges let protected services make requests for resources outside oftheir perimeter. These bridges make the following connections for theterraform-google-secured-data-warehouse repository:

They connect the data ingestion project to the governance project so thatde-identification can take place during ingestion.
They connect the non-confidential data project and the confidential dataproject so that confidential data can be re-identified when a data analystrequests it.
They connect the confidential project to the data governance project so thatre-identification can take place when a data analyst requests it.

These bridges make the following connections for theterraform-google-secured-data-warehouse-onprem-ingest repository:

They connect the Data ingestion project to the Data project so that data canbe ingested into BigQuery.
They connect the Data project to the Data governance project so thatSensitive Data Protection can scan BigQuery forunprotected confidential data.
They connect the Data ingestion project to the Data governance project foraccess to logging, monitoring, and encryption keys.

In addition to perimeter bridges, you useegressrulesto let resources protected by service perimeters access resources that areoutside the perimeter. In this solution, you configure egress rules to obtainthe external Dataflow Flex Template jobs that are located inCloud Storage in an external project. For more information, seeAccessa Google Cloud resource outside theperimeter.

Access policy

To help ensure that only specific identities (user or service) can accessresources and data, you enable IAM groups and roles.

To help ensure that only specific sources can access your projects, you enableanaccesspolicy for yourGoogle organization. We recommend that you create an access policy thatspecifies the allowed IP address range for requests and only allows requestsfrom specific users or service accounts. For more information, seeAccess levelattributes.

Service accounts and access controls

Service accounts are identities that Google Cloud can use to run APIrequests on your behalf. Service accounts ensure that user identities don'thave direct access to services. To permit separation of duties, you createservice accounts with different roles for specific purposes. These serviceaccounts are defined in thedata-ingestion module and theconfidential-datamodule in each architecture.

For theterraform-google-secured-data-warehouse repository, the serviceaccounts are as follows:

A Dataflow controller service account for theDataflow pipeline that de-identifies confidential data.
A Dataflow controller service account for theDataflow pipeline that re-identifies confidential data.
A Cloud Storage service account to ingest data from a batch file.
A Pub/Sub service account to ingest data from a streamingservice.
A Cloud Scheduler service account to run the batch Dataflowjob that creates the Dataflow pipeline.

The following table lists the roles that are assigned to each service account:

Service Account	Name	Project	Roles
Dataflow controller This account is used forde-identification.	`sa-dataflow-controller`	Data ingestion	BigQuery Admin`roles/bigquery.admin` Cloud KMS Admin (`roles/cloudkms.admin`) Cloud KMS CryptoKeyDecrypter (`roles/cloudkms.cryptoKeyDecrypter`) ComputeViewer (`roles/compute.viewer`) Dataflow Worker (`roles/dataflow.worker`) DLPAdministrator (`roles/dlp.admin`) Pub/Sub Subscriber (`roles/pubsub.subscriber`) StorageAdmin (`roles/storage.admin`) Dataflow ServiceAgent (`roles/dataflow.serviceAgent`)
Dataflow controller This account is used forre-identification.	`sa-dataflow-controller-reid`	Confidential data	BigQuery Admin`roles/bigquery.admin` Cloud KMS Admin (`roles/cloudkms.admin`) Cloud KMS CryptoKeyDecrypter (`roles/cloudkms.cryptoKeyDecrypter`) ComputeViewer (`roles/compute.viewer`) Dataflow Worker (`roles/dataflow.worker`) DLPAdministrator (`roles/dlp.admin`) Pub/Sub Subscriber (`roles/pubsub.subscriber`) StorageAdmin (`roles/storage.admin`) Dataflow ServiceAgent (`roles/dataflow.serviceAgent`)
Cloud Storage	`sa-storage-writer`	Data ingestion	StorageObject Creator (`roles/storage.objectCreator`) StorageObject Viewer (`roles/storage.ObjectViewer`)
Pub/Sub	`sa-pubsub-writer`	Data ingestion	Pub/Sub Publisher (`roles/pubsub.publisher`) Pub/Sub Subscriber (`roles/pubsub.subscriber`)
Cloud Scheduler	`sa-scheduler-controller`	Data ingestion	ComputeViewer (`roles/compute.viewer`) Dataflow Developer (`roles/dataflow.developer`)

For theterraform-google-secured-data-warehouse-onprem-ingest repository, theservice accounts are as follows:

Cloud Storage service account runs the automated batch data uploadprocess to the ingestion storage bucket.
Pub/Sub service account enables streaming of data toPub/Sub service.
Dataflow controller service account is used by theDataflow pipeline to transform and write data fromPub/Sub to BigQuery.
Cloud Run functions service account writes subsequent batchdata uploaded from Cloud Storage to BigQuery.
Storage Upload service account allows the ETL pipeline to create objects.
Pub/Sub Write service Account lets the ETL pipeline write datato Pub/Sub.

The following table lists the roles that are assigned to each service account:

Name	Roles	Scope of Assignment
Dataflow controller service account	BigQuery DataEditor (`roles/bigquery.dataEditor`) BigQuery JobUser (`roles/bigquery.jobUser`) Dataflow Developer (`roles/dataflow.developer`) Dataflow Worker (`roles/dataflow.worker`) Pub/Sub Editor (`roles/pubsub.editor`) Pub/Sub Subscriber (`roles/pubsub.subscriber`) ServiceUsage Consumer (`roles/serviceusage.serviceUsageConsumer`) StorageObject Viewer (`roles/storage.ObjectViewer`)	Data ingestion project
	BigQuery DataEditor (`roles/bigquery.dataEditor`) BigQuery MetadataViewer (`roles/bigquery.metadataViewer`)	Data project
	DLPInspect Findings Reader (`roles/dlp.deidentifyTemplatesReader`) DLPInspect Templates Editor (`roles/dlp.inspectTemplatesReader`) DLPUser (`roles/dlp.user`)	Data governance
Cloud Run functions service account	BigQuery DataEditor (`roles/bigquery.dataEditor`) BigQuery JobUser (`roles/bigquery.jobUser`) Cloud Run Invoker (`roles/run.invoker`) EventarcEvent Receiver (`roles/eventarc.eventReceiver`)	Data ingestion project
	BigQuery DataEditor (`roles/bigquery.dataEditor`) BigQuery MetadataViewer (`roles/bigquery.metadataViewer`)	Data project
Storage Upload service account	StorageObject Creator (`roles/storage.objectCreator`) StorageObject Viewer (`roles/storage.ObjectViewer`)	Data ingestion Project
Pub/Sub Write service account	Pub/Sub Publisher (`roles/pubsub.publisher`) Pub/Sub Subscriber (`roles/pubsub.subscriber`)	Data ingestion Project

Organizational policies

This architecture includes the organization policy constraints that theenterprise foundations blueprint uses and adds additional constraints. For moreinformation about the constraints that the enterprise foundations blueprintuses, seeOrganization policyconstraints.

The following table describes the additional organizational policyconstraintsthat are defined in theorg_policies module for the respective repositories:

Policy	Constraint name	Recommended value
Restrictresource deployments to specific physical locations. For additional values,seeValuegroups.	`gcp.resourceLocations`	One of thefollowing: `in:us-locations` `in:eu-locations` `in:asia-locations`
Disableservice account creation	`iam.disableServiceAccountCreation`	`true`
Enable OSLogin for VMs created in the project.	`compute.requireOsLogin`	`true`
Restrictnew forwarding rules to be internal only, based on IP address.	`compute.restrictProtocolForwardingCreationForTypes`	`INTERNAL`
Definethe set of Shared VPC subnetworks that Compute Engine resources canuse.	`compute.restrictSharedVpcSubnetworks`	`projects//regions//s ubnetworks/`. Replace with theresource ID of the private subnet that you want the architecture touse.
Disableserial port output logging to Cloud Logging.	`compute.disableSerialPortLogging`	`true`
RequireCMEK protection (`terraform-google-secured-data-warehouse-onprem-ingest`only)	`gcp.restrictNonCmekServices`	`bigquery.googleapis.com`
Disable service account key creation(`terraform-google-secured-data-warehouse-onprem-ingest only`)	`disableServiceAccountKeyCreation`	true
Enable OSLogin for VMs created in the project(`terraform-google-secured-data-warehouse-onprem-ingest only`)	`compute.requireOsLogin`	true
Disableautomatic role grants to default service account(`terraform-google-secured-data-warehouse-onprem-ingest only`)	`automaticIamGrantsForDefaultServiceAccounts`	true
Allowed ingress settings (Cloud Run functions)(`terraform-google-secured-data-warehouse-onprem-ingest only`)	`cloudfunctions.allowedIngressSettings`	`ALLOW_INTERNAL_AND_GCLB`

Security controls for external data sources

The following sections describe the controls that apply to ingesting data fromexternal sources.

Encrypted connection to Google Cloud

When importing data from external sources, you can use Cloud VPN orCloud Interconnect to protect all data that flows between Google Cloudand your environment. This enterprise architecture recommendsDedicated Interconnect, because it provides a direct connectionand high throughput, which are important if you're streaming a lot of data.

To permit access to Google Cloud from your environment, you must defineallowlisted IP addresses in the access levels policy rules.

Client-side encryption

Before you move your sensitive data into Google Cloud, encrypt your datalocally to help protect it at rest and in transit. You can use theTink encryption library, or you can useother encryption libraries. The Tink encryption library is compatible withBigQuery AEADencryption, whichthe architecture uses to decrypt column-level encrypted data after the data isimported.

The Tink encryption library uses DEKs that you can generate locally or fromCloud HSM.To wrap or protect the DEK, you can use a KEK that is generated inCloud HSM. The KEK is asymmetricCMEK encryption keyset that is storedsecurely in Cloud HSM and managed using IAM roles andpermissions.

During ingestion, both the wrapped DEK and the data are stored inBigQuery. BigQuery includes two tables: one forthe data and the other for the wrapped DEK. When analysts need to viewconfidential data, BigQuery can useAEADdecryptionto unwrap the DEK with the KEK and decrypt the protected column.

Also, client-side encryption using Tink further protects your data by encryptingsensitive data columns in BigQuery. The architecture uses thefollowing Cloud HSM encryption keys:

A CMEK key for the ingestion process that's also used byPub/Sub, Dataflow pipeline for streaming,Cloud Storage batch upload, and Cloud Run functionsartifacts for subsequent batch uploads.
The cryptographic key wrapped by Cloud HSM for the data encrypted onyour network using Tink.
CMEK key for the BigQuery warehouse in the Data project.

You specify the CMEK location, which determines the geographical location thatthe key is stored and is made available for access. You must ensure that yourCMEK is in the same location as your resources. By default, the CMEK is rotatedevery 30 days.

If your organization's compliance obligations require that you manage your ownkeys externally from Google Cloud, you can enableCloud External Key Manager. If you use external keys,you're responsible for key management activities, including key rotation.

Dynamic data masking

To help with sharing and applying data access policies at scale, you canconfiguredynamic datamasking.Dynamic data masking lets existing queries automatically mask column data usingthe following criteria:

The masking rules that are applied to the column at query runtime.
The roles that are assigned to the user who is running the query. To accessunmasked column data, the data analyst must have theFine-GrainedReaderrole.

To define access for columns in BigQuery, you createpolicytags. Forexample, the taxonomy created in thestandaloneexamplecreates the1_Sensitive policy tag for columns that include data that cannotbe made public, such as the credit limit. The default data masking rule isapplied to these columns to hide the value of the column.

Anything that isn't tagged is available to all users who have access to the datawarehouse. These access controls ensure that, even after the data is written toBigQuery, the data in sensitive fields still cannot be read untilaccess is explicitly granted to the user.

Column-level encryption and decryption

Column-levelencryption lets youencrypt data in BigQuery at a more granular level. Instead ofencrypting an entire table, you select the columns that contain sensitive datawithin BigQuery, and only those columns are encrypted.BigQuery usesAEAD encryption anddecryptionfunctions that create the keysets that contain the keys for encryption anddecryption. These keys are then used to encrypt and decrypt individual values ina table, and rotate keys within a keyset. Column-level encryption providesdual-access control on encrypted data in BigQuery, because theuser must have permissions to both the table and the encryption key to read datain cleartext.

Data profiler for BigQuery with Sensitive Data Protection

Dataprofilerlets you identify the locations of sensitive and high risk data inBigQuery tables. Data profiler automatically scans and analyzesall BigQuery tables and columns across the entire organization,including all folders and projects. Data profiler then outputs metrics such asthe predictedinfoTypes,the assessed data risk and sensitivity levels, and metadata about your tables.Using these insights, you can make informed decisions about how you protect,share, and use your data.

Security controls for internal data sources

The following sections describe the controls that apply to ingesting data fromGoogle Cloud sources.

Key management and encryption for ingestion

Both ingestion options (Cloud Storage or Pub/Sub) useCloud HSM to manage the CMEK. You use the CMEK keys to help protectyour data during ingestion. Sensitive Data Protection further protectsyour data by encrypting confidential data, using the detectors that youconfigure.

To ingest data, you use the following encryption keys:

A CMEK key for the ingestion process that's also used by theDataflow pipeline and the Pub/Sub service.
The cryptographic keywrappedby Cloud HSM for the data de-identification process usingSensitive Data Protection.
Two CMEK keys, one for the BigQuery warehouse in thenon-confidential data project, and the other for the warehouse in theconfidential data project. For more information, seeKeymanagement.

You specify theCMEK location,which determines the geographical location that the key is stored and is madeavailable for access. You must ensure that your CMEK is in the same location asyour resources. By default, the CMEK is rotated every 30 days.

If your organization's compliance obligations require that you manage your ownkeys externally from Google Cloud, you can enableCloud EKM. If you use externalkeys, you are responsible for key management activities, including key rotation.

Data de-identification

You use Sensitive Data Protection to de-identify your structured andunstructured data during the ingestion phase. For structured data, you userecordtransformationsbased on fields to de-identify data. For an example of this approach, see the/examples/de_identification_template/folder. This example checks structured data for any credit card numbers and cardPINs. For unstructured data, you useinformationtypesto de-identify data.

To de-identify data that is tagged as confidential, you useSensitive Data Protection and a Dataflow pipeline totokenize it. This pipeline takes data from Cloud Storage, processes it,and then sends it to the BigQuery data warehouse.

For more information about the data de-identification process, seedatagovernance.

Column-level access controls

To help protect confidential data, you useaccess controls for specificcolumns inthe BigQuery warehouse. In order to access the data in thesecolumns, a data analyst must have the Fine-GrainedReader role.

To define access for columns in BigQuery, you create policytags. Forexample, thetaxonomy.tf file in thebigquery-confidential-data examplemodule creates the following tags:

A3_Confidential policy tag for columns that include very sensitiveinformation, such as credit card numbers. Users who have access to this tagalso have access to columns that are tagged with the2_Private or1_Sensitive policy tags.
A2_Private policy tag for columns that include sensitive personalidentifiable information (PII), such as a person's first name.Users who have access to this tag also have access to columns that are taggedwith the1_Sensitive policy tag. Users don't have access to columns thatare tagged with the3_Confidential policy tag.
A1_Sensitive policy tag for columns that include data that cannot be madepublic, such as the credit limit. Users who have access to this tag don'thave access to columns that are tagged with the2_Private or3_Confidential policy tags.

Anything that is not tagged is available to all users who have access to thedata warehouse.

These access controls ensure that, even after the data is re-identified, thedata still cannot be read until access is explicitly granted to the user.

Note: You can use the default definitions to run the examples. For more bestpractices, seeBest practices for using policy tags inBigQuery.

Service accounts with limited roles

You must limit access to the confidential data project so that only authorizedusers can view the confidential data. To do so, you create a service accountwith theService AccountUser (roles/iam.serviceAccountUser)role that authorized users must impersonate.Service Accountimpersonationhelps users to use service accounts without downloading the service accountkeys, which improves the overall security of your project. Impersonation createsa short-term token that authorized users who have theService Account TokenCreator (roles/iam.serviceAccountTokenCreator)role are allowed to download.

Key management and encryption for storage and re-identification

You manage separate CMEK keys for your confidential data so that you canre-identity the data. You use Cloud HSM to protect your keys. Tore-identify your data, use the following keys:

A CMEK key that the Dataflow pipeline uses for there-identification process.
The original cryptographic key that Sensitive Data Protection uses tode-identify your data.
A CMEK key for the BigQuery warehouse in the confidential dataproject.

As mentioned inKey management and encryption foringestion,you can specify the CMEK location and rotation periods. You can useCloud EKM if it is required by your organization.

Operations

You can enable logging andSecurity Command Center Premium or Enterprise tierfeaturessuch as Security Health Analytics and Event Threat Detection. These controls help you to do thefollowing:

Monitor who is accessing your data.
Ensure that proper auditing is put in place.
Generate findings for misconfigured cloud resources
Support the ability of your incident management and operations teams torespond to issues that might occur.

Access Transparency

Access Transparencyprovides you with real-time notification whenGooglepersonnelrequire access to your data. Access Transparency logs are generated whenever ahuman accesses content, and only Google personnel with valid businessjustifications (for example, a support case) can obtain access.

Logging

To help you to meet auditing requirements and get insight into your projects,you configure theGoogle Cloud Observabilitywith data logs for services you want to track. Thecentralized-logging modulein the repositories configures the following best practices:

Creating an aggregated logsink acrossall projects.
Storing your logs in the appropriateregion.
Adding CMEK keys to your logging sink.

For all services within the projects, your logs must include information aboutdata reads and writes, and information about what administrators read. Foradditional logging best practices, seeDetectivecontrols.

Alerts and monitoring

After you deploy the architecture, you can set up alerts to notify your securityoperations center (SOC) that a security incident might be occurring. Forexample, you can use alerts to let your security analyst know when anIAM permission has changed. For more information aboutconfiguring Security Command Center alerts, seeSetting up findingnotifications.For additional alerts that aren't published by Security Command Center, you can set upalerts with Cloud Monitoring.

Additional security considerations

In addition to the security controls described in this document, you shouldreview and manage the security and risk in key areas that overlap and interactwith your use of this solution. These include the following:

The security of the code that you use to configure, deploy, and runDataflow jobs and Cloud Run functions.
The data classification taxonomy that you use with this solution.
Generation and management of encryption keys.
The content, quality, and security of the datasets that you store and analyzein the data warehouse.
The overall environment in which you deploy the solution, including thefollowing:
- The design, segmentation, and security of networks that you connect to thissolution.
- The security and governance of your organization's IAMcontrols.
- The authentication and authorization settings for the actors to whom yougrant access to the infrastructure that's part of this solution, and whohave access to the data that's stored and managed in that infrastructure.

Bringing it all together

To implement the architecture described in this document, do the following:

Determine whether you will deploy the architecture with the enterprisefoundations blueprint or on its own. If you choose not to deploy theenterprise foundations blueprint, ensure that your environment has a similarsecurity baseline in place.
For importing data from external sources, set up aDedicated Interconnectconnectionwith your network.
Review theterraform-google-secured-data-warehouseREADMEorterraform-google-secured-data-warehouse-onprem-ingestREADMEand ensure that you meet all the prerequisites.
Verify that your user identity has theService Account User(roles/iam.serviceAccountUser)andService Account Token Creator(roles/iam.serviceAccountTokenCreator)roles for your organization's development folder, as described inOrganization structure. If you don't have a folderthat you use for testing,create afolderandconfigureaccess.
Record your billing account ID, organization's display name, folder ID foryour test or demo folder, and the email addresses for the followingusergroups:
- Data analysts
- Encrypted data viewer
- Plaintext reader
- Data engineers
- Network administrators
- Security administrators
- Security analysts
Create theprojects. For a list of APIsthat you must enable, see the README.
Create the service account for Terraform and assign the appropriate roles forall projects.
Set up the Access Control Policy.
For Google Cloud data sources using theterraform-google-secured-data-warehouse repository, in your testingenvironment, deploy thewalkthroughto see the solution in action. As part of your testing process, consider thefollowing:
1. Add your own sample data into the BigQuery warehouse.
2. Work with a data analyst in your enterprise to test their access to theconfidential data and whether they can interact with the data fromBigQuery in the way that they would expect.
For external data sources using theterraform-google-secured-data-warehouse-onprem-ingest repository, in yourtesting environment, deploy the solution:
1. Clone and run the Terraform scripts to set up an environment inGoogle Cloud.
2. Install the Tink encryptionlibrary on your network.
3. Set up Application DefaultCredentialsso that you can run the Tink library on your network.
4. Create encryptionkeys withCloud KMS.
5. Generate encryptedkeysets withTink.
6. Encrypt data with Tink using one of the following methods:
  - Usingdeterministicencryption.
  - Using ahelper script with sampledata.
7. Upload encrypted data to BigQuery using streaming or batch uploads.

For external data sources, verify that authorized users can read unencrypteddata from BigQuery using the BigQuery AEADdecrypt function. For example, run the following create decryption function:

Run the create view query:

CREATEORREPLACEVIEW`{project_id}.{bigquery_dataset}.decryption_view`ASSELECTCard_Type_Code,Issuing_Bank,Card_Number,`bigquery_dataset.decrypt`(Card_Number)ASCard_Number_DecryptedFROM`project_id.dataset.table_name`

Run the select query from view:

SELECTCard_Type_Code,Issuing_Bank,Card_Number,Card_Number_DecryptedFROM`{project_id}.{bigquery_dataset}.decrypted_view`

For additional queries and use cases, seeColumn-level encryption withCloud KMS.

Use Security Command Center to scan the newly created projects against yourcompliancerequirements.
Deploy the architecture into your production environment.

What's next

Review theenterprise foundationsblueprint for abaseline secure environment.
To see the details of the architecture, read theTerraform configurationREADMEfor internal data sources (terraform-google-secured-data-warehouserepository) or read theTerraform configurationREADMEfor external data sources (terraform-google-secured-data-warehouse-onprem-ingest repository).

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-06-15 UTC.

Movatterモバイル変換

Import data into a secured BigQuery data warehouse Stay organized with collections Save and categorize content based on your preferences.