De-identification of sensitive Cloud Storage data Stay organized with collections Save and categorize content based on your preferences.
This page describes how Sensitive Data Protection can create de-identifiedcopies of data stored in Cloud Storage. It also lists the limitations ofthis operation and the points that you should consider before you start.
For information about how to use Sensitive Data Protection to createde-identified copies of your Cloud Storage data, see the following:
- Create de-identified copies of data stored in Cloud Storage usingthe Google Cloud console
- Create de-identified copies of data stored in Cloud Storage usingthe API
About de-identification
De-identification is the process of removing identifying information fromdata. Its goal is to enable the use and sharing of personalinformation—such as health, financial, or demographicinformation—while meeting privacy requirements. For moreinformation about de-identification, seeDe-identifying sensitive data.
For more in-depth information about de-identification transformations inSensitive Data Protection, seeTransformation reference. For more informationabout how Sensitive Data Protection redacts sensitive data from images, seeImage inspection and redaction.
When to use this feature
This feature is useful if the files that you use inyour business operations contain sensitive data, such aspersonally identifiable information (PII). This feature lets you use andshare information as part of your business processes, while keeping sensitivepieces of data obscured.
De-identification process
This section describes the de-identification process in Sensitive Data Protectionfor content in Cloud Storage.
To use this feature, youcreate aninspection job (DlpJob) that's configured to make de-identifiedcopies of the Cloud Storage files.Sensitive Data Protection scans the files in the specified location, inspectingthem according to your configuration. As it inspects each file,Sensitive Data Protection de-identifies any data that matches your criteria forsensitive data, and then writes the content to a new file. The new file always hasthe same filename as the original file.It stores this new file in an output directory that you specify. If a file isincluded in your scan, but no data matches your de-identification criteria, andthere are no errors in its processing, then the file is copied, unaltered, tothe output directory.
The output directory that you set must be in a Cloud Storage bucket that'sdifferent from the bucket containing your input files. In your output directory,Sensitive Data Protection creates a file structure that mirrors the file structureof the input directory.
For example, suppose you set the following input and output directories:
- Input directory:
gs://input-bucket/folder1/folder1a - Output directory:
gs://output-bucket/output-directory
During de-identification, Sensitive Data Protection stores the de-identified filesings://output-bucket/output-directory/folder1/folder1a.
If a file exists in the output directory with the same filename as ade-identified file, that file is overwritten. If you don't want existing filesto be overwritten, change the output directory before running this operation.Alternatively, considerenabling object versioning on the output bucket.
File-level access control lists (ACLs) for the original files are copied to thenew files, regardless of whether sensitive data was found and de-identified.However, if the output bucket is configured only for uniform bucket-levelpermissions, and not fine-grained (object-level) permissions, then the ACLsaren't copied to the de-identified files.
The following diagram shows the de-identification process for four filesstored in a Cloud Storage bucket. Each file is copiedregardless of whether Sensitive Data Protection detects any sensitive data. Eachcopied file is named the same as the original.
Pricing
For pricing information, seeInspection and transformation of data in storage.
Supported file types
Sensitive Data Protection can de-identify the followingfile type groups:
- CSV
- Image
- Text
- TSV
Default de-identification behavior
If you want to define how Sensitive Data Protection transforms the findings,you can providede-identify templates for the following types of files:
- Unstructured files, like text files with freeform text
- Structured files, like CSV files
- Images
If you don't provide any de-identify template, Sensitive Data Protectiontransforms the findings as follows:
- In unstructured and structured files, Sensitive Data Protection replaces allfindings with their corresponding infoType, as described inInfoType replacement.
- In images, Sensitive Data Protection covers all findings with a blackbox.
Limitations and considerations
Consider the following points before creating de-identified copies ofCloud Storage data.
Disk space
This operation only supports content stored in Cloud Storage.
This operation makes a copy of each file as Sensitive Data Protection inspects it.It does not modify or remove the original content. The copied data will take uproughly the same amount of additional disk space as the original data.
Write access to the storage
Because Sensitive Data Protection creates a copy of the original files,the service agent of your project must have write access on theCloud Storage output bucket.
Sampling and setting finding limits
This operation doesn't support sampling. Specifically, you can't limit how muchof each file Sensitive Data Protection scans and de-identifies. That is, if you'reusing the Cloud Data Loss Prevention API, youcan't usebytesLimitPerFile andbytesLimitPerFilePercent in theCloudStorageOptions object of yourDlpJob.
Also, you can't control the maximum number of findings to be returned.If you're using the DLP API, you can't set aFindingLimits objectin yourDlpJob.
Requirement to inspect data
When running your inspection job, Sensitive Data Protection first inspects thedata, according to your inspection configuration, before it performsde-identification. It can't skip the inspection process.
Requirement to use file extensions
Sensitive Data Protection relies on file extensions to identify the file typesof the files in your input directory. It might not de-identify files that don'thave file extensions, even if those files are of supported types.
Skipped files
When de-identifying files in storage, Sensitive Data Protection skips the followingfiles:
- Files that exceed 60,000 KB. If youhave large files that exceed this limit, consider breaking them into smallerchunks.
- File types that aren't listed inSupported file types on this page.
- File types that you purposely excluded from the de-identificationconfiguration. If you're using the DLP API, the filetypes that you excluded from the
file_types_to_transformfield of theDeidentifyaction of yourDlpJobare skipped. - Files that encountered transformation errors.
Order of output rows in de-identified tables
There is no guarantee that the order of rows in a de-identified table matchesthe order of rows in the original table. If you want to compare theoriginal table to the de-identified table, you can't rely on the row number toidentify the corresponding rows. If you intend to compare rows of the tables, youmust use a unique identifier to identify each record.
Note: The inspection findings summary shows the row number ofeach finding. The referenced rows belong to the original table, not thede-identified table.Transient keys
If you choose acryptographic method as your transformation method, you mustfirstcreate a wrapped key using Cloud Key Management Service. Then, provide that key inyour de-identification template. Transient (raw) keys aren't supported.
What's next
- Learn how tode-identify sensitive data stored in Cloud Storage using the DLP API.
- Learn how tode-identify sensitive data stored in Cloud Storage using the Google Cloud console.
- Work through theCreating a De-identified Copy of Data inCloud Storage codelab.
- Learn how toinspect storage for sensitive data.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.