De-identification and re-identification of PII in large-scale datasets using Sensitive Data Protection

Last reviewed 2024-06-07 UTC

This document discusses how to use Sensitive Data Protection to createan automated data transformation pipeline to de-identify sensitive data likepersonally identifiable information (PII). De-identification techniques liketokenization (pseudonymization) let you preserve the utility of your data forjoining or analytics while reducing the risk of handling the data by obfuscatingthe raw sensitive identifiers. To minimize the risk of handling large volumes ofsensitive data, you can use an automated data transformation pipeline to createde-identified replicas.Sensitive Data Protection enables transformations such as redaction, masking, tokenization, bucketing,and other methods ofde-identification.When a dataset hasn't been characterized,Sensitive Data Protection can also inspect the data for sensitive information by usingmore than 100 built-in classifiers.

This document is intended for a technical audience whose responsibilitiesinclude data security, data processing, or data analytics. This guide assumesthat you're familiar with data processing and data privacy, without the need tobe an expert.

Reference architecture

The following diagram shows a reference architecture for usingGoogle Cloud products to add a layer of security to sensitive datasets by usingde-identification techniques.

Architecture of de-identification pipeline, configuration management, and re-identification pipeline.

The architecture consists of the following:

Data de-identification streaming pipeline: De-identifies sensitivedata in text using Dataflow. You can reuse thepipeline for multiple transformations and use cases.
Configuration (Sensitive Data Protection template and key) management: A managedde-identification configuration that is accessible by only a small group ofpeople—for example, security admins—to avoid exposing de-identificationmethods and encryption keys.
Data validation and re-identification pipeline: Validates copies ofthe de-identified data and uses a Dataflow pipeline tore-identify data at a large scale.

Helping to secure sensitive data

One of the key tasks of any enterprise is to help ensure the security of theirusers' and employees' data. Google Cloud provides built-in securitymeasures to facilitate data security, including encryption of stored data andencryption of data in transit.

Encryption at rest: Cloud Storage

Maintaining data security is critical for most organizations. Unauthorizedaccess to even moderately sensitive data can damage the trust, relationships,and reputation that you have with your customers. Googleencrypts data stored at rest by default. By default, any object uploaded to aCloud Storage bucket is encrypted using aGoogle-owned and Google-managed encryption key.If your dataset uses a pre-existing encryption method and requires a non-defaultoption before uploading, there are other encryption options provided byCloud Storage. For more information, seeData encryption options.

Encryption in transit: Dataflow

When your data is in transit, theat-rest encryption isn't in place.In-transit data is protected by secure network protocols referred to asencryption in transit.By default, Dataflow uses Google-owned and Google-managed encryption keys. Thetutorials associated with this document use an automated pipeline that uses thedefault Google-owned and Google-managed encryption keys.

Sensitive Data Protection data transformations

There are two main types of transformations performed by Sensitive Data Protection:

BothrecordTransformations andinfoTypeTransformations methods cande-identify and encrypt sensitive information in your data. For example, you cantransform the values in theUS_SOCIAL_SECURITY_NUMBER column to beunidentifiable or use tokenization to obscure it while keeping referentialintegrity.

TheinfoTypeTransformations method lets you inspect for sensitive dataand transform the finding. For example, if you have unstructured or free-textdata, theinfoTypeTransformations method can help you identify an SSN insideof a sentence and encrypt the SSN value while leaving the rest of the textintact. You can also define custominfoTypes methods.

TherecordTransformations method lets you apply a transformationconfiguration per field when using structured or tabular data. With therecordTransformations method, you can apply the same transformation acrossevery value in that field such as hashing or tokenizing every value in a columnwithSSN column as the field or header name.

With therecordTransformations method , you can also mix in theinfoTypeTransformations method that only apply to the values in the specifiedfields. For example, you can use aninfoTypeTransformations method inside of arecordTransformations method for the field namedcomments to redact anyfindings forUS_SOCIAL_SECURITY_NUMBER that are found inside the text in thefield.

In increasing order of complexity, the de-identification processes are asfollows:

Redaction: Remove the sensitive content with no replacement of content.
Masking: Replace the sensitive content with fixed characters.
Encryption: Replace sensitive content with encrypted strings, possiblyreversibly.

Working with delimited data

Often, data consists of records delimited by a selected character, with fixedtypes in each column, like a CSV file. For this class of data, you can applyde-identification transformations (recordTransformations) directly, withoutinspecting the data. For example, you can expect a column labeledSSN tocontain only SSN data. You don't need to inspect the data to know that theinfoType detector isUS_SOCIAL_SECURITY_NUMBER. However, free-formcolumns labeledAdditional Details can contain sensitive information, but theinfoType class is unknown beforehand. For a free-form column, you need toinspect theinfoTypes detector (infoTypeTransformations) before applyingde-identification transformations. Sensitive Data Protection allows both of thesetransformation types to co-exist in a single de-identification template.Sensitive Data Protection includesmore than 100 built-ininfoTypes detectors.You can also create custom types or modify built-ininfoTypes detectors tofind sensitive data that is unique to your organization.

Determining transformation type

Determining when to use therecordTransformations orinfoTypeTransformationsmethod depends on your use case. Because using theinfoTypeTransformationsmethod requires more resources and is therefore more costly, we recommend usingthis method only for situations where the data type is unknown. You can evaluatethe costs of running Sensitive Data Protection using theGoogle Cloud pricing calculator.

For examples of transformation, this document refers to a dataset that containsCSV files with fixed columns, as demonstrated in the following table.

Column name	Inspection`infoType` (custom or built-in)	Sensitive Data Protection transformation type
`Card Number`	Not applicable	Deterministic encryption (DE)
`Card Holder's Name`	Not applicable	Deterministic encryption (DE)
`Card PIN`	Not applicable	Crypto hashing
`SSN (Social Security Number)`	Not applicable	Masking
`Age`	Not applicable	Bucketing
`Job Title`	Not applicable	Bucketing
`Additional Details`	Built-in: `IBAN_CODE`,`EMAIL_ADDRESS`,`PHONE_NUMBER` Custom: `ONLINE_USER_ID`	Replacement

This table lists the column names and describes which type of transformation isneeded for each column. For example, theCard Number column contains creditcard numbers that need to be encrypted; however, they don't need to beinspected, because the data type (infoType) is known.

The only column where an inspection transformation is recommended is theAdditional Details column. This column is free-form and might contain PII,which, for the purposes of this example, should be detected and de-identified.

The examples in this table present five different de-identificationtransformations:

Two-way tokenization: Replaces the original data with a token that isdeterministic, preserving referential integrity. You can use the token tojoin data or use the token in aggregate analysis. You can reverse orde-tokenize the data using the same key that you used to create the token.There are two methods for two-way tokenizations:
- Deterministic encryption (DE):Replaces the original data with a base64-encoded encrypted value anddoesn't preserve the original character set or length.
- Format-preserving encryption with FFX (FPE-FFX):Replaces the original data with a token generated by usingformat-preserving encryption in FFX mode. By design, FPE-FFX preservesthe length and character set of the input text. It lacks authenticationand an initialization vector, which can cause a length expansion in theoutput token. Other methods, like DE, provide stronger securityand are recommended for tokenization use cases unless length andcharacter-set preservation are strict requirements, such as backwardcompatibility with legacy data systems.
One-way tokenization, usingcryptographic hashing:Replaces the original value with a hashed value, preserving referentialintegrity. However, unlike two-way tokenization, a one-way method isn'treversible. The hash value is generated by using an SHA-256-based messageauthentication code(HMAC-SHA-256)on the input value.
Masking:Replaces the original data with a specified character, either partiallyor completely.
Bucketing:Replaces a more identifiable value with a less distinguishing value.
Replacement:Replaces original data with a token or the name of theinfoType ifdetected.

Method selection

Choosing the best de-identification method can vary based on your use case. Forexample, if a legacy app is processing the de-identified records, then formatpreservation might be important. If you're dealing with strictly formatted10-digit numbers, FPE preserves the length (10 digits) and character set(numeric) of an input for legacy system support.

However, if strict formatting isn't required for legacy compatibility, as isthe case for values in theCard Holder's Name column, then DE is thepreferred choice because it has a stronger authentication method. Both FPE andDE enable the tokens to be reversed or de-tokenized. If you don't needde-tokenization, then cryptographic hashing provides integrity but the tokenscan't be reversed.

Other methods—like masking,bucketing,date-shifting,and replacement—are good for values that don't need to retain full integrity.For example, bucketing an age value (for example, 27) to an age range (20-30)can still be analyzed while reducing the uniqueness that might lead to theidentification of an individual.

Token encryption keys

For cryptographic de-identification transformations, a cryptographic key,also known astoken encryption key, is required. The token encryption keythat is used for de-identification encryption is also used to re-identify theoriginal value. The secure creation and management of token encryption keys arebeyond the scope of this document. However, there are some important principlesto consider that are used later in the associated tutorials:

Avoid using plaintext keys in the template. Instead, useCloud KMS to create awrapped key.
Use separate token encryption keys for each data element to reduce therisk of compromising keys.
Rotate token encryption keys. Although you can rotate the wrapped key, rotating thetoken encryption key breaks the integrity of the tokenization. When the keyis rotated, you need to re-tokenize the entire dataset.

Sensitive Data Protection templates

For large-scale deployments, useSensitive Data Protection templates to accomplish the following:

Enable security control withIdentity and Access Management (IAM).
Decouple configuration information, and how you de-identify thatinformation, from the implementation of your requests.
Reuse a set of transformations. You can use the de-identify andre-identify templates over multiple datasets.

BigQuery

The final component of the reference architecture is viewing and working withthe de-identified data inBigQuery.BigQuery is Google's data warehouse tool that includesserverless infrastructure, BigQuery ML, and the ability to runSensitive Data Protection as a native tool. In the example reference architecture,BigQuery serves as a data warehouse for the de-identified dataand as a backend to an automated re-identification data pipeline that can sharedata throughPub/Sub.

What's next

Learn about using Sensitive Data Protection toinspect storage and databases for sensitive data.
Learn about otherpattern recognition solutions.
For more reference architectures, diagrams, and best practices, explore theCloud Architecture Center.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-06-07 UTC.

Movatterモバイル変換

De-identification and re-identification of PII in large-scale datasets using Sensitive Data Protection Stay organized with collections Save and categorize content based on your preferences.