Use differential privacy

This document provides general information about differential privacy forBigQuery. For syntax, see thedifferential privacy clause.For a list of functions that you can use with this syntax, seedifferentially private aggregate functions.

Note: In this topic, the privacy parameters in the examples aren'trecommendations. You should work with your privacy or security officer todetermine the optimal privacy parameters for your dataset and organization.

What is differential privacy?

Differential privacy is a standard for computations on data that limits thepersonal information that's revealed by an output. Differential privacy iscommonly used to share data and to allow inferences about groups of peoplewhile preventing someone from learning information about an individual.

Differential privacy is useful:

  • Where a risk of re-identification exists.
  • To quantify the tradeoff between risk and analytical utility.

To better understand differential privacy, let's look at a simple example.

This bar chart shows the busyness of a small restaurant on one particularevening. Lots of guests come at 7 PM, and the restaurant is completely emptyat 1 AM:

Chart shows busyness of a small rest by mapping visitors at specific hours of the day.

This chart looks useful, but there's a catch. When a new guest arrives, thisfact is immediately revealed by the bar chart. In the following chart, it'sclear that there's a new guest, and that this guest arrived at roughly 1 AM:

Chart shows outlier arrival.

Showing this detail isn't great from a privacy perspective, as anonymizedstatistics shouldn't reveal individual contributions. Putting those two chartsside by side makes it even more apparent: the orange bar chart has one extraguest that has arrived around 1 AM:

Chart comparison highlights an individual contribution.

Again, that's not great. To avoid this kind privacy issue, you can add randomnoise to the bar charts by using differential privacy. In the followingcomparison chart, the results are anonymized and no longer reveal individualcontributions.

Differential privacy is applied to comparisons.

How differential privacy works on queries

The goal ofdifferential privacy is to mitigatedisclosure risk: the risk that someone can learn information about an entity ina dataset. Differential privacy balances the need to safeguard privacyagainst the need for statistical analytical utility. As privacy increases,statistical analytical utility decreases, and vice versa.

With GoogleSQL for BigQuery, you can transform the results of a query withdifferentially private aggregations. When the query is executed, it performsthe following:

  1. Computes per-entity aggregations for each group if groups are specified withaGROUP BY clause. Limits the number of groups each entity cancontribute to, based on themax_groups_contributed differential privacy parameter.
  2. Clamps each per-entity aggregate contribution to be withinthe clamping bounds. If the clamping bounds aren't specified, they areimplicitly calculated in a differentially private way.
  3. Aggregates the clamped per-entity aggregate contributions for each group.
  4. Adds noise to the final aggregate value for each group. The scale ofrandom noise is a function of all of the clamped bounds and privacyparameters.
  5. Computes a noisy entity count for each group and eliminates groups withfew entities. A noisy entity count helps eliminate a non-deterministic setof groups.

The final result is a dataset where each group has noisy aggregate resultsand small groups have been eliminated.

Note: BigQuery relies onGoogle's open source differential privacy library to implement differential privacy functionality. The libraryprovides low-level differential privacy primitives that you can use toimplement end-to-end privacy systems. For additional information on guarantees,seeLimitations on privacy guarantees.Note: BigQuery additionally supports differential privacy forBigQuery Omni data sources, including Amazon Simple Storage Service (Amazon S3). BigQueryremote functions can also call external differential privacy librarieslikeTumult Analytics.

For additional context on what differential privacy is and its use cases, seethe following articles:

Produce a valid differentially private query

The following rules must be met for the differentially private query to bevalid:

Define a privacy unit column

A privacy unit is the entity in a dataset that's being protected, usingdifferential privacy. An entity can be an individual, a company, a location,or any column that you choose.

A differentially private query must include one and only oneprivacy unit column. A privacy unit column is a unique identifier for aprivacy unit and can exist within multiple groups. Because multiple groupsare supported, the data type for theprivacy unit column must begroupable.

You can define a privacy unit column in theOPTIONS clause of adifferential privacy clause with the unique identifierprivacy_unit_column.

In the following examples, a privacy unit column is added to adifferential privacy clause.id represents a column that originates from atable calledstudents.

SELECTWITHDIFFERENTIAL_PRIVACYOPTIONS(epsilon=10,delta=.01,privacy_unit_column=id)item,COUNT(*,contribution_bounds_per_group=>(0,100))FROMstudents;
SELECTWITHDIFFERENTIAL_PRIVACYOPTIONS(epsilon=10,delta=.01,privacy_unit_column=members.id)item,COUNT(*,contribution_bounds_per_group=>(0,100))FROM(SELECT*FROMstudents)ASmembers;

Remove noise from a differentially private query

In the "Query syntax" reference, seeRemove noise.

Add noise to a differentially private query

In the "Query syntax" reference, seeAdd noise.

Limit the groups in which a privacy unit ID can exist

In the "Query syntax" reference, seeLimit the groups in which a privacy unit ID can exist.

Limitations

This section describes limitations of differential privacy.

Performance implications of differential privacy

Differentially private queries execute more slowly than standard queriesbecause per-entity aggregation is performed and themax_groups_contributed limitationis applied. Limiting contribution bounds can help improve the performance ofyour differentially private queries.

The performance profiles of the following queries aren't similar:

SELECTWITHDIFFERENTIAL_PRIVACYOPTIONS(epsilon=1,delta=1e-10,privacy_unit_column=id)column_a,COUNT(column_b)FROMtable_aGROUPBYcolumn_a;
SELECTcolumn_a,COUNT(column_b)FROMtable_aGROUPBYcolumn_a;

The reason for the performance difference is that an additionalfiner-granularity level of grouping is performed fordifferentially private queries, because per-entity aggregation must also beperformed.

The performance profiles of the following queries should be similar,although the differentially private query is slightly slower:

SELECTWITHDIFFERENTIAL_PRIVACYOPTIONS(epsilon=1,delta=1e-10,privacy_unit_column=id)column_a,COUNT(column_b)FROMtable_aGROUPBYcolumn_a;
SELECTcolumn_a,id,COUNT(column_b)FROMtable_aGROUPBYcolumn_a,id;

The differentially private query performs more slowly because it has a highnumber of distinct values for the privacy unit column.

Implicit bounding limitations for small datasets

Implicit bounding works best when computed using large datasets.Implicit bounding can fail with datasets that contain a low number ofprivacy units, returning no results. Furthermore,implicit bounding on a dataset with a low number of privacy units can clamp alarge portion of non-outliers, leading to underreported aggregations andresults that are altered more by clamping than by added noise. Datasets thathave a low number of privacy units or are thinly partitioned should useexplicit rather than implicit clamping.

Privacy vulnerabilities

Any differential privacy algorithm—including this one—incurs the risk of aprivate data leak when an analyst acts in bad faith, especially when computingbasic statistics like sums, due to arithmetic limitations.

Limitations on privacy guarantees

While BigQuery differential privacy applies thedifferential privacy algorithm, it doesn't make aguarantee regarding the privacy properties of the resulting dataset.

Runtime errors

An analyst acting in bad faith with the ability to write queries or controlinput data could trigger a runtime error on private data.

Floating point noise

Vulnerabilities related to rounding, repeated rounding, andre-ordering attacks should be considered before using differential privacy.These vulnerabilities are particularly concerning when an attacker cancontrol some of the contents of a dataset or the order of contents in a dataset.

Differentially private noise additions on floating-point data types are subjectto the vulnerabilities described inWidespread Underestimation of Sensitivityin Differentially Private Libraries and How to Fix It.Noise additions on integer data types aren't subject to the vulnerabilitiesdescribed in the paper.

Timing attack risks

An analyst acting in bad faith could execute a sufficiently complex query tomake an inference about input data based on a query's execution duration.

Misclassification

Creating a differential privacy query assumes that your data is in a well-knownand understood structure. If you apply differential privacy on the wrongidentifiers, such as one that represents a transaction ID instead of anindividual's ID, you could expose sensitive data.

If you need help understanding your data, considerusing services and tools such as the following:

Pricing

There is no additional cost to use differential privacy,but standardBigQuery pricing for analysis applies.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.