Extend differential privacy
This document provides examples of how to extend differential privacy forBigQuery differential privacy.
BigQuery lets you extenddifferential privacyto multi-cloud data sources and external differential privacy libraries. Thisdocument provides examples of how to apply differential privacyfor multi-cloud data sources like AWS S3 with BigQuery Omni, how tocall an external differential privacy library using a remote function, and howto perform differential privacy aggregations withPipelineDP, a Python library that can runwith Apache Spark and Apache Beam.
Note: In this document, the privacy parameters in the examples are notrecommendations. You should work with your privacy or security officer todetermine the optimal privacy parameters for your dataset and organization.For more information about differential privacy, seeUse differential privacy.
Differential privacy with BigQuery Omni
BigQuery differential privacy supports calls to multi-cloud datasources like AWS S3. The following example queries an external source of data,foo.wikidata, and applies differential privacy. For more information about thesyntax of the differential privacy clause, seeDifferential privacyclause.
SELECTWITHDIFFERENTIAL_PRIVACYOPTIONS(epsilon=1,delta=1e-5,privacy_unit_column=foo.wikidata.es_description)COUNT(*)ASresultsFROMfoo.wikidata;
This example returns results similar to the following:
-- These results will change each time you run the query.+----------+| results |+----------+| 3465 |+----------+
For more information about BigQuery Omnilimitations, seeLimitations.
Call external differential privacy libraries with remote functions
You can call external differential privacy libraries usingremotefunctions. Thefollowing link uses a remote function to call an external library hosted byTumult Analytics to use zero-concentrateddifferential privacy on a retail sales dataset.
For information about working with Tumult Analytics, see theTumult Analytics launch post {: .external}.
Differential privacy aggregations with PipelineDP
PipelineDP is a Python library that performs differential privacyaggregations andcan run with Apache Spark and Apache Beam. BigQuery can runApache Spark stored procedures written in Python. For more information aboutrunning Apache Spark stored procedures, seeWork with stored procedures forApache Spark.
The following example performs a differential privacy aggregation using thePipelineDP library. It uses theChicago Taxi Trips public dataset and computes for eachtaxi car - the number of trips, and sum and mean of tips for these trips.
Before you begin
A standard Apache Spark image does not include PipelineDP. You must create aDocker image that contains all necessarydependencies before running a PipelineDP stored procedure. This sectiondescribes how to create and push a Docker image to Google Cloud.
Before you begin, ensure you have installed Docker on your local machine and setup authentication for pushing Docker images togcr.io.For more information about pushing Docker images, seePush and pull images.
Create and push a Docker image
To create and push a Docker image with required dependencies, follow these steps:
- Create a local folder
DIR. - Download theMiniconda installer,with the Python 3.9 version, to
DIR. Save the following text to theDockerfile.
# Debian 11 is recommended.FROMdebian:11-slim# Suppress interactive promptsENVDEBIAN_FRONTEND=noninteractive# (Required) Install utilities required by Spark scripts.RUNaptupdate&&aptinstall-yprocpstinilibjemalloc2# Enable jemalloc2 as default memory allocatorENVLD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2# Install and configure Miniconda3.ENVCONDA_HOME=/opt/miniconda3ENVPYSPARK_PYTHON=${CONDA_HOME}/bin/pythonENVPATH=${CONDA_HOME}/bin:${PATH}COPYMiniconda3-py39_23.1.0-1-Linux-x86_64.sh.RUNbashMiniconda3-py39_23.1.0-1-Linux-x86_64.sh-b-p/opt/miniconda3\&&${CONDA_HOME}/bin/condaconfig--system--setalways_yesTrue\&&${CONDA_HOME}/bin/condaconfig--system--setauto_update_condaFalse\&&${CONDA_HOME}/bin/condaconfig--system--prependchannelsconda-forge\&&${CONDA_HOME}/bin/condaconfig--system--setchannel_prioritystrict# The following packages are installed in the default image, it is# strongly recommended to include all of them.RUNaptinstall-ypython3RUNaptinstall-ypython3-pipRUNaptinstall-ylibopenblas-devRUNpipinstall\cython\fastavro\fastparquet\gcsfs\google-cloud-bigquery-storage\google-cloud-bigquery[pandas]\google-cloud-bigtable\google-cloud-container\google-cloud-datacatalog\google-cloud-dataproc\google-cloud-datastore\google-cloud-language\google-cloud-logging\google-cloud-monitoring\google-cloud-pubsub\google-cloud-redis\google-cloud-spanner\google-cloud-speech\google-cloud-storage\google-cloud-texttospeech\google-cloud-translate\google-cloud-vision\koalas\matplotlib\nltk\numba\numpy\orc\pandas\pyarrow\pysal\regex\requests\rtree\scikit-image\scikit-learn\scipy\seaborn\sqlalchemy\sympy\tables\virtualenvRUNpipinstall--no-inputpipeline-dp==0.2.0# (Required) Create the 'spark' group/user.# The GID and UID must be 1099. Home directory is required.RUNgroupadd-g1099sparkRUNuseradd-u1099-g1099-d/home/spark-msparkUSERsparkRun the following command.
IMAGE=gcr.io/PROJECT_ID/DOCKER_IMAGE:0.0.1# Build and push the image.dockerbuild-t"${IMAGE}"dockerpush"${IMAGE}"Replace the following:
PROJECT_ID: the project in which you want to createthe Docker image.DOCKER_IMAGE: the Docker image name.
The image is uploaded.
Run a PipelineDP stored procedure
To create a stored procedure, use theCREATEPROCEDUREstatement.
CREATEORREPLACEPROCEDURE`PROJECT_ID.DATASET_ID.pipeline_dp_example_spark_proc`()WITHCONNECTION`PROJECT_ID.REGION.CONNECTION_ID`OPTIONS(engine="SPARK",container_image="gcr.io/PROJECT_ID/DOCKER_IMAGE")LANGUAGEPYTHONASR"""from pyspark.sql import SparkSessionimport pipeline_dpdef compute_dp_metrics(data, spark_context):budget_accountant = pipeline_dp.NaiveBudgetAccountant(total_epsilon=10, total_delta=1e-6)backend = pipeline_dp.SparkRDDBackend(spark_context)# Create a DPEngine instance.dp_engine = pipeline_dp.DPEngine(budget_accountant, backend)params = pipeline_dp.AggregateParams( noise_kind=pipeline_dp.NoiseKind.LAPLACE, metrics=[ pipeline_dp.Metrics.COUNT, pipeline_dp.Metrics.SUM, pipeline_dp.Metrics.MEAN], max_partitions_contributed=1, max_contributions_per_partition=1, min_value=0, # Tips that are larger than 100 will be clipped to 100. max_value=100)# Specify how to extract privacy_id, partition_key and value from an# element of the taxi dataset.data_extractors = pipeline_dp.DataExtractors( partition_extractor=lambda x: x.taxi_id, privacy_id_extractor=lambda x: x.unique_key, value_extractor=lambda x: 0 if x.tips is None else x.tips)# Run aggregation.dp_result = dp_engine.aggregate(data, params, data_extractors)budget_accountant.compute_budgets()dp_result = backend.map_tuple(dp_result, lambda pk, result: (pk, result.count, result.sum, result.mean))return dp_resultspark = SparkSession.builder.appName("spark-pipeline-dp-demo").getOrCreate()spark_context = spark.sparkContext# Load data from BigQuery.taxi_trips = spark.read.format("bigquery")\.option("table", "bigquery-public-data:chicago_taxi_trips.taxi_trips")\.load().rdddp_result = compute_dp_metrics(taxi_trips, spark_context).toDF(["pk", "count","sum", "mean"])# Saving the data to BigQuerydp_result.write.format("bigquery")\.option("writeMethod", "direct")\.save("DATASET_ID.TABLE_NAME")""";
Replace the following:
PROJECT_ID: the project in which you want tocreate the stored procedure.DATASET_ID: the dataset in which you want tocreate the stored procedure.REGION: the region your project is located in.DOCKER_IMAGE: the Docker image name.CONNECTION_ID: the name of the connection.TABLE_NAME: the name of the table.
Use theCALLstatement to call the procedure.
CALL`PROJECT_ID.DATASET_ID.pipeline_dp_example_spark_proc`()
Replace the following:
PROJECT_ID: the project in which you want tocreate the stored procedure.DATASET_ID: the dataset in which you want tocreate the stored procedure.
What's next
- Learn how touse differential privacy.
- Learn about thedifferential privacy clause.
- Learn how to usedifferentially private aggregate functions.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.