Cloud Pub/Sub to Cloud Storage template

Use the Serverless for Apache Spark Cloud Pub/Sub to Cloud Storage template to extractdata from Pub/Sub to Cloud Storage.

Use the template

Run the template using the gcloud CLI or DataprocAPI.

gcloud

Before using any of the command data below, make the following replacements:

  • PROJECT_ID: Required. Your Google Cloud project ID listed in theIAM Settings.
  • REGION: Required.Compute Engine region.
  • SUBNET: Optional. If a subnet is not specified, the subnet in the specified REGION in thedefault network is selected.

    Example:projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME

  • TEMPLATE_VERSION: Required. Specifylatest for the latest template version, or the date of a specific version, for example,2023-03-17_v0.1.0-beta (visitgs://dataproc-templates-binaries or rungcloud storage ls gs://dataproc-templates-binaries to list available template versions).
  • PUBSUB_SUBSCRIPTION_PROJECT_ID: Required. The Google Cloud project ID listed in theIAM Settings that contains the input Pub/Sub subscription to be read.
  • SUBSCRIPTION: Required. Pub/Sub subscription name.
  • CLOUD_STORAGE_OUTPUT_BUCKET_NAME: Required. Cloud Storage bucket name where output will be stored.

    Note: The output files will be stored in theoutput/ folder inside the bucket.

  • FORMAT: Required. Output data format. Options:avro orjson.

    Note: Ifavro, you must add "file:///usr/lib/spark/connector/spark-avro.jar" to thejars gcloud CLI flag or API field.

    Example (thefile:// prefix references a Serverless for Apache Spark jar file):

    --jars=file:///usr/lib/spark/connector/spark-avro.jar, [ ... other jars]
  • TIMEOUT: Optional. Time in milliseconds before termination of stream. Defaults to60000.
  • DURATION: Optional. Frequency in seconds of writes to Cloud Storage. Defaults to15 seconds.
  • NUM_RECEIVERS: Optional. Number of streams read from a Pub/Sub subscription in parallel. Defaults to5.
  • BATCHSIZE: Optional. Number of records to insert in one round trip into Cloud Storage. Defaults to1000.
  • SERVICE_ACCOUNT: Optional. If not provided, thedefault Compute Engine service account is used.
  • PROPERTY andPROPERTY_VALUE: Optional. Comma-separated list ofSpark property=value pairs.
  • LABEL andLABEL_VALUE: Optional. Comma-separated list oflabel=value pairs.
  • LOG_LEVEL: Optional. Level of logging. Can be one ofALL,DEBUG,ERROR,FATAL,INFO,OFF,TRACE, orWARN. Default:INFO.
  • KMS_KEY: Optional. The Cloud Key Management Service key to use for encryption. If a key is not specified, data isencrypted at rest using a Google-owned and Google-managed encryption key.

    Example:projects/PROJECT_ID/regions/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME

Execute the following command:

Linux, macOS, or Cloud Shell

Note: Ensure you have initialized the Google Cloud CLI with authentication and a project by running eithergcloud init; orgcloud auth login andgcloud config set project.
gclouddataprocbatchessubmitspark\--class=com.google.cloud.dataproc.templates.main.DataProcTemplate\--version="1.2"\--project="PROJECT_ID"\--region="REGION"\--jars="gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar"\--subnet="SUBNET"\--kms-key="KMS_KEY"\--service-account="SERVICE_ACCOUNT"\--properties="PROPERTY=PROPERTY_VALUE"\--labels="LABEL=LABEL_VALUE"\----template=PUBSUBTOGCS\--templatePropertylog.level="LOG_LEVEL"\--templatePropertypubsubtogcs.input.project.id="PUBSUB_SUBSCRIPTION_PROJECT_ID"\--templatePropertypubsubtogcs.input.subscription="SUBSCRIPTION"\--templatePropertypubsubtogcs.gcs.bucket.name="CLOUD_STORAGE_OUTPUT_BUCKET_NAME"\--templatePropertypubsubtogcs.gcs.output.data.format="FORMAT"\--templatePropertypubsubtogcs.timeout.ms="TIMEOUT"\--templatePropertypubsubtogcs.streaming.duration.seconds="DURATION"\--templatePropertypubsubtogcs.total.receivers="NUM_RECEIVERS"\--templatePropertypubsubtogcs.batch.size="BATCHSIZE"

Windows (PowerShell)

Note: Ensure you have initialized the Google Cloud CLI with authentication and a project by running eithergcloud init; orgcloud auth login andgcloud config set project.
gclouddataprocbatchessubmitspark`--class=com.google.cloud.dataproc.templates.main.DataProcTemplate`--version="1.2"`--project="PROJECT_ID"`--region="REGION"`--jars="gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar"`--subnet="SUBNET"`--kms-key="KMS_KEY"`--service-account="SERVICE_ACCOUNT"`--properties="PROPERTY=PROPERTY_VALUE"`--labels="LABEL=LABEL_VALUE"`----template=PUBSUBTOGCS`--templatePropertylog.level="LOG_LEVEL"`--templatePropertypubsubtogcs.input.project.id="PUBSUB_SUBSCRIPTION_PROJECT_ID"`--templatePropertypubsubtogcs.input.subscription="SUBSCRIPTION"`--templatePropertypubsubtogcs.gcs.bucket.name="CLOUD_STORAGE_OUTPUT_BUCKET_NAME"`--templatePropertypubsubtogcs.gcs.output.data.format="FORMAT"`--templatePropertypubsubtogcs.timeout.ms="TIMEOUT"`--templatePropertypubsubtogcs.streaming.duration.seconds="DURATION"`--templatePropertypubsubtogcs.total.receivers="NUM_RECEIVERS"`--templatePropertypubsubtogcs.batch.size="BATCHSIZE"

Windows (cmd.exe)

Note: Ensure you have initialized the Google Cloud CLI with authentication and a project by running eithergcloud init; orgcloud auth login andgcloud config set project.
gclouddataprocbatchessubmitspark^--class=com.google.cloud.dataproc.templates.main.DataProcTemplate^--version="1.2"^--project="PROJECT_ID"^--region="REGION"^--jars="gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar"^--subnet="SUBNET"^--kms-key="KMS_KEY"^--service-account="SERVICE_ACCOUNT"^--properties="PROPERTY=PROPERTY_VALUE"^--labels="LABEL=LABEL_VALUE"^----template=PUBSUBTOGCS^--templatePropertylog.level="LOG_LEVEL"^--templatePropertypubsubtogcs.input.project.id="PUBSUB_SUBSCRIPTION_PROJECT_ID"^--templatePropertypubsubtogcs.input.subscription="SUBSCRIPTION"^--templatePropertypubsubtogcs.gcs.bucket.name="CLOUD_STORAGE_OUTPUT_BUCKET_NAME"^--templatePropertypubsubtogcs.gcs.output.data.format="FORMAT"^--templatePropertypubsubtogcs.timeout.ms="TIMEOUT"^--templatePropertypubsubtogcs.streaming.duration.seconds="DURATION"^--templatePropertypubsubtogcs.total.receivers="NUM_RECEIVERS"^--templatePropertypubsubtogcs.batch.size="BATCHSIZE"

REST

Before using any of the request data, make the following replacements:

  • PROJECT_ID: Required. Your Google Cloud project ID listed in theIAM Settings.
  • REGION: Required.Compute Engine region.
  • SUBNET: Optional. If a subnet is not specified, the subnet in the specified REGION in thedefault network is selected.

    Example:projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME

  • TEMPLATE_VERSION: Required. Specifylatest for the latest template version, or the date of a specific version, for example,2023-03-17_v0.1.0-beta (visitgs://dataproc-templates-binaries or rungcloud storage ls gs://dataproc-templates-binaries to list available template versions).
  • PUBSUB_SUBSCRIPTION_PROJECT_ID: Required. The Google Cloud project ID listed in theIAM Settings that contains the input Pub/Sub subscription to be read.
  • SUBSCRIPTION: Required. Pub/Sub subscription name.
  • CLOUD_STORAGE_OUTPUT_BUCKET_NAME: Required. Cloud Storage bucket name where output will be stored.

    Note: The output files will be stored in theoutput/ folder inside the bucket.

  • FORMAT: Required. Output data format. Options:avro orjson.

    Note: Ifavro, you must add "file:///usr/lib/spark/connector/spark-avro.jar" to thejars gcloud CLI flag or API field.

    Example (thefile:// prefix references a Serverless for Apache Spark jar file):

    --jars=file:///usr/lib/spark/connector/spark-avro.jar, [ ... other jars]
  • TIMEOUT: Optional. Time in milliseconds before termination of stream. Defaults to60000.
  • DURATION: Optional. Frequency in seconds of writes to Cloud Storage. Defaults to15 seconds.
  • NUM_RECEIVERS: Optional. Number of streams read from a Pub/Sub subscription in parallel. Defaults to5.
  • BATCHSIZE: Optional. Number of records to insert in one round trip into Cloud Storage. Defaults to1000.
  • SERVICE_ACCOUNT: Optional. If not provided, thedefault Compute Engine service account is used.
  • PROPERTY andPROPERTY_VALUE: Optional. Comma-separated list ofSpark property=value pairs.
  • LABEL andLABEL_VALUE: Optional. Comma-separated list oflabel=value pairs.
  • LOG_LEVEL: Optional. Level of logging. Can be one ofALL,DEBUG,ERROR,FATAL,INFO,OFF,TRACE, orWARN. Default:INFO.
  • KMS_KEY: Optional. The Cloud Key Management Service key to use for encryption. If a key is not specified, data isencrypted at rest using a Google-owned and Google-managed encryption key.

    Example:projects/PROJECT_ID/regions/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME

HTTP method and URL:

POST https://dataproc.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/batches

Request JSON body:

{  "environmentConfig":{    "executionConfig":{      "subnetworkUri":"SUBNET",      "kmsKey": "KMS_KEY",      "serviceAccount": "SERVICE_ACCOUNT"    }  },  "labels": {    "LABEL": "LABEL_VALUE"  },  "runtimeConfig": {    "version": "1.2",    "properties": {      "PROPERTY": "PROPERTY_VALUE"    }  },  "sparkBatch":{    "mainClass":"com.google.cloud.dataproc.templates.main.DataProcTemplate",    "args":[      "--template","PUBSUBTOGCS",      "--templateProperty","log.level=LOG_LEVEL",      "--templateProperty","pubsubtogcs.input.project.id=PUBSUB_SUBSCRIPTION_PROJECT_ID",      "--templateProperty","pubsubtogcs.input.subscription=SUBSCRIPTION",      "--templateProperty","pubsubtogcs.gcs.bucket.name=CLOUD_STORAGE_OUTPUT_BUCKET_NAME",      "--templateProperty","pubsubtogcs.gcs.output.data.format=FORMAT",      "--templateProperty","pubsubtogcs.timeout.ms=TIMEOUT",      "--templateProperty","pubsubtogcs.streaming.duration.seconds=DURATION",      "--templateProperty","pubsubtogcs.total.receivers=NUM_RECEIVERS",      "--templateProperty","pubsubtogcs.batch.size=BATCHSIZE"    ],    "jarFileUris":[      "file:///usr/lib/spark/connector/spark-avro.jar", "gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar"    ]  }}

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login , or by usingCloud Shell, which automatically logs you into thegcloud CLI . You can check the currently active account by runninggcloud auth list.

Save the request body in a file namedrequest.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://dataproc.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/batches"

PowerShell (Windows)

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login . You can check the currently active account by runninggcloud auth list.

Save the request body in a file namedrequest.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://dataproc.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/batches" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{  "name": "projects/PROJECT_ID/regions/REGION/operations/OPERATION_ID",  "metadata": {    "@type": "type.googleapis.com/google.cloud.dataproc.v1.BatchOperationMetadata",    "batch": "projects/PROJECT_ID/locations/REGION/batches/BATCH_ID",    "batchUuid": "de8af8d4-3599-4a7c-915c-798201ed1583",    "createTime": "2023-02-24T03:31:03.440329Z",    "operationType": "BATCH",    "description": "Batch"  }}

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.