Use workflows

You set up and run a workflow by:

  1. Creating a workflow template
  2. Configuring a managed (ephemeral) cluster or selecting an existing cluster
  3. Adding jobs
  4. Instantiating the template to run the workflow
You canparameterize a workflow templateto use it dynamically for different workflows. You can alsouse YAML filesor call theInstantiateInlineAPI to define and run aninline workflowthat does not create or modify workflow template resources.

Create a template

gcloud CLI

Run the followingcommand to create a Dataproc workflow template resource.

gcloud dataproc workflow-templates createTEMPLATE_ID \    --region=REGION

Notes:

  • REGION: Specify theregion where your template will run.
  • TEMPLATE_ID: Provide an ID for your template, such as, "workflow-template-1".
  • CMEK encryption. You can add the--kms-keyflag to useCMEK encryption on workflow template job arguments.

REST API

Submit aWorkflowTemplate as partof aworkflowTemplates.createrequest. You can add theWorkflowTemplate.EncryptionConfig.kmsKey field to useCMEK encryptionon workflow template job arguments.kmsKey

Console

You can view existing workflow templates and instantiated workflows from the DataprocWorkflows page in Google Cloud console.

Configure or select a cluster

Dataproc can create and use a new, "managed" cluster for yourworkflow or an existing cluster.

  • Existing cluster: SeeUsing cluster selectors with workflowsto select an existing cluster for your workflow.

  • Managed cluster: You must configure a managed cluster foryour workflow. Dataproc will create this new cluster to runworkflow jobs, then delete the cluster at the end of the workflow.

    You can configure a managed cluster for your workflow using thegcloudcommand-line tool or the Dataproc API.

    Google Cloud CLI

    Use flags inherited fromgcloud dataproc clustercreate to configure the managed cluster, such as the number of workers andthe master and worker machine type. Dataproc will add a suffix tothe cluster name to ensure uniqueness. You can use the--service-accountflag to specify aVM service accountfor the managed cluster.

    gcloud dataproc workflow-templates set-managed-clusterTEMPLATE_ID \    --region=REGION \    --master-machine-type=MACHINE_TYPE \    --worker-machine-type=MACHINE_TYPE \    --num-workers=NUMBER \    --cluster-name=CLUSTER_NAME    --service-account=SERVICE_ACCOUNT

    REST API

    SeeWorkflowTemplatePlacement.ManagedCluster, which you can provide as part of acompletedWorkflowTemplatesubmitted with aworkflowTemplates.createorworkflowTemplates.updaterequest.

    You can use theGceClusterConfig.serviceAccountfield to specify aVM service accountfor the managed cluster.

    Console

    You can view existing workflow templates and instantiated workflows fromthe DataprocWorkflows page in Google Cloud console.

Add jobs to a template

All jobs run concurrently unless you specify one or more job dependencies. Ajob's dependencies are expressed as a list of other jobs that must finishsuccessfully before the ultimate job can start. You must provide astep-idfor each job. The ID must be unique within the workflow, but does not need to beunique globally.

gcloud CLI

Use job type and flags inherited fromgcloud dataproc jobs submitto define the job to add to the template. You can optionally use the‑‑start-afterjob-id of another workflow jobflag to have the job start after the completion of one or more other jobs in the workflow.

The--max-failures-per-hour and--max-failures-per-hourrestartable job flagsare not supported in Dataproc workflow template jobs.

Examples:

Add Hadoop job "foo" to the "my-workflow" template.

gcloud dataproc workflow-templates add-job hadoop \    --region=REGION \    --step-id=foo \    --workflow-template=my-workflow \    --space separated job args

Add job "bar" to the "my-workflow" template, which will be run after workflow job "foo" has completed successfully.

gcloud dataproc workflow-templates add-jobJOB_TYPE \    --region=REGION \    --step-id=bar \    --start-after=foo \    --workflow-template=my-workflow \    --space separated job args

Add another job "baz" to "my-workflow" template to be run after the successful completion of both "foo" and "bar" jobs.

gcloud dataproc workflow-templates add-jobJOB_TYPE \    --region=REGION \    --step-id=baz \    --start-after=foo,bar \    --workflow-template=my-workflow \    --space separated job args

REST API

SeeWorkflowTemplate.OrderedJob. This field is provided as part of acompletedWorkflowTemplatesubmitted with aworkflowTemplates.createorworkflowTemplates.updaterequest.

ThemaxFailuresPerHour andmaxFailuresTotalOrderedJob.JobScheduling fieldsare not supported in Dataproc workflow template jobs.

Console

You can view existing workflow templates and instantiated workflows fromthe DataprocWorkflows page in Google Cloud console.

TheMax restarts per hourrestartable job optionis not supported in Dataproc workflow template jobs.

Run a workflow

The instantiation of a workflow template runs the workflow defined by thetemplate. Multiple instantiations of a template are supported—youcan run a workflow multiple times.

gcloud command

gcloud dataproc workflow-templates instantiateTEMPLATE_ID \    --region=REGION

The command returns an operation ID, which you can use to track workflow status.

Example command and output:
gcloud beta dataproc workflow-templates instantiate my-template-id \    --region=us-central1...WorkflowTemplate [my-template-id] RUNNING...Created cluster: my-template-id-rg544az7mpbfa.Job ID teragen-rg544az7mpbfa RUNNINGJob ID teragen-rg544az7mpbfa COMPLETEDJob ID terasort-rg544az7mpbfa RUNNINGJob ID terasort-rg544az7mpbfa COMPLETEDJob ID teravalidate-rg544az7mpbfa RUNNINGJob ID teravalidate-rg544az7mpbfa COMPLETED...Deleted cluster: my-template-id-rg544az7mpbfa.WorkflowTemplate [my-template-id] DONE

REST API

SeeworkflowTemplates.instantiate.

Console

You can view existing workflow templates and instantiated workflows fromthe DataprocWorkflows page in Google Cloud console.

Workflow job failures

A failure in any job in a workflow will cause the workflow to fail.Dataproc will seek to mitigate the effect of failures by causing allconcurrently executing jobs to fail and preventing subsequent jobsfrom starting.

Monitor and list a workflow

gcloud CLI

To monitor a workflow:

gcloud dataproc operations describeOPERATION_ID \    --region=REGION

Note: The operation-id is returned when you instantiate the workflowwithgcloud dataproc workflow-templates instantiate (seeRun a workflow).

To list workflow status:

gcloud dataproc operations list \    --region=REGION \    --filter="labels.goog-dataproc-operation-type=WORKFLOW AND status.state=RUNNING"

REST API

To monitor a workflow, use the Dataprocoperations.get API.

To list running workflows, use the Dataprocoperations.listAPI with a label filter.

Console

You can view existing workflow templates and instantiated workflows fromthe DataprocWorkflows page in Google Cloud console.

Terminate a workflow

You can end a workflow using the Google Cloud CLI or by callingthe Dataproc API.

Note: Ending a workflow cancels running workflow jobs and, if theworkflow runs on a managed (ephemeral) cluster, deletes the managedcluster.

Update a workflow template

Updates don't affect running workflows. The new template version will onlyapply to new workflows.

gcloud CLI

Workflow templates can be updated by issuing newgcloud workflow-templates commands that reference an existing workflow template-id:

to an existing workflow template.

REST API

To make an update to a template with the REST API:

  1. CallworkflowTemplates.get, which returns the current template with theversion field filledin with the current server version.
  2. Make updates to the fetched template.
  3. CallworkflowTemplates.update with the updated template.
As a guard against concurrent modifications, a request to update atemplate must specify the current server version in theworkflowTemplate.version field.

Console

You can view existing workflow templates and instantiated workflows fromthe DataprocWorkflows page in Google Cloud console.

Delete a workflow template

gcloud CLI

gcloud dataproc workflow-templates deleteTEMPLATE_ID \    --region=REGION

Note: The operation-id that is returned when you instantiate the workflowwithgcloud dataproc workflow-templates instantiate (seeRun a workflow).

REST API

SeeworkflowTemplates.delete.

Console

You can view existing workflow templates and instantiated workflows fromthe DataprocWorkflows page in Google Cloud console.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.