Overview of Dataproc Workflow Templates Stay organized with collections Save and categorize content based on your preferences.
The DataprocWorkflowTemplates API provides aflexible and easy-to-use mechanism for managing and executing workflows. AWorkflow Template is a reusable workflow configuration. It defines a graph ofjobs with information on where to run those jobs.
Key Points:
- Instantiating a Workflow Templatelaunches a Workflow. A Workflow is an operation that runs aDirected Acyclic Graph (DAG) of jobs on a cluster.
- If the workflowuses amanaged cluster, it creates the cluster, runs the jobs,and then deletes the cluster when the jobs are finished.
- If the workflow uses acluster selector, it runs jobs on a selected existing cluster.
- Workflows are ideal for complex job flows. You can create job dependenciesso that a job starts only after its dependencies complete successfully.
- When youcreate a workflow template Dataprocdoes not create a cluster or submit jobs to a cluster.Dataproc creates or selects a cluster and runs workflow jobs onthe cluster when a workflow template isinstantiated.
Kinds of Workflow Templates
Managed cluster
A workflow template can specify a managed cluster. The workflow will create an"ephemeral" cluster to run workflow jobs, and then delete the cluster when theworkflow is finished.
Cluster selector
A workflow template can specify an existing cluster on which to run workflowjobs by specifying one or moreuser labelspreviously attached to the cluster. The workflow will run on acluster that matches all of the labels. If multiple clusters matchall labels, Dataproc selects the cluster with the mostYARN available memory to run all workflow jobs. At the end of workflow,Dataproc does not delete the selected cluster. SeeUse cluster selectors with workflowsfor more information.
A workflow can select a specific cluster by matching thegoog-dataproc-cluster-name label (seeUsing Automatically Applied Labels).Parameterized
If you will run a workflow template multiple times with different values, useparameters to avoid editing the workflow template for each run:
define parameters in the template, then
pass different values for the parameters for each run.
SeeParameterization of Workflow Templatesfor more information.
Inline
Workflows can be instantiated inline using thegcloud command withworkflow template YAML files or by calling the DataprocInstantiateInlineAPI (seeUsing inline Dataproc workflows).Inline workflows do not create or modify workflow template resources.
Workflow Template use cases
Automation of repetitive tasks. Workflows encapsulate frequently usedcluster configurations and jobs.
Transactional fire-and-forget API interaction model. Workflow Templatesreplace the steps involved in a typical flow, which include:
- creating the cluster
- submitting jobs
- polling
- deleting the cluster
Workflow Templates use a single token to track progress from cluster creationto deletion, and automate error handling and recovery. They also simplify theintegration of Dataproc with other tools, such as Cloud Run functionsand Cloud Composer.
Support for ephemeral and long-lived clusters. A common complexityassociated with running Apache Hadoop is tuning and right-sizing clusters.Ephemeral (managed) clusters are easier to configure since they run asingle workload. Cluster selectors can be used withlonger-lived clusters to repeatedly execute the same workloadwithout incurring the amortized cost of creating and deleting clusters.
Granular IAM security. Creating Dataproc clusters andsubmitting jobs require all-or-nothing IAM permissions.Workflow Templates use a per-templateworkflowTemplates.instantiatepermission, and do not depend on cluster or job permissions.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.