Dataproc best practices for production Stay organized with collections Save and categorize content based on your preferences.
This document discusses Dataproc best practices that can help yourun reliable, efficient, and insightful data processing jobs onDataproc clusters in production environments.
Specify cluster image versions
Dataproc usesimage versionsto bundle operating system, big datacomponents,and Google Cloud connectors into a package that is deployed on a cluster.If you don't specify an image version when creating a cluster, Dataprocdefaults to the most recent stable image version.
For production environments, associate your cluster with a specificmajor.minor Dataproc image version, asshown in the following gcloud CLI command.
gcloud dataproc clusters createCLUSTER_NAME \ --region=region \ --image-version=2.0
Dataproc resolves themajor.minor version to the latest sub-minor version version(2.0 is resolved to2.0.x). Note: if you need to rely on a specific sub-minor version for your cluster,you can specify it: for example,--image-version=2.0.x. SeeHow versioning works formore information.
Dataproc preview image versions
New minor versions of Dataprocimages are available in apreview version prior to releasein the standard minor image version track. Use a preview imageto test and validate your jobs against a new minor image versionprior to adopting the standard minor image version in production.SeeDataproc versioningfor more information.
Use custom images when necessary
If you have dependencies to add to the cluster, such as nativePython libraries, or security hardening or virus protection software,create a custom image from thelatest imagein your target minor image version track. This practice allows you to meet dependency requirementswhen you create clusters using your custom image. When you rebuild your custom image toupdate dependency requirements, use the latest available sub-minor image version within the minor image track.
Submit jobs to the Dataproc service
Submit jobs to the Dataproc service with ajobs.submitcall using thegcloud CLIor the Google Cloud console. Set job and cluster permissions by grantingDataproc roles. Usecustom roles to separate cluster access from job submit permissions.
Benefits of submitting jobs to the Dataproc service:
- No complicated networking settings required - the API is widely reachable
- Easy to manage IAM permissions and roles
- Track job status easily - no Dataproc job metadata to complicate results.
In production, run jobs that only depend on cluster-leveldependencies at a fixed minor image version, (for example,--image-version=2.0). Bundledependencies with jobs when the jobs are submitted. Submittinganuber jar toSpark or MapReduce is a common way to do this.
- Example: If a job jar depends on
args4jandspark-sql, withargs4jspecific to the jobandspark-sqla cluster-level dependency, bundleargs4jin the job's uber jar.
Control initialization action locations
Initialization actionsallow you to automatically run scripts or installcomponents when you create a Dataproc cluster (see thedataproc-initialization-actionsGitHub repository for common Dataproc initialization actions).When using cluster initialization actions in a productionenvironment, copy initialization scripts to Cloud Storagerather than sourcing them from a public repository. This practice avoids runninginitialization scripts that are subject to modification by others.
Monitor Dataproc release notes
Dataproc regularly releases new sub-minor image versions.View or subscribe toDataproc release notesto be aware of the latest Dataproc image version releases and otherannouncements, changes, and fixes.
View the staging bucket to investigate failures
Look at your cluster'sstaging bucket to investigate cluster and job error messages. Typically, the staging bucket Cloud Storage location is shown in error messages, as shown in thebold text in the following sample error message:
ERROR:(gcloud.dataproc.clusters.create) Operation ... failed:...- Initialization action failed. Failed action ... see output in:gs://dataproc-<BUCKETID>-us-central1/google-cloud-dataproc-metainfo/CLUSTERID/<CLUSTER_ID>\dataproc-initialization-script-0_output
Use the gcloud CLI to view staging bucket contents:
Sample output:gcloud storage cat gs://STAGING_BUCKET
+ readonly RANGER_VERSION=1.2.0... Ranger admin password not set. Please use metadata flag - default-password
Get support
Google Cloud supports your production OSS workloads and helps you meet yourbusiness SLAs throughtiers of support. Also, Google CloudConsulting Services can provide guidance on best practicesfor your team's production deployments.
For more information
Read the Google Cloud blogDataproc best practices guide.
ViewDemocratizing Dataproc on YouTube.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.