Dataproc best practices for production

This document discusses Dataproc best practices that can help yourun reliable, efficient, and insightful data processing jobs onDataproc clusters in production environments.

Specify cluster image versions

Dataproc usesimage versionsto bundle operating system, big datacomponents,and Google Cloud connectors into a package that is deployed on a cluster.If you don't specify an image version when creating a cluster, Dataprocdefaults to the most recent stable image version.

For production environments, associate your cluster with a specificmajor.minor Dataproc image version, asshown in the following gcloud CLI command.

gcloud dataproc clusters createCLUSTER_NAME \    --region=region \    --image-version=2.0

Dataproc resolves themajor.minor version to the latest sub-minor version version(2.0 is resolved to2.0.x). Note: if you need to rely on a specific sub-minor version for your cluster,you can specify it: for example,--image-version=2.0.x. SeeHow versioning works formore information.

Each supported minor image version page,such as 2.0.x release versions,lists the component versions available with the current and previous four sub-minor image releases.

Dataproc preview image versions

New minor versions of Dataprocimages are available in apreview version prior to releasein the standard minor image version track. Use a preview imageto test and validate your jobs against a new minor image versionprior to adopting the standard minor image version in production.SeeDataproc versioningfor more information.

Use custom images when necessary

If you have dependencies to add to the cluster, such as nativePython libraries, or security hardening or virus protection software,create a custom image from thelatest imagein your target minor image version track. This practice allows you to meet dependency requirementswhen you create clusters using your custom image. When you rebuild your custom image toupdate dependency requirements, use the latest available sub-minor image version within the minor image track.

Submit jobs to the Dataproc service

Submit jobs to the Dataproc service with ajobs.submitcall using thegcloud CLIor the Google Cloud console. Set job and cluster permissions by grantingDataproc roles. Usecustom roles to separate cluster access from job submit permissions.

Benefits of submitting jobs to the Dataproc service:

No complicated networking settings required - the API is widely reachable
Easy to manage IAM permissions and roles
Track job status easily - no Dataproc job metadata to complicate results.

In production, run jobs that only depend on cluster-leveldependencies at a fixed minor image version, (for example,--image-version=2.0). Bundledependencies with jobs when the jobs are submitted. Submittinganuber jar toSpark or MapReduce is a common way to do this.

Example: If a job jar depends onargs4j andspark-sql, withargs4j specific to the jobandspark-sql a cluster-level dependency, bundleargs4j in the job's uber jar.

Control initialization action locations

Initialization actionsallow you to automatically run scripts or installcomponents when you create a Dataproc cluster (see thedataproc-initialization-actionsGitHub repository for common Dataproc initialization actions).When using cluster initialization actions in a productionenvironment, copy initialization scripts to Cloud Storagerather than sourcing them from a public repository. This practice avoids runninginitialization scripts that are subject to modification by others.

Monitor Dataproc release notes

Dataproc regularly releases new sub-minor image versions.View or subscribe toDataproc release notesto be aware of the latest Dataproc image version releases and otherannouncements, changes, and fixes.

View the staging bucket to investigate failures

Look at your cluster'sstaging bucket to investigate cluster and job error messages. Typically, the staging bucket Cloud Storage location is shown in error messages, as shown in thebold text in the following sample error message:

ERROR:(gcloud.dataproc.clusters.create) Operation ... failed:...- Initialization action failed. Failed action ... see output in:gs://dataproc-<BUCKETID>-us-central1/google-cloud-dataproc-metainfo/CLUSTERID/<CLUSTER_ID>\dataproc-initialization-script-0_output

Use the gcloud CLI to view staging bucket contents:

gcloud storage cat gs://STAGING_BUCKET

Sample output:

+ readonly RANGER_VERSION=1.2.0... Ranger admin password not set. Please use metadata flag - default-password

Get support

Google Cloud supports your production OSS workloads and helps you meet yourbusiness SLAs throughtiers of support. Also, Google CloudConsulting Services can provide guidance on best practicesfor your team's production deployments.

For more information

Read the Google Cloud blogDataproc best practices guide.
ViewDemocratizing Dataproc on YouTube.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換

Dataproc best practices for production Stay organized with collections Save and categorize content based on your preferences.