Run Spark jobs with DataprocFileOutputCommitter

TheDataprocFileOutputCommitter feature is an enhancedversion of the open sourceFileOutputCommitter. Itenables concurrent writes by Apache Spark jobs to an output location.

Limitations

TheDataprocFileOutputCommitter feature supports Spark jobs run onDataproc Compute Engine clusters created withthe following image versions:

  • 2.1 image versions 2.1.10 and higher

  • 2.0 image versions 2.0.62 and higher

UseDataprocFileOutputCommitter

To use this feature:

  1. Create a Dataproc on Compute Engine clusterusing image versions2.1.10 or2.0.62 or higher.

  2. Setspark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory andspark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=falseas a job property when yousubmit a Spark jobto the cluster.

    • Google Cloud CLI example:
    gcloud dataproc jobs submit spark \    --properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \    --region=REGION \    other args ...
    • Code example:
    sc.hadoopConfiguration.set("spark.hadoop.mapreduce.outputcommitter.factory.class","org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory")sc.hadoopConfiguration.set("spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs","false")
    The Dataproc file output committer must setspark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=falseto avoid conflicts between success marker files created during concurrent writes.You can also set this property inspark-defaults.conf.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.