Compare to Dataproc on Compute Engine

For Google Cloud customers who rely on Apache Spark to run theirdata processing and analytics workloads, a key decision is choosing betweenDataproc on Compute Engine (referred to as"Dataproc" in this document) and Serverless for Apache Spark.While both services offer a managed, highly scalable, production-ready, andsecure Spark environment that is OSS-compatible with support for all formats,these two platforms differ fundamentally in how the underlying infrastructure ismanaged and billed.

This document compares Google Cloud Serverless for Apache Spark toDataproc and lists theirfeatures and capabilities to help you decide on the best workload solution.

Compare Serverless for Apache Spark to Dataproc

If you want to provision and manage infrastructure, and then executeworkloads on Spark and other open source processing frameworks, useDataproc on Compute Engine.The following table lists key differences between the Dataproc onCompute Engine and Serverless for Apache Spark.

Capability	Serverless for Apache Spark	Dataproc on Compute Engine
Processing frameworks	Batch workloads and interactive sessions: Spark	Spark. Other open source frameworks, such as Hive, Flink, Trino, and Kafka
Serverless	Yes	No
Startup time	50s	120s
Infrastructure control	No	Yes
Resource management	Serverless	YARN
GPU support	Yes	Yes
Interactive sessions	Yes	No
Custom containers	Yes	No
VM access (SSH)	No	Yes
Java versions	Java 17, 21	Java 17 and previous versions

Decide on the best Spark service

This section outlines the core strengths and primary use cases for each serviceto help you select the best service for your Spark workloads.

Overview

Dataproc and Serverless for Apache Spark differ in the degree of control, infrastructure management, and billing mode that each offer.

Dataproc-managed Spark: Dataproc offersSpark-clusters-as-a-service, running managed Spark on yourCompute Engine infrastructure. You pay for cluster uptime.
Serverless for Apache Spark: Serverless for Apache Spark offersSpark-jobs-as-a-service, running Spark on fully managed Google Cloudinfrastructure. You pay for job runtime.

Due to these differences, each service is best suited in the followinguse cases:

Service	Use cases
Dataproc	Long-running, shared environments Workloads requiring granular control over infrastructure Migrating legacy Hadoop and Spark environments
Serverless for Apache Spark	Different dedicated job environments Scheduled batch workloads Code management prioritized over infrastructure management

Key differences

Feature	Serverless for Apache Spark	Dataproc
Management model	Fully managed, serverless execution environment.	Cluster-based. You provision and manage clusters.
Control & customization	Less infrastructure control, with focus on submitting code and specifying Spark parameters.	Greater control over cluster configuration, machine types, and software. Ability to use spot VMs, and reuse reservations and Compute Engine resource capacity. Suitable for workloads that have a dependency on specific VM shapes, such as CPU architectures.
Use cases	Ad-hoc queries, interactive analysis, new Spark pipelines, and workloads with unpredictable resource needs.	Long-running, shared clusters, migrating existing Hadoop and Spark workloads with custom configs, workloads requiring deep customization.
Operational overhead	Lower overhead. Google Cloud manages the infrastructure, scaling, and provisioning, enabling a`NoOps` model.Gemini Cloud Assist makes troubleshooting easier while Serverless for Apache Sparkautotuning helps provide optimal performance.	Higher overhead that requires cluster management, scaling, and maintenance.
Efficiency model	No idle compute overhead: compute resource allocation only when the job is running. No startup and shutdown cost. Shared interactive sessions supported for improved efficiency.	Efficiency gained by sharing clusters across jobs and teams, with a sharing,multi-tenancy model.
Location control	Serverless for Apache Spark supports regional workloads without extra cost to provide extra reliability and obtainability.	Clusters are zonal. The zone can be auto-selected during cluster creation.
Cost	Billed only for the duration of the Spark job execution, not including startup and teardown, based on resources consumed. Billed as Data Compute Units (DCU) used and other infrastructure costs.	Billed for the time the cluster is running, including startup and teardown, based on the number of nodes. Includes Dataproc license fee plus infrastructure cost.
Committed Usage Discounts (CUDs)	BigQuery spend-based CUDs apply to Serverless for Apache Spark jobs.	Compute Engine CUDs apply to all resource usage.
Image and runtime control	Users can pin to minor Serverless for Apache Spark runtime versions; subminor versions are managed by Serverless for Apache Spark.	Users can pin to minor and subminor Dataproc image versions.
Resource management	Serverless	YARN
GPU support	Yes	Yes
Interactive sessions	Yes	No
Custom containers	Yes	No
VM access (SSH)	No	Yes
Java versions	Java`17`,`21`	Previous versions supported
Startup time	50s	120s

When to choose Dataproc

Dataproc is a managed service you can use torun Apache Spark and other open source data processing frameworks.It offers a high degree of control and flexibility, making it the preferred choice inthe following scenarios:

Migrating existing Hadoop and Spark workloads: Supports migrating on-premises Hadoop or Spark clusters to Google Cloud. Replicate existing configurations with minimal code changes, particularly when using older Spark versions.
Deep customization and control: Lets you customize cluster machine types, disk sizes, and network configurations. This level of control is critical for performance tuning and optimizing resource utilization for complex, long-running jobs.
Long-running and persistent clusters: Supports continuous, long-running Spark jobs and persistent clusters for multiple teams and projects.
Diverse open source ecosystem: Provides a unified environment to run data processing pipelines running Hadoop ecosystem tools, such as Hive, Pig, or Presto, with your Spark workloads.
Security compliance: Enables control over infrastructure to meet specific security or compliance standards, such as safeguarding personally identifiable information (PII) or protected health information (PHI).
Infrastructure flexibility: Offers Spot VMs and the ability to reuse reservations and Compute Engine resource capacity to balance resource use and facilitate your cloud infrastructure strategy.

When to choose Serverless for Apache Spark

Serverless for Apache Spark abstracts away the complexities of cluster management,allowing you to focus on Spark code. This makes it an excellent choice for use inthe following data processing scenarios:

Ad-hoc and interactive analysis: For data scientists and analysts who run interactive queries and exploratory analysis using Spark, the serverless model provides a quick way to get started without focusing on infrastructure.
Spark-based applications and pipelines: When building new data pipelines or applications on Spark, Serverless for Apache Spark can significantly accelerate development by removing the operational overhead of cluster management.
Workloads with sporadic or unpredictable demand: For intermittent Spark jobs or jobs with fluctuating resource requirements, Serverless for Apache Sparkautoscaling and pay-per-use pricing (charges apply to job resource consumption) can significantly reduce costs.
Developer productivity focus: By eliminating the need for cluster provisioning and management, Serverless for Apache Spark speeds the creation of business logic, provides faster insights, and increases productivity.
Simplified operations and reduced overhead: Serverless for Apache Spark infrastructure management reduces operational burdens and costs.

Summing up

The decision whether to use Dataproc or Serverless for Apache Spark depends on your workload requirements, operational preferences, and preferred level of control.

Choose Dataproc when you need maximum control, need to migrate Hadoop or Spark workloads, or require a persistent, customized, shared cluster environment.
Choose Serverless for Apache Spark for its ease of use, cost-efficiency for intermittent workloads, and its ability to accelerate development for new Spark applications by removing the overhead of infrastructure management.

After evaluating the factors listed in this section, select the most efficientand cost-effective service to run Spark to unlock the full potential of your data.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Compare to Dataproc on Compute Engine Stay organized with collections Save and categorize content based on your preferences.