Dataproc monitoring and troubleshooting tools

Dataproc is a fully managed and highly scalable service for running open-sourcedistributed processing platforms such as Apache Hadoop, Apache Spark, Apache Flink, andTrino. You can use the tools and files discussed in the following sections toinvestigate, troubleshoot, and monitor your Dataproc clusters and jobs.

AI-powered Investigations with Gemini Cloud Assist (Preview)

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

Overview

The Gemini Cloud Assist Investigations preview feature uses Gemini advanced capabilities to assist in creating and running Dataproc clustersand jobs. This feature analyzes failed clusters and failed and slow-running jobs to identifyroot causes and recommend fixes. It creates persistent analysis that you can review,save, and share with Google Cloud support to facilitating collaboration andaccelerate issue resolution.

Features

Use this feature to create investigations from the Google Cloud console:

Add a natural language context description to an issue before creating aninvestigation.
Analyse failed clusters and slow and failed jobs.
Get insights into issue root causes with recommended fixes.
Create Google Cloud support cases with the full investigationcontext attached.

Before you begin

To get started using the Investigation feature, in your Google Cloud project,enable theGemini Cloud Assist API.

Create an investigation

To create an investigation, do the following:

In the Google Cloud console, go to theCloud Assist Investigations page.
Cloud Assist Investigations
ClickCreate.
Describe the issue: Provide a description of the cluster or job issue.
Select time range: Prove a time range when the issue occurred (default is30 minutes).
Select resources:
1. ClickaddAdd resource.
  1. In theQuick filters field, type "dataproc",and then select one or more ofdataproc.Batch,dataproc.Job, ordataproc.Cluster as filters. You can also filter byLocation.
  2. Select the listed batch, job, or cluster to investigate.You can add multiple resources that are affected bythe issue.
ClickCreate.

Tip: Gemini Cloud Assist is also accessible through direct APIand other interfaces. You can create investigations for any job from theseinterfaces.

Interpret investigation results

Once an investigation is complete, theInvestigation details page opens.This page contains the full Gemini analysis, which is organizedinto the following sections:

Issue: A collapsed section containing auto-populated details of the job being investigated.
Relevant Observations: A collapsed section that lists key data points and anomalies that Gemini found during its analysis of logs and metrics.
Hypotheses: This is the primary section, which is expanded by default.It presents a list of potential root causes for the observed issue. Each hypothesisincludes:
- Overview: A description of the possible cause, such as "High Shuffle Write Time and Potential Task Skew."
- Recommended Fixes: A list of actionable steps to address thepotential issue.

Take action

After reviewing the hypotheses and recommendations:

Apply one or more of the suggested fixes to the job configuration or code,and then rerun the job.
Provide feedback on the helpfulness of the investigation by clicking thethumbs-up or thumbs-down icons at the top of the panel.

Review and escalate investigations

The results of a previously run investigation can be reviewed by clicking theinvestigation name on theCloud Assist Investigations page to open theInvestigation details page.

If further assistance is needed, you can use open a Google Cloud support case.This process provides the support engineer with thecomplete context of the previously performed investigation, including the observationsand hypotheses generated by Gemini. This context sharing significantly reduces theback-and-forth communication required with the support team, and leads tofaster case resolution.

To create a support case from an investigation:

In theInvestigation details page, clickRequest support.

Note: You can lickOpen chat to start a conversation and ask clarifying questionsabout investigation results before opening a support case.

Preview status and pricing

There is no charge for Gemini Cloud Assist investigations duringpublic preview. Charges will apply to the feature when it becomes generally available (GA).

For more information about pricing after general availability, seeGemini Cloud Assist Pricing.

Open source web interfaces

Many Dataproc cluster open source components, such as ApacheHadoop and Apache Spark, provide web interfaces. These interfaces can be used to monitor clusterresources and job performance. For example, you can use the YARN Resource ManagerUI to view YARN application resource allocation on a Dataproc cluster.

To enable access to component web interfaces availableon a cluster, enable the Dataproc Component Gatewaywhen you create a cluster.

Persistent History Server

Open Source web interfaces running on a cluster are available when the cluster isrunning, but they terminate when you delete the cluster. To view clusterand job data after a cluster is deleted, you can create aPersistent History Server(PHS).

Example: You encounter a job error or slowdown that you want to analyze. Youstop or delete the job cluster, then view and analyze job history data using yourPHS.

After you create a PHS, you enable it on a Dataproc cluster orGoogle Cloud Serverless for Apache Spark batch workloadwhen you create the cluster or submit the batch workload. A PHS can accesshistory data for jobs run on multiple clusters, letting you monitorjobs across a project instead of monitoring separate UIs running on differentclusters.

Dataproc logs

Dataproc collects the logs generated by Apache Hadoop, Spark, Hive, Zookeeperand other open source systems running on your clusters, and sends them toLogging. These logs are grouped based on thesource of logs, which lets you select and view logs of interest to you: forexample, YARN NodeManager and Spark Executor logs generated on a cluster arelabelled separately. SeeDataproc logs for more information on Dataproc log contents and options.

Cloud Logging

Logging is a fully-managed, real-time log management system. It providesstorage for logs ingested from Google Cloud services and tools to search, filter,and analyze logs at scale. Dataproc clusters generate multiple logs, includingDataproc service agent logs,cluster startup logs, and OSS component logs, such as YARN NodeManager logs.

Logging is enabled by default on Dataproc clusters andServerless for Apache Spark batchworkloads. Logs are periodically exported to Logging,where they persist after the cluster is deleted or the workload is completed.

Dataproc metrics

Dataproc cluster and job metrics,prefixed withdataproc.googleapis.com/, consist oftime-series data that provide insights into the performanceof a cluster, such as CPU utilization or job status. Dataproccustom metrics,prefixed withcustom.googleapis.com/,include metrics emitted by open source systems running on the cluster,such as the YARNrunning applications metric. Gaining insight into Dataproc metricscan help you configure your clusters efficiently. Setting up metric-based alerts can help yourecognize and respond to problems quickly.

Use the Google Cloud console to view Dataproc metricsfrom theMetrics Explorer inMonitoring or from theMonitoring tabon the DataprocCluster details page.

Dataproc cluster and job metrics are collected by default without charge.The collection ofcustom metrics ischarged to customers.You canenable the collection of custom metricswhen you create a cluster.The collection of Serverless for Apache SparkSpark metricsis enabled by default on Spark batch workloads.

Cloud Monitoring

Monitoring uses cluster metadata and metrics,including HDFS, YARN, job, and operation metrics, to provide visibility into thehealth, performance, and availability of Dataproc clusters and jobs.You can use Monitoring to explore metrics, add charts,build dashboards, and create alerts.

Metrics Explorer

You can use theMetrics Explorerto viewDataproc metrics.Dataproc cluster, job, and Serverless for Apache Spark batchmetrics are listed under theCloud Dataproc Cluster,Cloud Dataproc Job, andCloud Dataproc Batch resources. Dataproc custom metrics are listed under theVM Instances resource,Custom category.

Charts

You can use Metrics Explorer tocreate charts that visualize Dataprocmetrics.

Example: You create a chart to see the number of active Yarn applications runningon your clusters, and then add a filter to select visualized metrics by clustername or region.

Dashboards

You canbuild dashboardsto monitor Dataproc clusters and jobs using metrics from multiple projects anddifferent Google Cloud products. You can build dashboards inthe Google Cloud console from theDashboards Overview pageby clicking, creating, and then saving a chart from theMetrics Explorer page.

Alerts

You can createDataproc metric alertsto receive timely notice of cluster or job issues.

What's next

Learn how totroubleshoot Dataproc error messages.
Learn how toview Dataproc cluster diagnostic data.
SeeDataproc FAQ.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換

Dataproc monitoring and troubleshooting tools Stay organized with collections Save and categorize content based on your preferences.