Dataproc monitoring and troubleshooting tools Stay organized with collections Save and categorize content based on your preferences.
Dataproc is a fully managed and highly scalable service for running open-sourcedistributed processing platforms such as Apache Hadoop, Apache Spark, Apache Flink, andTrino. You can use the tools and files discussed in the following sections toinvestigate, troubleshoot, and monitor your Dataproc clusters and jobs.
AI-powered Investigations with Gemini Cloud Assist (Preview)
Preview
This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.
Overview
The Gemini Cloud Assist Investigations preview feature uses Gemini advanced capabilities to assist in creating and running Dataproc clustersand jobs. This feature analyzes failed clusters and failed and slow-running jobs to identifyroot causes and recommend fixes. It creates persistent analysis that you can review,save, and share with Google Cloud support to facilitating collaboration andaccelerate issue resolution.
Features
Use this feature to create investigations from the Google Cloud console:
- Add a natural language context description to an issue before creating aninvestigation.
- Analyse failed clusters and slow and failed jobs.
- Get insights into issue root causes with recommended fixes.
- Create Google Cloud support cases with the full investigationcontext attached.
Before you begin
To get started using the Investigation feature, in your Google Cloud project,enable theGemini Cloud Assist API.
Create an investigation
To create an investigation, do the following:
In the Google Cloud console, go to theCloud Assist Investigations page.
ClickCreate.
Describe the issue: Provide a description of the cluster or job issue.
Select time range: Prove a time range when the issue occurred (default is30 minutes).
Select resources:
- ClickAdd resource.
- In theQuick filters field, type "dataproc",and then select one or more of
dataproc.Batch,dataproc.Job, ordataproc.Clusteras filters. You can also filter byLocation. - Select the listed batch, job, or cluster to investigate.You can add multiple resources that are affected bythe issue.
- In theQuick filters field, type "dataproc",and then select one or more of
- ClickAdd resource.
ClickCreate.
Interpret investigation results
Once an investigation is complete, theInvestigation details page opens.This page contains the full Gemini analysis, which is organizedinto the following sections:
- Issue: A collapsed section containing auto-populated details of the job being investigated.
- Relevant Observations: A collapsed section that lists key data points and anomalies that Gemini found during its analysis of logs and metrics.
- Hypotheses: This is the primary section, which is expanded by default.It presents a list of potential root causes for the observed issue. Each hypothesisincludes:
- Overview: A description of the possible cause, such as "High Shuffle Write Time and Potential Task Skew."
- Recommended Fixes: A list of actionable steps to address thepotential issue.
Take action
After reviewing the hypotheses and recommendations:
Apply one or more of the suggested fixes to the job configuration or code,and then rerun the job.
Provide feedback on the helpfulness of the investigation by clicking thethumbs-up or thumbs-down icons at the top of the panel.
Review and escalate investigations
The results of a previously run investigation can be reviewed by clicking theinvestigation name on theCloud Assist Investigations page to open theInvestigation details page.
If further assistance is needed, you can use open a Google Cloud support case.This process provides the support engineer with thecomplete context of the previously performed investigation, including the observationsand hypotheses generated by Gemini. This context sharing significantly reduces theback-and-forth communication required with the support team, and leads tofaster case resolution.
To create a support case from an investigation:
In theInvestigation details page, clickRequest support.
Note: You can lickOpen chat to start a conversation and ask clarifying questionsabout investigation results before opening a support case.Preview status and pricing
There is no charge for Gemini Cloud Assist investigations duringpublic preview. Charges will apply to the feature when it becomesgenerally available (GA).
For more information about pricing after general availability, seeGemini Cloud Assist Pricing.
Open source web interfaces
Many Dataproc cluster open source components, such as ApacheHadoop and Apache Spark, provide web interfaces. These interfaces can be used to monitor clusterresources and job performance. For example, you can use the YARN Resource ManagerUI to view YARN application resource allocation on a Dataproc cluster.
To enable access to component web interfaces availableon a cluster, enable theDataproc Component Gatewaywhen you create a cluster.Persistent History Server
Open Source web interfaces running on a cluster are available when the cluster isrunning, but they terminate when you delete the cluster. To view clusterand job data after a cluster is deleted, you can create aPersistent History Server(PHS).
Example: You encounter a job error or slowdown that you want to analyze. Youstop or delete the job cluster, then view and analyze job history data using yourPHS.
After you create a PHS, you enable it on a Dataproc cluster orGoogle Cloud Serverless for Apache Spark batch workloadwhen you create the cluster or submit the batch workload. A PHS can accesshistory data for jobs run on multiple clusters, letting you monitorjobs across a project instead of monitoring separate UIs running on differentclusters.
Dataproc logs
Dataproc collects the logs generated by Apache Hadoop, Spark, Hive, Zookeeperand other open source systems running on your clusters, and sends them toLogging. These logs are grouped based on thesource of logs, which lets you select and view logs of interest to you: forexample, YARN NodeManager and Spark Executor logs generated on a cluster arelabelled separately. SeeDataproc logs for more information on Dataproc log contents and options.
Cloud Logging
Logging is a fully-managed, real-time log management system. It providesstorage for logs ingested from Google Cloud services and tools to search, filter,and analyze logs at scale. Dataproc clusters generate multiple logs, includingDataproc service agent logs,cluster startup logs, and OSS component logs, such as YARN NodeManager logs.
Logging is enabled by default on Dataproc clusters andServerless for Apache Spark batchworkloads. Logs are periodically exported to Logging,where they persist after the cluster is deleted or the workload is completed.
Dataproc metrics
Dataproc cluster and job metrics,prefixed withdataproc.googleapis.com/, consist oftime-series data that provide insights into the performanceof a cluster, such as CPU utilization or job status. Dataproccustom metrics,prefixed withcustom.googleapis.com/,include metrics emitted by open source systems running on the cluster,such as the YARNrunning applications metric. Gaining insight into Dataproc metricscan help you configure your clusters efficiently. Setting up metric-based alerts can help yourecognize and respond to problems quickly.
Dataproc cluster and job metrics are collected by default without charge.The collection ofcustom metrics ischarged to customers.You canenable the collection of custom metricswhen you create a cluster.The collection of Serverless for Apache SparkSpark metricsis enabled by default on Spark batch workloads.
Cloud Monitoring
Monitoring uses cluster metadata and metrics,including HDFS, YARN, job, and operation metrics, to provide visibility into thehealth, performance, and availability of Dataproc clusters and jobs.You can use Monitoring to explore metrics, add charts,build dashboards, and create alerts.
Metrics Explorer
You can use theMetrics Explorerto viewDataproc metrics.Dataproc cluster, job, and Serverless for Apache Spark batchmetrics are listed under theCloud Dataproc Cluster,Cloud Dataproc Job, andCloud Dataproc Batch resources. Dataproc custom metrics are listed under theVM Instances resource,Custom category.
Charts
You can use Metrics Explorer tocreate charts that visualize Dataprocmetrics.
Example: You create a chart to see the number of active Yarn applications runningon your clusters, and then add a filter to select visualized metrics by clustername or region.
Dashboards
You canbuild dashboardsto monitor Dataproc clusters and jobs using metrics from multiple projects anddifferent Google Cloud products. You can build dashboards inthe Google Cloud console from theDashboards Overview pageby clicking, creating, and then saving a chart from theMetrics Explorer page.
Alerts
You can createDataproc metric alertsto receive timely notice of cluster or job issues.
What's next
- Learn how totroubleshoot Dataproc error messages.
- Learn how toview Dataproc cluster diagnostic data.
- SeeDataproc FAQ.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.