Create Dataproc metric alerts

You can create a Monitoring alert that notifies youwhen a Dataproc cluster or job metric exceeds aspecified threshold.

Create an alert

Open theAlerting page in the Google Cloud console.
Click+ Create Policy to open theCreate alerting policy page.

Use Prometheus Query Language (PromQL) for customalerts. You can selectPromQL on theCreatealerting policy page to create PromQL alerts.

Sample alerts

This section describes a sample alert for a job submitted to theDataproc service and an alert for a job run as a YARNapplication.

Long-running Dataproc job alert

Dataproc emits thedataproc.googleapis.com/job/state metric,which tracks how long a job has been in different states. This metric is foundunder the Google Cloud console Metrics Explorer under theCloud Dataproc Job(cloud_dataproc_job) resource.You can use this metric to set up an alert that notifies you when the jobRUNNING state exceeds a duration threshold (Max threshold limit : 7 days). To set up an alert for a job that is expected to run more than 7 days,seeLong-running YARN application alert.

Job duration alert setup

This example uses thePrometheus Query Language (PromQL)to create an alert policy. For more information, seeCreate PromQL-based alerting policies (Console).

sum by (job_id, state) ({  "__name__"="dataproc.googleapis.com/job/state",  "monitored_resource"="cloud_dataproc_job",  "state"="RUNNING"}) != 0

To trigger this alert to fire when a job has been running for more than 30minutes, in theConfigure trigger tab, set theEvaluation Interval to 30minutes.

You can modify the query by filtering on thejob_id to apply itto a specific job:

sum by (job_id) ({  "__name__"="dataproc.googleapis.com/job/state",  "monitored_resource"="cloud_dataproc_job",  "state"="RUNNING",  "job_id"="1234567890"}) != 0

Long-running YARN application alert

The previous sample shows an alert that is triggered when a Dataproc job runs longerthan a specified duration, but it only applies to jobs submitted to the Dataprocservice using the Google Cloud console, the Google Cloud CLI, or by direct calls to theDataprocjobs API. You can also use OSS metricsto set up similar alerts that monitor the running time of YARN applications.

First, some background. YARN emits running time metrics into multiple buckets.By default, YARN maintains 60, 300, and 1440 minutes as bucket thresholdsand emits 4 metrics,running_0,running_60,running_300 andrunning_1440:

running_0 records the number of jobs with a runtime between 0 and 60 minutes.
running_60 records the number of jobs with a runtime between 60 and 300 minutes.
running_300 records the number of jobs with a runtime between 300 and 1440 minutes.
running_1440 records the number of jobs with a runtime greater than 1440 minutes.

For example, a job running for 72 minutes will be recorded inrunning_60, but not inrunning_0.

Note: These metrics only provide a count of the number of applications withrunning times that are within each bucket threshold; they don't identify thenames of the applications.

These default bucket thresholds can be modified by passing new values to theyarn:yarn.resourcemanager.metrics.runtime.bucketscluster propertyduring Dataproc cluster creation. When defining custom bucketthresholds, you must also define metric overrides. For example, to specifybucket thresholds of 30, 60, and 90 minutes, thegcloud dataproc clusters create command should include the following flags:

bucket thresholds:‑‑properties=yarn:yarn.resourcemanager.metrics.runtime.buckets=30,60,90
metrics overrides:‑‑metric-overrides=yarn:ResourceManager:QueueMetrics:running_0, yarn:ResourceManager:QueueMetrics:running_30,yarn:ResourceManager:QueueMetrics:running_60, yarn:ResourceManager:QueueMetrics:running_90

Sample cluster creation command

gcloud dataproc clusters create test-cluster  \   --properties ^#^yarn:yarn.resourcemanager.metrics.runtime.buckets=30,60,90  \   --metric-sources=yarn  \   --metric-overrides=yarn:ResourceManager:QueueMetrics:running_0,yarn:ResourceManager:QueueMetrics:running_30,yarn:ResourceManager:QueueMetrics:running_60,yarn:ResourceManager:QueueMetrics:running_90

These metrics are listed in the Google Cloud console Metrics Explorerunder theVM Instance (gce_instance) resource.

Note: Recently defined metrics may not be immediatelyvisible in Cloud Monitoring. Typically, they appear within ten to fifteen minutesafter you create a cluster with custom buckets and metric overrides enabled.

YARN application alert setup

Create a cluster with required buckets and metrics enabled.
Create an alert policy that triggers when the number of applicationsin a YARN metric bucket exceeds a specified threshold.
- Optionally, add a filter to alert on clusters that match a pattern.
- Configure the threshold for triggering the alert.

Failed Dataproc job alert

You can also use thedataproc.googleapis.com/job/state metric(seeLong-running Dataproc job alert)to alert you when a Dataproc job fails.

Failed job alert setup

This example uses thePrometheus Query Language (PromQL)to create an alert policy. For more information, seeCreate PromQL-based alerting policies (Console).

Alert PromQL

sum by (job_id, state) ({  "__name__"="dataproc.googleapis.com/job/state",  "monitored_resource"="cloud_dataproc_job",  "state"="ERROR"}) != 0

Alert trigger configuration

In the following example, the alert triggers when any Dataprocjob fails in your project.

You can modify the query by filtering on thejob_id to apply itto a specific job:

sum by (job_id) ({  "__name__"="dataproc.googleapis.com/job/state",  "monitored_resource"="cloud_dataproc_job",  "state"="ERROR",  "job_id"="1234567890"}) != 0

Cluster capacity deviation alert

Dataproc emits thedataproc.googleapis.com/cluster/capacity_deviationmetric, which reports the difference between the expected node count in thecluster and the active YARN node count. You can find this metric in theGoogle Cloud consoleMetrics Explorer under theCloud Dataproc Clusterresource. You can use this metric to create an alert that notifies you whencluster capacity deviates from expected capacity for longer than a specifiedthreshold duration.

The following operations can cause a temporary underreporting of cluster nodesin thecapacity_deviation metric. To avoid false positive alerts, setthe metric alert threshold to account for these operations:

Cluster creation and updates: Thecapacity_deviation metric is notemitted during cluster create or update operations.
Cluster initialization actions: Initialization actions are performedafter a node is provisioned.
Secondary worker updates: Secondary workers are added asynchronously,after the update operation completes.

The maximum lookback window for the clustercapacity_deviation metric is 7 days. If a cluster updateoperation does not occur during the previous 7 days, the metric willbe empty.

Capacity deviation alert setup

This example uses the Prometheus Query Language (PromQL)to create an alert policy. For more information, seeCreate PromQL-based alerting policies (Console).

{  "__name__"="dataproc.googleapis.com/cluster/capacity_deviation",  "monitored_resource"="cloud_dataproc_cluster"} != 0

In the next example, the alert triggers when cluster capacity deviationis non-zero for more than 30 minutes.

View alerts

When an alert is triggered by a metric threshold condition, Monitoringcreates an incident and a corresponding event. You can view incidents from theMonitoring alerting page in the Google Cloud console.

If you defined a notification mechanism in the alert policy, such as an email or SMS notification,Monitoring sends a notification of the incident.

What's next

See theIntroduction to alerting.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Create Dataproc metric alerts Stay organized with collections Save and categorize content based on your preferences.