Troubleshoot VM out-of-memory errors Stay organized with collections Save and categorize content based on your preferences.
This page provides information on Dataproc onCompute Engine VM out-of-memory (OOM) errors, and explains steps you can taketo troubleshoot and resolve OOM errors.
OOM error effects
When Dataproc on Compute Engine VMs encounter out-of-memory(OOM) errors, the effects include the following conditions:
Master and worker VMs freeze for a period of time.
Master VMs OOM errors cause jobs to fail with "task not acquired" errors.
Worker VM OOM errors cause a loss of the node on YARN HDFS, which delaysDataproc job execution.
YARN memory controls
Apache YARNprovides the following types ofmemory controls:
- Polling based (legacy)
- Strict
- Elastic
By default, Dataproc doesn't setyarn.nodemanager.resource.memory.enabled to enable YARN memory controls, forthe following reasons:
- Strict memory control can cause the termination of containers when there issufficient memory if container sizes aren't configured correctly.
- Elastic memory control requirements can adversely affect job execution.
- YARN memory controls can fail to prevent OOM errors when processesaggressively consume memory.
Dataproc memory protection
When a Dataproc cluster VM is under memory pressure,Dataproc memory protection terminates processes or containersuntil the OOM condition is removed.
Dataproc provides memory protection for the following clusternodes in the followingDataproc on Compute Engine image versions:
| Role | 1.5 | 2.0 | 2.1 | 2.2 |
|---|---|---|---|---|
| Master VM | 1.5.74+ | 2.0.48+ | all | all |
| Worker VM | Not Available | 2.0.76+ | 2.1.24+ | all |
| Driver Pool VM | Not Available | 2.0.76+ | 2.1.24+ | all |
Identify and confirm memory protection terminations
You can use the following information to identify and confirmjob terminations due to memory pressure.
Process terminations
Processes that Dataproc memory protection terminates exitwith code
137or143.When Dataproc terminates a process due to memory pressure,the following actions or conditions can occur:
- Dataproc increments the
dataproc.googleapis.com/node/problem_countcumulative metric, and setsthereasontoProcessKilledDueToMemoryPressure.SeeDataproc resource metric collection. - Dataproc writes a
google.dataproc.oom-killerlog with the message:"A process is killed due to memory pressure:process name.To view these messages, enable Logging, then use thefollowing log filter:resource.type="cloud_dataproc_cluster"resource.labels.cluster_name="CLUSTER_NAME"resource.labels.cluster_uuid="CLUSTER_UUID"jsonPayload.message:"A process is killed due to memory pressure:"
- Dataproc increments the
Master node or driver node pool job terminations
When a Dataproc master node or driver node pool jobterminates due to memory pressure, the job fails with error
Driver received SIGTERM/SIGKILL signal and exited withINTcode. To view these messages, enable Logging, then use thefollowing log filter:resource.type="cloud_dataproc_cluster"resource.labels.cluster_name="CLUSTER_NAME"resource.labels.cluster_uuid="CLUSTER_UUID"jsonPayload.message:"Driver received SIGTERM/SIGKILL signal and exited with"
- Check the
google.dataproc.oom-killerlog or thedataproc.googleapis.com/node/problem_countto confirm that Dataproc Memory Protection terminated thejob (seeProcess terminations).
Solutions:
- If the cluster has adriver pool,increase
driver-required-memory-mbto actual job memory usage. - If the cluster does not have a driver pool, recreate the cluster, lowering themaximum number of concurrent jobsrunning on the cluster.
- Use a master node machine type with increased memory.
- Check the
Worker node YARN container terminations
Dataproc writes the following message in the YARNresource manager:
container id exited with codeEXIT_CODE. To view these messages, enableLogging, then use the following log filter:resource.type="cloud_dataproc_cluster"resource.labels.cluster_name="CLUSTER_NAME"resource.labels.cluster_uuid="CLUSTER_UUID"jsonPayload.message:"container" AND "exited with code" AND "which potentially signifies memory pressure onNODE
If a container exited with
codeINT, check thegoogle.dataproc.oom-killerlog or thedataproc.googleapis.com/node/problem_countto confirm that Dataproc Memory Protection terminated the job(seeProcess terminations).Solutions:
- Check that container sizes are configured correctly.
- Consider lowering
yarn.nodemanager.resource.memory-mb. This propertycontrols the amount of memory used for scheduling YARN containers. - If job containers consistently fail, check if data skew is causingincreased usage of specific containers. If so, repartition the job orincrease worker size to accommodate additional memory requirements.
Fine-tune Linux memory protection on the master node (advanced)
Dataproc master nodes use theearlyoom utility to prevent system hangs byfreeing memory when available memory is critically low. The defaultconfiguration is suitable for many workloads. However, you might need to adjustthe configuration if your master node has a large amount of memory and experiencesrapid memory consumption.
In scenarios with high memory pressure, the system can enter a state of"thrashing," where it spends most of its time on memory management and becomesunresponsive. This can happen so quickly thatearlyoom fails to take action based onits default settings or fails to act before the kernel OOM response is invoked.
Before you begin
- This is an advanced tuning option. Before you adjust
earlyoomsettings,prioritize other solutions, such as using a master VM with more memory,reducing job concurrency, or optimizing job memory usage.
Customizeearlyoom settings
The defaultearlyoom configuration uses a fixed amount of free memory as atrigger. On virtual machines with a large amount of RAM, for example32GB ormore, this fixed amount might represent a small fraction of the total memory.This makes the system susceptible to sudden spikes in memory usage.
To customize theearlyoom settings, connect to the master node and modify theconfiguration file.
Open the configuration file for editing:
sudonano/etc/default/earlyoomAdjust the minimum memory threshold. Locate the
Tuning considerations: Setting theEARLYOOM_ARGSline. The-M <kbytes>option sets the minimum amount of free memory in KiB thatearlyoomtries to maintain. The default value is-M 65536, which is64 MiB.-Mvalue too high can causeearlyoomto terminate processes more aggressively. This can impact applications even when sufficient memory might still be available for them. Monitor system behavior after making changes. Adjust the-Mvalue based on your instance size and workload. A general guideline is to set-Mto a value that represents1-5%of total system memory.For a master node with substantial memory, increase this value. For example,to set the threshold to
1 GiB(1048576 KiB), modify the line as follows:EARLYOOM_ARGS="-r 15 -M 1048576 -s 1"Notes:
-r: Memory report interval in seconds-s: The swap space threshold to triggerearlyoom
Restart the
earlyoomservice to apply the changes:sudosystemctlrestartearlyoom
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.