Perform proactive monitoring with Cloud Monitoring Stay organized with collections Save and categorize content based on your preferences.
Reacting to issues after they occur can lead to downtime. To maintain aresilient system in Google Kubernetes Engine (GKE), you need to identifypotential problemsbefore they affect your users.
Use this page to proactively monitor your GKE environment withCloud Monitoring by tracking key performance indicators, visualizing trends,and setting up alerts to detect issues like rising error rates or resourceconstraints.
This information is important for Platform admins and operators responsible forensuring the health, reliability, and efficiency of the GKEenvironment. It also helps Application developers understand theirapp's performance in real-world conditions, detect regressions acrossdeployments, and gain insights for optimization. For more information about thecommon roles and example tasks that we reference in Google Cloud content, seeCommon GKE user roles and tasks.
Review useful metrics
GKE automatically sends a set of metrics to Cloud Monitoring.The following sections list some of the most important metrics fortroubleshooting:
- Container performance and health metrics
- Node performance and health metrics
- Pod performance and health metrics
For a complete list of GKE metrics, seeGKE system metrics.
Container performance and health metrics
Start with these metrics when you suspect a problem with a specific app. Thesemetrics help you monitor the health of your app, including discovering if acontainer is restarting frequently, running out of memory, or being throttled byCPU limits.
| Metric | Description | Troubleshooting significance |
|---|---|---|
kubernetes.io/container/cpu/limit_utilization | The fraction of the CPU limit that is currently in use on the instance. This value can be greater than 1 as a container might be allowed to exceed its CPU limit. | Identifies CPU throttling. High values can lead to performance degradation. |
kubernetes.io/container/memory/limit_utilization | The fraction of the memory limit that is currently in use on the instance. This value cannot exceed 1. | Monitors for risk of OutOfMemory (OOM) errors. |
kubernetes.io/container/memory/used_bytes | Actual memory consumed by the container in bytes. | Tracks memory consumption to identify potential memory leaks or risk of OOM errors. |
kubernetes.io/container/memory/page_fault_count | Number of page faults, broken down by type: major and minor. | Indicates significant memory pressure. Major page faults mean memory is being read from disk (swapping), even if memory limits aren't reached. |
kubernetes.io/container/restart_count | Number of times the container has restarted. | Highlights potential problems such as crashing apps, misconfigurations, or resource exhaustion through a high or increasing number of restarts. |
kubernetes.io/container/ephemeral_storage/used_bytes | Local ephemeral storage usage in bytes. | Monitors temporary disk usage to prevent Pod evictions due to full ephemeral storage. |
kubernetes.io/container/cpu/request_utilization | The fraction of the requested CPU that is currently in use on the instance. This value can be greater than 1 as usage can exceed the request. | Identifies over or under-provisioned CPU requests to help you optimize resource allocation. |
kubernetes.io/container/memory/request_utilization | The fraction of the requested memory that is currently in use on the instance. This value can be greater than 1 as usage can exceed the request. | Identifies over or under-provisioned memory requests to improve scheduling and prevent OOM errors. |
Node performance and health metrics
Examine these metrics when you need to diagnose issues with the underlyingGKE infrastructure. These metrics are crucial for understandingthe overall health and capacity of your nodes, helping you investigate whetherthe node is unhealthy or under pressure, or whether the node has enough memoryto schedule new Pods.
| Metric | Description | Troubleshooting significance |
|---|---|---|
kubernetes.io/node/cpu/allocatable_utilization | The fraction of the allocatable CPU that is currently in use on the instance. | Indicates if the sum of Pod usage is straining the node's available CPU resources. |
kubernetes.io/node/memory/allocatable_utilization | The fraction of the allocatable memory that is currently in use on the instance. This value cannot exceed 1 as usage cannot exceed allocatable memory bytes. | Suggests that the node lacks memory for scheduling new Pods or for existing Pods to operate, especially when values are high. |
kubernetes.io/node/status_condition (BETA) | Condition of a node from the node status condition field. | Reports node health conditions likeReady,MemoryPressure, orDiskPressure. |
kubernetes.io/node/ephemeral_storage/used_bytes | Local ephemeral storage bytes used by the node. | Helps prevent Pod startup failures or evictions by providing warnings about high ephemeral storage usage. |
kubernetes.io/node/ephemeral_storage/inodes_free | Free number of index nodes (inodes) on local ephemeral storage. | Monitors the number of free inodes. Running out of inodes can halt operations even if disk space is available. |
kubernetes.io/node/interruption_count (BETA) | Interruptions are system evictions of infrastructure while the customer is in control of that infrastructure. This metric is the current count of interruptions by type and reason. | Explains why a node might disappear unexpectedly due to system evictions. |
Pod performance and health metrics
These metrics help you troubleshoot issues related to a Pod's interaction withits environment, such as networking and storage. Use these metrics when you needto diagnose slow-starting Pods, investigate potential network connectivityissues, or proactively manage storage to prevent write failures from fullvolumes.
| Metric | Description | Troubleshooting significance |
|---|---|---|
kubernetes.io/pod/network/received_bytes_count | Cumulative number of bytes received by the Pod over the network. | Identifies unusual network activity (high or low) that can indicate app or network issues. |
kubernetes.io/pod/network/policy_event_count (BETA) | Change in the number of network policy events seen in the dataplane. | Identifies connectivity issues caused by network policies. |
kubernetes.io/pod/volume/utilization | The fraction of the volume that is currently being used by the instance. This value cannot be greater than 1 as usage cannot exceed the total available volume space. | Enables proactive management of volume space by warning when high utilization (approaching 1) might lead to write failures. |
kubernetes.io/pod/latencies/pod_first_ready (BETA) | The Pod end-to-end startup latency (from Pod `Created` to `Ready`), including image pulls. | Diagnoses slow-starting Pods. |
Visualize metrics with Metrics Explorer
To visualize the state of your GKE environment, createcharts based on metrics with Metrics Explorer.
To use Metrics Explorer, complete the following steps:
In the Google Cloud console, go to theMetrics Explorer page.
In theMetrics field, select or enter the metric that you want toinspect.
View the results and observe any trends over time.
For example, to investigate the memory consumption of Pods in a specificnamespace, you can do the following:
- In theSelect a metric list, choose the metric
kubernetes.io/container/memory/used_bytesand clickApply. - ClickAdd filter and selectnamespace_name.
- In theValue list, select the namespace you want to investigate.
- In theAggregation field, selectSum >pod_name and clickOK. This setting displays a separate time series line for each Pod.
- ClickSave chart.
The resulting chart shows you the memory usage for each Pod over time, which canhelp you visually identify any Pods with unusually high or spiking memoryconsumption.
Metrics Explorer has a great deal of flexibility in how to construct themetrics that you want to view. For more information about advancedMetrics Explorer options, seeCreate charts with Metrics Explorerin the Cloud Monitoring documentation.
Create alerts for proactive issue detection
To receive notifications when things go wrong or when metrics breach certainthresholds, set up alerting policies in Cloud Monitoring.
For example, to set up an alerting policy that notifies you when the containerCPU limit is over 80% for five minutes, do the following:
In the Google Cloud console, go to theAlerting page.
ClickCreate policy.
In theSelect a metric box, filter for
CPU limit utilizationand then select the following metric:kubernetes.io/container/cpu/limit_utilization.ClickApply.
Leave theAdd a filter field blank. This setting triggers an alertwhen any cluster violates your threshold.
In theTransform data section, do the following:
- In theRolling window list, select1 minute. Thissetting means that Google Cloud calculates an average value every minute.
In theRolling window function list, selectmean.
Both of these settings average the CPU limit utilization for eachcontainer every minute.
ClickNext.
In theConfigure alert section, do the following:
- ForCondition type, selectThreshold.
- ForAlert trigger, selectAny time series violates.
- ForThreshold position, selectAbove threshold.
- ForThreshold value, enter
0.8. This value represents the80% threshold that you want to monitor for. - ClickAdvanced options.
- In theRetest window list, select5 min. This setting meansthat the alert triggers only if the CPU utilization stays over 80%for a continuous five-minute period, which reduces false alarms frombrief spikes.
- In theCondition name field, give the condition a descriptive name.
- ClickNext.
In theConfigure the notifications and finalize the alert section,do the following:
- In theNotification channels list, select the channel whereyou want to receive the alert. If you don't have a channel, clickManage notification channels to create one.
- In theName the alert policy field, give the policy a clearand descriptive name.
- Leave all other fields with their default values.
- ClickNext.
Review your policy, and if it all looks correct, clickCreate policy.
To learn about the additional ways that you can create alerts, seeAlertingoverview in the Cloud Monitoring documentation.
What's next
ReadAccelerate diagnosis with Gemini Cloud Assist(the next page in this series).
See these concepts applied in theexample troubleshooting scenario.
For advice about resolving specific problems, reviewGKE's troubleshooting guides.
If you can't find a solution to your problem in the documentation, seeGet support for further help,including advice on the following topics:
- Opening a support case by contactingCloud Customer Care.
- Getting support from the community byasking questions on StackOverflow and using the
google-kubernetes-enginetag to search for similarissues. You can also join the#kubernetes-engineSlack channel for more community support. - Opening bugs or feature requests by using thepublic issue tracker.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-18 UTC.