Troubleshooting compute instance performance issues

This document shows you how to diagnose and mitigate CPU, memory,and storage performance issues on Compute Engine virtual machine (VM) andbare metal instances.

Before you begin

  • Install the Ops Agent to view full instance performance metrics, such as memory and disk space utilization

View performance metrics

To view performance metrics for your compute instances, use theCloud Monitoring observability metrics available in the Google Cloud console.

  1. In the Google Cloud console, go to theVM Instances page.

    Go to VM Instances

  2. You can view metrics for individual instances or for the five instances thatare consuming the largest amount of a resource.

    To view metrics for individual instances, do the following:

    1. Click the name of the instance that you want to view performance metricsfor. The instanceDetails page opens.

    2. Click theObservability tab to open the ObservabilityOverviewpage.

    To view metrics for the five instances consuming the largest amount of aresource, click theObservability tab on theVM instances page.

  3. Explore the instance's performance metrics. View theOverview,CPU,Memory,Network andDisk sections to see detailedmetrics about each topic. The following are key metrics that indicateinstance performance:

    • On theOverview page:

      • CPU Utilization. The percent of CPU used by the instance.

      • Memory Utilization. The percent of memory used by the instance,excluding disk caches. For instances that use a Linux OS, this alsoexcludes kernel memory.

      • Network Traffic. The average rate of bytes sent and receivedin one minute intervals.

      • New Connections with VMs/External/Google. The estimated number ofdistinct TCP/UDP flows in one minute, grouped by peer type.

      • Disk Throughput. The average rate of bytes written to and read fromdisks.

      • Disk IOPS. The average rate of I/O read and write operations todisks.

    • On theNetwork Summary page:

      • Sent to VMs/External/Google. The rate of network traffic rate sent toGoogle services, instances, and external destinations, based on a sampleof packets. The metric is scaled so that the sum matches the total sentnetwork traffic.

      • Received from VMs/External/Google. The rate of network trafficreceived from Google services, instances, and external sources, basedon a sample of packets. The metric is scaled so that the sum matches thetotal received network traffic.

      • Network Packet Totals. The total rate of sent and received packetsin one minute intervals.

      • Packet Mean Size. The mean size of packets, in bytes, sent andreceived in one minute intervals.

      • Firewall Incoming Packets Denied. The rate of incoming networkpackets sent to the instance, but not received by the instance,because they were denied by firewall rules.

    • On theDisks Performance page:

      • I/O Size Avg. The average size of I/O read and write operations todisks. Small (4 to 16 KiB) random I/Os are usually limited by IOPSand sequential or large (256 KiB to 1 MiB) I/Os arelimited by throughput.

      • Queue Length Avg. The number of queued and running disk I/Ooperations, also called queue depth, for the top 5 devices. To reach theperformance limits of your disks,use a high I/O queue depth.Persistent Disk and Google Cloud Hyperdisk are networked storage and generallyhave higher latency compared to physical disks or Local SSD disks.

      • I/O Latency Avg. The average latency of I/O read and write operationsaggregated across operations of all disks attached to the instance,measured by the Ops Agent. This value includes operating system andfile system processing latency, and is dependent on queue length and I/Osize.

Understand performance metrics

Instance performance is affected by the hardware that the instance runs on, theworkload running on the instance, and the instance's machine type. If thehardware cannot support the workload or network traffic of your instance, yourinstance's performance might be affected.

CPU and memory performance

Hardware details

CPU and memory performance is affected by the following hardware constraints:

  • Each virtual CPU (vCPU) is implemented as a single hardware multi thread on a CPU processor.
  • Intel Xeon CPU processors support multiple app threads on a single processor core.
  • VMs that useC2 machine types have fixed virtual-to-physical core mapping and exposeNUMA cell architecture to the guest OS.
  • Most VMs get the all-core turbo frequency listed onCPU platforms, even if only the base frequency is advertised to the guest environment
  • Shared-core machine types use context-switching to share a physical core between vCPUs for multitasking. They also offer bursting capabilities during which the CPU utilization for a VM can go over 100%. For more information, seeShared-core machine types.

To understand an instance's CPU and memory performance,view performance metrics forCPU UtilizationandMemory Utilization. You can additionally useprocess metrics to view runningprocesses, attribute anomalies in resource consumption to a specific process, oridentify your instance's most expensive resource consumers.

Consistently high CPU or memory utilization indicate the need to scale up thesize of a VM. If the VM consistently uses greater than 90% of its CPU or memory,change the VM's machine typeto a machine type with more vCPUs or memory.

Unusually high or unusually low CPU utilization might indicate your VM isexperiencing a CPU soft lockup. For more information, seeTroubleshooting vCPU soft lockups.

Network performance

Hardware details

Network performance is affected by the following hardware constraints:

  • Each machine type has a specific egress bandwidth cap. To find the maximum egress bandwidth for your instance's machine type, visit the page that corresponds to yourinstance's machine family.
  • Adding additional network interfaces or adding additional IP addresses per network interface to a VM doesn't increase the VM's ingress or egressnetwork bandwidth, but you can configure some machine types for higher bandwidth. For more information, seeConfiguring a VM with higher bandwidth.

To understand an instance's network performance,view performance metrics forNetwork Packet Totals,Packet Mean Size,New Connections with VMs/External/Google,Sent to VMs/External/Google,Received From VMs/External/Google, andFirewall Incoming Packets Denied.

Review whetherNetwork Packet Totals,Packet Mean Size, andNew Connections with VMs/External/Google are typicalfor your workload. For example, a web server might experience many connectionsand small packets, while a database might experience few connections and largepackets.

Consistently high outgoing network traffic might indicate the need tochange the VM's machine typeto a machine type that has a higher egress bandwidth limit.

If you notice high numbers of incoming packets denied by firewalls, visit theNetwork Intelligence Firewall Insights page in theGoogle Cloud console to learn more about the origins of denied packets.

Go to the Firewall Insights page

If you think your own traffic is being incorrectly denied by firewalls, you cancreate and run connectivity tests.

If your instance sends and receives a high amount of traffic from instances indifferent zones or regions, consider modifying your workload to keep more datawithin a zone or region to increase latency and decrease costs. For moreinformation, seeVM-VM data transfer pricing within Google Cloud.If your instance sends a large amount of traffic to other instances within thesame zone, consider acompact placement policyto achieve low network latency.

Bare metal instances

Similar to on-premise hardware, Compute Engine bare metal instances have allCPU sleep states enabled by default. This can cause idle cores to enter a sleep state and can result in reduced network performance of bare metal instances. These sleep states can be disabled in the operating system if you need full network bandwidth performance.

  • To disable the sleep states on abare metal instance withoutneeding to restart the instance, use the following script:

    for cpu in {0..191}; doecho "1" | sudo tee /sys/devices/system/cpu/cpu$cpu/cpuidle/state3/disableecho "1" | sudo tee /sys/devices/system/cpu/cpu$cpu/cpuidle/state2/disabledone
  • Alternatively, you can update the GRUB configuration file to persist thechanges across instance restarts.

    # add intel_idle.max_cstate=1 processor.max_cstate=1 to GRUB_CMDLINE_LINUXsudovim/etc/default/grubsudogrub2-mkconfig-o/boot/grub2/grub.cfgsudoreboot
  • After the reboot, verify that the C6 and C1E sleep states are disabled:

    ls/sys/devices/system/cpu/cpu0/cpuidle/state0state1cat/sys/devices/system/cpu/cpu0/cpuidle/state*/namePOLLC1

The Input-output Memory Management Unit (IOMMU) is a CPU featurethat provides address virtualization for PCI devices. IOMMU can negativelyimpact networking performance if there are a lot ofI/O translation lookasidebuffer (IOTLB) misses.

  • You are more likely to have misses when small pages are used.
  • For best performance, it is recommended to use large pages(2 MB to 1 GB in size).

Storage performance

Hardware details

Storage is affected by the following hardware constraints:

  • The total size of all persistent disks combined with the number of vCPUs determine total storage performance. If there are different types of persistent disks attached to a VM, the SSD persistent disk performance limit is shared by all disks on the VM. For more information,seeBlock storage performance.
  • When Persistent Disk and Hyperdisk compete with outbound data transfer traffic, 60% of the maximum outbound network bandwidth is used for Persistent Disk and Hyperdisk, and the remaining 40% can be used for outbound network data transfer. For more information, seeOther factors that affect performance.
  • I/O size and queue depth performance are dependant on workloads. Some workloads might not be large enough to use full I/O size and queue depth performance limits.
  • A VM's machine type affects its storage performance. For more information, seeMachine type and vCPU count.

To understand a VM's storage performance,view performance metrics forThroughput,Operations (IOPS),I/O Size,I/O Latency, andQueue Length.

Disk throughput and IOPS indicate whether the VM workload is operating asexpected. If throughput or IOPS is lower than the expected maximum listed in thedisk type chart, thenI/O size, queue length, or I/O latency performance issues might be present.

You can expect I/O size to be between 4-16 KiB for workloads that requirehigh IOPS and low latency, and 256 KiB-1 MiB for workloads thatinvolve sequential or large write sizes. I/O size outside of these rangesindicate disk performance issues.

Queue length, also known as queue depth, is a factor of throughput and IOPS.When a disk performs well, its queue length should be about the same as thequeue length recommended to achieve a particular throughput or IOPS level,listed in theRecommended I/O queue depthchart.

I/O latency is dependent on queue length and I/O size. If the queue length orI/O size for a disk is high, the latency will also be high.

If any storage performance metrics indicate disk performance issues, do one ormore of the following:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.