Monitoring GPU performance on Linux VMs

Linux
Tip: If you want to monitor A4 or A3 Ultra machine types that are deployed usingthe features provided by AI Hypercomputer, seeMonitor Compute Engine instances and Slurm clustersin the AI Hypercomputer documentation instead.

You can track metrics such as GPU utilization and GPU memory from yourvirtual machine (VM) instances by using theOps Agent, which isGoogle's recommended telemetry collection solution for Compute Engine.By using the Ops Agent, you can manage your GPU VMs as follows:

  • Visualize the health of your NVIDIA GPU fleet with our pre-configureddashboards.
  • Optimize costs by identifying underutilized GPUs and consolidating workloads.
  • Plan scaling by looking at trends to decide when to expand GPU capacity orupgrade existing GPUs.
  • Use NVIDIA Data Center GPU Manager (DCGM) profiling metrics to identifybottlenecks and performance issues within your GPUs.
  • Set upmanaged instance groups (MIGs)to autoscale resources.
  • Get alerts on metrics from your NVIDIA GPUs.

This document covers the procedures for monitoring GPUs on Linux VMs by usingthe Ops Agent. Alternatively, a reporting script is available on GitHub that canalso be setup for monitoring GPU usage on Linux VMs, seecompute-gpu-monitoring monitoring script.This script is not actively maintained.

For monitoring GPUs on Windows VMs, seeMonitoring GPU performance (Windows).

Overview

The Ops Agent, version 2.38.0 or later, can automatically track GPUutilization and GPU memory usage rates on your Linux VMs that have the agentinstalled. These metrics, obtained from the NVIDIA Management Library (NVML),are tracked per GPU and per process for any process that uses GPUs.To view the metrics that are monitored by the Ops Agent,seeAgent metrics: gpu.

You can also set up the NVIDIA Data Center GPU Manager (DCGM) integration withthe Ops Agent. This integration allows the Ops Agent to track metricsusing the hardware counters on the GPU. DCGM provides access to theGPU device-level metrics. These include Streaming Multiprocessor (SM)block utilization, SM occupancy, SM pipe utilization, PCIe traffic rate,and NVLink traffic rate. To view the metrics monitored by the Ops Agent, seeThird-party application metrics: NVIDIA Data Center GPU Manager (DCGM).

To review GPU metrics by using the Ops Agent, complete the following steps:

  1. On each VM, check that you have metthe requirements.
  2. On each VM,install the Ops Agent.
  3. Optional: On each VM, set up theNVIDIA Data Center GPU Manager (DCGM) integration.
  4. Reviewmetrics in Cloud Monitoring.

Limitations

  • The Ops Agent doesn't track GPU utilization on VMs that useContainer-Optimized OS.

Requirements

On each of your VMs, check that you meet the following requirements:

Install the Ops Agent

To install the Ops Agent, complete the following steps:

  1. If you were previously using thecompute-gpu-monitoring monitoring script to track GPU utilization, disable the service before installing the Ops Agent.To disable the monitoring script, run the following command:

    sudo systemctl --no-reload --now disable google_gpu_monitoring_agent
  2. Install the latest version of the Ops Agent. For detailed instructions, seeInstalling the Ops Agent.

  3. After you have installed the Ops agent, if you need to install or upgrade yourGPU drivers by using theinstallation scripts provided by Compute Engine,review thelimitations section.

Review NVML metrics in Compute Engine

You can review the NVML metrics that the Ops Agent collects from theObservability tabs for Compute Engine Linux VM instances.

To view the metrics for a single VM do the following:

  1. In the Google Cloud console, go to theVM instances page.

    Go to VM instances

  2. Select a VM to open theDetails page.

  3. Click theObservability tab to display information about the VM.

  4. Select theGPU quick filter.

To view the metrics for multiple VMs, do the following:

  1. In the Google Cloud console, go to theVM instances page.

    Go to VM instances

  2. Click theObservability tab.

  3. Select theGPU quick filter.

Optional: Set up NVIDIA Data Center GPU Manager (DCGM) integration

The Ops Agent also provides integration for NVIDIA Data Center GPU Manager(DCGM) to collect key advanced GPU metrics such as Streaming Multiprocessor (SM)block utilization, SM occupancy, SM pipe utilization, PCIe traffic rate,and NVLink traffic rate.

These advanced GPU metrics are not collected from NVIDIA P100 and P4 models.

For detailed instructions on how to setup and use this integration on each VM,seeNVIDIA Data Center GPU Manager (DCGM).

Review DCGM metrics in Cloud Monitoring

  1. In the Google Cloud console, go to theMonitoring > Dashboards page.

    Go to Monitoring

  2. Select theSample Library tab.

  3. In theFilter field,typeNVIDIA. TheNVIDIA GPU Monitoring Overview (GCE and GKE)dashboard displays.

    If you have set up the NVIDIA Data Center GPU Manager (DCGM) integration, theNVIDIA GPU Monitoring Advanced DCGM Metrics (GCE Only)dashboard also displays.

    Cloud Monitoring dashboards

  4. For the required dashboard, clickPreview. TheSample dashboard previewpage displays.

  5. From theSample dashboard preview page, clickImport sample dashboard.

    • TheNVIDIA GPU Monitoring Overview (GCE and GKE)dashboard displays the GPU metrics such as GPU utilization, NIC traffic rate,and GPU memory usage.

      Your GPU utilization display is similar to the following output:

      Cloud Monitoring (NVML)

    • TheNVIDIA GPU Monitoring Advanced DCGM Metrics (GCE Only)dashboard displays key advanced metrics such as SM utilization, SM occupancy,SM pipe utilization, PCIe traffic rate, and NVLink traffic rate.

      Your Advanced DCGM Metric display is similar to the following output:

      Cloud Monitoring (DCGM)

What's next?

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.