NVIDIA Data Center GPU Manager (DCGM)

The NVIDIA Data Center GPU Manager integration collects key advanced GPUmetrics from DCGM. The Ops Agent can be configured to collect one of twodifferent sets of metrics by selecting the version of thedcgm receiver:

Version 2 of thedcgm receiver provides a curatedset of metrics for monitoring the performance and stateof the GPUs attached to a given VM instance.
Version 1 of thedcgm receiver provides a set ofprofiling metrics meant to be used in combination withthedefault GPU metrics.For information about the purpose and interpretationof these metrics, seeProfiling Metricsin the DCGM feature overview.

For more information about the NVIDIA Data Center GPU Manager, see theDCGMdocumentation.This integration is compatible with DCGM version 3.1 through 3.3.9.

These metrics are available for Linux systems only.Profiling metrics are not collected from NVIDIA GPU models P100 and P4.

Prerequisites

To collect NVIDIA DCGM metrics, you must do the following:

Install the NVIDIA Datacenter driver.
Install DCGM.
Install the Ops Agent.
- Version 1 metrics: Ops Agent version 2.38.0 or higher.Only Ops Agent version2.38.0 or versions 2.41.0 or higherare compatible with GPU monitoring.Do not installOps Agent versions 2.39.0and 2.40.0 on VMs with attached GPUs.For more information, seeAgent crashes and report mentions NVIDIA.
- Version 2 metrics: Ops Agent version 2.51.0 or higher.

Install DCGM and verify installation

You must install DCGM version 3.1 through 3.3.9and ensure that it runs as a privileged service.To install DCGM, seeInstallationin the DCGM documentation.

To verify that DCGM is running correctly, do the following:

Check the status of the DCGM service by running the following command:

sudo service nvidia-dcgm status

If the service is running, thenvidia-dcgm service islisted asactive (running). The output resembles the following:

● nvidia-dcgm.service - NVIDIA DCGM serviceLoaded: loaded (/usr/lib/systemd/system/nvidia-dcgm.service; disabled; vendor preset: enabled)Active: active (running) since Sat 2023-01-07 15:24:29 UTC; 3s agoMain PID: 24388 (nv-hostengine)Tasks: 7 (limit: 14745)CGroup: /system.slice/nvidia-dcgm.service       └─24388 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

Verify that the GPU devices are found by running the following command:

dcgmi discovery --list

If devices are found, the output resembles the following:

1 GPU found.+--------+----------------------------------------------------------------------+| GPU ID | Device Information                                                   |+--------+----------------------------------------------------------------------+| 0      | Name: NVIDIA A100-SXM4-40GB                                          ||        | PCI Bus ID: 00000000:00:04.0                                         ||        | Device UUID: GPU-a2d9f5c7-87d3-7d57-3277-e091ad1ba957                |+--------+----------------------------------------------------------------------+

Configure the Ops Agent for DCGM

Following the guide forConfiguring the OpsAgent,add the required elements to collect telemetry from your DCGM service, andrestart the agent.

Example configuration

The following commands create the configuration to collect and ingest thereceiver version 2 metrics for NVIDIA DCGM:

#ConfiguresOpsAgenttocollecttelemetryfromtheapp.Youmustrestarttheagentfortheconfigurationtotakeeffect.set-e#Checkifthefileexistsif[!-f/etc/google-cloud-ops-agent/config.yaml];then#Createthefileifitdoesn'texist.sudomkdir-p/etc/google-cloud-ops-agentsudotouch/etc/google-cloud-ops-agent/config.yamlfi#Createabackupoftheexistingfilesoexistingconfigurationsarenotlost.sudocp/etc/google-cloud-ops-agent/config.yaml/etc/google-cloud-ops-agent/config.yaml.bak#ConfiguretheOpsAgent.sudotee/etc/google-cloud-ops-agent/config.yaml >/dev/null <<EOFmetrics:receivers:dcgm:type:dcgmreceiver_version:2service:pipelines:dcgm:receivers:-dcgmEOF

If you want to collect only DCGM profiling metrics, then replace the value ofthereceiver_version field with1. You can also remove thereceiver_version entry entirely; the default version is1.You can't use both versions at the same time.

For these changes to take effect, you must restart the Ops Agent:

To restart the agent, run the following command on your instance:
```
sudo systemctl restart google-cloud-ops-agent
```
To confirm that the agent restarted, run the following command and verify that the components "Metrics Agent" and "Logging Agent" started:
```
sudo systemctl status "google-cloud-ops-agent*"
```

If you get an error message like "Unable to connect to DCGM daemon atlocalhost:5555 on libdcgm.so not Found; Is the DCGM daemon running?", then youhave probably installed version 4.0 of the DGCM service. The DCGM shared librarywas renamed tolibdgcdm.so.4, which the Ops Agent DCGM receiver doesn'trecognize. You must use DCGM version 3.1 through 3.3.9.

Note: If you restarted the Ops Agent by using the commandsudo service google-cloud-ops-agent restart and the sub-agents did notrestart correctly, then use thesystemctl restart command shown in theconfiguration commands. For some Deep Learning VM Images, theOps Agent doesn't restart all the dependencies when you restart withservice restart.

If you are using custom service account instead of the defaultCompute Engine service account, or if you have avery old Compute Engine VM, then you might need to authorize the Ops Agent.

Configure metrics collection

To ingest metrics from NVIDIA DCGM, you must create a receiver for the metricsthat NVIDIA DCGM produces and then create a pipeline for the new receiver.

This receiver does notsupport the use of multiple instances in the configuration, for example, tomonitor multiple endpoints. All such instances write to the same time series,and Cloud Monitoring has no way to distinguish among them.

To configure a receiver for yourdcgm metrics, specify the followingfields:

Field	Default	Description
`collection_interval`	`60s`	Atime duration, such as`30s` or`5m`.
`endpoint`	`localhost:5555`	Address of the DCGM service, formatted as`host:port`.
`receiver_version`	`1`	Either 1 or 2. Version 2 has many more metrics available.
`type`		This value must be`dcgm`.

What is monitored

The following tables provides lists of metrics that the Ops Agent collectsfrom the NVIDIA DGCM instance. Not all metrics are available for all GPU models.Profiling metrics are not collected from NVIDIA GPU models P100 and P4.

Version 1 metrics

The following metrics are collected by using version 1 of thedcgm receiver.

Metric type
Kind, Type Monitored resources	Labels
`workload.googleapis.com/dcgm.gpu.profiling.dram_utilization` ^†
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.nvlink_traffic_rate` ^†
`GAUGE`, `INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.pcie_traffic_rate` ^†
`GAUGE`, `INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.pipe_utilization` ^†
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `pipe` ^‡ `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.sm_occupancy` ^†
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.sm_utilization` ^†
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`

^† Not available on GPU models P100 and P4.

^‡ For L4, thepipe valuefp64 is not supported.

Version 2 metrics

The following metrics are collected by using version 2 of thedcgm receiver.

Preview

This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms. Pre-GA products and features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

Metric type
Kind, Type Monitored resources	Labels
`workload.googleapis.com/gpu.dcgm.clock.frequency`
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.clock.throttle_duration.time`
`CUMULATIVE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid` `violation` ^†
`workload.googleapis.com/gpu.dcgm.codec.decoder.utilization`
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.codec.encoder.utilization`
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.ecc_errors`
`CUMULATIVE`, `INT64` gce_instance	`error_type` `gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.energy_consumption`
`CUMULATIVE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.memory.bandwidth_utilization`
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.memory.bytes_used`
`GAUGE`, `INT64` gce_instance	`gpu_number` `model` `state` `uuid`
`workload.googleapis.com/gpu.dcgm.nvlink.io` ^‡
`CUMULATIVE`, `INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.pcie.io` ^‡
`CUMULATIVE`, `INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.pipe.utilization` ^‡
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `pipe` ^§ `uuid`
`workload.googleapis.com/gpu.dcgm.sm.utilization` ^‡
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.temperature`
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.utilization`
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`

^† For P100 and P4, onlyviolation valuespower,thermal, andsync_boost are supported.

^‡ Not available on GPU models P100 and P4.

^§ For L4, thepipe valuefp64 is not supported.

GPU metrics

In addition, the built-in configuration for the Ops Agentalso collectsagent.googleapis.com/gpumetrics, whichare reported by the NVIDIAManagement Library (NVML).You do not need any additional configuration in the Ops Agent to collectthese metrics, but you mustcreate your VM with attached GPUs andinstall the GPU driver. For more information, seeAbout thegpu metrics. Thedcgm receiver version 1 metrics aredesigned to complement these default metrics, whiledcgm receiver version 2metrics are intended to be standalone.

Verify the configuration

This section describes how to verify that you correctly configured theNVIDIA DCGM receiver. It might take one or twominutes for the Ops Agent to begin collecting telemetry.

To verify that NVIDIA DCGM metrics are being sent toCloud Monitoring, do the following:

In the Google Cloud console, go to the Metrics explorer page:
Go toMetrics explorer
If you use the search bar to find this page, then select the result whose subheading isMonitoring.
In the toolbar of thequery-builder pane, select the button whose name is either MQL or PromQL.
Verify thatPromQL is selectedin theLanguage toggle. The language toggle is in the same toolbar thatlets you format your query.

For v1 metrics, enter the following query in the editor, and then clickRun query:

{"workload.googleapis.com/dcgm.gpu.profiling.sm_utilization", monitored_resource="gce_instance"}

For v2 metrics, enter the following query in the editor, and then clickRun:

{"workload.googleapis.com/gpu.dcgm.sm.utilization", monitored_resource="gce_instance"}

View dashboard

To view your NVIDIA DCGM metrics, you must have a chart or dashboardconfigured.The NVIDIA DCGM integration includes one or more dashboards for you.Any dashboards are automatically installed after you configure theintegration and the Ops Agent has begun collecting metric data.

You can also view static previews of dashboards withoutinstalling the integration.

To view an installed dashboard, do the following:

In the Google Cloud console, go to the Dashboards page:
Go toDashboards
If you use the search bar to find this page, then select the result whose subheading isMonitoring.
Select theDashboard List tab, and then choose theIntegrations category.
Click the name of the dashboard you want to view.

If you have configured an integration but the dashboard has not beeninstalled, then check that the Ops Agent is running. When there is nometric data for a chart in the dashboard, installation of the dashboard fails.After the Ops Agent begins collecting metrics, the dashboard is installedfor you.

To view a static preview of the dashboard, do the following:

In the Google Cloud console, go to the Integrations page:
Go toIntegrations
If you use the search bar to find this page, then select the result whose subheading isMonitoring.
Click theCompute Engine deployment-platform filter.
Locate the entry for NVIDIA DCGM and clickView Details.
Select theDashboards tab to see a static preview. If the dashboard is installed, then you can navigate to it by clickingView dashboard.

For more information about dashboards in Cloud Monitoring, seeDashboards and charts.

For more information about using theIntegrations page, seeManage integrations.

DCGM limitations, and pausing profiling

Concurrent usage of DCGM can conflict with usage of some otherNVIDIA developer tools, such as Nsight Systems or Nsight Compute.This limitation applies to NVIDIA A100 and earlier GPUs. For moreinformation, seeProfiling Sampling Ratein the DCGM feature overiew.

When you need to use tools like Nsight Systems without significant disruption,you can temporarily pause or resume the metrics collection by using thefollowing commands:

dcgmi profile --pausedcgmi profile --resume

When profiling is paused, none of the DCGM metrics that the Ops Agentcollects are emitted from the VM.

What's next

For a walkthrough on how to use Ansible to install the Ops Agent, configurea third-party application, and install a sample dashboard, see theInstall the Ops Agent to troubleshoot third-party applications video.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-17 UTC.

Movatterモバイル変換

NVIDIA Data Center GPU Manager (DCGM) Stay organized with collections Save and categorize content based on your preferences.