Performance Analysis #

NVIDIA Nsight Systems reports at the application level are highly informative. Metric sampling capabilities have increased over generations and provide a clean middle-ground between timing analysis and kernel-level deep dives with NVIDIA Nsight Compute.

Given the potential long runtimes of Large Languages Models (LLMs) and the diversity of workloads a model may experience during a single inference pass or binary execution, NVIDIA has added features to TensorRT LLM to get the most out of Nsight Systems capabilities. This document outlines those features as well as provides examples of how to best utilize them to understand your application.

Feature Descriptions#

The main functionality:

Relies on toggling the CUDA profiler runtime API on and off.
(PyTorch workflow only) Toggling the PyTorch profiler on and off.
Provides a means to understand which regions a user may want to focus on.

Toggling the CUDA profiler runtime API on and off:

Allows users to know specifically what the profiled region corresponds to.
Results in smaller files to post-process (for metric extraction or similar).

(PyTorch workflow only) Toggling the PyTorch profiler on and off:

Help users to analysis the performance breakdown in the model.
Results in smaller files to post-process (for metric extraction or similar).

Coordinating with NVIDIA Nsight Systems Launch#

Consult the Nsight Systems User Guide for full overview of options.

On the PyTorch workflow, basic NVTX markers are by default provided. On the C++/TensorRT workflow, append--nvtx when callingscripts/build_wheel.py script to compile, and clean build the code.

Only collect specific iterations#

To reduce the Nsight Systems profile size, and ensure that only specific iterations are collected, set environment variableTLLM_PROFILE_START_STOP=A-B, and append-ccudaProfilerApi tonsysprofile command.

Enable more NVTX markers for debugging#

Set environment variableTLLM_NVTX_DEBUG=1.

Enable garbage collection (GC) NVTX markers#

Set environment variableTLLM_PROFILE_RECORD_GC=1.

Enable GIL information in NVTX markers#

Append “python-gil” to Nsys “-t” option.

Coordinating with PyTorch profiler (PyTorch workflow only)#

Collect PyTorch profiler results#

Set environment variableTLLM_PROFILE_START_STOP=A-B to specify the range of the iterations to be collected.
Set environment variableTLLM_TORCH_PROFILE_TRACE=<path>, and the results will be saved to<path>.

Visualize the PyTorch profiler results#

Usechrome://tracing/ to inspect the saved profile.

Examples#

Consult the Nsight Systems User Guide for full overview of MPI-related options.

Profiling specific iterations on a`trtllm-bench`/`trtllm-serve` run#

Say we want to profile iterations 100 to 150 on atrtllm-bench/trtllm-serve run, we want to collect as much information as possible for debugging, such as GIL, debugging NVTX markers, etc:

#!/bin/bash# Prepare dataset for the benchmarkpython3benchmarks/cpp/prepare_dataset.py\--tokenizer=${MODEL_PATH}\--stdouttoken-norm-dist--num-requests=${NUM_SAMPLES}\--input-mean=1000--output-mean=1000--input-stdev=0--output-stdev=0>/tmp/dataset.txt# Benchmark and profileTLLM_PROFILE_START_STOP=100-150nsysprofile\-otrace-ftrue\-t'cuda,nvtx,python-gil'-ccudaProfilerApi\--cuda-graph-tracenode\-eTLLM_PROFILE_RECORD_GC=1,TLLM_LLMAPI_ENABLE_NVTX=1,TLLM_TORCH_PROFILE_TRACE=trace.json\--trace-fork-before-exec=true\trtllm-bench\# or trtllm-serve command--modeldeepseek-ai/DeepSeek-V3\--model_path${MODEL_PATH}\throughput\--dataset/tmp/dataset.txt--warmup0\--backendpytorch\--streaming

The Nsight Systems reports will be saved totrace.nsys-rep. Use NVIDIA Nsight Systems application to open it.

The PyTorch profiler results will be saved totrace.json. Usechrome://tracing/ to inspect the saved profile.

On this page

Movatterモバイル変換

Performance Analysis#