GPU Debugging¶
General Debugging Options¶
The DebugFS section provides documentation on a number files to aid in debuggingissues on the GPU.
GPUVM Debugging¶
To aid in debugging GPU virtual memory related problems, the driver supports anumber of options module parameters:
vm_fault_stop - If non-0, halt the GPU memory controller on a GPU page fault.
vm_update_mode - If non-0, use the CPU to update GPU page tables rather thanthe GPU.
Decoding a GPUVM Page Fault¶
If you see a GPU page fault in the kernel log, you can decode it to figureout what is going wrong in your application. A page fault in your kernellog may look something like this:
[gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32777, for process glxinfo pid 2424 thread glxinfo:cs0 pid 2425) in page starting at address 0x0000800102800000 from IH client 0x1b (UTCL2)VM_L2_PROTECTION_FAULT_STATUS:0x00301030 Faulty UTCL2 client ID: TCP (0x8) MORE_FAULTS: 0x0 WALKER_ERROR: 0x0 PERMISSION_FAULTS: 0x3 MAPPING_ERROR: 0x0 RW: 0x0
First you have the memory hub, gfxhub and mmhub. gfxhub is the memoryhub used for graphics, compute, and sdma on some chips. mmhub is thememory hub used for multi-media and sdma on some chips.
Next you have the vmid and pasid. If the vmid is 0, this fault was likelycaused by the kernel driver or firmware. If the vmid is non-0, it is generallya fault in a user application. The pasid is used to link a vmid to a systemprocess id. If the process is active when the fault happens, the processinformation will be printed.
The GPU virtual address that caused the fault comes next.
The client ID indicates the GPU block that caused the fault.Some common client IDs:
CB/DB: The color/depth backend of the graphics pipe
CPF: Command Processor Frontend
CPC: Command Processor Compute
CPG: Command Processor Graphics
TCP/SQC/SQG: Shaders
SDMA: SDMA engines
VCN: Video encode/decode engines
JPEG: JPEG engines
PERMISSION_FAULTS describe what faults were encountered:
bit 0: the PTE was not valid
bit 1: the PTE read bit was not set
bit 2: the PTE write bit was not set
bit 3: the PTE execute bit was not set
Finally, RW, indicates whether the access was a read (0) or a write (1).
In the example above, a shader (cliend id = TCP) generated a read (RW = 0x0) toan invalid page (PERMISSION_FAULTS = 0x3) at GPU virtual address0x0000800102800000. The user can then inspect their shader code and resourcedescriptor state to determine what caused the GPU page fault.
UMR¶
umr is a general purposeGPU debugging and diagnostics tool. Please see the umrdocumentation for more informationabout its capabilities.
Debugging backlight brightness¶
Default backlight brightness is intended to be set via the policy advertisedby the firmware. Firmware will often provide different defaults for AC or DC.Furthermore, some userspace software will save backlight brightness duringthe previous boot and attempt to restore it.
Some firmware also has support for a feature called “Custom Backlight Curves”where an input value for brightness is mapped along a linearly interpolatedcurve of brightness values that better match display characteristics.
In the event of problems happening with backlight, there is a trace eventthat can be enabled at bootup to log every brightness change request.This can help isolate where the problem is. To enable the trace event addthe following to the kernel command line:
tp_printk trace_event=amdgpu_dm:amdgpu_dm_brightness:mod:amdgpu trace_buf_size=1M