Display Core Debug tools¶
In this section, you will find helpful information on debugging the amdgpudriver from the display perspective. This page introduces debug mechanisms andprocedures to help you identify if some issues are related to display code.
Narrow down display issues¶
Since the display is the driver’s visual component, it is common to see usersreporting issues as a display when another component causes the problem. Thissection equips users to determine if a specific issue was caused by the displaycomponent or another part of the driver.
DC dmesg important messages¶
The dmesg log is the first source of information to be checked, and amdgputakes advantage of this feature by logging some valuable information. Whenlooking for the issues associated with amdgpu, remember that each component ofthe driver (e.g., smu, PSP, dm, etc.) is loaded one by one, and thisinformation can be found in the dmesg log. In this sense, look for the part ofthe log that looks like the below log snippet:
[ 4.254295] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x744C 0x1002:0x0E3B 0xC8).[ 4.254718] [drm] register mmio base: 0xFCB00000[ 4.254918] [drm] register mmio size: 1048576[ 4.260095] [drm] add ip block number 0 <soc21_common>[ 4.260318] [drm] add ip block number 1 <gmc_v11_0>[ 4.260510] [drm] add ip block number 2 <ih_v6_0>[ 4.260696] [drm] add ip block number 3 <psp>[ 4.260878] [drm] add ip block number 4 <smu>[ 4.261057] [drm] add ip block number 5 <dm>[ 4.261231] [drm] add ip block number 6 <gfx_v11_0>[ 4.261402] [drm] add ip block number 7 <sdma_v6_0>[ 4.261568] [drm] add ip block number 8 <vcn_v4_0>[ 4.261729] [drm] add ip block number 9 <jpeg_v4_0>[ 4.261887] [drm] add ip block number 10 <mes_v11_0>
From the above example, you can see the line that reports that<dm>,(Display Manager), was loaded, which means that display can be part of theissue. If you do not see that line, something else might have failed beforeamdgpu loads the display component, indicating that we don’t have adisplay issue.
After you identified that the DM was loaded correctly, you can check for thedisplay version of the hardware in use, which can be retrieved from the dmesglog with the command:
dmesg | grep -i 'display core'
This command shows a message that looks like this:
[ 4.655828] [drm] Display Core v3.2.285 initialized on DCN 3.2
This message has two key pieces of information:
The DC version (e.g., v3.2.285): Display developers release a new DC versionevery week, and this information can be advantageous in a situation where auser/developer must find a good point versus a bad point based on a testedversion of the display code. Remember from pageDisplay Core,that every week the new patches for display are heavily tested with IGT andmanual tests.
The DCN version (e.g., DCN 3.2): The DCN block is associated with thehardware generation, and the DCN version conveys the hardware generation thatthe driver is currently running. This information helps to narrow down thecode debug area since each DCN version has its files in the DC folder per DCNcomponent (from the example, the developer might want to focus onfiles/folders/functions/structs with the dcn32 label might be executed).However, keep in mind that DC reuses code across different DCN versions; forexample, it is expected to have some callbacks set in one DCN that are the sameas those from another DCN. In summary, use the DCN version just as a guide.
From the dmesg file, it is also possible to get the ATOM bios code by using:
dmesg | grep -i 'ATOM BIOS'
Which generates an output that looks like this:
[ 4.274534] amdgpu: ATOM BIOS: 113-D7020100-102
This type of information is useful to be reported.
Avoid loading display core¶
Sometimes, it might be hard to figure out which part of the driver is causingthe issue; if you suspect that the display is not part of the problem and yourbug scenario is simple (e.g., some desktop configuration) you can try to removethe display component from the equation. First, you need to identifydm IDfrom the dmesg log; for example, search for the following log:
[ 4.254295] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x744C 0x1002:0x0E3B 0xC8).[..][ 4.260095] [drm] add ip block number 0 <soc21_common>[ 4.260318] [drm] add ip block number 1 <gmc_v11_0>[..][ 4.261057] [drm] add ip block number 5 <dm>
Notice from the above example that thedm id is 5 for this specific hardware.Next, you need to run the following binary operation to identify the IP blockmask:
0xffffffff & ~(1 << [DM ID])
From our example the IP mask is:
0xffffffff & ~(1 << 5) = 0xffffffdf
Finally, to disable DC, you just need to set the below parameter in yourbootloader:
amdgpu.ip_block_mask = 0xffffffdf
If you can boot your system with the DC disabled and still see the issue, itmeans you can rule DC out of the equation. However, if the bug disappears, youstill need to consider the DC part of the problem and keep narrowing down theissue. In some scenarios, disabling DC is impossible since it might benecessary to use the display component to reproduce the issue (e.g., play agame).
Note: This will probably lead to the absence of a display output.
Display flickering¶
Display flickering might have multiple causes; one is the lack of proper powerto the GPU or problems in the DPM switches. A good first generic verificationis to set the GPU to use high voltage:
bash -c "echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level"
The above command sets the GPU/APU to use the maximum power allowed whichdisables DPM switches. If forcing DPM levels high does not fix the issue, itis less likely that the issue is related to power management. If the issuedisappears, there is a good chance that other components might be involved, andthe display should not be ignored since this could be a DPM issues. From thedisplay side, if the power increase fixes the issue, it is worth debugging theclock configuration and the pipe split police used in the specificconfiguration.
Display artifacts¶
Users may see some screen artifacts that can be categorized into two differenttypes: localized artifacts and general artifacts. The localized artifactshappen in some specific areas, such as around the UI window corners; if you seethis type of issue, there is a considerable chance that you have a userspaceproblem, likely Mesa or similar. The general artifacts usually happen on theentire screen. They might be caused by a misconfiguration at the driver levelof the display parameters, but the userspace might also cause this issue. Oneway to identify the source of the problem is to take a screenshot or make adesktop video capture when the problem happens; after checking thescreenshot/video recording, if you don’t see any of the artifacts, it meansthat the issue is likely on the driver side. If you can still see theproblem in the data collected, it is an issue that probably happened duringrendering, and the display code just got the framebuffer already corrupted.
Disabling/Enabling specific features¶
DC has astructnameddc_debug_options, which is statically initialized byall DCE/DCN components based on the specific hardware characteristic. Thisstructure usually facilitates the bring-up phase since developers can startwith many disabled features and enable them individually. This is also animportant debug feature since users can change it when debugging specificissues.
For example, dGPU users sometimes see a problem where a horizontal fillet offlickering happens in some specific part of the screen. This could be anindication of Sub-Viewport issues; after the users identified the target DCN,they can set theforce_disable_subvp field to true in the staticallyinitialized version ofdc_debug_options to see if the issue gets fixed. Alongthe same lines, users/developers can also try to turn offfams2_config andenable_single_display_2to1_odm_policy. In summary, thedc_debug_options isan interesting form for identifying the problem.
DC Visual Confirmation¶
Display core provides a feature named visual confirmation, which is a set ofbars added at the scanout time by the driver to convey some specificinformation. In general, you can enable this debug option by using:
echo <N> > /sys/kernel/debug/dri/0/amdgpu_dm_visual_confirm
WhereN is an integer number for some specific scenarios that the developerwants to enable, you will see some of these debug cases in the followingsubsection.
Multiple Planes Debug¶
If you want to enable or debug multiple planes in a specific user-spaceapplication, you can leverage a debug feature named visual confirm. Forenabling it, you will need:
echo 1 > /sys/kernel/debug/dri/0/amdgpu_dm_visual_confirm
You need to reload your GUI to see the visual confirmation. When the planeconfiguration changes or a full update occurs there will be a colored bar atthe bottom of each hardware plane being drawn on the screen.
The color indicates the format - For example, red is AR24 and green is NV12
The height of the bar indicates the index of the plane
Pipe split can be observed if there are two bars with a difference in heightcovering the same plane
Consider the video playback case in which a video is played in a specificplane, and the desktop is drawn in another plane. The video plane shouldfeature one or two green bars at the bottom of the video depending on pipesplit configuration.
There shouldnot be any visual corruption
There shouldnot be any underflow or screen flashes
There shouldnot be any black screens
There shouldnot be any cursor corruption
Multiple planemay be briefly disabled during window transitions orresizing but should come back after the action has finished
Pipe Split Debug¶
Sometimes we need to debug if DCN is splitting pipes correctly, and visualconfirmation is also handy for this case. Similar to the MPO case, you can usethe below command to enable visual confirmation:
echo 1 > /sys/kernel/debug/dri/0/amdgpu_dm_visual_confirm
In this case, if you have a pipe split, you will see one small red bar at thebottom of the display covering the entire display width and another barcovering the second pipe. In other words, you will see a bit high bar in thesecond pipe.
DTN Debug¶
DC (DCN) provides an extensive log that dumps multiple details from ourhardware configuration. Via debugfs, you can capture those status values byusing Display Test Next (DTN) log, which can be captured via debugfs by using:
cat /sys/kernel/debug/dri/0/amdgpu_dm_dtn_log
Since this log is updated accordingly with DCN status, you can also follow thechange in real-time by using something like:
sudo watch -d cat /sys/kernel/debug/dri/0/amdgpu_dm_dtn_log
When reporting a bug related to DC, consider attaching this log before andafter you reproduce the bug.
Collect Firmware information¶
When reporting issues, it is important to have the firmware information sinceit can be helpful for debugging purposes. To get all the firmware information,use the command:
cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
From the display perspective, pay attention to the firmware of the DMCU andDMCUB.
DMUB Firmware Debug¶
Sometimes, dmesg logs aren’t enough. This is especially true if a feature isimplemented primarily in DMUB firmware. In such cases, all we see in dmesg whenan issue arises is some generic timeout error. So, to get more relevantinformation, we can trace DMUB commands by enabling the relevant bits inamdgpu_dm_dmub_trace_mask.
Currently, we support the tracing of the following groups:
Trace Groups¶
Name | Mask Value |
|---|---|
INFO | 0x1 |
IRQ SVC | 0x2 |
VBIOS | 0x4 |
REGISTER | 0x8 |
PHY DBG | 0x10 |
PSR | 0x20 |
AUX | 0x40 |
SMU | 0x80 |
MALL | 0x100 |
ABM | 0x200 |
ALPM | 0x400 |
TIMER | 0x800 |
HW LOCK MGR | 0x1000 |
INBOX1 | 0x2000 |
PHY SEQ | 0x4000 |
PSR STATE | 0x8000 |
ZSTATE | 0x10000 |
TRANSMITTER CTL | 0x20000 |
PANEL CNTL | 0x40000 |
FAMS | 0x80000 |
DPIA | 0x100000 |
SUBVP | 0x200000 |
INBOX0 | 0x400000 |
SDP | 0x4000000 |
REPLAY | 0x8000000 |
REPLAY RESIDENCY | 0x20000000 |
CURSOR INFO | 0x80000000 |
IPS | 0x100000000 |
Note: Not all ASICs support all of the listed trace groups
So, to enable just PSR tracing you can use the following command:
# echo 0x8020 > /sys/kernel/debug/dri/0/amdgpu_dm_dmub_trace_mask
Then, you need to enable logging trace events to the buffer, which you can dousing the following:
# echo 1 > /sys/kernel/debug/dri/0/amdgpu_dm_dmcub_trace_event_en
Lastly, after you are able to reproduce the issue you are trying to debug,you can disable tracing and read the trace log by using the following:
# echo 0 > /sys/kernel/debug/dri/0/amdgpu_dm_dmcub_trace_event_en# cat /sys/kernel/debug/dri/0/amdgpu_dm_dmub_tracebuffer
So, when reporting bugs related to features such as PSR and ABM, considerenabling the relevant bits in the mask before reproducing the issue andattach the log that you obtain from the trace buffer in any bug reports that youcreate.