18.Debugging AMD Zen systems

18.1.Introduction

This document describes techniques that are useful for debugging issues withAMD Zen systems. It is intended for use by developers and technical usersto help identify and resolve issues.

18.2.S3 vs s2idle

On AMD systems, it’s not possible to simultaneously support suspend-to-RAM (S3)and suspend-to-idle (s2idle). To confirm which mode your system supports youcan look atcat/sys/power/mem_sleep. If it showss2idle[deep] thenS3 is supported. If it shows[s2idle] thens2idle issupported.

On systems that supportS3, the firmware will be utilized to put all hardware intothe appropriate low power state.

On systems that supports2idle, the kernel will be responsible for transitioning devicesinto the appropriate low power state. When all devices are in the appropriate lowpower state, the hardware will transition into a hardware sleep state.

After a suspend cycle you can tell how much time was spent in a hardware sleepstate by looking atcat/sys/power/suspend_stats/last_hw_sleep.

This flowchart explains how the AMD s2idle suspend flow works.

../../_images/suspend.svg

This flowchart explains how the amd s2idle resume flow works.

../../_images/resume.svg

18.3.s2idle debugging tool

As there are a lot of places that problems can occur, a debugging tool has beencreated atamd-debug-toolsthat can help test for common problems and offer suggestions.

If you have an s2idle issue, it’s best to start with this and follow instructionsfrom its findings. If you continue to have an issue, raise a bug with thereport generated from this script todrm/amd gitlab.

18.4.Spurious s2idle wakeups from an IRQ

Spurious wakeups will generally have an IRQ set to/sys/power/pm_wakeup_irq.This can be matched to/proc/interrupts to determine what device woke the system.

If this isn’t enough to debug the problem, then the following sysfs filescan be set to add more verbosity to the wakeup process:

# echo 1 | sudo tee /sys/power/pm_debug_messages# echo 1 | sudo tee /sys/power/pm_print_times

After making those changes, the kernel will display messages that canbe traced back to kernel s2idle loop code as well as display any activeGPIO sources while waking up.

If the wakeup is caused by the ACPI SCI, additional ACPI debugging may beneeded. These commands can enable additional trace data:

# echo enable | sudo tee /sys/module/acpi/parameters/trace_state# echo 1 | sudo tee /sys/module/acpi/parameters/aml_debug_output# echo 0x0800000f | sudo tee /sys/module/acpi/parameters/debug_level# echo 0xffff0000 | sudo tee /sys/module/acpi/parameters/debug_layer

18.5.Spurious s2idle wakeups from a GPIO

If a GPIO is active when waking up the system ideally you would look at theschematic to determine what device it is associated with. If the schematicis not available, another tactic is to look at the ACPI_EVT() entryto determine what device is notified when that GPIO is active.

For a hypothetical example, say that GPIO 59 woke up the system. You canlook at the SSDT to determine what device is notified when GPIO 59 is active.

First convert the GPIO number into hex.

$ python3 -c "print(hex(59))"0x3b

Next determine which ACPI table has the_EVT entry. For example:

$ sudo grep EVT /sys/firmware/acpi/tables/SSDT*grep: /sys/firmware/acpi/tables/SSDT27: binary file matches

Decode this table:

$ sudo cp /sys/firmware/acpi/tables/SSDT27 .$ sudo iasl -d SSDT27

Then look at the table and find the matching entry for GPIO 0x3b.

Case (0x3B){    M000 (0x393B)    M460 ("    Notify (\\_SB.PCI0.GP17.XHC1, 0x02)\n", Zero, Zero, Zero, Zero, Zero, Zero)    Notify (\_SB.PCI0.GP17.XHC1, 0x02) // Device Wake}

You can see in this case that the device\_SB.PCI0.GP17.XHC1 is notifiedwhen GPIO 59 is active. It’s obvious this is an XHCI controller, but to go astep further you can figure out which XHCI controller it is by matching it toACPI.:

$ grep "PCI0.GP17.XHC1" /sys/bus/acpi/devices/*/path/sys/bus/acpi/devices/device:2d/path:\_SB_.PCI0.GP17.XHC1/sys/bus/acpi/devices/device:2e/path:\_SB_.PCI0.GP17.XHC1.RHUB/sys/bus/acpi/devices/device:2f/path:\_SB_.PCI0.GP17.XHC1.RHUB.PRT1/sys/bus/acpi/devices/device:30/path:\_SB_.PCI0.GP17.XHC1.RHUB.PRT1.CAM0/sys/bus/acpi/devices/device:31/path:\_SB_.PCI0.GP17.XHC1.RHUB.PRT1.CAM1/sys/bus/acpi/devices/device:32/path:\_SB_.PCI0.GP17.XHC1.RHUB.PRT2/sys/bus/acpi/devices/LNXPOWER:0d/path:\_SB_.PCI0.GP17.XHC1.PWRS

Here you can see it matches todevice:2d. Look at thephysical_nodeto determine what PCI device that actually is.

$ ls -l /sys/bus/acpi/devices/device:2d/physical_nodelrwxrwxrwx 1 root root 0 Feb 12 13:22 /sys/bus/acpi/devices/device:2d/physical_node -> ../../../../../pci0000:00/0000:00:08.1/0000:c2:00.4

So there you have it: the PCI device associated with this GPIO wakeup was0000:c2:00.4.

Theamd_s2idle.py script will capture most of these artifacts for you.

18.6.s2idle PM debug messages

During the s2idle flow on AMD systems, the ACPI LPS0 driver is responsibleto check all uPEP constraints. Failing uPEP constraints does not prevents0i3 entry. This means that if some constraints are not met, it is possiblethe kernel may attempt to enter s2idle even if there are some known issues.

To activate PM debugging, either specifypm_debug_messagess kernelcommand-line option at boot or write to/sys/power/pm_debug_messages.Unmet constraints will be displayed in the kernel log and can beviewed by logging tools that process kernel ring buffer likedmesg orjournalctl.”

If the system freezes on entry/exit before these messages are flushed, auseful debugging tactic is to unbind theamd_pmc driver to preventnotification to the platform to start s0i3 entry. This will stop thesystem from freezing on entry or exit and let you view all the failedconstraints.

cd /sys/bus/platform/drivers/amd_pmcls | grep AMD | sudo tee unbind

After doing this, run the suspend cycle and look specifically for errors around:

ACPI: LPI: Constraint not met; min power state:%s current power state:%s

18.7.Historical examples of s2idle issues

To help understand the types of issues that can occur and how to debug them,here are some historical examples of s2idle issues that have been resolved.

18.7.1.Core offlining

An end user had reported that taking a core offline would prevent the systemfrom properly entering s0i3. This was debugged using internal AMD toolsto capture and display a stream of metrics from the hardware showing what changedwhen a core was offlined. It was determined that the hardware didn’t getnotification the offline cores were in the deepest state, and so it preventedCPU from going into the deepest state. The issue was debugged to a missingcommand to put cores into C3 upon offline.

commit d6b88ce2eb9d2 (“ACPI: processor idle: Allow playing dead in C3 state”)

18.7.2.Corruption after resume

A big problem that occurred with Rembrandt was that there was graphicalcorruption after resume. This happened because of a misalignment of PSPand driver responsibility. The PSP will save and restore DMCUB, but thedriver assumed it needed to reset DMCUB on resume.This actually was a misalignment for earlier silicon as well, but was notobserved.

commit 79d6b9351f086 (“drm/amd/display: Don’t reinitialize DMCUB on s0ix resume”)

18.7.3.Back to Back suspends fail

When using a wakeup source that triggers the IRQ to wakeup, a bug in thepinctrl-amd driver may capture the wrong state of the IRQ and prevent thesystem going back to sleep properly.

commit b8c824a869f22 (“pinctrl: amd: Don’t save/restore interrupt status and wake status bits”)

18.7.4.Spurious timer based wakeup after 5 minutes

The HPET was being used to program the wakeup source for the system, howeverthis was causing a spurious wakeup after 5 minutes. The correct alarm to usewas the ACPI alarm.

commit 3d762e21d5637 (“rtc: cmos: Use ACPI alarm for non-Intel x86 systems too”)

18.7.5.Disk disappears after resume

After resuming from s2idle, the NVME disk would disappear. This was due to theBIOS not specifying the _DSD StorageD3Enable property. This caused the NVMEdriver not to put the disk into the expected state at suspend and to failon resume.

commit e79a10652bbd3 (“ACPI: x86: Force StorageD3Enable on more products”)

18.7.6.Spurious IRQ1

A number of Renoir, Lucienne, Cezanne, & Barcelo platforms have aplatform firmware bug where IRQ1 is triggered during s0i3 resume.

This was fixed in the platform firmware, but a number of systems didn’treceive any more platform firmware updates.

commit 8e60615e89321 (“platform/x86/amd: pmc: Disable IRQ1 wakeup for RN/CZN”)

18.7.7.Hardware timeout

The hardware performs many actions besides accepting the values fromamd-pmc driver. As the communication path with the hardware is a mailbox,it’s possible that it might not respond quickly enough.This issue manifested as a failure to suspend:

PM: dpm_run_callback(): acpi_subsys_suspend_noirq+0x0/0x50 returns -110amd_pmc AMDI0005:00: PM: failed to suspend noirq: error -110

The timing problem was identified by comparing the values of the idle mask.

commit 3c3c8e88c8712 (“platform/x86: amd-pmc: Increase the response register timeout”)

18.7.8.Failed to reach hardware sleep state with panel on

On some Strix systems certain panels were observed to block the system fromentering a hardware sleep state if the internal panel was on during the sequence.

Even though the panel got turned off during suspend it exposed a timing problemwhere an interrupt caused the display hardware to wake up and block low powerstate entry.

commit 40b8c14936bd2 (“drm/amd/display: Disable unneeded hpd interrupts during dm_init”)

18.8.Runtime power consumption issues

Runtime power consumption is influenced by many factors, including but notlimited to the configuration of the PCIe Active State Power Management (ASPM),the display brightness, the EPP policy of the CPU, and the power managementof the devices.

18.8.1.ASPM

For the best runtime power consumption, ASPM should be programmed as intendedby the BIOS from the hardware vendor. To accomplish this the Linux kernelshould be compiled withCONFIG_PCIEASPM_DEFAULT set toy and thesysfs file/sys/module/pcie_aspm/parameters/policy should not be modified.

Most notably, if L1.2 is not configured properly for any devices, the SoCwill not be able to enter the deepest idle state.

18.8.2.EPP Policy

Theenergy_performance_preference sysfs file can be used to set a biasof efficiency or performance for a CPU. This has a direct relationship onthe battery life when more heavily biased towards performance.

18.9.BIOS debug messages

Most OEM machines don’t have a serial UART for outputting kernel or BIOSdebug messages. However BIOS debug messages are useful for understandingboth BIOS bugs and bugs with the Linux kernel drivers that call BIOS AML.

As the BIOS on most OEM AMD systems are based off an AMD reference BIOS,the infrastructure used for exporting debugging messages is often the sameas AMD reference BIOS.

18.9.1.Manually Parsing

There is generally an ACPI method\M460 that different paths of the AMLwill call to emit a message to the BIOS serial log. This method takes7 arguments, with the first being a string and the rest being optionalintegers:

Method (M460, 7, Serialized)

Here is an example of a string that BIOS AML may call out using\M460:

M460 ("  OEM-ASL-PCIe Address (0x%X)._REG (%d %d)  PCSA = %d\n", DADR, Arg0, Arg1, PCSA, Zero, Zero)

Normally when executed, the\M460 method would populate the additionalarguments into the string. In order to get these messages from the Linuxkernel a hook has been added into ACPICA that can capture theargumentssent to\M460 and print them to the kernel ring buffer.For example the following message could be emitted into kernel ring buffer:

extrace-0174 ex_trace_args         :  "  OEM-ASL-PCIe Address (0x%X)._REG (%d %d)  PCSA = %d\n", ec106000, 2, 1, 1, 0, 0

In order to get these messages, you need to compile withCONFIG_ACPI_DEBUGand then turn on the following ACPICA tracing parameters.This can be done either on the kernel command line or at runtime:

  • acpi.trace_method_name=\M460

  • acpi.trace_state=method

NOTE: These can be very noisy at bootup. If you turn these parameters onthe kernel command, please also consider turning upCONFIG_LOG_BUF_SHIFTto a larger size such as 17 to avoid losing early boot messages.

18.9.2.Tool assisted Parsing

As mentioned above, parsing by hand can be tedious, especially with a lot ofmessages. To help with this, a tool has been created atamd-debug-toolsto help parse the messages.

18.10.Random reboot issues

When a random reboot occurs, the high-level reason for the reboot is storedin a register that will persist onto the next boot.

There are 6 classes of reasons for the reboot:
  • Software induced

  • Power state transition

  • Pin induced

  • Hardware induced

  • Remote reset

  • Internal CPU event

Bit

Type

Reason

0

Pin

thermal pin BP_THERMTRIP_L was tripped

1

Pin

power button was pressed for 4 seconds

2

Pin

shutdown pin was tripped

4

Remote

remote ASF power off command was received

9

Internal

internal CPU thermal limit was tripped

16

Pin

system reset pin BP_SYS_RST_L was tripped

17

Software

software issued PCI reset

18

Software

software wrote 0x4 to reset control register 0xCF9

19

Software

software wrote 0x6 to reset control register 0xCF9

20

Software

software wrote 0xE to reset control register 0xCF9

21

ACPI-state

ACPI power state transition occurred

22

Pin

keyboard reset pin KB_RST_L was tripped

23

Internal

internal CPU shutdown event occurred

24

Hardware

system failed to boot before failed boot timer expired

25

Hardware

hardware watchdog timer expired

26

Remote

remote ASF reset command was received

27

Internal

an uncorrected error caused a data fabric sync flood event

29

Internal

FCH and MP1 failed warm reset handshake

30

Internal

a parity error occurred

31

Internal

a software sync flood event occurred

This information is read by the kernel at bootup and printed intothe syslog. When a random reboot occurs this message can be helpfulto determine the next component to debug.