8.The PCI Express Advanced Error Reporting Driver Guide HOWTO¶
- Authors:
Long Nguyen <tom.l.nguyen@intel.com>
Yanmin Zhang <yanmin.zhang@intel.com>
- Copyright:
© 2006 Intel Corporation
8.1.Overview¶
8.1.1.About this guide¶
This guide describes the basics of the PCI Express (PCIe) Advanced ErrorReporting (AER) driver and provides information on how to use it, aswell as how to enable the drivers of Endpoint devices to conform withthe PCIe AER driver.
8.1.2.What is the PCIe AER Driver?¶
PCIe error signaling can occur on the PCIe link itselfor on behalf of transactions initiated on the link. PCIedefines two error reporting paradigms: the baseline capability andthe Advanced Error Reporting capability. The baseline capability isrequired of all PCIe components providing a minimum definedset of error reporting requirements. Advanced Error Reportingcapability is implemented with a PCIe Advanced Error Reportingextended capability structure providing more robust error reporting.
The PCIe AER driver provides the infrastructure to support PCIe AdvancedError Reporting capability. The PCIe AER driver provides three basicfunctions:
Gathers the comprehensive error information if errors occurred.
Reports error to the users.
Performs error recovery actions.
The AER driver only attaches to Root Ports and RCECs that support the PCIeAER capability.
8.2.User Guide¶
8.2.1.Include the PCIe AER Root Driver into the Linux Kernel¶
The PCIe AER driver is a Root Port service driver attachedvia the PCIe Port Bus driver. If a user wants to use it, the drivermust be compiled. It is enabled with CONFIG_PCIEAER, whichdepends on CONFIG_PCIEPORTBUS.
8.2.2.Load PCIe AER Root Driver¶
Some systems have AER support in firmware. Enabling Linux AER support atthe same time the firmware handles AER would result in unpredictablebehavior. Therefore, Linux does not handle AER events unless the firmwaregrants AER control to the OS via the ACPI _OSC method. See the PCI FirmwareSpecification for details regarding _OSC usage.
8.2.3.AER error output¶
When a PCIe AER error is captured, an error message will be output toconsole. If it’s a correctable error, it is output as a warning message.Otherwise, it is printed as an error. So users could choose differentlog level to filter out correctable error messages.
Below shows an example:
0000:50:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Requester ID)0000:50:00.0: device [8086:0329] error status/mask=00100000/000000000000:50:00.0: [20] UnsupReq (First)0000:50:00.0: TLP Header: 0x04000001 0x00200a03 0x05010000 0x00050100
In the example, ‘Requester ID’ means the ID of the device that sentthe error message to the Root Port. Please refer to PCIe specs for otherfields.
8.2.4.AER Ratelimits¶
Since error messages can be generated for each transaction, we may seelarge volumes of errors reported. To prevent spammy devices from floodingthe console/stalling execution, messages are throttled by device and errortype (correctable vs. non-fatal uncorrectable). Fatal errors, includingDPC errors, are not ratelimited.
AER uses the default ratelimit of DEFAULT_RATELIMIT_BURST (10 events) overDEFAULT_RATELIMIT_INTERVAL (5 seconds).
Ratelimits are exposed in the form of sysfs attributes and configurable.SeeABI file testing/sysfs-bus-pci-devices-aer.
8.2.5.AER Statistics / Counters¶
When PCIe AER errors are captured, the counters / statistics are also exposedin the form of sysfs attributes which are documented atABI file testing/sysfs-bus-pci-devices-aer.
8.3.Developer Guide¶
To enable error recovery, a software driver must provide callbacks.
To support AER better, developers need to understand how AER works.
PCIe errors are classified into two types: correctable errorsand uncorrectable errors. This classification is based on the impactof those errors, which may result in degraded performance or functionfailure.
Correctable errors pose no impacts on the functionality of theinterface. The PCIe protocol can recover without any softwareintervention or any loss of data. These errors are detected andcorrected by hardware.
Unlike correctable errors, uncorrectableerrors impact functionality of the interface. Uncorrectable errorscan cause a particular transaction or a particular PCIe linkto be unreliable. Depending on those error conditions, uncorrectableerrors are further classified into non-fatal errors and fatal errors.Non-fatal errors cause the particular transaction to be unreliable,but the PCIe link itself is fully functional. Fatal errors, onthe other hand, cause the link to be unreliable.
When PCIe error reporting is enabled, a device will automatically send anerror message to the Root Port above it when it capturesan error. The Root Port, upon receiving an error reporting message,internally processes and logs the error message in its AERCapability structure. Error information being logged includes storingthe error reporting agent’s Requester ID into the Error SourceIdentification Registers and setting the error bits of the Root ErrorStatus Register accordingly. If AER error reporting is enabled in the RootError Command Register, the Root Port generates an interrupt when anerror is detected.
Note that the errors as described above are related to the PCIehierarchy and links. These errors do not include any device specificerrors because device specific errors will still get sent directly tothe device driver.
8.3.1.Provide callbacks¶
8.3.1.1.PCI error-recovery callbacks¶
The PCIe AER Root driver uses error callbacks to coordinatewith downstream device drivers associated with a hierarchy in questionwhen performing error recovery actions.
Datastructpci_driver has a pointer, err_handler, to point topci_error_handlers who consists of a couple of callback functionpointers. The AER driver follows the rules defined inPCI Error Recovery except PCIe-specific parts (seebelow). Please refer toPCI Error Recovery for detaileddefinitions of the callbacks.
The sections below specify when to call the error callback functions.
8.3.1.2.Correctable errors¶
Correctable errors pose no impacts on the functionality ofthe interface. The PCIe protocol can recover without anysoftware intervention or any loss of data. These errors do notrequire any recovery actions. The AER driver clears the device’scorrectable error status register accordingly and logs these errors.
8.3.1.3.Uncorrectable (non-fatal and fatal) errors¶
The AER driver performs a Secondary Bus Reset to recover fromuncorrectable errors. The reset is applied at the port abovethe originating device: If the originating device is an Endpoint,only the Endpoint is reset. If on the other hand the originatingdevice has subordinate devices, those are all affected by thereset as well.
If the originating device is a Root Complex Integrated Endpoint,there’s no port above where a Secondary Bus Reset could be applied.In this case, the AER driver instead applies a Function Level Reset.
If an error message indicates a non-fatal error, performing a resetat upstream is not required. The AER driver calls error_detected(dev,pci_channel_io_normal) to all drivers associated within a hierarchy inquestion. For example:
Endpoint <==> Downstream Port B <==> Upstream Port A <==> Root Port
If Upstream Port A captures an AER error, the hierarchy consists ofDownstream Port B and Endpoint.
A driver may return PCI_ERS_RESULT_CAN_RECOVER,PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending onwhether it can recover without a reset, considers the device unrecoverableor needs a reset for recovery. If all affected drivers agree that they canrecover without a reset, it is skipped. Should one driver request a reset,it overrides all other drivers.
If an error message indicates a fatal error, kernel will broadcasterror_detected(dev, pci_channel_io_frozen) to all drivers withina hierarchy in question. Then, performing a reset at upstream isnecessary. If error_detected returns PCI_ERS_RESULT_CAN_RECOVERto indicate that recovery without a reset is possible, the errorhandling goes to mmio_enabled, but afterwards a reset is stillperformed.
In other words, for non-fatal errors, drivers may opt in to a reset.But for fatal errors, they cannot opt out of a reset, based on theassumption that the link is unreliable.
8.3.2.Frequently Asked Questions¶
- Q:
What happens if a PCIe device driver does not provide anerror recovery handler (pci_driver->err_handler is equal to NULL)?
- A:
The devices attached with the driver won’t be recovered.The kernel will print out informational messages to identifyunrecoverable devices.
8.4.Software error injection¶
Debugging PCIe AER error recovery code is quite difficult because itis hard to trigger real hardware errors. Software based errorinjection can be used to fake various kinds of PCIe errors.
First you should enable PCIe AER software error injection in kernelconfiguration, that is, following item should be in your .config.
CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
After reboot with new kernel or insert the module, a device file named/dev/aer_inject should be created.
Then, you need a user space tool named aer-inject, which can be gottenfrom:
More information about aer-inject can be found in the document inits source code.