Recoverable Hardware Error Tracking in vmcoreinfo

Overview

This feature provides a generic infrastructure within the Linux kernel to trackand log recoverable hardware errors. These are hardware recoverable errorsvisible that might not cause immediate panics but may influence health, mainlybecause new code path will be executed in the kernel.

By recording counts and timestamps of recoverable errors into the vmcoreinfocrash dump notes, this infrastructure aids post-mortem crash analysis tools incorrelating hardware events with kernel failures. This enables faster triageand better understanding of root causes, especially in large-scale cloudenvironments where hardware issues are common.

Benefits

  • Facilitates correlation of hardware recoverable errors with kernel panics orunusual code paths that lead to system crashes.

  • Provides operators and cloud providers quick insights, improving reliabilityand reducing troubleshooting time.

  • Complements existing full hardware diagnostics without replacing them.

Data Exposure and Consumption

  • The tracked error data consists of per-error-type counts and timestamps oflast occurrence.

  • This data is stored in thehwerror_data array, categorized by error sourcetypes like CPU, memory, PCI, CXL, and others.

  • It is exposed via vmcoreinfo crash dump notes and can be read using toolslikecrash,drgn, or other kernel crash analysis utilities.

  • There is no other way to read these data other than from crash dumps.

  • These errors are divided by area, which includes CPU, Memory, PCI, CXL andothers.

Typical usage example (in drgn REPL):

>>>prog['hwerror_data'](struct hwerror_info[HWERR_RECOV_MAX]){    {        .count = (int)844,        .timestamp = (time64_t)1752852018,    },    ...}

Enabling

  • This feature is enabled when CONFIG_VMCORE_INFO is set.