Scrub Control

Copyright (c) 2024-2025 HiSilicon Limited.

Author:

Shiju Jose <shiju.jose@huawei.com>

License:

The GNU Free Documentation License, Version 1.2 withoutInvariant Sections, Front-Cover Texts nor Back-Cover Texts.(dual licensed under the GPL v2)

  • Written for: 6.15

Introduction

Increasing DRAM size and cost have made memory subsystem reliability animportant concern. These modules are used where potentially corrupted datacould cause expensive or fatal issues. Memory errors are among the tophardware failures that cause server and workload crashes.

Memory scrubbing is a feature where an ECC (Error-Correcting Code) enginereads data from each memory media location, corrects if necessary and writesthe corrected data back to the same memory media location.

DIMMs can be scrubbed at a configurable rate to detect uncorrected memoryerrors and attempt recovery from detected errors, providing the followingbenefits:

  1. Proactively scrubbing DIMMs reduces the chance of a correctable errorbecoming uncorrectable.

  2. When detected, uncorrected errors caught in unallocated memory pages areisolated and prevented from being allocated to an application or the OS.

  3. This reduces the likelihood of software or hardware products encounteringmemory errors.

  4. The additional data on failures in memory may be used to build upstatistics that are later used to decide whether to use memory repairtechnologies such as Post Package Repair or Sparing.

There are 2 types of memory scrubbing:

  1. Background (patrol) scrubbing while the DRAM is otherwise idle.

  2. On-demand scrubbing for a specific address range or region of memory.

Several types of interfaces to hardware memory scrubbers have beenidentified, such as CXL memory device patrol scrub, CXL DDR5 ECS, ACPIRAS2 memory scrubbing, and ACPI NVDIMM ARS (Address Range Scrub).

The control mechanisms vary across different memory scrubbers. To enablestandardized userspace tooling, there is a need to present these controlsthrough a standardized ABI.

A generic memory EDAC scrub control allows users to manage underlyingscrubbers in the system through a standardized sysfs control interface. Itabstracts the management of various scrubbing functionalities into a unifiedset of functions.

Use cases of common scrub control feature

  1. Several types of interfaces for hardware memory scrubbers have beenidentified, including the CXL memory device patrol scrub, CXL DDR5 ECS,ACPI RAS2 memory scrubbing features, ACPI NVDIMM ARS (Address Range Scrub),and software-based memory scrubbers.

    Of the identified interfaces to hardware memory scrubbers some supportcontrol over patrol (background) scrubbing (e.g., ACPI RAS2, CXL) and/oron-demand scrubbing (e.g., ACPI RAS2, ACPI ARS). However, the scrub controlinterfaces vary between memory scrubbers, highlighting the need fora standardized, generic sysfs scrub control interface that is accessible touserspace for administration and use by scripts/tools.

  2. User-space scrub controls allow users to disable scrubbing if necessary,for example, to disable background patrol scrubbing or adjust the scrubrate for performance-aware operations where background activities need tobe minimized or disabled.

  3. User-space tools enable on-demand scrubbing for specific address ranges,provided that the scrubber supports this functionality.

  4. User-space tools can also control memory DIMM scrubbing at a configurablescrub rate via sysfs scrub controls. This approach offers several benefits:

    4.1. Detects uncorrectable memory errors early, before user access to affected

    memory, helping facilitate recovery.

    4.2. Reduces the likelihood of correctable errors developing into uncorrectable

    errors.

  5. Policy control for hotplugged memory is necessary because there may notbe a system-wide BIOS or similar control to manage scrub settings for a CXLdevice added after boot. Determining these settings is a policy decision,balancing reliability against performance, so userspace should control it.Therefore, a unified interface is recommended for handling this function ina way that aligns with other similar interfaces, rather than creating aseparate one.

Scrubbing features

CXL Memory Scrubbing features

CXL spec r3.1[1] section 8.2.9.9.11.1 describes the memory device patrolscrub control feature. The device patrol scrub proactively locates and makescorrections to errors in regular cycle. The patrol scrub control allows theuserspace request to change CXL patrol scrubber’s configurations.

The patrol scrub control allows the requester to specify the number ofhours in which the patrol scrub cycles must be completed, provided thatthe requested scrub rate must be within the supported range of thescrub rate that the device is capable of. In the CXL driver, thenumber of seconds per scrub cycles, which user requests via sysfs, isrescaled to hours per scrub cycles.

In addition, they allow the host to disable the feature in case it interfereswith performance-aware operations which require the background operations tobe turned off.

Error Check Scrub (ECS)

CXL spec r3.1[1] section 8.2.9.9.11.2 describes Error Check Scrub (ECS)- a feature defined in the JEDEC DDR5 SDRAM Specification (JESD79-5) andallowing DRAM to internally read, correct single-bit errors, and write backcorrected data bits to the DRAM array while providing transparency to errorcounts.

The DDR5 device contains number of memory media Field Replaceable Units (FRU)per device. The DDR5 ECS feature and thus the ECS control driver supportsconfiguring the ECS parameters per FRU.

ACPI RAS2 Hardware-based Memory Scrubbing

ACPI spec 6.5[2] section 5.2.21 ACPI RAS2 describes an ACPI RAS2 tablewhich provides interfaces for platform RAS features and supports independentRAS controls and capabilities for a given RAS feature for multiple instancesof the same component in a given system.

Memory RAS features apply to RAS capabilities, controls and operations thatare specific to memory. RAS2 PCC sub-spaces for memory-specific RAS featureshave a Feature Type of 0x00 (Memory).

The platform can use the hardware-based memory scrubbing feature to exposecontrols and capabilities associated with hardware-based memory scrubengines. The RAS2 memory scrubbing feature supports as per spec,

  1. Independent memory scrubbing controls for each NUMA domain, identifiedusing its proximity domain.

  2. Provision for background (patrol) scrubbing of the entire memory system,as well as on-demand scrubbing for a specific region of memory.

ACPI Address Range Scrubbing (ARS)

ACPI spec 6.5[2] section 9.19.7.2 describes Address Range Scrubbing (ARS).ARS allows the platform to communicate memory errors to system software.This capability allows system software to prevent accesses to addresses withuncorrectable errors in memory. ARS functions manage all NVDIMMs present inthe system. Only one scrub can be in progress system wide at any given time.

The following functions are supported as per the specification:

  1. Query ARS Capabilities for a given address range, indicates platformsupports the ACPI NVDIMM Root Device Unconsumed Error Notification.

  2. Start ARS triggers an Address Range Scrub for the given memory range.Address scrubbing can be done for volatile or persistent memory, or both.

  3. Query ARS Status command allows software to get the status of ARS,including the progress of ARS and ARS error record.

  4. Clear Uncorrectable Error.

  5. Translate SPA

  6. ARS Error Inject etc.

The kernel supports an existing control for ARS and ARS is currently notsupported in EDAC.

[1](1,2)

https://computeexpresslink.org/cxl-specification/

[2](1,2)

https://uefi.org/specs/ACPI/6.5/

Comparison of various scrubbing features

Name

ACPIRAS2

CXL patrolscrub

CXL ECS

ARS

On-demandScrubbing

Supported

No

No

Supported

Backgroundscrubbing

Supported

Supported

Supported

No

Mode ofscrubbing

Scrub ctrlper NUMAdomain.

per device

per memorymedia

Unknown

Query scrubcapabilities

Supported

Supported

Supported

Supported

Settingaddress range

Supported

No

No

Supported

Settingscrub rate

Supported

Supported

No

No

Unit forscrub rate

NotDefined

in hours

No

No

Scrubstatus/Completion

Supportedon-demandscrubbingonly

No

No

Supported

UC errorreporting

Exception

CXL generalmedia/DRAMevent/mediascan?

CXL generalmedia/DRAMevent/mediascan?

ACPI UCEnotify andqueryARS status

Support forEDAC control

Supported

Supported

Supported

No

The File System

The control attributes of a registered scrubber instance could beaccessed in:

/sys/bus/edac/devices/<dev-name>/scrubX/

sysfs

Sysfs files are documented inABI file testing/sysfs-edac-scrub

ABI file testing/sysfs-edac-ecs

Examples

The usage takes the form shown in these examples:

  1. CXL memory Patrol Scrub

The following are the use cases identified why we might increase the scrub rate.

  • Scrubbing is needed at device granularity because a device is showingunexpectedly high errors.

  • Scrubbing may apply to memory that isn’t online at all yet. Likely thisis a system wide default setting on boot.

  • Scrubbing at a higher rate because the monitor software has determined thatmore reliability is necessary for a particular data set. This is calledDifferentiated Reliability.

1.1. Device based scrubbing

CXL memory is exposed to memory management subsystem and ultimately userspacevia CXL devices. Device-based scrubbing is used for the first use casedescribed in “Section 1 CXL Memory Patrol Scrub”.

When combining control via the device interfaces and region interfaces,“see Section 1.2 Region based scrubbing”.

Sysfs files for scrubbing are documented inABI file testing/sysfs-edac-scrub

1.2. Region based scrubbing

CXL memory is exposed to memory management subsystem and ultimately userspacevia CXL regions. CXL Regions represent mapped memory capacity in systemphysical address space. These can incorporate one or more parts of multiple CXLmemory devices with traffic interleaved across them. The user may want to controlthe scrub rate via this more abstract region instead of having to figure out theconstituent devices and program them separately. The scrub rate for each devicecovers the whole device. Thus if multiple regions use parts of that device thenrequests for scrubbing of other regions may result in a higher scrub rate thanrequested for this specific region.

Region-based scrubbing is used for the third use case described in“Section 1 CXL Memory Patrol Scrub”.

Userspace must follow below set of rules on how to set the scrub rates for anymixture of requirements.

  1. Taking each region in turn from lowest desired scrub rate to highest and settheir scrub rates. Later regions may override the scrub rate on individualdevices (and hence potentially whole regions).

  2. Take each device for which enhanced scrubbing is required (higher rate) andset those scrub rates. This will override the scrub rates of individual devices,setting them to the maximum rate required for any of the regions they help back,unless a specific rate is already defined.

Sysfs files for scrubbing are documented inABI file testing/sysfs-edac-scrub

  1. CXL memory Error Check Scrub (ECS)

The Error Check Scrub (ECS) feature enables a memory device to perform errorchecking and correction (ECC) and count single-bit errors. The associatedmemory controller sets the ECS mode with a trigger sent to the memorydevice. CXL ECS control allows the host, thus the userspace, to change theattributes for error count mode, threshold number of errors per segment(indicating how many segments have at least that number of errors) forreporting errors, and reset the ECS counter. Thus the responsibility forinitiating Error Check Scrub on a memory device may lie with the memorycontroller or platform when unexpectedly high error rates are detected.

Sysfs files for scrubbing are documented inABI file testing/sysfs-edac-ecs