4.11.Reliability, Availability, and Serviceability (RAS) Extensions

This document describesTF-A support for Arm Reliability, Availability, andServiceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 andlater CPUs, and also an optional extension to the base Armv8.0 architecture.

For the description of Arm RAS extensions, Standard Error Records, and theprecise definition of RAS terminology, please refer to the Arm ArchitectureReference Manual andRAS Supplement. The rest of this document assumesfamiliarity with architecture and terminology.

IMPORTANT NOTE: TF-A implementation assumes that if RAS extension is presentthen FEAT_IESB is also implmented.

There are two philosophies for handling RAS errors from Non-secure world pointof view.

4.11.1.Firmware First Handling (FFH)

4.11.1.1.Introduction

EA’s and Error interrupts corresponding to NS nodes are handled first in firmware

  • Errors signaled back to NS world via suitable mechanism

  • Kernel is prohibited from accessing the RAS error records directly

  • Firmware creates CPER records for kernel to navigate and process

  • Firmware signals error back to Kernel via SDEI

4.11.1.2.Overview

FFH works in conjunction withException Handling Framework. Exceptions resulting fromerrors in Non-secure world are routed to and handled in EL3. Said errors are SynchronousExternal Abort (SEA), Asynchronous External Abort (signalled as SErrors), Fault Handlingand Error Recovery interrupts.RAS Framework in TF-A allows the platform to define an external abort handler and toregister RAS nodes and interrupts. It also provideshelpers for accessing StandardError Records as introduced by the RAS extensions

4.11.2.Kernel First Handling (KFH)

4.11.2.1.Introduction

EA’s originating/attributed to NS world are handled first in NS and Kernel navigatesthe std error records directly.

  • KFH is the default handling mode if platform does not explicitly enable FFH mode.

  • KFH mode does not need any EL3 involvement except for the reflection of errors backto lower EL. This happens when there is an error (EA) in the system which is not yetsignaled to PE while executing at lower EL. During entry into EL3 the errors (EA) aresynchronized causing async EA to pend at EL3.

4.11.3.Error Syncronization at EL3 entry

During entry to EL3 from lower EL, if there is any pending async EAs they are eitherreflected back to lower EL (KFH) or handled in EL3 itself (FFH).

Image 1

4.11.4.TF-A build options

  • ENABLE_FEAT_RAS: Enable RAS extension feature at EL3.

  • HANDLE_EA_EL3_FIRST_NS: Required for FFH

  • RAS_TRAP_NS_ERR_REC_ACCESS: Trap Non-secure access of RAS error record registers.

  • RAS_EXTENSION: Deprecated macro, equivalent to ENABLE_FEAT_RAS andHANDLE_EA_EL3_FIRST_NS put together.

RAS internal macros

  • FFH_SUPPORT: Gets enabled ifHANDLE_EA_EL3_FIRST_NS is enabled.

RAS feature has dependency on some other TF-A build flags

  • EL3_EXCEPTION_HANDLING: Required for FFH

  • FAULT_INJECTION_SUPPORT: Required for testing RAS feature on fvp platform

4.11.5.TF-A Tests

RAS functionality is regularly tested in TF-A CI usingRAS test group which has multipleconfigurations for testing lower EL External aborts.

All the tests are written in TF-A tests which runs as NS-EL2 payload.

  • FFH without RAS extension

    fvp-ea-ffh,fvp-ea-ffh:fvp-tftf-fip.tftf-aemv8a-debug

    Couple of tests, one each for sync EA and async EA from lower EL which gets handled in El3.Inject External aborts(sync/async) which traps in EL3, FVP has a handler which gracefullyhandles these errors and returns back to TF-A Tests

    Build Configs :HANDLE_EA_EL3_FIRST_NS ,PLATFORM_TEST_EA_FFH

  • FFH with RAS extension

    Three Tests :

    • fvp-ras-ffh,fvp-single-fault:fvp-tftf-fip.tftf-aemv8a.fi-debug

      Inject an unrecoverable RAS error, which gets handled in EL3.

    • fvp-ras-ffh,fvp-uncontainable:fvp-tftf.fault-fip.tftf-aemv8a.fi-debug

      Inject uncontainable RAS errors which causes platform to panic.

    • fvp-ras-ffh,fvp-ras-ffh-nested:fvp-tftf-fip.tftf-ras_ffh_nested-aemv8a.fi-debug

      Test nested exception handling at El3 for synchronized async EAs. Inject an SError in lower ELwhich remain pending until we enter EL3 through SMC call. At EL3 entry on encountering a pendingasync EA it will handle the async EA first (nested exception) before handling the original SMC call.

  • KFH with RAS extension

Couple of tests in the group :

  • fvp-ras-kfh,fvp-ras-kfh:fvp-tftf-fip.tftf-aemv8a.fi-debug

    Inject and handle RAS errors in TF-A tests (no El3 involvement)

  • fvp-ras-kfh,fvp-ras-kfh-reflect:fvp-tftf-fip.tftf-ras_kfh_reflection-aemv8a.fi-debug

    Reflection of synchronized errors from EL3 to TF-A tests, two tests one each for reflectingin IRQ and SMC path.

4.11.6.RAS Framework

../_images/ras.svg

4.11.6.1.Platform APIs

The RAS framework allows the platform to define handlers for External Abort,Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Pleaserefer toRAS Porting Guide.

4.11.6.2.Registering RAS error records

RAS nodes are components in the system capable of signalling errors to PEsthrough one one of the notification mechanisms—SEAs, SErrors, or interrupts. RASnodes contain one or more error records, which are registers through which thenodes advertise various properties of the signalled error. Arm recommends thaterror records are implemented in the Standard Error Record format. The RASarchitecture allows for error records to be accessible via system ormemory-mapped registers.

The platform should enumerate the error records providing for each of them:

  • A handler to probe error records for errors;

  • When the probing identifies an error, a handler to handle it;

  • For memory-mapped error record, its base address and size in KB; for a systemregister-accessed record, the start index of the record and number ofcontinuous records from that index;

  • Any node-specific auxiliary data.

With this information supplied, when the run time firmware receives one of thenotification mechanisms, the RAS framework can iterate through and probe errorrecords for error, and invoke the appropriate handler to handle it.

The RAS framework provides the macros to populate error record information. Themacros are versioned, and the latest version as of this writing is 1. Thesemacros create a structure of typestructerr_record_info from its arguments,which are later passed to probe and error handlers.

For memory-mapped error records:

ERR_RECORD_MEMMAP_V1(base_addr,size_num_k,probe,handler,aux)

And, for system register ones:

ERR_RECORD_SYSREG_V1(idx_start,num_idx,probe,handler,aux)

The probe handler must have the following prototype:

typedefint(*err_record_probe_t)(conststructerr_record_info*info,int*probe_data);

The probe handler must return a non-zero value if an error was detected, or 0otherwise. Theprobe_data output parameter can be used to pass any usefulinformation resulting from probe to the error handler (seebelow). Forexample, it could return the index of the record.

The error handler must have the following prototype:

typedefint(*err_record_handler_t)(conststructerr_record_info*info,intprobe_data,conststructerr_handler_data*constdata);

Thedata constant parameter describes the various properties of the error,including the reason for the error, exception syndrome, and alsoflags,cookie, andhandle parameters from thetop-level exception handler.

The platform is expected populate an array using the macros above, and registerthe it with the RAS framework using the macroREGISTER_ERR_RECORD_INFO(),passing it the name of the array describing the records. Note that the macromust be used in the same file where the array is defined.

4.11.6.2.1.Standard Error Record helpers

TheTF-A RAS framework provides probe handlers for Standard Error Records, forboth memory-mapped and System Register accesses:

intras_err_ser_probe_memmap(conststructerr_record_info*info,int*probe_data);intras_err_ser_probe_sysreg(conststructerr_record_info*info,int*probe_data);

When the platform enumerates error records, for those records in the StandardError Record format, these helpers maybe used instead of rolling out their own.Both helpers above:

  • Return non-zero value when an error is detected in a Standard Error Record;

  • Setprobe_data to the index of the error record upon detecting an error.

4.11.6.3.Registering RAS interrupts

RAS nodes can signal errors to the PE by raising Fault Handling and/or ErrorRecovery interrupts. For the firmware-first handling paradigm for interrupts towork, the platform must setup and register withEHF. SeeInteraction withException Handling Framework.

For each RAS interrupt, the platform has to provide structure of typestructras_interrupt:

  • Interrupt number;

  • The associated error record information (pointer to the correspondingstructerr_record_info);

  • Optionally, a cookie.

The platform is expected to define an array ofstructras_interrupt, andregister it with the RAS framework using the macroREGISTER_RAS_INTERRUPTS(), passing it the name of the array. Note that themacro must be used in the same file where the array is defined.

The array ofstructras_interrupt must be sorted in the increasing order ofinterrupt number. This allows for fast look of handlers in order to service RASinterrupts.

4.11.6.4.Double-fault handling

A Double Fault condition arises when an error is signalled to the PE whilehandling of a previously signalled error is still underway. When a Double Faultcondition arises, the Arm RAS extensions only require for handler to performorderly shutdown of the system, as recovery may be impossible.

The RAS extensions part of Armv8.4 introduced new architectural features to dealwith Double Fault conditions, specifically, the introduction ofNMEA andEASE bits toSCR_EL3 register. These were introduced to assist EL3software which runs part of its entry/exit routines with exceptions momentarilymasked—meaning, in such systems, External Aborts/SErrors are not immediatelyhandled when they occur, but only after the exceptions are unmasked again.

TF-A, for legacy reasons, executes entire EL3 with all exceptions unmasked.This means that all exceptions routed to EL3 are handled immediately.TF-Athus is able to detect a Double Fault conditions in software, without needingthe intended advantages of Armv8.4 Double Fault architecture extensions.

Double faults are fatal, and terminate at the platform double fault handler, anddoesn’t return.

4.11.6.5.Engaging the RAS framework

Enabling RAS support is a platform choice

The RAS support inTF-A introduces a default implementation ofplat_ea_handler, the External Abort handler in EL3. WhenENABLE_FEAT_RASis set to1, it’ll first callras_ea_handler() function, which is thetop-level RAS exception handler.ras_ea_handler is responsible for iteratingto through platform-supplied error records, probe them, and when an error isidentified, look up and invoke the corresponding error handler.

Note that, if the platform chooses to override theplat_ea_handler functionand intend to use the RAS framework, it must explicitly callras_ea_handler() from within.

Similarly, for RAS interrupts, the framework definesras_interrupt_handler(). The RAS framework arranges for it to be invokedwhen a RAS interrupt taken at EL3. The function bisects the platform-suppliedsorted array of interrupts to look up the error record information associatedwith the interrupt number. That error handler for that record is then invoked tohandle the error.

4.11.6.6.Interaction with Exception Handling Framework

As mentioned in earlier sections, RAS framework interacts with theEHF toarbitrate handling of RAS exceptions with others that are routed to EL3. Thismeans that the platform must partition apriority level for handling RAS exceptions. The platform must then definethe macroPLAT_RAS_PRI to the priority level used for RAS exceptions.Platforms would typically want to allocate the highest secure priority forRAS handling.

Handling of bothinterrupt andnon-interrupt exceptions follow the sequences outlined in theEHFdocumentation. I.e., for interrupts, the priority management is implicit; butfor non-interrupt exceptions, they’re explicit usingEHF APIs.


Copyright (c) 2018-2023, Arm Limited and Contributors. All rights reserved.