How realtime kernels differ

Author:

Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Preface

With forced-threaded interrupts and sleeping spin locks, code paths thatpreviously caused long scheduling latencies have been made preemptible andmoved into process context. This allows the scheduler to manage them moreeffectively and respond to higher-priority tasks with reduced latency.

The following chapters provide an overview of key differences between aPREEMPT_RT kernel and a standard, non-PREEMPT_RT kernel.

Locking

Spinning locks such as spinlock_t are used to provide synchronization for datastructures accessed from both interrupt context and process context. For thisreason, locking functions are also available with the_irq() or_irqsave()suffixes, which disable interrupts before acquiring the lock. This ensures thatthe lock can be safely acquired in process context when interrupts are enabled.

However, on a PREEMPT_RT system, interrupts are forced-threaded and no longerrun in hard IRQ context. As a result, there is no need to disable interrupts aspart of the locking procedure when using spinlock_t.

For low-level core components such as interrupt handling, the scheduler, or thetimer subsystem the kernel uses raw_spinlock_t. This lock type preservestraditional semantics: it disables preemption and, when used with_irq() or_irqsave(), also disables interrupts. This ensures proper synchronization incritical sections that must remain non-preemptible or with interrupts disabled.

Execution context

Interrupt handling in a PREEMPT_RT system is invoked in process context throughthe use of threaded interrupts. Other parts of the kernel also shift theirexecution into threaded context by different mechanisms. The goal is to keepexecution paths preemptible, allowing the scheduler to interrupt them when ahigher-priority task needs to run.

Below is an overview of the kernel subsystems involved in this transition tothreaded, preemptible execution.

Interrupt handling

All interrupts are forced-threaded in a PREEMPT_RT system. The exceptions areinterrupts that are requested with the IRQF_NO_THREAD, IRQF_PERCPU, orIRQF_ONESHOT flags.

The IRQF_ONESHOT flag is used together with threaded interrupts, meaning thoseregistered usingrequest_threaded_irq() and providing only a threaded handler.Its purpose is to keep the interrupt line masked until the threaded handler hascompleted.

If a primary handler is also provided in this case, it is essential that thehandler does not acquire any sleeping locks, as it will not be threaded. Thehandler should be minimal and must avoid introducing delays, such asbusy-waiting on hardware registers.

Soft interrupts, bottom half handling

Soft interrupts are raised by the interrupt handler and are executed after thehandler returns. Since they run in thread context, they can be preempted byother threads. Do not assume that softirq context runs with preemptiondisabled. This means you must not rely on mechanisms likelocal_bh_disable() inprocess context to protect per-CPU variables. Because softirq handlers arepreemptible under PREEMPT_RT, this approach does not provide reliablesynchronization.

If this kind of protection is required for performance reasons, consider usinglocal_lock_nested_bh(). On non-PREEMPT_RT kernels, this allows lockdep toverify that bottom halves are disabled. On PREEMPT_RT systems, it adds thenecessary locking to ensure proper protection.

Usinglocal_lock_nested_bh() also makes the locking scope explicit and easierfor readers and maintainers to understand.

per-CPU variables

Protecting access to per-CPU variables solely by usingpreempt_disable() shouldbe avoided, especially if the critical section has unbounded runtime or maycall APIs that can sleep.

If using a spinlock_t is considered too costly for performance reasons,consider using local_lock_t. On non-PREEMPT_RT configurations, this introducesno runtime overhead when lockdep is disabled. With lockdep enabled, it verifiesthat the lock is only acquired in process context and never from softirq orhard IRQ context.

On a PREEMPT_RT kernel, local_lock_t is implemented using a per-CPU spinlock_t,which provides safe local protection for per-CPU data while keeping the systempreemptible.

Because spinlock_t on PREEMPT_RT does not disable preemption, it cannot be usedto protect per-CPU data by relying on implicit preemption disabling. If thisinherited preemption disabling is essential and if local_lock_t cannot be useddue to performance constraints, brevity of the code, or abstraction boundarieswithin an API thenpreempt_disable_nested() may be a suitable alternative. Onnon-PREEMPT_RT kernels, it verifies with lockdep that preemption is alreadydisabled. On PREEMPT_RT, it explicitly disables preemption.

Timers

By default, an hrtimer is executed in hard interrupt context. The exception istimers initialized with the HRTIMER_MODE_SOFT flag, which are executed insoftirq context.

On a PREEMPT_RT kernel, this behavior is reversed: hrtimers are executed insoftirq context by default, typically within the ktimersd thread. This threadruns at the lowest real-time priority, ensuring it executes before anySCHED_OTHER tasks but does not interfere with higher-priority real-timethreads. To explicitly request execution in hard interrupt context onPREEMPT_RT, the timer must be marked with the HRTIMER_MODE_HARD flag.

Memory allocation

The memory allocation APIs, such askmalloc() andalloc_pages(), require agfp_t flag to indicate the allocation context. On non-PREEMPT_RT kernels, it isnecessary to use GFP_ATOMIC when allocating memory from interrupt context orfrom sections where preemption is disabled. This is because the allocator mustnot sleep in these contexts waiting for memory to become available.

However, this approach does not work on PREEMPT_RT kernels. The memoryallocator in PREEMPT_RT uses sleeping locks internally, which cannot beacquired when preemption is disabled. Fortunately, this is generally not aproblem, because PREEMPT_RT moves most contexts that would traditionally runwith preemption or interrupts disabled into threaded context, where sleeping isallowed.

What remains problematic is code that explicitly disables preemption orinterrupts. In such cases, memory allocation must be performed outside thecritical section.

This restriction also applies to memory deallocation routines such askfree()andfree_pages(), which may also involve internal locking and must not becalled from non-preemptible contexts.

IRQ work

The irq_work API provides a mechanism to schedule a callback in interruptcontext. It is designed for use in contexts where traditional scheduling is notpossible, such as from within NMI handlers or from inside the scheduler, whereusing a workqueue would be unsafe.

On non-PREEMPT_RT systems, all irq_work items are executed immediately ininterrupt context. Items marked with IRQ_WORK_LAZY are deferred until the nexttimer tick but are still executed in interrupt context.

On PREEMPT_RT systems, the execution model changes. Because irq_work callbacksmay acquire sleeping locks or have unbounded execution time, they are handledin thread context by a per-CPU irq_work kernel thread. This thread runs at thelowest real-time priority, ensuring it executes before any SCHED_OTHER tasksbut does not interfere with higher-priority real-time threads.

The exception are work items marked with IRQ_WORK_HARD_IRQ, which are stillexecuted in hard interrupt context. Lazy items (IRQ_WORK_LAZY) continue to bedeferred until the next timer tick and are also executed by the irq_work/thread.

RCU callbacks

RCU callbacks are invoked by default in softirq context. Their execution isimportant because, depending on the use case, they either free memory or ensureprogress in state transitions. Running these callbacks as part of the softirqchain can lead to undesired situations, such as contention for CPU resourceswith other SCHED_OTHER tasks when executed within ksoftirqd.

To avoid running callbacks in softirq context, the RCU subsystem provides amechanism to execute them in process context instead. This behavior can beenabled by setting the boot command-line parameter rcutree.use_softirq=0. Thissetting is enforced in kernels configured with PREEMPT_RT.

Spin until ready

The “spin until ready” pattern involves repeatedly checking (spinning on) thestate of a data structure until it becomes available. This pattern assumes thatpreemption, soft interrupts, or interrupts are disabled. If the data structureis marked busy, it is presumed to be in use by another CPU, and spinning shouldeventually succeed as that CPU makes progress.

Some examples arehrtimer_cancel() ortimer_delete_sync(). These functionscancel timers that execute with interrupts or soft interrupts disabled. If athread attempts to cancel a timer and finds it active, spinning until thecallback completes is safe because the callback can only run on another CPU andwill eventually finish.

On PREEMPT_RT kernels, however, timer callbacks run in thread context. Thisintroduces a challenge: a higher-priority thread attempting to cancel the timermay preempt the timer callback thread. Since the scheduler cannot migrate thecallback thread to another CPU due to affinity constraints, spinning can resultin livelock even on multiprocessor systems.

To avoid this, both the canceling and callback sides must use a handshakemechanism that supports priority inheritance. This allows the canceling threadto suspend until the callback completes, ensuring forward progress withoutrisking livelock.

In order to solve the problem at the API level, the sequence locks were extendedto allow a proper handover between the the spinning reader and the maybeblocked writer.

Sequence locks

Sequence counters and sequential locks are documented inSequence counters and sequential locks.

The interface has been extended to ensure proper preemption states for thewriter and spinning reader contexts. This is achieved by embedding the writerserialization lock directly into the sequence counter type, resulting incomposite types such as seqcount_spinlock_t or seqcount_mutex_t.

These composite types allow readers to detect an ongoing write and activelyboost the writer’s priority to help it complete its update instead of spinningand waiting for its completion.

If the plain seqcount_t is used, extra care must be taken to synchronize thereader with the writer during updates. The writer must ensure its update isserialized and non-preemptible relative to the reader. This cannot be achievedusing a regular spinlock_t because spinlock_t on PREEMPT_RT does not disablepreemption. In such cases, using seqcount_spinlock_t is the preferred solution.

However, if there is no spinning involved i.e., if the reader only needs todetect whether a write has started and not serialize against it then usingseqcount_t is reasonable.