Entry/exit handling for exceptions, interrupts, syscalls and KVM¶
All transitions between execution domains require state updates which aresubject to strict ordering constraints. State updates are required for thefollowing:
Lockdep
RCU / Context tracking
Preemption counter
Tracing
Time accounting
The update order depends on the transition type and is explained below inthe transition type sections:Syscalls,KVM,Interrupts and regularexceptions,NMI and NMI-like exceptions.
Non-instrumentable code - noinstr¶
Most instrumentation facilities depend on RCU, so instrumentation is prohibitedfor entry code before RCU starts watching and exit code after RCU stopswatching. In addition, many architectures must save and restore register state,which means that (for example) a breakpoint in the breakpoint entry code wouldoverwrite the debug registers of the initial breakpoint.
Such code must be marked with the ‘noinstr’ attribute, placing that code into aspecial section inaccessible to instrumentation and debug facilities. Somefunctions are partially instrumentable, which is handled by marking themnoinstr and usinginstrumentation_begin() andinstrumentation_end() to flag theinstrumentable ranges of code:
noinstrvoidentry(void){handle_entry();// <-- must be 'noinstr' or '__always_inline'...instrumentation_begin();handle_context();// <-- instrumentable codeinstrumentation_end();...handle_exit();// <-- must be 'noinstr' or '__always_inline'}
This allows verification of the ‘noinstr’ restrictions via objtool onsupported architectures.
Invoking non-instrumentable functions from instrumentable context has norestrictions and is useful to protect e.g. state switching which wouldcause malfunction if instrumented.
All non-instrumentable entry/exit code sections before and after the RCUstate transitions must run with interrupts disabled.
Syscalls¶
Syscall-entry code starts in assembly code and calls out into low-level C codeafter establishing low-level architecture-specific state and stack frames. Thislow-level C code must not be instrumented. A typical syscall handling functioninvoked from low-level assembly code looks like this:
noinstrvoidsyscall(structpt_regs*regs,intnr){arch_syscall_enter(regs);nr=syscall_enter_from_user_mode(regs,nr);instrumentation_begin();if(!invoke_syscall(regs,nr)&&nr!=-1)result_reg(regs)=__sys_ni_syscall(regs);instrumentation_end();syscall_exit_to_user_mode(regs);}
syscall_enter_from_user_mode() first invokesenter_from_user_mode() whichestablishes state in the following order:
Lockdep
RCU / Context tracking
Tracing
and then invokes the various entry work functions like ptrace, seccomp, audit,syscall tracing, etc. After all that is done, the instrumentable invoke_syscallfunction can be invoked. The instrumentable code section then ends, after whichsyscall_exit_to_user_mode() is invoked.
syscall_exit_to_user_mode() handles all work which needs to be done beforereturning to user space like tracing, audit, signals, task work etc. Afterthat it invokesexit_to_user_mode() which again handles the statetransition in the reverse order:
Tracing
RCU / Context tracking
Lockdep
syscall_enter_from_user_mode() andsyscall_exit_to_user_mode() are alsoavailable as fine grained subfunctions in cases where the architecture codehas to do extra work between the various steps. In such cases it has toensure thatenter_from_user_mode() is called first on entry andexit_to_user_mode() is called last on exit.
Do not nest syscalls. Nested syscalls will cause RCU and/or context trackingto print a warning.
KVM¶
Entering or exiting guest mode is very similar to syscalls. From the hostkernel point of view the CPU goes off into user space when entering theguest and returns to the kernel on exit.
guest_state_enter_irqoff() is a KVM-specific variant ofexit_to_user_mode()andguest_state_exit_irqoff() is the KVM variant ofenter_from_user_mode().The state operations have the same ordering.
Task work handling is done separately for guest at the boundary of thevcpu_run() loop viaxfer_to_guest_mode_handle_work() which is a subset ofthe work handled on return to user space.
Do not nest KVM entry/exit transitions because doing so is nonsensical.
Interrupts and regular exceptions¶
Interrupts entry and exit handling is slightly more complex than syscallsand KVM transitions.
If an interrupt is raised while the CPU executes in user space, the entryand exit handling is exactly the same as for syscalls.
If the interrupt is raised while the CPU executes in kernel space the entry andexit handling is slightly different. RCU state is only updated when theinterrupt is raised in the context of the CPU’s idle task. Otherwise, RCU willalready be watching. Lockdep and tracing have to be updated unconditionally.
irqentry_enter() andirqentry_exit() provide the implementation for this.
The architecture-specific part looks similar to syscall handling:
noinstrvoidinterrupt(structpt_regs*regs,intnr){arch_interrupt_enter(regs);state=irqentry_enter(regs);instrumentation_begin();irq_enter_rcu();invoke_irq_handler(regs,nr);irq_exit_rcu();instrumentation_end();irqentry_exit(regs,state);}
Note that the invocation of the actual interrupt handler is within airq_enter_rcu() andirq_exit_rcu() pair.
irq_enter_rcu() updates the preemption count which makesin_hardirq()return true, handles NOHZ tick state and interrupt time accounting. Thismeans that up to the point whereirq_enter_rcu() is invokedin_hardirq()returns false.
irq_exit_rcu() handles interrupt time accounting, undoes the preemptioncount update and eventually handles soft interrupts and NOHZ tick state.
In theory, the preemption count could be updated inirqentry_enter(). Inpractice, deferring this update toirq_enter_rcu() allows the preemption-countcode to be traced, while also maintaining symmetry withirq_exit_rcu() andirqentry_exit(), which are described in the next paragraph. The only downsideis that the early entry code up toirq_enter_rcu() must be aware that thepreemption count has not yet been updated with the HARDIRQ_OFFSET state.
Note thatirq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption countbefore it handles soft interrupts, whose handlers must run in BH context ratherthan irq-disabled context. In addition,irqentry_exit() might schedule, whichalso requires that HARDIRQ_OFFSET has been removed from the preemption count.
Even though interrupt handlers are expected to run with local interruptsdisabled, interrupt nesting is common from an entry/exit perspective. Forexample, softirq handling happens within an irqentry_{enter,exit}() block withlocal interrupts enabled. Also, although uncommon, nothing prevents aninterrupt handler from re-enabling interrupts.
Interrupt entry/exit code doesn’t strictly need to handle reentrancy, since itruns with local interrupts disabled. But NMIs can happen anytime, and a lot ofthe entry code is shared between the two.
NMI and NMI-like exceptions¶
NMIs and NMI-like exceptions (machine checks, double faults, debuginterrupts, etc.) can hit any context and must be extra careful withthe state.
State changes for debug exceptions and machine-check exceptions depend onwhether these exceptions happened in user-space (breakpoints or watchpoints) orin kernel mode (code patching). From user-space, they are treated likeinterrupts, while from kernel mode they are treated like NMIs.
NMIs and other NMI-like exceptions handle state transitions withoutdistinguishing between user-mode and kernel-mode origin.
The state update on entry is handled inirqentry_nmi_enter() which updatesstate in the following order:
Preemption counter
Lockdep
RCU / Context tracking
Tracing
The exit counterpartirqentry_nmi_exit() does the reverse operation in thereverse order.
Note that the update of the preemption counter has to be the firstoperation on enter and the last operation on exit. The reason is that bothlockdep and RCU rely onin_nmi() returning true in this case. Thepreemption count modification in the NMI entry/exit case must not betraced.
Architecture-specific code looks like this:
noinstrvoidnmi(structpt_regs*regs){arch_nmi_enter(regs);state=irqentry_nmi_enter(regs);instrumentation_begin();nmi_handler(regs);instrumentation_end();irqentry_nmi_exit(regs);}
and for e.g. a debug exception it can look like this:
noinstrvoiddebug(structpt_regs*regs){arch_nmi_enter(regs);debug_regs=save_debug_regs();if(user_mode(regs)){state=irqentry_enter(regs);instrumentation_begin();user_mode_debug_handler(regs,debug_regs);instrumentation_end();irqentry_exit(regs,state);}else{state=irqentry_nmi_enter(regs);instrumentation_begin();kernel_mode_debug_handler(regs,debug_regs);instrumentation_end();irqentry_nmi_exit(regs,state);}}
There is no combinedirqentry_nmi_if_kernel() function available as theabove cannot be handled in an exception-agnostic way.
NMIs can happen in any context. For example, an NMI-like exception triggeredwhile handling an NMI. So NMI entry code has to be reentrant and state updatesneed to handle nesting.