Light-weight System Calls for IA-64

Started: 13-Jan-2003

Last update: 27-Sep-2003

David Mosberger-Tang<davidm@hpl.hp.com>

Using the “epc” instruction effectively introduces a new mode ofexecution to the ia64 linux kernel. We call this mode the“fsys-mode”. To recap, the normal states of execution are:

  • kernel mode:
    Both the register stack and the memory stack have beenswitched over to kernel memory. The user-level state is savedin a pt-regs structure at the top of the kernel memory stack.
  • user mode:
    Both the register stack and the kernel stack are inuser memory. The user-level state is contained in theCPU registers.
  • bank 0 interruption-handling mode:
    This is the non-interruptible state which allinterruption-handlers start execution in. The user-levelstate remains in the CPU registers and some kernel state maybe stored in bank 0 of registers r16-r31.

In contrast, fsys-mode has the following special properties:

  • execution is at privilege level 0 (most-privileged)
  • CPU registers may contain a mixture of user-level and kernel-levelstate (it is the responsibility of the kernel to ensure that nosecurity-sensitive kernel-level state is leaked back touser-level)
  • execution is interruptible and preemptible (an fsys-mode handlercan disable interrupts and avoid all other interruption-sourcesto avoid preemption)
  • neither the memory-stack nor the register-stack can be trusted whilein fsys-mode (they point to the user-level stacks, which maybe invalid, or completely bogus addresses)

In summary, fsys-mode is much more similar to running in user-modethan it is to running in kernel-mode. Of course, given that theprivilege level is at level 0, this means that fsys-mode requires somecare (see below).

How to tell fsys-mode

Linux operates in fsys-mode when (a) the privilege level is 0 (mostprivileged) and (b) the stacks have NOT been switched to kernel memoryyet. For convenience, the header file <asm-ia64/ptrace.h> providesthree macros:

user_mode(regs)user_stack(task,regs)fsys_mode(task,regs)

The “regs” argument is a pointer to a pt_regs structure. The “task”argument is a pointer to the task structure to which the “regs”pointer belongs to. user_mode() returns TRUE if the CPU state pointedto by “regs” was executing in user mode (privilege level 3).user_stack() returns TRUE if the state pointed to by “regs” wasexecuting on the user-level stack(s). Finally, fsys_mode() returnsTRUE if the CPU state pointed to by “regs” was executing in fsys-mode.The fsys_mode() macro is equivalent to the expression:

!user_mode(regs) && user_stack(task,regs)

How to write an fsyscall handler

The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers(fsyscall_table). This table contains one entry for each system call.By default, a system call is handled by fsys_fallback_syscall(). Thisroutine takes care of entering (full) kernel mode and calling thenormal Linux system call handler. For performance-critical systemcalls, it is possible to write a hand-tuned fsyscall_handler. Forexample, fsys.S contains fsys_getpid(), which is a hand-tuned versionof the getpid() system call.

The entry and exit-state of an fsyscall handler is as follows:

Machine state on entry to fsyscall handler

r100
r11saved ar.pfs (a user-level value)
r15system call number
r16“current” task pointer (in normal kernel-mode, this is in r13)
r32-r39system call arguments
b6return address (a user-level value)
ar.pfsprevious frame-state (a user-level value)
PSR.becleared to zero (i.e., little-endian byte order is in effect)
all other registers may contain values passed in from user-mode

Required machine state on exit to fsyscall handler

r11saved ar.pfs (as passed into the fsyscall handler)
r15system call number (as passed into the fsyscall handler)
r32-r39system call arguments (as passed into the fsyscall handler)
b6return address (as passed into the fsyscall handler)
ar.pfsprevious frame-state (as passed into the fsyscall handler)

Fsyscall handlers can execute with very little overhead, but with thatspeed comes a set of restrictions:

  • Fsyscall-handlers MUST check for any pending work in the flagsmember of the thread-info structure and if any of theTIF_ALLWORK_MASK flags are set, the handler needs to fall back ondoing a full system call (by calling fsys_fallback_syscall).
  • Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11,r15, b6, and ar.pfs) because they will be needed in case of asystem call restart. Of course, all “preserved” registers alsomust be preserved, in accordance to the normal calling conventions.
  • Fsyscall-handlers MUST check argument registers for containing aNaT value before using them in any way that could trigger aNaT-consumption fault. If a system call argument is found tocontain a NaT value, an fsyscall-handler may return immediatelywith r8=EINVAL, r10=-1.
  • Fsyscall-handlers MUST NOT use the “alloc” instruction or performany other operation that would trigger mandatory RSE(register-stack engine) traffic.
  • Fsyscall-handlers MUST NOT write to any stacked registers becauseit is not safe to assume that user-level called a handler with theproper number of arguments.
  • Fsyscall-handlers need to be careful when accessing per-CPU variables:unless proper safe-guards are taken (e.g., interruptions are avoided),execution may be pre-empted and resumed on another CPU at any giventime.
  • Fsyscall-handlers must be careful not to leak sensitive kernel’information back to user-level. In particular, before returning touser-level, care needs to be taken to clear any scratch registersthat could contain sensitive information (note that the currenttask pointer is not considered sensitive: it’s already exposedthrough ar.k6).
  • Fsyscall-handlers MUST NOT access user-memory without firstvalidating access-permission (this can be done typically viaprobe.r.fault and/or probe.w.fault) and without guarding againstmemory access exceptions (this can be done with the EX() macrosdefined by asmmacro.h).

The above restrictions may seem draconian, but remember that it’spossible to trade off some of the restrictions by paying a slightlyhigher overhead. For example, if an fsyscall-handler could benefitfrom the shadow register bank, it could temporarily disable PSR.i andPSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers asneeded. In other words, following the above rules yields extremelyfast system call execution (while fully preserving system callsemantics), but there is also a lot of flexibility in handling morecomplicated cases.

Signal handling

The delivery of (asynchronous) signals must be delayed until fsys-modeis exited. This is accomplished with the help of the lower-privilegetransfer trap: arch/ia64/kernel/process.c:do_notify_resume_user()checks whether the interrupted task was in fsys-mode and, if so, setsPSR.lp and returns immediately. When fsys-mode is exited via the“br.ret” instruction that lowers the privilege level, a trap willoccur. The trap handler clears PSR.lp again and returns immediately.The kernel exit path then checks for and delivers any pending signals.

PSR Handling

The “epc” instruction doesn’t change the contents of PSR at all. Thisis in contrast to a regular interruption, which clears almost allbits. Because of that, some care needs to be taken to ensure thingswork as expected. The following discussion describes how each PSR bitis handled.

PSR.beCleared when entering fsys-mode. A srlz.d instruction is usedto ensure the CPU is in little-endian mode before the firstload/store instruction is executed. PSR.be is normally NOTrestored upon return from an fsys-mode handler. In otherwords, user-level code must not rely on PSR.be being preservedacross a system call.
PSR.upUnchanged.
PSR.acUnchanged.
PSR.mflUnchanged. Note: fsys-mode handlers must not write-registers!
PSR.mfhUnchanged. Note: fsys-mode handlers must not write-registers!
PSR.icUnchanged. Note: fsys-mode handlers can clear the bit, if needed.
PSR.iUnchanged. Note: fsys-mode handlers can clear the bit, if needed.
PSR.pkUnchanged.
PSR.dtUnchanged.
PSR.dflUnchanged. Note: fsys-mode handlers must not write-registers!
PSR.dfhUnchanged. Note: fsys-mode handlers must not write-registers!
PSR.spUnchanged.
PSR.ppUnchanged.
PSR.diUnchanged.
PSR.siUnchanged.
PSR.dbUnchanged. The kernel prevents user-level from setting a hardwarebreakpoint that triggers at any privilege level other than3 (user-mode).
PSR.lpUnchanged.
PSR.tbLazy redirect. If a taken-branch trap occurs while infsys-mode, the trap-handler modifies the saved machine statesuch that execution resumes in the gate page atsyscall_via_break(), with privilege level 3. Note: thetaken branch would occur on the branch invoking thefsyscall-handler, at which point, by definition, a syscallrestart is still safe. If the system call number is invalid,the fsys-mode handler will return directly to user-level. Thisreturn will trigger a taken-branch trap, but since the trap istaken _after_ restoring the privilege level, the CPU has alreadyleft fsys-mode, so no special treatment is needed.
PSR.rtUnchanged.
PSR.cplCleared to 0.
PSR.isUnchanged (guaranteed to be 0 on entry to the gate page).
PSR.mcUnchanged.
PSR.itUnchanged (guaranteed to be 1).
PSR.idUnchanged. Note: the ia64 linux kernel never sets this bit.
PSR.daUnchanged. Note: the ia64 linux kernel never sets this bit.
PSR.ddUnchanged. Note: the ia64 linux kernel never sets this bit.
PSR.ssLazy redirect. If set, “epc” will cause a Single Step Trap tobe taken. The trap handler then modifies the saved machinestate such that execution resumes in the gate page atsyscall_via_break(), with privilege level 3.
PSR.riUnchanged.
PSR.edUnchanged. Note: This bit could only have an effect if an fsys-modehandler performed a speculative load that gets NaTted. If so, thiswould be the normal & expected behavior, so no special treatment isneeded.
PSR.bnUnchanged. Note: fsys-mode handlers may clear the bit, if needed.Doing so requires clearing PSR.i and PSR.ic as well.
PSR.iaUnchanged. Note: the ia64 linux kernel never sets this bit.

Using fast system calls

To use fast system calls, userspace applications need simply call__kernel_syscall_via_epc(). For example

– example fgettimeofday() call –

– fgettimeofday.S –

#include <asm/asmmacro.h>GLOBAL_ENTRY(fgettimeofday).prologue.save ar.pfs, r11mov r11 = ar.pfs.bodymov r2 = 0xa000000000020660;;  // gate address                             // found by inspection of System.map for the                             // __kernel_syscall_via_epc() function.  See                             // below for how to do this for real.mov b7 = r2mov r15 = 1087                       // gettimeofday syscall;;br.call.sptk.many b6 = b7;;.restore spmov ar.pfs = r11br.ret.sptk.many rp;;       // return to callerEND(fgettimeofday)

– end fgettimeofday.S –

In reality, getting the gate address is accomplished by two extravalues passed via the ELF auxiliary vector (include/asm-ia64/elf.h)

  • AT_SYSINFO : is the address of __kernel_syscall_via_epc()
  • AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO

The ELF DSO is a pre-linked library that is mapped in by the kernel atthe gate page. It is a proper ELF shared object so, with a dynamicloader that recognises the library, you should be able to make calls tothe exported functions within it as with any other shared library.AT_SYSINFO points into the kernel DSO at the__kernel_syscall_via_epc() function for historical reasons (it wasused before the kernel DSO) and as a convenience.