Syscall User Dispatch

Background

Compatibility layers like Wine need a way to efficiently emulate systemcalls of only a part of their process - the part that has theincompatible code - while being able to execute native syscalls withouta high performance penalty on the native part of the process. Seccompfalls short on this task, since it has limited support to efficientlyfilter syscalls based on memory regions, and it doesn’t support removingfilters. Therefore a new mechanism is necessary.

Syscall User Dispatch brings the filtering of the syscall dispatcheraddress back to userspace. The application is in control of a flipswitch, indicating the current personality of the process. Amultiple-personality application can then flip the switch withoutinvoking the kernel, when crossing the compatibility layer APIboundaries, to enable/disable the syscall redirection and executesyscalls directly (disabled) or send them to be emulated in userspacethrough a SIGSYS.

The goal of this design is to provide very quick compatibility layerboundary crosses, which is achieved by not executing a syscall to changepersonality every time the compatibility layer executes. Instead, auserspace memory region exposed to the kernel indicates the currentpersonality, and the application simply modifies that variable toconfigure the mechanism.

There is a relatively high cost associated with handling signals on mostarchitectures, like x86, but at least for Wine, syscalls issued bynative Windows code are currently not known to be a performance problem,since they are quite rare, at least for modern gaming applications.

Since this mechanism is designed to capture syscalls issued bynon-native applications, it must function on syscalls whose invocationABI is completely unexpected to Linux. Syscall User Dispatch, thereforedoesn’t rely on any of the syscall ABI to make the filtering. It usesonly the syscall dispatcher address and the userspace key.

As the ABI of these intercepted syscalls is unknown to Linux, thesesyscalls are not instrumentable via ptrace or the syscall tracepoints.

Interface

A thread can setup this mechanism on supported kernels by executing thefollowing prctl:

prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])

<op> is either PR_SYS_DISPATCH_EXCLUSIVE_ON/PR_SYS_DISPATCH_INCLUSIVE_ONor PR_SYS_DISPATCH_OFF, to enable and disable the mechanism globally forthat thread. When PR_SYS_DISPATCH_OFF is used, the other fields must be zero.

For PR_SYS_DISPATCH_EXCLUSIVE_ON [<offset>, <offset>+<length>) delimita memory region interval from which syscalls are always executed directly,regardless of the userspace selector. This provides a fast path for theC library, which includes the most common syscall dispatchers in the nativecode applications, and also provides a way for the signal handler to returnwithout triggering a nested SIGSYS on (rt_)sigreturn. Users of thisinterface should make sure that at least the signal trampoline code isincluded in this region. In addition, for syscalls that implement thetrampoline code on the vDSO, that trampoline is never intercepted.

For PR_SYS_DISPATCH_INCLUSIVE_ON [<offset>, <offset>+<length>) delimita memory region interval from which syscalls are dispatched based onthe userspace selector. Syscalls from outside of the range are alwaysexecuted directly.

[selector] is a pointer to a char-sized region in the process memoryregion, that provides a quick way to enable disable syscall redirectionthread-wide, without the need to invoke the kernel directly. selectorcan be set to SYSCALL_DISPATCH_FILTER_ALLOW or SYSCALL_DISPATCH_FILTER_BLOCK.Any other value should terminate the program with a SIGSYS.

Additionally, a tasks syscall user dispatch configuration can be peekedand poked via the PTRACE_(GET|SET)_SYSCALL_USER_DISPATCH_CONFIG ptracerequests. This is useful for checkpoint/restart software.

Security Notes

Syscall User Dispatch provides functionality for compatibility layers toquickly capture system calls issued by a non-native part of theapplication, while not impacting the Linux native regions of theprocess. It is not a mechanism for sandboxing system calls, and itshould not be seen as a security mechanism, since it is trivial for amalicious application to subvert the mechanism by jumping to an alloweddispatcher region prior to executing the syscall, or to discover theaddress and modify the selector value. If the use case requires anykind of security sandboxing, Seccomp should be used instead.

Any fork or exec of the existing process resets the mechanism toPR_SYS_DISPATCH_OFF.