Adding a New System Call¶
This document describes what’s involved in adding a new system call to theLinux kernel, over and above the normal submission advice inDocumentation/process/submitting-patches.rst.
System Call Alternatives¶
The first thing to consider when adding a new system call is whether one ofthe alternatives might be suitable instead. Although system calls are themost traditional and most obvious interaction points between userspace and thekernel, there are other possibilities -- choose what fits best for yourinterface.
If the operations involved can be made to look like a filesystem-likeobject, it may make more sense to create a new filesystem or device. Thisalso makes it easier to encapsulate the new functionality in a kernel modulerather than requiring it to be built into the main kernel.
If the new functionality involves operations where the kernel notifiesuserspace that something has happened, then returning a new filedescriptor for the relevant object allows userspace to use
poll/select/epollto receive that notification.However, operations that don’t map toread(2)/write(2)-like operationshave to be implemented asioctl(2) requests, which can leadto a somewhat opaque API.
If you’re just exposing runtime system information, a new node in sysfs(see
Documentation/filesystems/sysfs.rst) or the/procfilesystem maybe more appropriate. However, access to these mechanisms requires that therelevant filesystem is mounted, which might not always be the case (e.g.in a namespaced/sandboxed/chrooted environment). Avoid adding any API todebugfs, as this is not considered a ‘production’ interface to userspace.If the operation is specific to a particular file or file descriptor, thenan additionalfcntl(2) command option may be more appropriate. However,fcntl(2) is a multiplexing system call that hides a lot of complexity, sothis option is best for when the new function is closely analogous toexistingfcntl(2) functionality, or the new functionality is very simple(for example, getting/setting a simple flag related to a file descriptor).
If the operation is specific to a particular task or process, then anadditionalprctl(2) command option may be more appropriate. Aswithfcntl(2), this system call is a complicated multiplexor sois best reserved for near-analogs of existing
prctl()commands orgetting/setting a simple flag related to a process.
Designing the API: Planning for Extension¶
A new system call forms part of the API of the kernel, and has to be supportedindefinitely. As such, it’s a very good idea to explicitly discuss theinterface on the kernel mailing list, and it’s important to plan for futureextensions of the interface.
(The syscall table is littered with historical examples where this wasn’t done,together with the corresponding follow-up system calls --eventfd/eventfd2,dup2/dup3,inotify_init/inotify_init1,pipe/pipe2,renameat/renameat2 -- solearn from the history of the kernel and plan for extensions from the start.)
For simpler system calls that only take a couple of arguments, the preferredway to allow for future extensibility is to include a flags argument to thesystem call. To make sure that userspace programs can safely use flagsbetween kernel versions, check whether the flags value holds any unknownflags, and reject the system call (withEINVAL) if it does:
if (flags & ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3)) return -EINVAL;
(If no flags values are used yet, check that the flags argument is zero.)
For more sophisticated system calls that involve a larger number of arguments,it’s preferred to encapsulate the majority of the arguments into a structurethat is passed in by pointer. Such a structure can cope with future extensionby including a size argument in the structure:
struct xyzzy_params { u32 size; /* userspace sets p->size = sizeof(struct xyzzy_params) */ u32 param_1; u64 param_2; u64 param_3;};As long as any subsequently added field, sayparam_4, is designed so that azero value gives the previous behaviour, then this allows both directions ofversion mismatch:
To cope with a later userspace program calling an older kernel, the kernelcode should check that any memory beyond the size of the structure that itexpects is zero (effectively checking that
param_4==0).To cope with an older userspace program calling a newer kernel, the kernelcode can zero-extend a smaller instance of the structure (effectivelysetting
param_4=0).
Seeperf_event_open(2) and theperf_copy_attr() function (inkernel/events/core.c) for an example of this approach.
Designing the API: Other Considerations¶
If your new system call allows userspace to refer to a kernel object, itshould use a file descriptor as the handle for that object -- don’t invent anew type of userspace object handle when the kernel already has mechanisms andwell-defined semantics for using file descriptors.
If your newxyzzy(2) system call does return a new file descriptor,then the flags argument should include a value that is equivalent to settingO_CLOEXEC on the new FD. This makes it possible for userspace to closethe timing window betweenxyzzy() and callingfcntl(fd,F_SETFD,FD_CLOEXEC), where an unexpectedfork() andexecve() in another thread could leak a descriptor tothe exec’ed program. (However, resist the temptation to re-use the actual valueof theO_CLOEXEC constant, as it is architecture-specific and is part of anumbering space ofO_* flags that is fairly full.)
If your system call returns a new file descriptor, you should also considerwhat it means to use thepoll(2) family of system calls on that filedescriptor. Making a file descriptor ready for reading or writing is thenormal way for the kernel to indicate to userspace that an event hasoccurred on the corresponding kernel object.
If your newxyzzy(2) system call involves a filename argument:
int sys_xyzzy(const char __user *path, ..., unsigned int flags);
you should also consider whether anxyzzyat(2) version is more appropriate:
int sys_xyzzyat(int dfd, const char __user *path, ..., unsigned int flags);
This allows more flexibility for how userspace specifies the file in question;in particular it allows userspace to request the functionality for analready-opened file descriptor using theAT_EMPTY_PATH flag, effectivelygiving anfxyzzy(3) operation for free:
- xyzzyat(AT_FDCWD, path, ..., 0) is equivalent to xyzzy(path,...)- xyzzyat(fd, "", ..., AT_EMPTY_PATH) is equivalent to fxyzzy(fd, ...)
(For more details on the rationale of the *at() calls, see theopenat(2) man page; for an example of AT_EMPTY_PATH, see thefstatat(2) man page.)
If your newxyzzy(2) system call involves a parameter describing anoffset within a file, make its typeloff_t so that 64-bit offsets can besupported even on 32-bit architectures.
If your newxyzzy(2) system call involves privileged functionality,it needs to be governed by the appropriate Linux capability bit (checked witha call tocapable()), as described in thecapabilities(7) manpage. Choose an existing capability bit that governs related functionality,but try to avoid combining lots of only vaguely related functions togetherunder the same bit, as this goes against capabilities’ purpose of splittingthe power of root. In particular, avoid adding new uses of the alreadyoverly-generalCAP_SYS_ADMIN capability.
If your newxyzzy(2) system call manipulates a process other thanthe calling process, it should be restricted (using a call toptrace_may_access()) so that only a calling process with the samepermissions as the target process, or with the necessary capabilities, canmanipulate the target process.
Finally, be aware that some non-x86 architectures have an easier time ifsystem call parameters that are explicitly 64-bit fall on odd-numberedarguments (i.e. parameter 1, 3, 5), to allow use of contiguous pairs of 32-bitregisters. (This concern does not apply if the arguments are part of astructure that’s passed in by pointer.)
Proposing the API¶
To make new system calls easy to review, it’s best to divide up the patchsetinto separate chunks. These should include at least the following items asdistinct commits (each of which is described further below):
The core implementation of the system call, together with prototypes,generic numbering, Kconfig changes and fallback stub implementation.
Wiring up of the new system call for one particular architecture, usuallyx86 (including all of x86_64, x86_32 and x32).
A demonstration of the use of the new system call in userspace via aselftest in
tools/testing/selftests/.A draft man-page for the new system call, either as plain text in thecover letter, or as a patch to the (separate) man-pages repository.
New system call proposals, like any change to the kernel’s API, should alwaysbe cc’ed tolinux-api@vger.kernel.org.
Generic System Call Implementation¶
The main entry point for your newxyzzy(2) system call will be calledsys_xyzzy(), but you add this entry point with the appropriateSYSCALL_DEFINEn() macro rather than explicitly. The ‘n’ indicates thenumber of arguments to the system call, and the macro takes the system call namefollowed by the (type, name) pairs for the parameters as arguments. Usingthis macro allows metadata about the new system call to be made available forother tools.
The new entry point also needs a corresponding function prototype, ininclude/linux/syscalls.h, marked as asmlinkage to match the way that systemcalls are invoked:
asmlinkage long sys_xyzzy(...);
Some architectures (e.g. x86) have their own architecture-specific syscalltables, but several other architectures share a generic syscall table. Add yournew system call to the generic list by adding an entry to the list ininclude/uapi/asm-generic/unistd.h:
#define __NR_xyzzy 292__SYSCALL(__NR_xyzzy, sys_xyzzy)
Also update the __NR_syscalls count to reflect the additional system call, andnote that if multiple new system calls are added in the same merge window,your new syscall number may get adjusted to resolve conflicts.
The filekernel/sys_ni.c provides a fallback stub implementation of eachsystem call, returning-ENOSYS. Add your new system call here too:
COND_SYSCALL(xyzzy);
Your new kernel functionality, and the system call that controls it, shouldnormally be optional, so add aCONFIG option (typically toinit/Kconfig) for it. As usual for newCONFIG options:
Include a description of the new functionality and system call controlledby the option.
Make the option depend on EXPERT if it should be hidden from normal users.
Make any new source files implementing the function dependent on the CONFIGoption in the Makefile (e.g.
obj-$(CONFIG_XYZZY_SYSCALL)+=xyzzy.o).Double check that the kernel still builds with the new CONFIG option turnedoff.
To summarize, you need a commit that includes:
CONFIGoption for the new function, normally ininit/Kconfig
SYSCALL_DEFINEn(xyzzy,...)for the entry pointcorresponding prototype in
include/linux/syscalls.hgeneric table entry in
include/uapi/asm-generic/unistd.hfallback stub in
kernel/sys_ni.c
Since 6.11¶
Starting with kernel version 6.11, general system call implementation for thefollowing architectures no longer requires modifications toinclude/uapi/asm-generic/unistd.h:
arc
arm64
csky
hexagon
loongarch
nios2
openrisc
riscv
Instead, you need to updatescripts/syscall.tbl and, if applicable, adjustarch/*/kernel/Makefile.syscalls.
Asscripts/syscall.tbl serves as a common syscall table across multiplearchitectures, a new entry is required in this table:
468 common xyzzy sys_xyzzy
Note that adding an entry toscripts/syscall.tbl with the “common” ABIalso affects all architectures that share this table. For more limited orarchitecture-specific changes, consider using an architecture-specific ABI ordefining a new one.
If a new ABI, sayxyz, is introduced, the corresponding updates should bemade toarch/*/kernel/Makefile.syscalls as well:
syscall_abis_{32,64} += xyz (...)To summarize, you need a commit that includes:
CONFIGoption for the new function, normally ininit/Kconfig
SYSCALL_DEFINEn(xyzzy,...)for the entry pointcorresponding prototype in
include/linux/syscalls.hnew entry in
scripts/syscall.tbl(if needed) Makefile updates in
arch/*/kernel/Makefile.syscallsfallback stub in
kernel/sys_ni.c
x86 System Call Implementation¶
To wire up your new system call for x86 platforms, you need to update themaster syscall tables. Assuming your new system call isn’t special in someway (see below), this involves a “common” entry (for x86_64 and x32) inarch/x86/entry/syscalls/syscall_64.tbl:
333 common xyzzy sys_xyzzy
and an “i386” entry inarch/x86/entry/syscalls/syscall_32.tbl:
380 i386 xyzzy sys_xyzzy
Again, these numbers are liable to be changed if there are conflicts in therelevant merge window.
Compatibility System Calls (Generic)¶
For most system calls the same 64-bit implementation can be invoked even whenthe userspace program is itself 32-bit; even if the system call’s parametersinclude an explicit pointer, this is handled transparently.
However, there are a couple of situations where a compatibility layer isneeded to cope with size differences between 32-bit and 64-bit.
The first is if the 64-bit kernel also supports 32-bit userspace programs, andso needs to parse areas of (__user) memory that could hold either 32-bit or64-bit values. In particular, this is needed whenever a system call argumentis:
a pointer to a pointer
a pointer to a
structcontaininga pointer (e.g.structiovec__user*)a pointer to a varying sized integral type (
time_t,off_t,long, ...)a pointer to a
structcontaininga varying sized integral type.
The second situation that requires a compatibility layer is if one of thesystem call’s arguments has a type that is explicitly 64-bit even on a 32-bitarchitecture, for exampleloff_t or__u64. In this case, a value thatarrives at a 64-bit kernel from a 32-bit application will be split into two32-bit values, which then need to be re-assembled in the compatibility layer.
(Note that a system call argument that’s a pointer to an explicit 64-bit typedoesnot need a compatibility layer; for example,splice(2)’s arguments oftypeloff_t__user* do not trigger the need for acompat_ system call.)
The compatibility version of the system call is calledcompat_sys_xyzzy(),and is added with theCOMPAT_SYSCALL_DEFINEn() macro, analogously toSYSCALL_DEFINEn. This version of the implementation runs as part of a 64-bitkernel, but expects to receive 32-bit parameter values and does whatever isneeded to deal with them. (Typically, thecompat_sys_ version converts thevalues to 64-bit versions and either calls on to thesys_ version, or both ofthem call a common inner implementation function.)
The compat entry point also needs a corresponding function prototype, ininclude/linux/compat.h, marked as asmlinkage to match the way that systemcalls are invoked:
asmlinkage long compat_sys_xyzzy(...);
If the system call involves a structure that is laid out differently on 32-bitand 64-bit systems, saystructxyzzy_args, then the include/linux/compat.hheader file should also include a compat version of the structure (structcompat_xyzzy_args) where each variable-size field has the appropriatecompat_ type that corresponds to the type instructxyzzy_args. Thecompat_sys_xyzzy() routine can then use thiscompat_ structure toparse the arguments from a 32-bit invocation.
For example, if there are fields:
struct xyzzy_args { const char __user *ptr; __kernel_long_t varying_val; u64 fixed_val; /* ... */};instructxyzzy_args, thenstructcompat_xyzzy_args would have:
struct compat_xyzzy_args { compat_uptr_t ptr; compat_long_t varying_val; u64 fixed_val; /* ... */};The generic system call list also needs adjusting to allow for the compatversion; the entry ininclude/uapi/asm-generic/unistd.h should use__SC_COMP rather than__SYSCALL:
#define __NR_xyzzy 292__SC_COMP(__NR_xyzzy, sys_xyzzy, compat_sys_xyzzy)
To summarize, you need:
a
COMPAT_SYSCALL_DEFINEn(xyzzy,...)for the compat entry pointcorresponding prototype in
include/linux/compat.h(if needed) 32-bit mapping
structininclude/linux/compat.hinstance of
__SC_COMPnot__SYSCALLininclude/uapi/asm-generic/unistd.h
Since 6.11¶
This applies to all the architectures listed inSince 6.11under “Generic System Call Implementation”, except arm64. SeeCompatibility System Calls (arm64) for more information.
You need to extend the entry inscripts/syscall.tbl with an extra columnto indicate that a 32-bit userspace program running on a 64-bit kernel shouldhit the compat entry point:
468 common xyzzy sys_xyzzy compat_sys_xyzzy
To summarize, you need:
COMPAT_SYSCALL_DEFINEn(xyzzy,...)for the compat entry pointcorresponding prototype in
include/linux/compat.hmodification of the entry in
scripts/syscall.tblto include an extra“compat” column(if needed) 32-bit mapping
structininclude/linux/compat.h
Compatibility System Calls (arm64)¶
On arm64, there is a dedicated syscall table for compatibility system callstargeting 32-bit (AArch32) userspace:arch/arm64/tools/syscall_32.tbl.You need to add an additional line to this table specifying the compatentry point:
468 common xyzzy sys_xyzzy compat_sys_xyzzy
Compatibility System Calls (x86)¶
To wire up the x86 architecture of a system call with a compatibility version,the entries in the syscall tables need to be adjusted.
First, the entry inarch/x86/entry/syscalls/syscall_32.tbl gets an extracolumn to indicate that a 32-bit userspace program running on a 64-bit kernelshould hit the compat entry point:
380 i386 xyzzy sys_xyzzy __ia32_compat_sys_xyzzy
Second, you need to figure out what should happen for the x32 ABI version ofthe new system call. There’s a choice here: the layout of the argumentsshould either match the 64-bit version or the 32-bit version.
If there’s a pointer-to-a-pointer involved, the decision is easy: x32 isILP32, so the layout should match the 32-bit version, and the entry inarch/x86/entry/syscalls/syscall_64.tbl is split so that x32 programs hitthe compatibility wrapper:
333 64 xyzzy sys_xyzzy...555 x32 xyzzy __x32_compat_sys_xyzzy
If no pointers are involved, then it is preferable to re-use the 64-bit systemcall for the x32 ABI (and consequently the entry inarch/x86/entry/syscalls/syscall_64.tbl is unchanged).
In either case, you should check that the types involved in your argumentlayout do indeed map exactly from x32 (-mx32) to either the 32-bit (-m32) or64-bit (-m64) equivalents.
System Calls Returning Elsewhere¶
For most system calls, once the system call is complete the user programcontinues exactly where it left off -- at the next instruction, with thestack the same and most of the registers the same as before the system call,and with the same virtual memory space.
However, a few system calls do things differently. They might return to adifferent location (rt_sigreturn) or change the memory space(fork/vfork/clone) or even architecture (execve/execveat)of the program.
To allow for this, the kernel implementation of the system call may need tosave and restore additional registers to the kernel stack, allowing completecontrol of where and how execution continues after the system call.
This is arch-specific, but typically involves defining assembly entry pointsthat save/restore additional registers and invoke the real system call entrypoint.
For x86_64, this is implemented as astub_xyzzy entry point inarch/x86/entry/entry_64.S, and the entry in the syscall table(arch/x86/entry/syscalls/syscall_64.tbl) is adjusted to match:
333 common xyzzy stub_xyzzy
The equivalent for 32-bit programs running on a 64-bit kernel is normallycalledstub32_xyzzy and implemented inarch/x86/entry/entry_64_compat.S,with the corresponding syscall table adjustment inarch/x86/entry/syscalls/syscall_32.tbl:
380 i386 xyzzy sys_xyzzy stub32_xyzzy
If the system call needs a compatibility layer (as in the previous section)then thestub32_ version needs to call on to thecompat_sys_ versionof the system call rather than the native 64-bit version. Also, if the x32 ABIimplementation is not common with the x86_64 version, then its syscalltable will also need to invoke a stub that calls on to thecompat_sys_version.
For completeness, it’s also nice to set up a mapping so that user-mode Linuxstill works -- its syscall table will reference stub_xyzzy, but the UML builddoesn’t includearch/x86/entry/entry_64.S implementation (because UMLsimulates registers etc). Fixing this is as simple as adding a #define toarch/x86/um/sys_call_table_64.c:
#define stub_xyzzy sys_xyzzy
Other Details¶
Most of the kernel treats system calls in a generic way, but there is theoccasional exception that may need updating for your particular system call.
The audit subsystem is one such special case; it includes (arch-specific)functions that classify some special types of system call -- specificallyfile open (open/openat), program execution (execve/exeveat) orsocket multiplexor (socketcall) operations. If your new system call isanalogous to one of these, then the audit system should be updated.
More generally, if there is an existing system call that is analogous to yournew system call, it’s worth doing a kernel-wide grep for the existing systemcall to check there are no other special cases.
Testing¶
A new system call should obviously be tested; it is also useful to providereviewers with a demonstration of how user space programs will use the systemcall. A good way to combine these aims is to include a simple self-testprogram in a new directory undertools/testing/selftests/.
For a new system call, there will obviously be no libc wrapper function and sothe test will need to invoke it usingsyscall(); also, if the system callinvolves a new userspace-visible structure, the corresponding header will needto be installed to compile the test.
Make sure the selftest runs successfully on all supported architectures. Forexample, check that it works when compiled as an x86_64 (-m64), x86_32 (-m32)and x32 (-mx32) ABI program.
For more extensive and thorough testing of new functionality, you should alsoconsider adding tests to the Linux Test Project, or to the xfstests projectfor filesystem-related changes.
git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
Man Page¶
All new system calls should come with a complete man page, ideally using groffmarkup, but plain text will do. If groff is used, it’s helpful to include apre-rendered ASCII version of the man page in the cover email for thepatchset, for the convenience of reviewers.
The man page should be cc’ed tolinux-man@vger.kernel.orgFor more details, seehttps://www.kernel.org/doc/man-pages/patches.html
Do not call System Calls in the Kernel¶
System calls are, as stated above, interaction points between userspace andthe kernel. Therefore, system call functions such assys_xyzzy() orcompat_sys_xyzzy() should only be called from userspace via the syscalltable, but not from elsewhere in the kernel. If the syscall functionality isuseful to be used within the kernel, needs to be shared between an old and anew syscall, or needs to be shared between a syscall and its compatibilityvariant, it should be implemented by means of a “helper” function (such asksys_xyzzy()). This kernel function may then be called within thesyscall stub (sys_xyzzy()), the compatibility syscall stub(compat_sys_xyzzy()), and/or other kernel code.
At least on 64-bit x86, it will be a hard requirement from v4.17 onwards to notcall system call functions in the kernel. It uses a different callingconvention for system calls wherestructpt_regs is decoded on-the-fly in asyscall wrapper which then hands processing over to the actual syscall function.This means that only those parameters which are actually needed for a specificsyscall are passed on during syscall entry, instead of filling in six CPUregisters with random user space content all the time (which may cause serioustrouble down the call chain).
Moreover, rules on how data may be accessed may differ between kernel data anduser data. This is another reason why callingsys_xyzzy() is generally abad idea.
Exceptions to this rule are only allowed in architecture-specific overrides,architecture-specific compatibility wrappers, or other code in arch/.
References and Sources¶
LWN article from Michael Kerrisk on use of flags argument in system calls:https://lwn.net/Articles/585415/
LWN article from Michael Kerrisk on how to handle unknown flags in a systemcall:https://lwn.net/Articles/588444/
LWN article from Jake Edge describing constraints on 64-bit system callarguments:https://lwn.net/Articles/311630/
Pair of LWN articles from David Drysdale that describe the system callimplementation paths in detail for v3.14:
Architecture-specific requirements for system calls are discussed in thesyscall(2) man-page:http://man7.org/linux/man-pages/man2/syscall.2.html#NOTES
Collated emails from Linus Torvalds discussing the problems with
ioctl():https://yarchive.net/comp/linux/ioctl.html“How to not invent kernel interfaces”, Arnd Bergmann,https://www.ukuug.org/events/linux2007/2007/papers/Bergmann.pdf
LWN article from Michael Kerrisk on avoiding new uses of CAP_SYS_ADMIN:https://lwn.net/Articles/486306/
Recommendation from Andrew Morton that all related information for a newsystem call should come in the same email thread:https://lore.kernel.org/r/20140724144747.3041b208832bbdf9fbce5d96@linux-foundation.org
Recommendation from Michael Kerrisk that a new system call should come witha man page:https://lore.kernel.org/r/CAKgNAkgMA39AfoSoA5Pe1r9N+ZzfYQNvNPvcRN7tOvRb8+v06Q@mail.gmail.com
Suggestion from Thomas Gleixner that x86 wire-up should be in a separatecommit:https://lore.kernel.org/r/alpine.DEB.2.11.1411191249560.3909@nanos
Suggestion from Greg Kroah-Hartman that it’s good for new system calls tocome with a man-page & selftest:https://lore.kernel.org/r/20140320025530.GA25469@kroah.com
Discussion from Michael Kerrisk of new system call vs.prctl(2) extension:https://lore.kernel.org/r/CAHO5Pa3F2MjfTtfNxa8LbnkeeU8=YJ+9tDqxZpw7Gz59E-4AUg@mail.gmail.com
Suggestion from Ingo Molnar that system calls that involve multiplearguments should encapsulate those arguments in a struct, which includes asize field for future extensibility:https://lore.kernel.org/r/20150730083831.GA22182@gmail.com
Numbering oddities arising from (re-)use of O_* numbering space flags:
commit 75069f2b5bfb (“vfs: renumber FMODE_NONOTIFY and add to uniquenesscheck”)
commit 12ed2e36c98a (“fanotify: FMODE_NONOTIFY and __O_SYNC in sparcconflict”)
commit bb458c644a59 (“Safer ABI for O_TMPFILE”)
Discussion from Matthew Wilcox about restrictions on 64-bit arguments:https://lore.kernel.org/r/20081212152929.GM26095@parisc-linux.org
Recommendation from Greg Kroah-Hartman that unknown flags should bepoliced:https://lore.kernel.org/r/20140717193330.GB4703@kroah.com
Recommendation from Linus Torvalds that x32 system calls should prefercompatibility with 64-bit versions rather than 32-bit versions:https://lore.kernel.org/r/CA+55aFxfmwfB7jbbrXxa=K7VBYPfAvmu3XOkGrLbB1UFjX1+Ew@mail.gmail.com
Patch series revising system call table infrastructure to usescripts/syscall.tbl across multiple architectures:https://lore.kernel.org/lkml/20240704143611.2979589-1-arnd@kernel.org