File management in the Linux kernel

This document describes how locking for files (structfile)and file descriptor table (structfiles) works.

Up until 2.6.12, the file descriptor table has been protectedwith a lock (files->file_lock) and reference count (files->count).->file_lock protected accesses to all the file related fieldsof the table. ->count was used for sharing the file descriptortable between tasks cloned with CLONE_FILES flag. Typicallythis would be the case for posix threads. As with the commonrefcounting model in the kernel, the last task doingaput_files_struct() frees the file descriptor (fd) table.The files (structfile) themselves are protected usingreference count (->f_count).

In the new lock-free model of file descriptor management,the reference counting is similar, but the locking isbased on RCU. The file descriptor table contains multipleelements - the fd sets (open_fds and close_on_exec, thearray of file pointers, the sizes of the sets and the arrayetc.). In order for the updates to appear atomic toa lock-free reader, all the elements of the file descriptortable are in a separate structure -structfdtable.files_struct contains a pointer tostructfdtable throughwhich the actual fd table is accessed. Initially thefdtable is embedded in files_struct itself. On a subsequentexpansion of fdtable, a new fdtable structure is allocatedand files->fdtab points to the new structure. The fdtablestructure is freed with RCU and lock-free readers eithersee the old fdtable or the new fdtable making the updateappear atomic. Here are the locking rules forthe fdtable structure -

  1. All references to the fdtable must be done throughthefiles_fdtable() macro:

    struct fdtable *fdt;rcu_read_lock();fdt = files_fdtable(files);....if (n <= fdt->max_fds)        .......rcu_read_unlock();

    files_fdtable() usesrcu_dereference() macro which takes care ofthe memory barrier requirements for lock-free dereference.The fdtable pointer must be read within the read-sidecritical section.

  2. Reading of the fdtable as described above must be protectedbyrcu_read_lock()/rcu_read_unlock().

  3. For any update to the fd table, files->file_lock mustbe held.

  4. To look up the file structure given an fd, a readermust use eitherlookup_fdget_rcu() orfiles_lookup_fdget_rcu() APIs. Thesetake care of barrier requirements due to lock-free lookup.

    An example:

    struct file *file;rcu_read_lock();file = lookup_fdget_rcu(fd);rcu_read_unlock();if (file) {        ...        fput(file);}....
  5. Since both fdtable and file structures can be looked uplock-free, they must be installed usingrcu_assign_pointer()API. If they are looked up lock-free,rcu_dereference()must be used. However it is advisable to usefiles_fdtable()andlookup_fdget_rcu()/files_lookup_fdget_rcu() which take care of theseissues.

  6. While updating, the fdtable pointer must be looked up whileholding files->file_lock. If ->file_lock is dropped, thenanother thread expand the files thereby creating a newfdtable and making the earlier fdtable pointer stale.

    For example:

    spin_lock(&files->file_lock);fd = locate_fd(files, file, start);if (fd >= 0) {        /* locate_fd() may have expanded fdtable, load the ptr */        fdt = files_fdtable(files);        __set_open_fd(fd, fdt);        __clear_close_on_exec(fd, fdt);        spin_unlock(&files->file_lock);.....

    Sincelocate_fd() can drop ->file_lock (and reacquire ->file_lock),the fdtable pointer (fdt) must be loaded afterlocate_fd().

On newer kernels rcu based file lookup has been switched to rely onSLAB_TYPESAFE_BY_RCU instead ofcall_rcu(). It isn’t sufficient anymoreto just acquire a reference to the file in question under rcu usingatomic_long_inc_not_zero() since the file might have already beenrecycled and someone else might have bumped the reference. In otherwords, callers might see reference count bumps from newer users. Forthis is reason it is necessary to verify that the pointer is the samebefore and after the reference count increment. This pattern can be seeninget_file_rcu() and__files_get_rcu().

In addition, it isn’t possible to access or check fields instructfilewithout first acquiring a reference on it under rcu lookup. Not doingthat was always very dodgy and it was only usable for non-pointer datainstructfile. With SLAB_TYPESAFE_BY_RCU it is necessary that callerseither first acquire a reference or they must hold the files_lock of thefdtable.