Overview of the Linux Virtual File System¶
Original author: Richard Gooch <rgooch@atnf.csiro.au>
- Copyright (C) 1999 Richard Gooch
- Copyright (C) 2005 Pekka Enberg
Introduction¶
The Virtual File System (also known as the Virtual Filesystem Switch) isthe software layer in the kernel that provides the filesystem interfaceto userspace programs. It also provides an abstraction within thekernel which allows different filesystem implementations to coexist.
VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so onare called from a process context. Filesystem locking is described inthe document Documentation/filesystems/locking.rst.
Directory Entry Cache (dcache)¶
The VFS implements the open(2), stat(2), chmod(2), and similar systemcalls. The pathname argument that is passed to them is used by the VFSto search through the directory entry cache (also known as the dentrycache or dcache). This provides a very fast look-up mechanism totranslate a pathname (filename) into a specific dentry. Dentries livein RAM and are never saved to disc: they exist only for performance.
The dentry cache is meant to be a view into your entire filespace. Asmost computers cannot fit all dentries in the RAM at the same time, somebits of the cache are missing. In order to resolve your pathname into adentry, the VFS may have to resort to creating dentries along the way,and then loading the inode. This is done by looking up the inode.
The Inode Object¶
An individual dentry usually has a pointer to an inode. Inodes arefilesystem objects such as regular files, directories, FIFOs and otherbeasts. They live either on the disc (for block device filesystems) orin the memory (for pseudo filesystems). Inodes that live on the discare copied into the memory when required and changes to the inode arewritten back to disc. A single inode can be pointed to by multipledentries (hard links, for example, do this).
To look up an inode requires that the VFS calls the lookup() method ofthe parent directory inode. This method is installed by the specificfilesystem implementation that the inode lives in. Once the VFS has therequired dentry (and hence the inode), we can do all those boring thingslike open(2) the file, or stat(2) it to peek at the inode data. Thestat(2) operation is fairly simple: once the VFS has the dentry, itpeeks at the inode data and passes some of it back to userspace.
The File Object¶
Opening a file requires another operation: allocation of a filestructure (this is the kernel-side implementation of file descriptors).The freshly allocated file structure is initialized with a pointer tothe dentry and a set of file operation member functions. These aretaken from the inode data. The open() file method is then called so thespecific filesystem implementation can do its work. You can see thatthis is another switch performed by the VFS. The file structure isplaced into the file descriptor table for the process.
Reading, writing and closing files (and other assorted VFS operations)is done by using the userspace file descriptor to grab the appropriatefile structure, and then calling the required file structure method todo whatever is required. For as long as the file is open, it keeps thedentry in use, which in turn means that the VFS inode is still in use.
Registering and Mounting a Filesystem¶
To register and unregister a filesystem, use the following APIfunctions:
#include<linux/fs.h>externintregister_filesystem(structfile_system_type*);externintunregister_filesystem(structfile_system_type*);
The passed struct file_system_type describes your filesystem. When arequest is made to mount a filesystem onto a directory in yournamespace, the VFS will call the appropriate mount() method for thespecific filesystem. New vfsmount referring to the tree returned by->mount() will be attached to the mountpoint, so that when pathnameresolution reaches the mountpoint it will jump into the root of thatvfsmount.
You can see all filesystems that are registered to the kernel in thefile /proc/filesystems.
struct file_system_type¶
This describes the filesystem. As of kernel 2.6.39, the followingmembers are defined:
structfile_system_operations{constchar*name;intfs_flags;structdentry*(*mount)(structfile_system_type*,int,constchar*,void*);void(*kill_sb)(structsuper_block*);structmodule*owner;structfile_system_type*next;structlist_headfs_supers;structlock_class_keys_lock_key;structlock_class_keys_umount_key;};
name- the name of the filesystem type, such as “ext2”, “iso9660”,“msdos” and so on
fs_flags- various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
mount- the method to call when a new instance of this filesystem shouldbe mounted
kill_sb- the method to call when an instance of this filesystem should beshut down
owner- for internal VFS use: you should initialize this to THIS_MODULEin most cases.
next- for internal VFS use: you should initialize this to NULL
s_lock_key, s_umount_key: lockdep-specific
The mount() method has the following arguments:
structfile_system_type*fs_type- describes the filesystem, partly initialized by the specificfilesystem code
intflags- mount flags
constchar*dev_name- the device name we are mounting.
void*data- arbitrary mount options, usually comes as an ASCII string (see“Mount Options” section)
The mount() method must return the root dentry of the tree requested bycaller. An active reference to its superblock must be grabbed and thesuperblock must be locked. On failure it should return ERR_PTR(error).
The arguments match those of mount(2) and their interpretation dependson filesystem type. E.g. for block filesystems, dev_name is interpretedas block device name, that device is opened and if it contains asuitable filesystem image the method creates and initializes structsuper_block accordingly, returning its root dentry to caller.
->mount() may choose to return a subtree of existing filesystem - itdoesn’t have to create a new one. The main result from the caller’spoint of view is a reference to dentry at the root of (sub)tree to beattached; creation of new superblock is a common side effect.
The most interesting member of the superblock structure that the mount()method fills in is the “s_op” field. This is a pointer to a “structsuper_operations” which describes the next level of the filesystemimplementation.
Usually, a filesystem uses one of the generic mount() implementationsand provides a fill_super() callback instead. The generic variants are:
mount_bdev- mount a filesystem residing on a block device
mount_nodev- mount a filesystem that is not backed by a device
mount_single- mount a filesystem which shares the instance between all mounts
A fill_super() callback implementation has the following arguments:
structsuper_block*sb- the superblock structure. The callback must initialize thisproperly.
void*data- arbitrary mount options, usually comes as an ASCII string (see“Mount Options” section)
intsilent- whether or not to be silent on error
The Superblock Object¶
A superblock object represents a mounted filesystem.
struct super_operations¶
This describes how the VFS can manipulate the superblock of yourfilesystem. As of kernel 2.6.22, the following members are defined:
structsuper_operations{structinode*(*alloc_inode)(structsuper_block*sb);void(*destroy_inode)(structinode*);void(*dirty_inode)(structinode*,intflags);int(*write_inode)(structinode*,int);void(*drop_inode)(structinode*);void(*delete_inode)(structinode*);void(*put_super)(structsuper_block*);int(*sync_fs)(structsuper_block*sb,intwait);int(*freeze_fs)(structsuper_block*);int(*unfreeze_fs)(structsuper_block*);int(*statfs)(structdentry*,structkstatfs*);int(*remount_fs)(structsuper_block*,int*,char*);void(*clear_inode)(structinode*);void(*umount_begin)(structsuper_block*);int(*show_options)(structseq_file*,structdentry*);ssize_t(*quota_read)(structsuper_block*,int,char*,size_t,loff_t);ssize_t(*quota_write)(structsuper_block*,int,constchar*,size_t,loff_t);int(*nr_cached_objects)(structsuper_block*);void(*free_cached_objects)(structsuper_block*,int);};
All methods are called without any locks being held, unless otherwisenoted. This means that most methods can block safely. All methods areonly called from a process context (i.e. not from an interrupt handleror bottom half).
alloc_inode- this method is called by alloc_inode() to allocate memory forstruct inode and initialize it. If this function is notdefined, a simple ‘struct inode’ is allocated. Normallyalloc_inode will be used to allocate a larger structure whichcontains a ‘struct inode’ embedded within it.
destroy_inode- this method is called by destroy_inode() to release resourcesallocated for struct inode. It is only required if->alloc_inode was defined and simply undoes anything done by->alloc_inode.
dirty_inode- this method is called by the VFS to mark an inode dirty.
write_inode- this method is called when the VFS needs to write an inode todisc. The second parameter indicates whether the write shouldbe synchronous or not, not all filesystems check this flag.
drop_inodecalled when the last access to the inode is dropped, with theinode->i_lock spinlock held.
This method should be either NULL (normal UNIX filesystemsemantics) or “generic_delete_inode” (for filesystems that donot want to cache inodes - causing “delete_inode” to always becalled regardless of the value of i_nlink)
The “generic_delete_inode()” behavior is equivalent to the oldpractice of using “force_delete” in the put_inode() case, butdoes not have the races that the “force_delete()” approach had.
delete_inode- called when the VFS wants to delete an inode
put_super- called when the VFS wishes to free the superblock(i.e. unmount). This is called with the superblock lock held
sync_fs- called when VFS is writing out all dirty data associated with asuperblock. The second parameter indicates whether the methodshould wait until the write out has been completed. Optional.
freeze_fs- called when VFS is locking a filesystem and forcing it into aconsistent state. This method is currently used by the LogicalVolume Manager (LVM).
unfreeze_fs- called when VFS is unlocking a filesystem and making it writableagain.
statfs- called when the VFS needs to get filesystem statistics.
remount_fs- called when the filesystem is remounted. This is called withthe kernel lock held
clear_inode- called then the VFS clears the inode. Optional
umount_begin- called when the VFS is unmounting a filesystem.
show_options- called by the VFS to show mount options for /proc/<pid>/mounts.(see “Mount Options” section)
quota_read- called by the VFS to read from filesystem quota file.
quota_write- called by the VFS to write to filesystem quota file.
nr_cached_objects- called by the sb cache shrinking function for the filesystem toreturn the number of freeable cached objects it contains.Optional.
free_cache_objectscalled by the sb cache shrinking function for the filesystem toscan the number of objects indicated to try to free them.Optional, but any filesystem implementing this method needs toalso implement ->nr_cached_objects for it to be calledcorrectly.
We can’t do anything with any errors that the filesystem mightencountered, hence the void return type. This will never becalled if the VM is trying to reclaim under GFP_NOFS conditions,hence this method does not need to handle that situation itself.
Implementations must include conditional reschedule calls insideany scanning loop that is done. This allows the VFS todetermine appropriate scan batch sizes without having to worryabout whether implementations will cause holdoff problems due tolarge scan batch sizes.
Whoever sets up the inode is responsible for filling in the “i_op”field. This is a pointer to a “struct inode_operations” which describesthe methods that can be performed on individual inodes.
struct xattr_handlers¶
On filesystems that support extended attributes (xattrs), the s_xattrsuperblock field points to a NULL-terminated array of xattr handlers.Extended attributes are name:value pairs.
name- Indicates that the handler matches attributes with the specifiedname (such as “system.posix_acl_access”); the prefix field mustbe NULL.
prefix- Indicates that the handler matches all attributes with thespecified name prefix (such as “user.”); the name field must beNULL.
list- Determine if attributes matching this xattr handler should belisted for a particular dentry. Used by some listxattrimplementations like generic_listxattr.
get- Called by the VFS to get the value of a particular extendedattribute. This method is called by the getxattr(2) systemcall.
set- Called by the VFS to set the value of a particular extendedattribute. When the new value is NULL, called to remove aparticular extended attribute. This method is called by thesetxattr(2) and removexattr(2) system calls.
When none of the xattr handlers of a filesystem match the specifiedattribute name or when a filesystem doesn’t support extended attributes,the various*xattr(2) system calls return -EOPNOTSUPP.
The Inode Object¶
An inode object represents an object within the filesystem.
struct inode_operations¶
This describes how the VFS can manipulate an inode in your filesystem.As of kernel 2.6.22, the following members are defined:
structinode_operations{int(*create)(structinode*,structdentry*,umode_t,bool);structdentry*(*lookup)(structinode*,structdentry*,unsignedint);int(*link)(structdentry*,structinode*,structdentry*);int(*unlink)(structinode*,structdentry*);int(*symlink)(structinode*,structdentry*,constchar*);int(*mkdir)(structinode*,structdentry*,umode_t);int(*rmdir)(structinode*,structdentry*);int(*mknod)(structinode*,structdentry*,umode_t,dev_t);int(*rename)(structinode*,structdentry*,structinode*,structdentry*,unsignedint);int(*readlink)(structdentry*,char__user*,int);constchar*(*get_link)(structdentry*,structinode*,structdelayed_call*);int(*permission)(structinode*,int);int(*get_acl)(structinode*,int);int(*setattr)(structdentry*,structiattr*);int(*getattr)(conststructpath*,structkstat*,u32,unsignedint);ssize_t(*listxattr)(structdentry*,char*,size_t);void(*update_time)(structinode*,structtimespec*,int);int(*atomic_open)(structinode*,structdentry*,structfile*,unsignedopen_flag,umode_tcreate_mode);int(*tmpfile)(structinode*,structdentry*,umode_t);};
Again, all methods are called without any locks being held, unlessotherwise noted.
create- called by the open(2) and creat(2) system calls. Only requiredif you want to support regular files. The dentry you get shouldnot have an inode (i.e. it should be a negative dentry). Hereyou will probably call
d_instantiate()with the dentry and thenewly created inode lookup- called when the VFS needs to look up an inode in a parentdirectory. The name to look for is found in the dentry. Thismethod must call
d_add()to insert the found inode into thedentry. The “i_count” field in the inode structure should beincremented. If the named inode does not exist a NULL inodeshould be inserted into the dentry (this is called a negativedentry). Returning an error code from this routine must only bedone on a real error, otherwise creating inodes with systemcalls like create(2), mknod(2), mkdir(2) and so on will fail.If you wish to overload the dentry methods then you shouldinitialise the “d_dop” field in the dentry; this is a pointer toa struct “dentry_operations”. This method is called with thedirectory inode semaphore held link- called by the link(2) system call. Only required if you want tosupport hard links. You will probably need to call
d_instantiate()just as you would in the create() method unlink- called by the unlink(2) system call. Only required if you wantto support deleting inodes
symlink- called by the symlink(2) system call. Only required if you wantto support symlinks. You will probably need to call
d_instantiate()just as you would in the create() method mkdir- called by the mkdir(2) system call. Only required if you wantto support creating subdirectories. You will probably need tocall
d_instantiate()just as you would in the create() method rmdir- called by the rmdir(2) system call. Only required if you wantto support deleting subdirectories
mknod- called by the mknod(2) system call to create a device (char,block) inode or a named pipe (FIFO) or socket. Only required ifyou want to support creating these types of inodes. You willprobably need to call
d_instantiate()just as you would in thecreate() method renamecalled by the rename(2) system call to rename the object to havethe parent and name given by the second inode and dentry.
The filesystem must return -EINVAL for any unsupported orunknown flags. Currently the following flags are implemented:(1) RENAME_NOREPLACE: this flag indicates that if the target ofthe rename exists the rename should fail with -EEXIST instead ofreplacing the target. The VFS already checks for existence, sofor local filesystems the RENAME_NOREPLACE implementation isequivalent to plain rename.(2) RENAME_EXCHANGE: exchange source and target. Both mustexist; this is checked by the VFS. Unlike plain rename, sourceand target may be of different type.
get_linkcalled by the VFS to follow a symbolic link to the inode itpoints to. Only required if you want to support symbolic links.This method returns the symlink body to traverse (and possiblyresets the current position with nd_jump_link()). If the bodywon’t go away until the inode is gone, nothing else is needed;if it needs to be otherwise pinned, arrange for its release byhaving get_link(…, …, done) do set_delayed_call(done,destructor, argument). In that case destructor(argument) willbe called once VFS is done with the body you’ve returned. Maybe called in RCU mode; that is indicated by NULL dentryargument. If request can’t be handled without leaving RCU mode,have it return ERR_PTR(-ECHILD).
If the filesystem stores the symlink target in ->i_link, theVFS may use it directly without calling ->get_link(); however,->get_link() must still be provided. ->i_link must not befreed until after an RCU grace period. Writing to ->i_linkpost-iget() time requires a ‘release’ memory barrier.
readlink- this is now just an override for use by readlink(2) for thecases when ->get_link uses nd_jump_link() or object is not infact a symlink. Normally filesystems should only implement->get_link for symlinks and readlink(2) will automatically usethat.
permissioncalled by the VFS to check for access rights on a POSIX-likefilesystem.
May be called in rcu-walk mode (mask & MAY_NOT_BLOCK). If inrcu-walk mode, the filesystem must check the permission withoutblocking or storing to the inode.
If a situation is encountered that rcu-walk cannot handle,return-ECHILD and it will be called again in ref-walk mode.
setattr- called by the VFS to set attributes for a file. This method iscalled by chmod(2) and related system calls.
getattr- called by the VFS to get attributes of a file. This method iscalled by stat(2) and related system calls.
listxattr- called by the VFS to list all extended attributes for a givenfile. This method is called by the listxattr(2) system call.
update_time- called by the VFS to update a specific time or the i_version ofan inode. If this is not defined the VFS will update the inodeitself and call mark_inode_dirty_sync.
atomic_open- called on the last component of an open. Using this optionalmethod the filesystem can look up, possibly create and open thefile in one atomic operation. If it wants to leave actualopening to the caller (e.g. if the file turned out to be asymlink, device, or just something filesystem won’t do atomicopen for), it may signal this by returning finish_no_open(file,dentry). This method is only called if the last component isnegative or needs lookup. Cached positive dentries are stillhandled by f_op->open(). If the file was created, FMODE_CREATEDflag should be set in file->f_mode. In case of O_EXCL themethod must only succeed if the file didn’t exist and henceFMODE_CREATED shall always be set on success.
tmpfile- called in the end of O_TMPFILE open(). Optional, equivalent toatomically creating, opening and unlinking a file in givendirectory.
The Address Space Object¶
The address space object is used to group and manage pages in the pagecache. It can be used to keep track of the pages in a file (or anythingelse) and also track the mapping of sections of the file into processaddress spaces.
There are a number of distinct yet related services that anaddress-space can provide. These include communicating memory pressure,page lookup by address, and keeping track of pages tagged as Dirty orWriteback.
The first can be used independently to the others. The VM can try toeither write dirty pages in order to clean them, or release clean pagesin order to reuse them. To do this it can call the ->writepage methodon dirty pages, and ->releasepage on clean pages with PagePrivate set.Clean pages without PagePrivate and with no external references will bereleased without notice being given to the address_space.
To achieve this functionality, pages need to be placed on an LRU withlru_cache_add and mark_page_active needs to be called whenever the pageis used.
Pages are normally kept in a radix tree index by ->index. This treemaintains information about the PG_Dirty and PG_Writeback status of eachpage, so that pages with either of these flags can be found quickly.
The Dirty tag is primarily used by mpage_writepages - the default->writepages method. It uses the tag to find dirty pages to call->writepage on. If mpage_writepages is not used (i.e. the addressprovides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is almostunused. write_inode_now and sync_inode do use it (through__sync_single_inode) to check if ->writepages has been successful inwriting out the whole address_space.
The Writeback tag is used by filemap*wait* and sync_page* functions, viafilemap_fdatawait_range, to wait for all writeback to complete.
An address_space handler may attach extra information to a page,typically using the ‘private’ field in the ‘struct page’. If suchinformation is attached, the PG_Private flag should be set. This willcause various VM routines to make extra calls into the address_spacehandler to deal with that data.
An address space acts as an intermediate between storage andapplication. Data is read into the address space a whole page at atime, and provided to the application either by copying of the page, orby memory-mapping the page. Data is written into the address space bythe application, and then written-back to storage typically in wholepages, however the address_space has finer control of write sizes.
The read process essentially only requires ‘readpage’. The writeprocess is more complicated and uses write_begin/write_end orset_page_dirty to write data into the address_space, and writepage andwritepages to writeback data to storage.
Adding and removing pages to/from an address_space is protected by theinode’s i_mutex.
When data is written to a page, the PG_Dirty flag should be set. Ittypically remains set until writepage asks for it to be written. Thisshould clear PG_Dirty and set PG_Writeback. It can be actually writtenat any point after PG_Dirty is clear. Once it is known to be safe,PG_Writeback is cleared.
Writeback makes use of a writeback_control structure to direct theoperations. This gives the writepage and writepages operations someinformation about the nature of and reason for the writeback request,and the constraints under which it is being done. It is also used toreturn information back to the caller about the result of a writepage orwritepages request.
Handling errors during writeback¶
Most applications that do buffered I/O will periodically call a filesynchronization call (fsync, fdatasync, msync or sync_file_range) toensure that data written has made it to the backing store. When thereis an error during writeback, they expect that error to be reported whena file sync request is made. After an error has been reported on onerequest, subsequent requests on the same file descriptor should return0, unless further writeback errors have occurred since the previous filesyncronization.
Ideally, the kernel would report errors only on file descriptions onwhich writes were done that subsequently failed to be written back. Thegeneric pagecache infrastructure does not track the file descriptionsthat have dirtied each individual page however, so determining whichfile descriptors should get back an error is not possible.
Instead, the generic writeback error tracking infrastructure in thekernel settles for reporting errors to fsync on all file descriptionsthat were open at the time that the error occurred. In a situation withmultiple writers, all of them will get back an error on a subsequentfsync, even if all of the writes done through that particular filedescriptor succeeded (or even if there were no writes on that filedescriptor at all).
Filesystems that wish to use this infrastructure should callmapping_set_error to record the error in the address_space when itoccurs. Then, after writing back data from the pagecache in theirfile->fsync operation, they should call file_check_and_advance_wb_err toensure that the struct file’s error cursor has advanced to the correctpoint in the stream of errors emitted by the backing device(s).
struct address_space_operations¶
This describes how the VFS can manipulate mapping of a file to pagecache in your filesystem. The following members are defined:
structaddress_space_operations{int(*writepage)(structpage*page,structwriteback_control*wbc);int(*readpage)(structfile*,structpage*);int(*writepages)(structaddress_space*,structwriteback_control*);int(*set_page_dirty)(structpage*page);void(*readahead)(structreadahead_control*);int(*readpages)(structfile*filp,structaddress_space*mapping,structlist_head*pages,unsignednr_pages);int(*write_begin)(structfile*,structaddress_space*mapping,loff_tpos,unsignedlen,unsignedflags,structpage**pagep,void**fsdata);int(*write_end)(structfile*,structaddress_space*mapping,loff_tpos,unsignedlen,unsignedcopied,structpage*page,void*fsdata);sector_t(*bmap)(structaddress_space*,sector_t);void(*invalidatepage)(structpage*,unsignedint,unsignedint);int(*releasepage)(structpage*,int);void(*freepage)(structpage*);ssize_t(*direct_IO)(structkiocb*,structiov_iter*iter);/* isolate a page for migration */bool(*isolate_page)(structpage*,isolate_mode_t);/* migrate the contents of a page to the specified target */int(*migratepage)(structpage*,structpage*);/* put migration-failed page back to right list */void(*putback_page)(structpage*);int(*launder_page)(structpage*);int(*is_partially_uptodate)(structpage*,unsignedlong,unsignedlong);void(*is_dirty_writeback)(structpage*,bool*,bool*);int(*error_remove_page)(structmapping*mapping,structpage*page);int(*swap_activate)(structfile*);int(*swap_deactivate)(structfile*);};
writepagecalled by the VM to write a dirty page to backing store. Thismay happen for data integrity reasons (i.e. ‘sync’), or to freeup memory (flush). The difference can be seen inwbc->sync_mode. The PG_Dirty flag has been cleared andPageLocked is true. writepage should start writeout, should setPG_Writeback, and should make sure the page is unlocked, eithersynchronously or asynchronously when the write operationcompletes.
If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn’t have totry too hard if there are problems, and may choose to write outother pages from the mapping if that is easier (e.g. due tointernal dependencies). If it chooses not to start writeout, itshould return AOP_WRITEPAGE_ACTIVATE so that the VM will notkeep calling ->writepage on that page.
See the file “Locking” for more details.
readpage- called by the VM to read a page from backing store. The pagewill be Locked when readpage is called, and should be unlockedand marked uptodate once the read completes. If ->readpagediscovers that it needs to unlock the page for some reason, itcan do so, and then return AOP_TRUNCATED_PAGE. In this case,the page will be relocated, relocked and if that all succeeds,->readpage will be called again.
writepages- called by the VM to write out pages associated with theaddress_space object. If wbc->sync_mode is WB_SYNC_ALL, thenthe writeback_control will specify a range of pages that must bewritten out. If it is WB_SYNC_NONE, then a nr_to_write isgiven and that many pages should be written if possible. If no->writepages is given, then mpage_writepages is used instead.This will choose pages from the address space that are tagged asDIRTY and will pass them to ->writepage.
set_page_dirty- called by the VM to set a page dirty. This is particularlyneeded if an address space attaches private data to a page, andthat data needs to be updated when a page is dirtied. This iscalled, for example, when a memory mapped page gets modified.If defined, it should set the PageDirty flag, and thePAGECACHE_TAG_DIRTY tag in the radix tree.
readahead- Called by the VM to read pages associated with the address_spaceobject. The pages are consecutive in the page cache and arelocked. The implementation should decrement the page refcountafter starting I/O on each page. Usually the page will beunlocked by the I/O completion handler. If the filesystem decidesto stop attempting I/O before reaching the end of the readaheadwindow, it can simply return. The caller will decrement the pagerefcount and unlock the remaining pages for you. Set PageUptodateif the I/O completes successfully. Setting PageError on any pagewill be ignored; simply unlock the page if an I/O error occurs.
readpages- called by the VM to read pages associated with the address_spaceobject. This is essentially just a vector version of readpage.Instead of just one page, several pages are requested.readpages is only used for read-ahead, so read errors areignored. If anything goes wrong, feel free to give up.This interface is deprecated and will be removed by the end of2020; implement readahead instead.
write_beginCalled by the generic buffered write code to ask the filesystemto prepare to write len bytes at the given offset in the file.The address_space should check that the write will be able tocomplete, by allocating space if necessary and doing any otherinternal housekeeping. If the write will update parts of anybasic-blocks on storage, then those blocks should be pre-read(if they haven’t been read already) so that the updated blockscan be written out properly.
The filesystem must return the locked pagecache page for thespecified offset, in
*pagep, for the caller to write into.It must be able to cope with short writes (where the lengthpassed to write_begin is greater than the number of bytes copiedinto the page).
flags is a field for AOP_FLAG_xxx flags, described ininclude/linux/fs.h.
A void * may be returned in fsdata, which then gets passed intowrite_end.
Returns 0 on success; < 0 on failure (which is the error code),in which case write_end is not called.
write_endAfter a successful write_begin, and data copy, write_end must becalled. len is the original len passed to write_begin, andcopied is the amount that was able to be copied.
The filesystem must take care of unlocking the page andreleasing it refcount, and updating i_size.
Returns < 0 on failure, otherwise the number of bytes (<=‘copied’) that were able to be copied into pagecache.
bmap- called by the VFS to map a logical block offset within object tophysical block number. This method is used by the FIBMAP ioctland for working with swap-files. To be able to swap to a file,the file must have a stable mapping to a block device. The swapsystem does not go through the filesystem but instead uses bmapto find out where the blocks in the file are and uses thoseaddresses directly.
invalidatepage- If a page has PagePrivate set, then invalidatepage will becalled when part or all of the page is to be removed from theaddress space. This generally corresponds to either atruncation, punch hole or a complete invalidation of the addressspace (in the latter case ‘offset’ will always be 0 and ‘length’will be PAGE_SIZE). Any private data associated with the pageshould be updated to reflect this truncation. If offset is 0and length is PAGE_SIZE, then the private data should bereleased, because the page must be able to be completelydiscarded. This may be done by calling the ->releasepagefunction, but in this case the release MUST succeed.
releasepagereleasepage is called on PagePrivate pages to indicate that thepage should be freed if possible. ->releasepage should removeany private data from the page and clear the PagePrivate flag.If releasepage() fails for some reason, it must indicate failurewith a 0 return value. releasepage() is used in two distinctthough related cases. The first is when the VM finds a cleanpage with no active users and wants to make it a free page. If->releasepage succeeds, the page will be removed from theaddress_space and become free.
The second case is when a request has been made to invalidatesome or all pages in an address_space. This can happen throughthe fadvise(POSIX_FADV_DONTNEED) system call or by thefilesystem explicitly requesting it as nfs and 9fs do (when theybelieve the cache may be out of date with storage) by calling
invalidate_inode_pages2(). If the filesystem makes such a call,and needs to be certain that all pages are invalidated, then itsreleasepage will need to ensure this. Possibly it can clear thePageUptodate bit if it cannot free private data yet.freepage- freepage is called once the page is no longer visible in thepage cache in order to allow the cleanup of any private data.Since it may be called by the memory reclaimer, it should notassume that the original address_space mapping still exists, andit should not block.
direct_IO- called by the generic read/write routines to perform direct_IO -that is IO requests which bypass the page cache and transferdata directly between the storage and the application’s addressspace.
isolate_page- Called by the VM when isolating a movable non-lru page. If pageis successfully isolated, VM marks the page as PG_isolated via__SetPageIsolated.
migrate_page- This is used to compact the physical memory usage. If the VMwants to relocate a page (maybe off a memory card that issignalling imminent failure) it will pass a new page and an oldpage to this function. migrate_page should transfer any privatedata across and update any references that it has to the page.
putback_page- Called by the VM when isolated page’s migration fails.
launder_page- Called before freeing a page - it writes back the dirty page.To prevent redirtying the page, it is kept locked during thewhole operation.
is_partially_uptodate- Called by the VM when reading a file through the pagecache whenthe underlying blocksize != pagesize. If the required block isup to date then the read can complete without needing the IO tobring the whole page up to date.
is_dirty_writeback- Called by the VM when attempting to reclaim a page. The VM usesdirty and writeback information to determine if it needs tostall to allow flushers a chance to complete some IO.Ordinarily it can use PageDirty and PageWriteback but somefilesystems have more complex state (unstable pages in NFSprevent reclaim) or do not set those flags due to lockingproblems. This callback allows a filesystem to indicate to theVM if a page should be treated as dirty or writeback for thepurposes of stalling.
error_remove_page- normally set to generic_error_remove_page if truncation is okfor this address space. Used for memory failure handling.Setting this implies you deal with pages going away under you,unless you have them locked or reference counts increased.
swap_activate- Called when swapon is used on a file to allocate space ifnecessary and pin the block lookup information in memory. Areturn value of zero indicates success, in which case this filecan be used to back swapspace.
swap_deactivate- Called during swapoff on files where swap_activate wassuccessful.
The File Object¶
A file object represents a file opened by a process. This is also knownas an “open file description” in POSIX parlance.
struct file_operations¶
This describes how the VFS can manipulate an open file. As of kernel4.18, the following members are defined:
structfile_operations{structmodule*owner;loff_t(*llseek)(structfile*,loff_t,int);ssize_t(*read)(structfile*,char__user*,size_t,loff_t*);ssize_t(*write)(structfile*,constchar__user*,size_t,loff_t*);ssize_t(*read_iter)(structkiocb*,structiov_iter*);ssize_t(*write_iter)(structkiocb*,structiov_iter*);int(*iopoll)(structkiocb*kiocb,boolspin);int(*iterate)(structfile*,structdir_context*);int(*iterate_shared)(structfile*,structdir_context*);__poll_t(*poll)(structfile*,structpoll_table_struct*);long(*unlocked_ioctl)(structfile*,unsignedint,unsignedlong);long(*compat_ioctl)(structfile*,unsignedint,unsignedlong);int(*mmap)(structfile*,structvm_area_struct*);int(*open)(structinode*,structfile*);int(*flush)(structfile*,fl_owner_tid);int(*release)(structinode*,structfile*);int(*fsync)(structfile*,loff_t,loff_t,intdatasync);int(*fasync)(int,structfile*,int);int(*lock)(structfile*,int,structfile_lock*);ssize_t(*sendpage)(structfile*,structpage*,int,size_t,loff_t*,int);unsignedlong(*get_unmapped_area)(structfile*,unsignedlong,unsignedlong,unsignedlong,unsignedlong);int(*check_flags)(int);int(*flock)(structfile*,int,structfile_lock*);ssize_t(*splice_write)(structpipe_inode_info*,structfile*,loff_t*,size_t,unsignedint);ssize_t(*splice_read)(structfile*,loff_t*,structpipe_inode_info*,size_t,unsignedint);int(*setlease)(structfile*,long,structfile_lock**,void**);long(*fallocate)(structfile*file,intmode,loff_toffset,loff_tlen);void(*show_fdinfo)(structseq_file*m,structfile*f);#ifndef CONFIG_MMUunsigned(*mmap_capabilities)(structfile*);#endifssize_t(*copy_file_range)(structfile*,loff_t,structfile*,loff_t,size_t,unsignedint);loff_t(*remap_file_range)(structfile*file_in,loff_tpos_in,structfile*file_out,loff_tpos_out,loff_tlen,unsignedintremap_flags);int(*fadvise)(structfile*,loff_t,loff_t,int);};
Again, all methods are called without any locks being held, unlessotherwise noted.
llseek- called when the VFS needs to move the file position index
read- called by read(2) and related system calls
read_iter- possibly asynchronous read with iov_iter as destination
write- called by write(2) and related system calls
write_iter- possibly asynchronous write with iov_iter as source
iopoll- called when aio wants to poll for completions on HIPRI iocbs
iterate- called when the VFS needs to read the directory contents
iterate_shared- called when the VFS needs to read the directory contents whenfilesystem supports concurrent dir iterators
poll- called by the VFS when a process wants to check if there isactivity on this file and (optionally) go to sleep until thereis activity. Called by the select(2) and poll(2) system calls
unlocked_ioctl- called by the ioctl(2) system call.
compat_ioctl- called by the ioctl(2) system call when 32 bit system calls are
- used on 64 bit kernels.
mmap- called by the mmap(2) system call
open- called by the VFS when an inode should be opened. When the VFSopens a file, it creates a new “struct file”. It then calls theopen method for the newly allocated file structure. You mightthink that the open method really belongs in “structinode_operations”, and you may be right. I think it’s done theway it is because it makes filesystems simpler to implement.The open() method is a good place to initialize the“private_data” member in the file structure if you want to pointto a device structure
flush- called by the close(2) system call to flush a file
release- called when the last reference to an open file is closed
fsync- called by the fsync(2) system call. Also see the section aboveentitled “Handling errors during writeback”.
fasync- called by the fcntl(2) system call when asynchronous(non-blocking) mode is enabled for a file
lock- called by the fcntl(2) system call for F_GETLK, F_SETLK, andF_SETLKW commands
get_unmapped_area- called by the mmap(2) system call
check_flags- called by the fcntl(2) system call for F_SETFL command
flock- called by the flock(2) system call
splice_write- called by the VFS to splice data from a pipe to a file. Thismethod is used by the splice(2) system call
splice_read- called by the VFS to splice data from file to a pipe. Thismethod is used by the splice(2) system call
setlease- called by the VFS to set or release a file lock lease. setleaseimplementations should call generic_setlease to record or removethe lease in the inode after setting it.
fallocate- called by the VFS to preallocate blocks or punch a hole.
copy_file_range- called by the copy_file_range(2) system call.
remap_file_range- called by the ioctl(2) system call for FICLONERANGE and FICLONEand FIDEDUPERANGE commands to remap file ranges. Animplementation should remap len bytes at pos_in of the sourcefile into the dest file at pos_out. Implementations must handlecallers passing in len == 0; this means “remap to the end of thesource file”. The return value should the number of bytesremapped, or the usual negative error code if errors occurredbefore any bytes were remapped. The remap_flags parameteraccepts REMAP_FILE_* flags. If REMAP_FILE_DEDUP is set then theimplementation must only remap if the requested file ranges haveidentical contents. If REMAP_FILE_CAN_SHORTEN is set, the caller isok with the implementation shortening the request length tosatisfy alignment or EOF requirements (or any other reason).
fadvise- possibly called by the fadvise64() system call.
Note that the file operations are implemented by the specificfilesystem in which the inode resides. When opening a device node(character or block special) most filesystems will call specialsupport routines in the VFS which will locate the required devicedriver information. These support routines replace the filesystem fileoperations with those for the device driver, and then proceed to callthe new open() method for the file. This is how opening a device filein the filesystem eventually ends up calling the device driver open()method.
Directory Entry Cache (dcache)¶
struct dentry_operations¶
This describes how a filesystem can overload the standard dentryoperations. Dentries and the dcache are the domain of the VFS and theindividual filesystem implementations. Device drivers have no businesshere. These methods may be set to NULL, as they are either optional orthe VFS uses a default. As of kernel 2.6.22, the following members aredefined:
structdentry_operations{int(*d_revalidate)(structdentry*,unsignedint);int(*d_weak_revalidate)(structdentry*,unsignedint);int(*d_hash)(conststructdentry*,structqstr*);int(*d_compare)(conststructdentry*,unsignedint,constchar*,conststructqstr*);int(*d_delete)(conststructdentry*);int(*d_init)(structdentry*);void(*d_release)(structdentry*);void(*d_iput)(structdentry*,structinode*);char*(*d_dname)(structdentry*,char*,int);structvfsmount*(*d_automount)(structpath*);int(*d_manage)(conststructpath*,bool);structdentry*(*d_real)(structdentry*,conststructinode*);};
d_revalidatecalled when the VFS needs to revalidate a dentry. This iscalled whenever a name look-up finds a dentry in the dcache.Most local filesystems leave this as NULL, because all theirdentries in the dcache are valid. Network filesystems aredifferent since things can change on the server without theclient necessarily being aware of it.
This function should return a positive value if the dentry isstill valid, and zero or a negative error code if it isn’t.
d_revalidate may be called in rcu-walk mode (flags &LOOKUP_RCU). If in rcu-walk mode, the filesystem mustrevalidate the dentry without blocking or storing to the dentry,d_parent and d_inode should not be used without care (becausethey can change and, in d_inode case, even become NULL underus).
If a situation is encountered that rcu-walk cannot handle,return-ECHILD and it will be called again in ref-walk mode.
_weak_revalidatecalled when the VFS needs to revalidate a “jumped” dentry. Thisis called when a path-walk ends at dentry that was not acquiredby doing a lookup in the parent directory. This includes “/”,“.” and “..”, as well as procfs-style symlinks and mountpointtraversal.
In this case, we are less concerned with whether the dentry isstill fully correct, but rather that the inode is still valid.As with d_revalidate, most local filesystems will set this toNULL since their dcache entries are always valid.
This function has the same return code semantics asd_revalidate.
d_weak_revalidate is only called after leaving rcu-walk mode.
d_hashcalled when the VFS adds a dentry to the hash table. The firstdentry passed to d_hash is the parent directory that the name isto be hashed into.
Same locking and synchronisation rules as d_compare regardingwhat is safe to dereference etc.
d_comparecalled to compare a dentry name with a given name. The firstdentry is the parent of the dentry to be compared, the second isthe child dentry. len and name string are properties of thedentry to be compared. qstr is the name to compare it with.
Must be constant and idempotent, and should not take locks ifpossible, and should not or store into the dentry. Should notdereference pointers outside the dentry without lots of care(eg. d_parent, d_inode, d_name should not be used).
However, our vfsmount is pinned, and RCU held, so the dentriesand inodes won’t disappear, neither will our sb or filesystemmodule. ->d_sb may be used.
It is a tricky calling convention because it needs to be calledunder “rcu-walk”, ie. without any locks or references on things.
d_delete- called when the last reference to a dentry is dropped and thedcache is deciding whether or not to cache it. Return 1 todelete immediately, or 0 to cache the dentry. Default is NULLwhich means to always cache a reachable dentry. d_delete mustbe constant and idempotent.
d_init- called when a dentry is allocated
d_release- called when a dentry is really deallocated
d_iput- called when a dentry loses its inode (just prior to its beingdeallocated). The default when this is NULL is that the VFScalls
iput(). If you define this method, you must calliput()yourself d_dnamecalled when the pathname of a dentry should be generated.Useful for some pseudo filesystems (sockfs, pipefs, …) todelay pathname generation. (Instead of doing it when dentry iscreated, it’s done only when the path is needed.). Realfilesystems probably dont want to use it, because their dentriesare present in global dcache hash, so their hash should be aninvariant. As no lock is held, d_dname() should not try tomodify the dentry itself, unless appropriate SMP safety is used.CAUTION :
d_path()logic is quite tricky. The correct way toreturn for example “Hello” is to put it at the end of thebuffer, and returns a pointer to the first char.dynamic_dname() helper function is provided to take care ofthis.Example :
staticchar*pipefs_dname(structdentry*dent,char*buffer,intbuflen){returndynamic_dname(dentry,buffer,buflen,"pipe:[%lu]",dentry->d_inode->i_ino);}
d_automountcalled when an automount dentry is to be traversed (optional).This should create a new VFS mount record and return the recordto the caller. The caller is supplied with a path parametergiving the automount directory to describe the automount targetand the parent VFS mount record to provide inheritable mountparameters. NULL should be returned if someone else managed tomake the automount first. If the vfsmount creation failed, thenan error code should be returned. If -EISDIR is returned, thenthe directory will be treated as an ordinary directory andreturned to pathwalk to continue walking.
If a vfsmount is returned, the caller will attempt to mount iton the mountpoint and will remove the vfsmount from itsexpiration list in the case of failure. The vfsmount should bereturned with 2 refs on it to prevent automatic expiration - thecaller will clean up the additional ref.
This function is only used if DCACHE_NEED_AUTOMOUNT is set onthe dentry. This is set by __d_instantiate() if S_AUTOMOUNT isset on the inode being added.
d_managecalled to allow the filesystem to manage the transition from adentry (optional). This allows autofs, for example, to hold upclients waiting to explore behind a ‘mountpoint’ while lettingthe daemon go past and construct the subtree there. 0 should bereturned to let the calling process continue. -EISDIR can bereturned to tell pathwalk to use this directory as an ordinarydirectory and to ignore anything mounted on it and not to checkthe automount flag. Any other error code will abort pathwalkcompletely.
If the ‘rcu_walk’ parameter is true, then the caller is doing apathwalk in RCU-walk mode. Sleeping is not permitted in thismode, and the caller can be asked to leave it and call again byreturning -ECHILD. -EISDIR may also be returned to tellpathwalk to ignore d_automount or any mounts.
This function is only used if DCACHE_MANAGE_TRANSIT is set onthe dentry being transited from.
d_realoverlay/union type filesystems implement this method to returnone of the underlying dentries hidden by the overlay. It isused in two different modes:
Called from file_dentry() it returns the real dentry matchingthe inode argument. The real dentry may be from a lower layeralready copied up, but still referenced from the file. Thismode is selected with a non-NULL inode argument.
With NULL inode the topmost real underlying dentry is returned.
Each dentry has a pointer to its parent dentry, as well as a hash listof child dentries. Child dentries are basically like files in adirectory.
Directory Entry Cache API¶
There are a number of functions defined which permit a filesystem tomanipulate dentries:
dget- open a new handle for an existing dentry (this just incrementsthe usage count)
dput- close a handle for a dentry (decrements the usage count). Ifthe usage count drops to 0, and the dentry is still in itsparent’s hash, the “d_delete” method is called to check whetherit should be cached. If it should not be cached, or if thedentry is not hashed, it is deleted. Otherwise cached dentriesare put into an LRU list to be reclaimed on memory shortage.
d_drop- this unhashes a dentry from its parents hash list. A subsequentcall to dput() will deallocate the dentry if its usage countdrops to 0
d_delete- delete a dentry. If there are no other open references to thedentry then the dentry is turned into a negative dentry (thed_iput() method is called). If there are other references, thend_drop() is called instead
d_add- add a dentry to its parents hash list and then calls
d_instantiate() d_instantiate- add a dentry to the alias hash list for the inode and updatesthe “d_inode” member. The “i_count” member in the inodestructure should be set/incremented. If the inode pointer isNULL, the dentry is called a “negative dentry”. This functionis commonly called when an inode is created for an existingnegative dentry
d_lookup- look up a dentry given its parent and path name component Itlooks up the child of that given name from the dcache hashtable. If it is found, the reference count is incremented andthe dentry is returned. The caller must use dput() to free thedentry when it finishes using it.
Mount Options¶
Parsing options¶
On mount and remount the filesystem is passed a string containing acomma separated list of mount options. The options can have either ofthese forms:
optionoption=value
The <linux/parser.h> header defines an API that helps parse theseoptions. There are plenty of examples on how to use it in existingfilesystems.
Showing options¶
If a filesystem accepts mount options, it must define show_options() toshow all the currently active options. The rules are:
- options MUST be shown which are not default or their values differfrom the default
- options MAY be shown which are enabled by default or have theirdefault value
Options used only internally between a mount helper and the kernel (suchas file descriptors), or which only have an effect during the mounting(such as ones controlling the creation of a journal) are exempt from theabove rules.
The underlying reason for the above rules is to make sure, that a mountcan be accurately replicated (e.g. umounting and mounting again) basedon the information found in /proc/mounts.
Resources¶
- (Note some of these resources are not up-to-date with the latest kernel
- version.)
- Creating Linux virtual filesystems. 2002
- <https://lwn.net/Articles/13325/>
- The Linux Virtual File-system Layer by Neil Brown. 1999
- <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>
- A tour of the Linux VFS by Michael K. Johnson. 1996
- <https://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
- A small trail through the Linux kernel by Andries Brouwer. 2001
- <https://www.win.tue.nl/~aeb/linux/vfs/trail.html>