Direct Access for files

Motivation

The page cache is usually used to buffer reads and writes to files.It is also used to provide the pages which are mapped into userspaceby a call to mmap.

For block devices that are memory-like, the page cache pages would beunnecessary copies of the original storage. TheDAX code removes theextra copy by performing reads and writes directly to the storage device.For file mappings, the storage device is mapped directly into userspace.

Usage

If you have a block device which supportsDAX, you can make a filesystemon it as usual. TheDAX code currently only supports files with a blocksize equal to your kernel’sPAGE_SIZE, so you may need to specify a blocksize when creating the filesystem.

Currently 5 filesystems supportDAX: ext2, ext4, xfs, virtiofs and erofs.EnablingDAX on them is different.

Enabling DAX on ext2 and erofs

When mounting the filesystem, use the-odax option on the command line oradd ‘dax’ to the options in/etc/fstab. This works to enableDAX on all fileswithin the filesystem. It is equivalent to the-odax=always behavior below.

Enabling DAX on xfs and ext4

Summary

  1. There exists an in-kernel file access mode flagS_DAX that corresponds tothe statx flagSTATX_ATTR_DAX. See the manpage for statx(2) for detailsabout this access mode.

  2. There exists a persistent flagFS_XFLAG_DAX that can be applied to regularfiles and directories. This advisory flag can be set or cleared at anytime, but doing so does not immediately affect theS_DAX state.

  3. If the persistentFS_XFLAG_DAX flag is set on a directory, this flag willbe inherited by all regular files and subdirectories that are subsequentlycreated in this directory. Files and subdirectories that exist at the timethis flag is set or cleared on the parent directory are not modified bythis modification of the parent directory.

  4. There exist dax mount options which can overrideFS_XFLAG_DAX in thesetting of theS_DAX flag. Given underlying storage which supportsDAX thefollowing hold:

    -odax=inode means “followFS_XFLAG_DAX” and is the default.

    -odax=never means “never setS_DAX, ignoreFS_XFLAG_DAX.”

    -odax=always means “always setS_DAX ignoreFS_XFLAG_DAX.”

    -odax is a legacy option which is an alias fordax=always.

    Warning

    The option-odax may be removed in the future so-odax=always isthe preferred method for specifying this behavior.

    Note

    Modifications to and the inheritance behavior ofFS_XFLAG_DAX remainthe same even when the filesystem is mounted with a dax option. However,in-core inode state (S_DAX) will be overridden until the filesystem isremounted with dax=inode and the inode is evicted from kernel memory.

  5. TheS_DAX policy can be changed via:

    1. Setting the parent directoryFS_XFLAG_DAX as needed before files arecreated

    2. Setting the appropriate dax=”foo” mount option

    3. Changing theFS_XFLAG_DAX flag on existing regular files anddirectories. This has runtime constraints and limitations that aredescribed in 6) below.

  6. When changing theS_DAX policy via toggling the persistentFS_XFLAG_DAXflag, the change to existing regular files won’t take effect until thefiles are closed by all processes.

Details

There are 2 per-file dax flags. One is a persistent inode setting (FS_XFLAG_DAX)and the other is a volatile flag indicating the active state of the feature(S_DAX).

FS_XFLAG_DAX is preserved within the filesystem. This persistent configsetting can be set, cleared and/or queried using theFS_IOC_FS`[`GS]`ETXATTR` ioctl(see ioctl_xfs_fsgetxattr(2)) or an utility such as ‘xfs_io’.

New files and directories automatically inheritFS_XFLAG_DAX fromtheir parent directorywhen created. Therefore, settingFS_XFLAG_DAX atdirectory creation time can be used to set a default behavior for an entiresub-tree.

To clarify inheritance, here are 3 examples:

Example A:

mkdir-pa/b/cxfs_io-c'chattr +x'amkdira/b/c/dmkdira/e------[outcome]------dax:a,enodax:b,c,d

Example B:

mkdiraxfs_io-c'chattr +x'amkdir-pa/b/c/d------[outcome]------dax:a,b,c,dnodax:

Example C:

mkdir-pa/b/cxfs_io-c'chattr +x'cmkdira/b/c/d------[outcome]------dax:c,dnodax:a,b

The current enabled state (S_DAX) is set when a file inode is instantiated inmemory by the kernel. It is set based on the underlying media support, thevalue ofFS_XFLAG_DAX and the filesystem’s dax mount option.

statx can be used to queryS_DAX.

Note

That only regular files will ever haveS_DAX set and therefore statxwill never indicate thatS_DAX is set on directories.

Setting theFS_XFLAG_DAX flag (specifically or through inheritance) occurs evenif the underlying media does not support dax and/or the filesystem isoverridden with a mount option.

Enabling DAX on virtiofs

The semantic of DAX on virtiofs is basically equal to that on ext4 and xfs,except that when ‘-o dax=inode’ is specified, virtiofs client derives the hintwhether DAX shall be enabled or not from virtiofs server through FUSE protocol,rather than the persistentFS_XFLAG_DAX flag. That is, whether DAX shall beenabled or not is completely determined by virtiofs server, while virtiofsserver itself may deploy various algorithm making this decision, e.g. dependingon the persistentFS_XFLAG_DAX flag on the host.

It is still supported to set or clear persistentFS_XFLAG_DAX flag insideguest, but it is not guaranteed that DAX will be enabled or disabled forcorresponding file then. Users inside guest still need to call statx(2) andcheck the statx flagSTATX_ATTR_DAX to see if DAX is enabled for this file.

Implementation Tips for Block Driver Writers

To supportDAX in your block driver, implement the ‘direct_access’block device operation. It is used to translate the sector number(expressed in units of 512-byte sectors) to a page frame number (pfn)that identifies the physical page for the memory. It also returns akernel virtual address that can be used to access the memory.

The direct_access method takes a ‘size’ parameter that indicates thenumber of bytes being requested. The function should return the numberof bytes that can be contiguously accessed at that offset. It may alsoreturn a negative errno if an error occurs.

In order to support this method, the storage must be byte-accessible bythe CPU at all times. If your device uses paging techniques to exposea large amount of memory through a smaller window, then you cannotimplement direct_access. Equally, if your device can occasionallystall the CPU for an extended period, you should also not attempt toimplement direct_access.

These block devices may be used for inspiration:- pmem: NVDIMM persistent memory driver

Implementation Tips for Filesystem Writers

Filesystem support consists of:

  • Adding support to mark inodes as beingDAX by setting theS_DAX flag ini_flags

  • Implementing ->read_iter and ->write_iter operations which usedax_iomap_rw() when inode hasS_DAX flag set

  • Implementing an mmap file operation forDAX files which sets theVM_MIXEDMAP andVM_HUGEPAGE flags on theVMA, and setting the vm_ops toinclude handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. Thesehandlers should probably calldax_iomap_fault() passing theappropriate fault size and iomap operations.

  • Callingiomap_zero_range() passing appropriate iomap operationsinstead ofblock_truncate_page() forDAX files

  • Ensuring that there is sufficient locking between reads, writes,truncates and page faults

The iomap handlers for allocating blocks must make sure that allocated blocksare zeroed out and converted to written extents before being returned to avoidexposure of uninitialized data through mmap.

These filesystems may be used for inspiration:

See also

xfs: seeThe SGI XFS Filesystem

See also

ext4: see Documentation/filesystems/ext4/

Handling Media Errors

The libnvdimm subsystem stores a record of known media error locations foreach pmem block device (in gendisk->badblocks). If we fault at such location,or one with a latent error not yet discovered, the application can expectto receive aSIGBUS. Libnvdimm also allows clearing of these errors by simplywriting the affected sectors (through the pmem driver, and if the underlyingNVDIMM supports the clear_poison DSM defined by ACPI).

SinceDAX IO normally doesn’t go through thedriver/bio path, applications orsysadmins have an option to restore the lost data from a priorbackup/inbuiltredundancy in the following ways:

  1. Delete the affected file, and restore from a backup (sysadmin route):This will free the filesystem blocks that were being used by the file,and the next time they’re allocated, they will be zeroed first, whichhappens through the driver, and will clear bad sectors.

  2. Truncate or hole-punch the part of the file that has a bad-block (at leastan entire aligned sector has to be hole-punched, but not necessarily anentire filesystem block).

These are the two basic paths that allowDAX filesystems to continue operatingin the presence of media errors. More robust error recovery mechanisms can bebuilt on top of this in the future, for example, involving redundancy/mirroringprovided at the block layer through DM, or additionally, at the filesystemlevel. These would have to rely on the above two tenets, that error clearingcan happen either by sending an IO through the driver, or zeroing (also throughthe driver).

Shortcomings

Even if the kernel or its modules are stored on a filesystem that supportsDAX on a block device that supportsDAX, they will still be copied into RAM.

The DAX code does not work correctly on architectures which have virtuallymapped caches such as ARM, MIPS and SPARC.

Callingget_user_pages() on a range of user memory that has beenmmapped from aDAX file will fail when there are no ‘structpage’ to describethose pages. This problem has been addressed in some device driversby adding optionalstructpage support for pages under the control ofthe driver (seeCONFIG_NVDIMM_PFN indrivers/nvdimm for an example ofhow to do this). In the nonstructpage casesO_DIRECT reads/writes tothose memory ranges from a non-DAX file will fail

Note

O_DIRECT reads/writes _of aDAX file do work, it is the memory thatis being accessed that is key here). Other things that will not work inthe nonstructpage case include RDMA,sendfile() andsplice().