NFS LOCALIO¶
Overview¶
The LOCALIO auxiliary RPC protocol allows the Linux NFS client andserver to reliably handshake to determine if they are on the samehost. Select “NFS client and server support for LOCALIO auxiliaryprotocol” in menuconfig to enable CONFIG_NFS_LOCALIO in the kernelconfig (both CONFIG_NFS_FS and CONFIG_NFSD must also be enabled).
Once an NFS client and server handshake as “local”, the client willbypass the network RPC protocol for read, write and commit operations.Due to this XDR and RPC bypass, these operations will operate faster.
The LOCALIO auxiliary protocol’s implementation, which uses the sameconnection as NFS traffic, follows the pattern established by the NFSACL protocol extension.
The LOCALIO auxiliary protocol is needed to allow robust discovery ofclients local to their servers. In a private implementation thatpreceded use of this LOCALIO protocol, a fragile sockaddr networkaddress based match against all local network interfaces was attempted.But unlike the LOCALIO protocol, the sockaddr-based matching didn’thandle use of iptables or containers.
The robust handshake between local client and server is just thebeginning, the ultimate use case this locality makes possible is theclient is able to open files and issue reads, writes and commitsdirectly to the server without having to go over the network. Therequirement is to perform these loopback NFS operations as efficientlyas possible, this is particularly useful for container use cases(e.g. kubernetes) where it is possible to run an IO job local to theserver.
The performance advantage realized from LOCALIO’s ability to bypassusing XDR and RPC for reads, writes and commits can be extreme, e.g.:
- fio for 20 secs with directio, qd of 8, 16 libaio threads:
With LOCALIO:4K read: IOPS=979k, BW=3825MiB/s (4011MB/s)(74.7GiB/20002msec)4K write: IOPS=165k, BW=646MiB/s (678MB/s)(12.6GiB/20002msec)128K read: IOPS=402k, BW=49.1GiB/s (52.7GB/s)(982GiB/20002msec)128K write: IOPS=11.5k, BW=1433MiB/s (1503MB/s)(28.0GiB/20004msec)
Without LOCALIO:4K read: IOPS=79.2k, BW=309MiB/s (324MB/s)(6188MiB/20003msec)4K write: IOPS=59.8k, BW=234MiB/s (245MB/s)(4671MiB/20002msec)128K read: IOPS=33.9k, BW=4234MiB/s (4440MB/s)(82.7GiB/20004msec)128K write: IOPS=11.5k, BW=1434MiB/s (1504MB/s)(28.0GiB/20011msec)
- fio for 20 secs with directio, qd of 8, 1 libaio thread:
With LOCALIO:4K read: IOPS=230k, BW=898MiB/s (941MB/s)(17.5GiB/20001msec)4K write: IOPS=22.6k, BW=88.3MiB/s (92.6MB/s)(1766MiB/20001msec)128K read: IOPS=38.8k, BW=4855MiB/s (5091MB/s)(94.8GiB/20001msec)128K write: IOPS=11.4k, BW=1428MiB/s (1497MB/s)(27.9GiB/20001msec)
Without LOCALIO:4K read: IOPS=77.1k, BW=301MiB/s (316MB/s)(6022MiB/20001msec)4K write: IOPS=32.8k, BW=128MiB/s (135MB/s)(2566MiB/20001msec)128K read: IOPS=24.4k, BW=3050MiB/s (3198MB/s)(59.6GiB/20001msec)128K write: IOPS=11.4k, BW=1430MiB/s (1500MB/s)(27.9GiB/20001msec)
FAQ¶
What are the use cases for LOCALIO?
Workloads where the NFS client and server are on the same hostrealize improved IO performance. In particular, it is common whenrunning containerised workloads for jobs to find themselvesrunning on the same host as the knfsd server being used forstorage.
What are the requirements for LOCALIO?
Bypass use of the network RPC protocol as much as possible. Thisincludes bypassing XDR and RPC for open, read, write and commitoperations.
Allow client and server to autonomously discover if they arerunning local to each other without making any assumptions aboutthe local network topology.
Support the use of containers by being compatible with relevantnamespaces (e.g. network, user, mount).
Support all versions of NFS. NFSv3 is of particular importancebecause it has wide enterprise usage and pNFS flexfiles makes useof it for the data path.
Why doesn’t LOCALIO just compare IP addresses or hostnames whendeciding if the NFS client and server are co-located on the samehost?
Since one of the main use cases is containerised workloads, we cannotassume that IP addresses will be shared between the client andserver. This sets up a requirement for a handshake protocol thatneeds to go over the same connection as the NFS traffic in order toidentify that the client and the server really are running on thesame host. The handshake uses a secret that is sent over the wire,and can be verified by both parties by comparing with a value storedin shared kernel memory if they are truly co-located.
Does LOCALIO improve pNFS flexfiles?
Yes, LOCALIO complements pNFS flexfiles by allowing it to takeadvantage of NFS client and server locality. Policy that initiatesclient IO as closely to the server where the data is stored naturallybenefits from the data path optimization LOCALIO provides.
Why not develop a new pNFS layout to enable LOCALIO?
A new pNFS layout could be developed, but doing so would put theonus on the server to somehow discover that the client is co-locatedwhen deciding to hand out the layout.There is value in a simpler approach (as provided by LOCALIO) thatallows the NFS client to negotiate and leverage locality withoutrequiring more elaborate modeling and discovery of such locality in amore centralized manner.
Why is having the client perform a server-side file OPEN, withoutusing RPC, beneficial? Is the benefit pNFS specific?
Avoiding the use of XDR and RPC for file opens is beneficial toperformance regardless of whether pNFS is used. Especially whendealing with small files its best to avoid going over the wirewhenever possible, otherwise it could reduce or even negate thebenefits of avoiding the wire for doing the small file I/O itself.Given LOCALIO’s requirements the current approach of having theclient perform a server-side file open, without using RPC, is ideal.If in the future requirements change then we can adapt accordingly.
Why is LOCALIO only supported with UNIX Authentication (AUTH_UNIX)?
Strong authentication is usually tied to the connection itself. Itworks by establishing a context that is cached by the server, andthat acts as the key for discovering the authorisation token, whichcan then be passed to rpc.mountd to complete the authenticationprocess. On the other hand, in the case of AUTH_UNIX, the credentialthat was passed over the wire is used directly as the key in theupcall to rpc.mountd. This simplifies the authentication process, andso makes AUTH_UNIX easier to support.
How do export options that translate RPC user IDs behave for LOCALIOoperations (eg. root_squash, all_squash)?
Export options that translate user IDs are managed by
nfsd_setuser()which is called bynfsd_setuser_and_check_port()which is called by__fh_verify(). So they get handled exactly the same way for LOCALIOas they do for non-LOCALIO.How does LOCALIO make certain that object lifetimes are managedproperly given NFSD and NFS operate in different contexts?
See the detailed “NFS Client and Server Interlock” section below.
RPC¶
The LOCALIO auxiliary RPC protocol consists of a single “UUID_IS_LOCAL”RPC method that allows the Linux NFS client to verify the local LinuxNFS server can see the nonce (single-use UUID) the client generated andmade available in nfs_common. This protocol isn’t part of an IETFstandard, nor does it need to be considering it is Linux-to-Linuxauxiliary RPC protocol that amounts to an implementation detail.
The UUID_IS_LOCAL method encodes the client generated uuid_t in terms ofthe fixed UUID_SIZE (16 bytes). The fixed size opaque encode and decodeXDR methods are used instead of the less efficient variable sizedmethods.
The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assignedby IANA, seehttps://www.iana.org/assignments/rpc-program-numbers/ ):Linux Kernel Organization 400122 nfslocalio
The LOCALIO protocol spec in rpcgen syntax is:
/* raw RFC 9562 UUID */#define UUID_SIZE 16typedef u8 uuid_t<UUID_SIZE>;program NFS_LOCALIO_PROGRAM { version LOCALIO_V1 { void NULL(void) = 0; void UUID_IS_LOCAL(uuid_t) = 1; } = 1;} = 400122;LOCALIO uses the same transport connection as NFS traffic. As such,LOCALIO is not registered with rpcbind.
NFS Common and Client/Server Handshake¶
fs/nfs_common/nfslocalio.c provides interfaces that enable an NFS clientto generate a nonce (single-use UUID) and associated short-livednfs_uuid_t struct, register it with nfs_common for subsequent lookup andverification by the NFS server and if matched the NFS server populatesmembers in the nfs_uuid_t struct. The NFS client then uses nfs_common totransfer the nfs_uuid_t from its nfs_uuids to the nn->nfsd_servclients_list from the nfs_common’s uuids_list. See:fs/nfs/localio.c:nfs_local_probe()
nfs_common’s nfs_uuids list is the basis for LOCALIO enablement, as suchit has members that point to nfsd memory for direct use by the client(e.g. ‘net’ is the server’s network namespace, through it the client canaccess nn->nfsd_serv with proper rcu read access). It is this clientand server synchronization that enables advanced usage and lifetime ofobjects to span from the host kernel’s nfsd to per-container knfsdinstances that are connected to nfs client’s running on the same localhost.
NFS Client and Server Interlock¶
LOCALIO provides the nfs_uuid_t object and associated interfaces toallow proper network namespace (net-ns) and NFSD object refcounting.
LOCALIO required the introduction and use of NFSD’s percpu nfsd_net_refto interlocknfsd_shutdown_net() andnfsd_open_local_fh(), to ensureeach net-ns is not destroyed while in use bynfsd_open_local_fh(), andwarrants a more detailed explanation:
nfsd_open_local_fh()usesnfsd_net_try_get()before opening itsnfsd_file handle and then the caller (NFS client) must drop thereference for the nfsd_file and associated net-ns usingnfsd_file_put_local()once it has completed its IO.This interlock working relies heavily on
nfsd_open_local_fh()beingafforded the ability to safely deal with the possibility that theNFSD’s net-ns (and nfsd_net by association) may have been destroyedbynfsd_destroy_serv()vianfsd_shutdown_net().
This interlock of the NFS client and server has been verified to fix aneasy to hit crash that would occur if an NFSD instance running in acontainer, with a LOCALIO client mounted, is shutdown. Upon restart ofthe container and associated NFSD, the client would go on to crash dueto NULL pointer dereference that occurred due to the LOCALIO client’sattempting tonfsd_open_local_fh() without having a proper reference onNFSD’s net-ns.
NFS Client issues IO instead of Server¶
Because LOCALIO is focused on protocol bypass to achieve improved IOperformance, alternatives to the traditional NFS wire protocol (SUNRPCwith XDR) must be provided to access the backing filesystem.
See fs/nfs/localio.c:nfs_local_open_fh() andfs/nfsd/localio.c:nfsd_open_local_fh() for the interface that makesfocused use of select nfs server objects to allow a client local to aserver to open a file pointer without needing to go over the network.
The client’s fs/nfs/localio.c:nfs_local_open_fh() will call into theserver’s fs/nfsd/localio.c:nfsd_open_local_fh() and carefully accessboth the associated nfsd network namespace and nn->nfsd_serv in terms ofRCU. Ifnfsd_open_local_fh() finds that the client no longer sees validnfsd objects (be itstructnet or nn->nfsd_serv) it returns -ENXIOtonfs_local_open_fh() and the client will try to reestablish theLOCALIO resources needed by callingnfs_local_probe() again. Thisrecovery is needed if/when an nfsd instance running in a container wereto reboot while a LOCALIO client is connected to it.
Once the client has an open nfsd_file pointer it will issue reads,writes and commits directly to the underlying local filesystem (normallydone by the nfs server). As such, for these operations, the NFS clientis issuing IO to the underlying local filesystem that it is sharing withthe NFS server. See: fs/nfs/localio.c:nfs_local_doio() andfs/nfs/localio.c:nfs_local_commit().
With normal NFS that makes use of RPC to issue IO to the server, if anapplication uses O_DIRECT the NFS client will bypass the pagecache butthe NFS server will not. The NFS server’s use of buffered IO affordsapplications to be less precise with their alignment when issuing IO tothe NFS client. But if all applications properly align their IO, LOCALIOcan be configured to use end-to-end O_DIRECT semantics from the NFSclient to the underlying local filesystem, that it is sharing withthe NFS server, by setting the ‘localio_O_DIRECT_semantics’ nfs moduleparameter to Y, e.g.:
echo Y > /sys/module/nfs/parameters/localio_O_DIRECT_semantics
Once enabled, it will cause LOCALIO to use end-to-end O_DIRECT semantics(but again, this may cause IO to fail if applications do not properlyalign their IO).
Security¶
LOCALIO is only supported when UNIX-style authentication (AUTH_UNIX, akaAUTH_SYS) is used.
Care is taken to ensure the same NFS security mechanisms are used(authentication, etc) regardless of whether LOCALIO or regular NFSaccess is used. The auth_domain established as part of the traditionalNFS client access to the NFS server is also used for LOCALIO.
Relative to containers, LOCALIO gives the client access to the networknamespace the server has. This is required to allow the client to accessthe server’s per-namespace nfsd_net struct. With traditional NFS, theclient is afforded this same level of access (albeit in terms of the NFSprotocol via SUNRPC). No other namespaces (user, mount, etc) have beenaltered or purposely extended from the server to the client.
Module Parameters¶
/sys/module/nfs/parameters/localio_enabled (bool)controls if LOCALIO is enabled, defaults to Y. If client and server arelocal but ‘localio_enabled’ is set to N then LOCALIO will not be used.
/sys/module/nfs/parameters/localio_O_DIRECT_semantics (bool)controls if O_DIRECT extends down to the underlying filesystem, defaultsto N. Application IO must be logical blocksize aligned, otherwiseO_DIRECT will fail.
/sys/module/nfsv3/parameters/nfs3_localio_probe_throttle (uint)controls if NFSv3 read and write IOs will trigger (re)enabling ofLOCALIO every N (nfs3_localio_probe_throttle) IOs, defaults to 0(disabled). Must be power-of-2, admin keeps all the pieces if theymisconfigure (too low a value or non-power-of-2).
Testing¶
The LOCALIO auxiliary protocol and associated NFS LOCALIO read, writeand commit access have proven stable against various test scenarios:
Client and server both on the same host.
All permutations of client and server support enablement for bothlocal and remote client and server.
Testing against NFS storage products that don’t support the LOCALIOprotocol was also performed.
Client on host, server within a container (for both v3 and v4.2).The container testing was in terms of podman managed containers andincludes successful container stop/restart scenario.
Formalizing these test scenarios in terms of existing testinfrastructure is on-going. Initial regular coverage is provided interms of ktest running xfstests against a LOCALIO-enabled NFS loopbackmount configuration, and includes lockdep and KASAN coverage, see:https://evilpiepirate.org/~testdashboard/ci?user=snitzer&branch=snitm-nfs-nexthttps://github.com/koverstreet/ktest
Various kdevops testing (in terms of “Chuck’s BuildBot”) has beenperformed to regularly verify the LOCALIO changes haven’t caused anyregressions to non-LOCALIO NFS use cases.
All of Hammerspace’s various sanity tests pass with LOCALIO enabled(this includes numerous pNFS and flexfiles tests).