RFC 9561 | pNFS SCSI Layout for NVMe | April 2024 |
Hellwig, et al. | Standards Track | [Page] |
This document specifies how to use the Parallel Network File System (pNFS)Small Computer System Interface (SCSI) Layout Type to access storage devicesusing the Non-Volatile Memory Express (NVMe) protocol family.¶
This is an Internet Standards Track document.¶
This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 7841.¶
Information about the current status of this document, any errata, and how to provide feedback on it may be obtained athttps://www.rfc-editor.org/info/rfc9561.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
NFSv4.1[RFC8881] includes a pNFS feature that allowsreads and writes to be performed by means other than directing read andwrite operations to the server. Through use of this feature, the server,in the role of metadata server, is responsible for managing file anddirectory metadata while separate means are provided to execute reads and writes.¶
These other means of performing file reads and writes are defined byindividual mapping types, which often have their own specifications.¶
The pNFS Small Computer System Interface (SCSI) layout[RFC8154] is a layouttype that allows NFS clients to directly perform I/O to block storage deviceswhile bypassing the Metadata Server (MDS). It is specified by usingconcepts from the SCSI protocol family for the data path to the storage devices.¶
NVM Express (NVMe), or the Non-Volatile Memory Host Controller InterfaceSpecification, is a set of specifications to talk to storage devices overa number of protocols such as PCI Express (PCIe), Fibre Channel (FC),TCP/IP, or Remote Direct Memory Access (RDMA) networking. NVMe is currentlythe predominantly used protocol to access PCIe Solid State Disks (SSDs),and it is increasingly being adopted for remote storage access to replaceSCSI-based protocols such as iSCSI.¶
This document defines how NVMe Namespaces using the NVM Command Set[NVME-NVM]exported by NVMe Controllers implementing theNVMe Base specification[NVME-BASE] are to be used asstorage devices using the SCSI Layout Type.The definition is independent of the underlying transport used by theNVMe Controller and thus supports Controllers implementing a wide varietyof transports, including PCIe, RDMA, TCP, and FC.¶
This document does not amend the existing SCSI layout document. Rather,itdefines how NVMe Namespaces can be used within the SCSI Layout byestablishing a mapping of the SCSI constructs used in the SCSI layoutdocument to corresponding NVMe constructs.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14[RFC2119][RFC8174] when, and only when, they appear in all capitals, as shown here.¶
The following definitions are included to provide context for the reader.¶
The "client" is the entity that accesses the NFS server'sresources. The client may be an application that contains thelogic to access the NFS server directly, or it may be part of the operatingsystem that provides remote file system services for a set ofapplications.¶
The Metadata Server (MDS) is the entity responsible for coordinatingclient access to a set of file systems and is identified by a serverowner.¶
Numerical values defined in the SCSI specifications (e.g.,[SPC5]) and theNVMe specifications (e.g.,[NVME-BASE]) are represented using the sameconventions as those specifications wherein a 'b' suffix denotes a binary(base 2) number (e.g., 110b = 6 decimal) and an 'h' suffix denotes ahexadecimal (base 16) number (e.g., 1ch = 28 decimal).¶
The SCSI layout definition[RFC8154] references only afew SCSI-specific concepts directly. This document provides a mappingfrom these SCSI concepts to NVM Express concepts that are usedwhen using the pNFS SCSI layout with NVMe namespaces.¶
The pNFS SCSI layout uses the Device Identification Vital Product Data (VPD)page (page code 83h) from[SPC5] to identify the devices used bya layout. Implementations that use NVMe namespaces as storage devicesmap NVMe namespace identifiers to a subset of the identifiersthat the Device Identification VPD page supports for SCSI logicalunits.¶
To be used as storage devices for the pNFS SCSI layout, NVMe namespacesMUST support either the IEEE Extended Unique Identifier (EUI64) orNamespace Globally Unique Identifier (NGUID) value reported in a NamespaceIdentification Descriptor, the I/O Command Set Independent IdentifyNamespace data structure, and the Identify Namespace data structure,NVM Command Set[NVME-BASE]. If available, use of the NGUID value ispreferred as it is the larger identifier.¶
Note: The PS_DESIGNATOR_T10 and PS_DESIGNATOR_NAME have no equivalentin NVMe and cannot be used to identify NVMe storage devices.¶
The pnfs_scsi_base_volume_info4 structure for an NVMe namespaceSHALL be constructed as follows:¶
The "sbv_code_set" fieldSHALL be set to PS_CODE_SET_BINARY.¶
The "pnfs_scsi_designator_type" fieldSHALL be set to PS_DESIGNATOR_EUI64.¶
The "sbv_designator" fieldSHALL contain either the NGUID or the EUI64 identifier for the namespace. If both NGUID and EUI64 identifiers are available, then the NGUID identifierSHOULD be used as it is the larger identifier.¶
RFC 8154[RFC8154] specifies the "sbv_designator" field as an XDR variablelength opaque<> (refer to Section4.10 of RFC 4506[RFC4506]). The length of that XDR opaque<> value (part ofits XDR representation) indicates which NVMe identifier is present.That lengthMUST be 16 octets for an NVMe NGUID identifier andMUST be 8 octets for an NVMe EUI64 identifier. All other lengthsMUST NOT be used with an NVMe namespace.¶
The SCSI layout uses Persistent Reservations (PRs) to provide clientfencing. For this to be achieved, both the MDS and the Clients have toregister a key with the storage device, and the MDS has to create areservation on the storage device.¶
The following subsections provide a full mapping of the requiredPERSISTENT RESERVE IN and PERSISTENT RESERVE OUT SCSI commands[SPC5]to NVMe commands thatMUST be used when usingNVMe namespaces as storage devices for the pNFS SCSI layout.¶
On NVMe namespaces, reservation keys are registered using theReservation Register command (refer to Section 7.3 of[NVME-BASE])with the Reservation Register Action(RREGA) field set to 000b (i.e., Register Reservation Key) andsupplying the reservation key in the New Reservation Key (NRKEY)field.¶
Reservation keys are unregistered using the Reservation Registercommand with the Reservation Register Action (RREGA) field set to001b (i.e., Unregister Reservation Key) and supplying the reservationkey in the Current Reservation Key (CRKEY) field.¶
One important difference between SCSI Persistent Reservationsand NVMe Reservations is that NVMe reservation keys always applyto all controllers used by a host (as indicated by the NVMe HostIdentifier). This behavior is analogous to setting the ALL_TG_PTbit when registering a SCSI Reservation Key, and it is always supportedby NVMe Reservations, unlike the ALL_TG_PT for which SCSI support isinconsistent and cannot be relied upon.Registering a reservation key with a namespace creates anassociation between a host and a namespace. A host that is aregistrant of a namespace may use any controller with which thathost is associated (i.e., that has the same Host Identifier,refer to Section 5.27.1.25 of[NVME-BASE])to access that namespace as a registrant.¶
Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the MDSneeds to prepare the volume for fencing using PRs. This is done byregistering the reservation generated for the MDS with the device(seeSection 2.2.1) followed by a Reservation Acquirecommand (refer to Section 7.2 of[NVME-BASE]) withthe Reservation Acquire Action (RACQA) field set to 000b (i.e., Acquire)and the Reservation Type (RTYPE) field set to 4h (i.e., Exclusive Access- Registrants Only Reservation).¶
In case of a non-responding client, the MDS fences the client byexecuting a Reservation Acquire command (refer to Section 7.2 of[NVME-BASE]),with the Reservation Acquire Action(RACQA) field set to 001b (i.e., Preempt) or 010b (i.e., Preempt andAbort), the Current Reservation Key (CRKEY) field set to theserver's reservation key, the Preempt Reservation Key (PRKEY) fieldset to the reservation key associated with the non-responding client,and the Reservation Type (RTYPE) field set to 4h (i.e., ExclusiveAccess - Registrants Only Reservation).The client can distinguish I/O errors due to fencing from othererrors based on the Reservation Conflict NVMe status code.¶
If an NVMe command issued by the client to the storage device returnsa non-retryable error (refer to the DNR bit defined in Figure 92 in[NVME-BASE]), the clientMUST commit all layouts thatuse the storage device through the MDS, return all outstanding layoutsfor the device, forget the device ID, and unregister the reservationkey.¶
For NVMe controllers, a volatile write cache is enabled if bit 0 of theVolatile Write Cache (VWC) field in the Identify Controller datastructure, I/O Command Set Independent (refer to Figure 275 in[NVME-BASE])is set and the Volatile Write Cache Enable (WCE) bit (i.e., bit 00) inthe Volatile Write Cache Feature (Feature Identifier 06h)(refer to Section 5.27.1.4 of[NVME-BASE]) is set.If a volatile write cache is enabled on an NVMe namespace used as astorage device for the pNFS SCSI layout, the pNFS server (MDS)MUSTuse the NVMe Flush command to flush the volatile write cache tostable storage before the LAYOUTCOMMIT operation returns by using theFlush command (refer to Section 7.1 of[NVME-BASE]).The NVMe Flush command is the equivalent to the SCSI SYNCHRONIZECACHE commands.¶
NFSv4 clients access NFSv4 metadata servers using the NFSv4protocol. The security considerations generally described in[RFC8881]apply to a client's interactions withthe metadata server. However, NFSv4 clients and servers accessNVMe storage devices at a lower layer than NFSv4. NFSv4 andRPC security are not directly applicable to the I/Os to data serversusing NVMe.Refer to Sections2.4.6 (Extents Are Permissions) and4 (Security Considerations) of[RFC8154] for thesecurity considerations of direct access to block storage from NFS clients.¶
pNFS with an NVMe layout can be used with NVMe transports (e.g., NVMeover PCIe[NVME-PCIE]) that provide essentially no additional securityfunctionality. Or, pNFS may be used with storage protocols such as NVMeover TCP[NVME-TCP] that can provide significant transport layersecurity.¶
It is the responsibility of those administering and deploying pNFS withan NVMe layout to ensure that appropriate protection is deployed to thatprotocol based on the deployment environment as well as the nature andsensitivity of the data and storage devices involved. When using IP-basedstorage protocols such as NVMe over TCP, data confidentiality andintegritySHOULD be provided for traffic between pNFS clients and NVMestorage devices by using a secure communication protocol such as TransportLayer Security (TLS)[RFC8446]. For NVMe over TCP, TLSSHOULD be used asdescribed in[NVME-TCP] to protect traffic between pNFS clients and NVMenamespaces used as storage devices.¶
A secure communication protocol might not be needed for pNFS with NVMelayouts in environments where physical and/or logical security measures(e.g., air gaps, isolated VLANs) provide effective access controlcommensurate with the sensitivity and value of the storage devices and datainvolved (e.g., public website contents may be significantly less sensitivethan a database containing personal identifying information, passwords,and other authentication credentials).¶
Physical security is a common means for protocols not based on IP. In environments where the security requirements for the storageprotocol cannot be met, pNFS with an NVMe layoutSHOULD NOT bedeployed.¶
When security is available for the data server storage protocol,it is generally at a different granularity and with a differentnotion of identity than NFSv4 (e.g., NFSv4 controls user accessto files, and NVMe controls initiator access to volumes). Aswith pNFS with the block layout type[RFC5663],the pNFS client is responsible for enforcing appropriatecorrespondences between these security layers. In environmentswhere the security requirements are such that client-sideprotection from access to storage outside of the layout is notsufficient, pNFS with a SCSI layout on a NVMe namespaceSHOULD NOT be deployed.¶
As with other block-oriented pNFS layout types, the metadata serveris able to fence off a client's access to the data on an NVMe namespaceused as a storage device. If a metadata server revokes a layout, theclient's accessMUST be terminated at the storage devices via fencingas specified inSection 2.2. The client has asubsequent opportunity to acquire a new layout.¶
This document has no IANA actions.¶
Carsten Bormann converted an earlier RFCXML v2 source for this document to a markdown source format.¶
David Noveck provided ample feedback to various drafts of this document.¶