Movatterモバイル変換

RFC 9561	pNFS SCSI Layout for NVMe	April 2024
Hellwig, et al.	Standards Track	[Page]

1.Introduction

NFSv4.1[RFC8881] includes a pNFS feature that allowsreads and writes to be performed by means other than directing read andwrite operations to the server. Through use of this feature, the server,in the role of metadata server, is responsible for managing file anddirectory metadata while separate means are provided to execute reads and writes.¶

These other means of performing file reads and writes are defined byindividual mapping types, which often have their own specifications.¶

The pNFS Small Computer System Interface (SCSI) layout[RFC8154] is a layouttype that allows NFS clients to directly perform I⁠/⁠O to block storage deviceswhile bypassing the Metadata Server (MDS). It is specified by usingconcepts from the SCSI protocol family for the data path to the storage devices.¶

NVM Express (NVMe), or the Non-Volatile Memory Host Controller InterfaceSpecification, is a set of specifications to talk to storage devices overa number of protocols such as PCI Express (PCIe), Fibre Channel (FC),TCP/IP, or Remote Direct Memory Access (RDMA) networking. NVMe is currentlythe predominantly used protocol to access PCIe Solid State Disks (SSDs),and it is increasingly being adopted for remote storage access to replaceSCSI-based protocols such as iSCSI.¶

This document defines how NVMe Namespaces using the NVM Command Set[NVME-NVM]exported by NVMe Controllers implementing theNVMe Base specification[NVME-BASE] are to be used asstorage devices using the SCSI Layout Type.The definition is independent of the underlying transport used by theNVMe Controller and thus supports Controllers implementing a wide varietyof transports, including PCIe, RDMA, TCP, and FC.¶

This document does not amend the existing SCSI layout document. Rather,itdefines how NVMe Namespaces can be used within the SCSI Layout byestablishing a mapping of the SCSI constructs used in the SCSI layoutdocument to corresponding NVMe constructs.¶

1.1.Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14[RFC2119][RFC8174] when, and only when, they appear in all capitals, as shown here.¶

1.2.General Definitions

The following definitions are included to provide context for the reader.¶

Client:: The "client" is the entity that accesses the NFS server'sresources. The client may be an application that contains thelogic to access the NFS server directly, or it may be part of the operatingsystem that provides remote file system services for a set ofapplications.¶
Metadata Server (MDS):: The Metadata Server (MDS) is the entity responsible for coordinatingclient access to a set of file systems and is identified by a serverowner.¶

1.3.Numerical Conventions

Numerical values defined in the SCSI specifications (e.g.,[SPC5]) and theNVMe specifications (e.g.,[NVME-BASE]) are represented using the sameconventions as those specifications wherein a 'b' suffix denotes a binary(base 2) number (e.g., 110b = 6 decimal) and an 'h' suffix denotes ahexadecimal (base 16) number (e.g., 1ch = 28 decimal).¶

2.SCSI Layout Mapping to NVMe

The SCSI layout definition[RFC8154] references only afew SCSI-specific concepts directly. This document provides a mappingfrom these SCSI concepts to NVM Express concepts that are usedwhen using the pNFS SCSI layout with NVMe namespaces.¶

2.1.Volume Identification

The pNFS SCSI layout uses the Device Identification Vital Product Data (VPD)page (page code 83h) from[SPC5] to identify the devices used bya layout. Implementations that use NVMe namespaces as storage devicesmap NVMe namespace identifiers to a subset of the identifiersthat the Device Identification VPD page supports for SCSI logicalunits.¶

To be used as storage devices for the pNFS SCSI layout, NVMe namespacesMUST support either the IEEE Extended Unique Identifier (EUI64) orNamespace Globally Unique Identifier (NGUID) value reported in a NamespaceIdentification Descriptor, the I⁠/⁠O Command Set Independent IdentifyNamespace data structure, and the Identify Namespace data structure,NVM Command Set[NVME-BASE]. If available, use of the NGUID value ispreferred as it is the larger identifier.¶

Note: The PS_DESIGNATOR_T10 and PS_DESIGNATOR_NAME have no equivalentin NVMe and cannot be used to identify NVMe storage devices.¶

The pnfs_scsi_base_volume_info4 structure for an NVMe namespaceSHALL be constructed as follows:¶

The "sbv_code_set" fieldSHALL be set to PS_CODE_SET_BINARY.¶
The "pnfs_scsi_designator_type" fieldSHALL be set to PS_DESIGNATOR_EUI64.¶
The "sbv_designator" fieldSHALL contain either the NGUID or the EUI64 identifier for the namespace. If both NGUID and EUI64 identifiers are available, then the NGUID identifierSHOULD be used as it is the larger identifier.¶

RFC 8154[RFC8154] specifies the "sbv_designator" field as an XDR variablelength opaque<> (refer to Section4.10 of RFC 4506[RFC4506]). The length of that XDR opaque<> value (part ofits XDR representation) indicates which NVMe identifier is present.That lengthMUST be 16 octets for an NVMe NGUID identifier andMUST be 8 octets for an NVMe EUI64 identifier. All other lengthsMUST NOT be used with an NVMe namespace.¶

2.2.Client Fencing

The SCSI layout uses Persistent Reservations (PRs) to provide clientfencing. For this to be achieved, both the MDS and the Clients have toregister a key with the storage device, and the MDS has to create areservation on the storage device.¶

The following subsections provide a full mapping of the requiredPERSISTENT RESERVE IN and PERSISTENT RESERVE OUT SCSI commands[SPC5]to NVMe commands thatMUST be used when usingNVMe namespaces as storage devices for the pNFS SCSI layout.¶

2.2.1.PRs - Key Registration

On NVMe namespaces, reservation keys are registered using theReservation Register command (refer to Section 7.3 of[NVME-BASE])with the Reservation Register Action(RREGA) field set to 000b (i.e., Register Reservation Key) andsupplying the reservation key in the New Reservation Key (NRKEY)field.¶

Reservation keys are unregistered using the Reservation Registercommand with the Reservation Register Action (RREGA) field set to001b (i.e., Unregister Reservation Key) and supplying the reservationkey in the Current Reservation Key (CRKEY) field.¶

One important difference between SCSI Persistent Reservationsand NVMe Reservations is that NVMe reservation keys always applyto all controllers used by a host (as indicated by the NVMe HostIdentifier). This behavior is analogous to setting the ALL_TG_PTbit when registering a SCSI Reservation Key, and it is always supportedby NVMe Reservations, unlike the ALL_TG_PT for which SCSI support isinconsistent and cannot be relied upon.Registering a reservation key with a namespace creates anassociation between a host and a namespace. A host that is aregistrant of a namespace may use any controller with which thathost is associated (i.e., that has the same Host Identifier,refer to Section 5.27.1.25 of[NVME-BASE])to access that namespace as a registrant.¶

2.2.2.PRs - MDS Registration and Reservation

Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the MDSneeds to prepare the volume for fencing using PRs. This is done byregistering the reservation generated for the MDS with the device(seeSection 2.2.1) followed by a Reservation Acquirecommand (refer to Section 7.2 of[NVME-BASE]) withthe Reservation Acquire Action (RACQA) field set to 000b (i.e., Acquire)and the Reservation Type (RTYPE) field set to 4h (i.e., Exclusive Access- Registrants Only Reservation).¶

2.2.3.Fencing Action

In case of a non-responding client, the MDS fences the client byexecuting a Reservation Acquire command (refer to Section 7.2 of[NVME-BASE]),with the Reservation Acquire Action(RACQA) field set to 001b (i.e., Preempt) or 010b (i.e., Preempt andAbort), the Current Reservation Key (CRKEY) field set to theserver's reservation key, the Preempt Reservation Key (PRKEY) fieldset to the reservation key associated with the non-responding client,and the Reservation Type (RTYPE) field set to 4h (i.e., ExclusiveAccess - Registrants Only Reservation).The client can distinguish I⁠/⁠O errors due to fencing from othererrors based on the Reservation Conflict NVMe status code.¶

2.2.4.Client Recovery after a Fence Action

If an NVMe command issued by the client to the storage device returnsa non-retryable error (refer to the DNR bit defined in Figure 92 in[NVME-BASE]), the clientMUST commit all layouts thatuse the storage device through the MDS, return all outstanding layoutsfor the device, forget the device ID, and unregister the reservationkey.¶

2.3.Volatile Write Caches

For NVMe controllers, a volatile write cache is enabled if bit 0 of theVolatile Write Cache (VWC) field in the Identify Controller datastructure, I⁠/⁠O Command Set Independent (refer to Figure 275 in[NVME-BASE])is set and the Volatile Write Cache Enable (WCE) bit (i.e., bit 00) inthe Volatile Write Cache Feature (Feature Identifier 06h)(refer to Section 5.27.1.4 of[NVME-BASE]) is set.If a volatile write cache is enabled on an NVMe namespace used as astorage device for the pNFS SCSI layout, the pNFS server (MDS)MUSTuse the NVMe Flush command to flush the volatile write cache tostable storage before the LAYOUTCOMMIT operation returns by using theFlush command (refer to Section 7.1 of[NVME-BASE]).The NVMe Flush command is the equivalent to the SCSI SYNCHRONIZECACHE commands.¶

3.Security Considerations

NFSv4 clients access NFSv4 metadata servers using the NFSv4protocol. The security considerations generally described in[RFC8881]apply to a client's interactions withthe metadata server. However, NFSv4 clients and servers accessNVMe storage devices at a lower layer than NFSv4. NFSv4 andRPC security are not directly applicable to the I⁠/⁠Os to data serversusing NVMe.Refer to Sections2.4.6 (Extents Are Permissions) and4 (Security Considerations) of[RFC8154] for thesecurity considerations of direct access to block storage from NFS clients.¶

pNFS with an NVMe layout can be used with NVMe transports (e.g., NVMeover PCIe[NVME-PCIE]) that provide essentially no additional securityfunctionality. Or, pNFS may be used with storage protocols such as NVMeover TCP[NVME-TCP] that can provide significant transport layersecurity.¶

It is the responsibility of those administering and deploying pNFS withan NVMe layout to ensure that appropriate protection is deployed to thatprotocol based on the deployment environment as well as the nature andsensitivity of the data and storage devices involved. When using IP-basedstorage protocols such as NVMe over TCP, data confidentiality andintegritySHOULD be provided for traffic between pNFS clients and NVMestorage devices by using a secure communication protocol such as TransportLayer Security (TLS)[RFC8446]. For NVMe over TCP, TLSSHOULD be used asdescribed in[NVME-TCP] to protect traffic between pNFS clients and NVMenamespaces used as storage devices.¶

A secure communication protocol might not be needed for pNFS with NVMelayouts in environments where physical and/or logical security measures(e.g., air gaps, isolated VLANs) provide effective access controlcommensurate with the sensitivity and value of the storage devices and datainvolved (e.g., public website contents may be significantly less sensitivethan a database containing personal identifying information, passwords,and other authentication credentials).¶

Physical security is a common means for protocols not based on IP. In environments where the security requirements for the storageprotocol cannot be met, pNFS with an NVMe layoutSHOULD NOT bedeployed.¶

When security is available for the data server storage protocol,it is generally at a different granularity and with a differentnotion of identity than NFSv4 (e.g., NFSv4 controls user accessto files, and NVMe controls initiator access to volumes). Aswith pNFS with the block layout type[RFC5663],the pNFS client is responsible for enforcing appropriatecorrespondences between these security layers. In environmentswhere the security requirements are such that client-sideprotection from access to storage outside of the layout is notsufficient, pNFS with a SCSI layout on a NVMe namespaceSHOULD NOT be deployed.¶

As with other block-oriented pNFS layout types, the metadata serveris able to fence off a client's access to the data on an NVMe namespaceused as a storage device. If a metadata server revokes a layout, theclient's accessMUST be terminated at the storage devices via fencingas specified inSection 2.2. The client has asubsequent opportunity to acquire a new layout.¶

5.References

5.1.Normative References

[NVME-BASE]: NVM Express, Inc.,"NVM Express Base Specification",Revision 2.0d,January 2024,<https://nvmexpress.org/wp-content/uploads/NVM-Express-Base-Specification-2.0d-2024.01.11-Ratified.pdf>.
[NVME-NVM]: NVM Express, Inc.,"NVM Express NVM Command Set Specification",Revision 1.0d,December 2023,<https://nvmexpress.org/wp-content/uploads/NVM-Express- NVM-Command-Set-Specification-1.0d-2023.12.28-Ratified.pdf>.
[NVME-TCP]: NVM Express, Inc.,"NVM Express TCP Transport Specification",Revision 1.0d,December 2023,<https://nvmexpress.org/wp-content/uploads/NVM-Express-TCP-Transport-Specification-1.0d-2023.12.27-Ratified.pdf>.
[RFC2119]: Bradner, S.,"Key words for use in RFCs to Indicate Requirement Levels",BCP 14,RFC 2119,DOI 10.17487/RFC2119,March 1997,<https://www.rfc-editor.org/info/rfc2119>.
[RFC4506]: Eisler, M., Ed.,"XDR: External Data Representation Standard",STD 67,RFC 4506,DOI 10.17487/RFC4506,May 2006,<https://www.rfc-editor.org/info/rfc4506>.
[RFC5663]: Black, D.,Fridella, S., andJ. Glasgow,"Parallel NFS (pNFS) Block/Volume Layout",RFC 5663,DOI 10.17487/RFC5663,January 2010,<https://www.rfc-editor.org/info/rfc5663>.
[RFC8154]: Hellwig, C.,"Parallel NFS (pNFS) Small Computer System Interface (SCSI) Layout",RFC 8154,DOI 10.17487/RFC8154,May 2017,<https://www.rfc-editor.org/info/rfc8154>.
[RFC8174]: Leiba, B.,"Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words",BCP 14,RFC 8174,DOI 10.17487/RFC8174,May 2017,<https://www.rfc-editor.org/info/rfc8174>.
[RFC8446]: Rescorla, E.,"The Transport Layer Security (TLS) Protocol Version 1.3",RFC 8446,DOI 10.17487/RFC8446,August 2018,<https://www.rfc-editor.org/info/rfc8446>.
[RFC8881]: Noveck, D., Ed. andC. Lever,"Network File System (NFS) Version 4 Minor Version 1 Protocol",RFC 8881,DOI 10.17487/RFC8881,August 2020,<https://www.rfc-editor.org/info/rfc8881>.
[SPC5]: INCITS Technical Committee T10,"SCSI Primary Commands - 5 (SPC-5)",INCITS 502-2019,2019.

5.2.Informative References

[NVME-PCIE]: NVM Express, Inc.,"NVMe over PCIe Transport Specification",Revision 1.0d,December 2023,<https://nvmexpress.org/wp-content/uploads/NVM-Express-PCIe-Transport-Specification-1.0d-2023.12.27-Ratified.pdf>.

Movatterモバイル変換

RFC 9561

Using the Parallel NFS (pNFS) SCSI Layout to Access Non-Volatile Memory Express (NVMe) Storage Devices

Abstract

Status of This Memo

Copyright Notice

Table of Contents