Movatterモバイル変換


[0]ホーム

URL:


RFC 8797RPC-over-RDMA CM Private DataJune 2020
LeverStandards Track[Page]
Stream:
Internet Engineering Task Force (IETF)
RFC:
8797
Updates:
8166
Category:
Standards Track
Published:
ISSN:
2070-1721
Author:
C. Lever
Oracle

RFC 8797

Remote Direct Memory Access - Connection Manager (RDMA-CM) Private Data for RPC-over-RDMA Version 1

Abstract

This document specifies the format ofRemote Direct Memory Access - Connection Manager (RDMA-CM) Private Dataexchanged between RPC-over-RDMA version 1 peersas part of establishing a connection.The addition of the Private Data payload specified in this documentis an optional extensionthat does not alter the RPC-over-RDMA version 1 protocol.This document updates RFC 8166.

Status of This Memo

This is an Internet Standards Track document.

This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 7841.

Information about the current status of this document, any errata, and how to provide feedback on it may be obtained athttps://www.rfc-editor.org/info/rfc8797.

Copyright Notice

Copyright (c) 2020 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

Table of Contents

1.Introduction

The RPC-over-RDMA version 1 transport protocol[RFC8166]enables payload data transfer usingRemote Direct Memory Access (RDMA)for upper-layer protocols based on Remote Procedure Calls (RPCs)[RFC5531].The terms "Remote Direct Memory Access" (RDMA) and"Direct Data Placement" (DDP) are introduced in[RFC5040].

The two most immediate shortcomingsof RPC-over-RDMA version 1 are as follows:

  1. Setting up an RDMA data transfer (via RDMA Read or Write) can be costly.The small default size of messages transmitted using RDMA Sendforces the use of RDMA Read or Write operationseven for relatively small messages and data payloads.

    The original specification of RPC-over-RDMA version 1 providedan out-of-band protocol for passing inline threshold valuesbetween connected peers[RFC5666].However,[RFC8166]eliminated support for this protocol, making it unavailable for this purpose.

  2. Unlike most other contemporary RDMA-enabled storage protocols,there is no facility in RPC-over-RDMA version 1that enables the use of remote invalidation[RFC5042].

Each RPC-over-RDMA version 1 Transport Header follows theExternal Data Representation (XDR) definition[RFC4506]specified in[RFC8166].However, RPC-over-RDMA version 1has no means of extending this definitionin such a way that interoperability with existing implementations is preserved.As a result, an out-of-band mechanism is neededto help relieve these constraintsfor existing RPC-over-RDMA version 1 implementations.

This document specifies a simple, non-XDR-based message formatdesigned to be passed between RPC-over-RDMA version 1 peersat the time each RDMA transport connection is first established.The mechanism assumes that the underlying RDMA transport has aPrivate Data field that is passed between peers at connection time,such as is present in the Marker PDU Aligned Framing (MPA) protocol (described inSection 7.1 of [RFC5044]and extended in[RFC6581]) or the InfiniBand Connection Manager[IBA].

To enable current RPC-over-RDMA version 1 implementationsto interoperate with implementations that supportthe message format described in this document,implementation of the Private Data exchange isOPTIONAL.When Private Data has been successfully exchanged,peers may choose to perform extended RDMA semantics.However, this exchange does not alter the XDR definition specified in[RFC8166].

The message format is intended to be further extensiblewithin the normal scope of such IETF work(seeSection 6for further details).Section 8of this document defines an IANA registry for this purpose.In addition, interoperation betweenimplementations of RPC-over-RDMA version 1 that present this message format to peersand those that do not recognize this message format is guaranteed.

2.Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14[RFC2119][RFC8174] when, and only when, they appear in all capitals, as shown here.

3.Advertised Transport Properties

3.1.Inline Threshold Size

Section 3.3.2 of [RFC8166]defines the term "inline threshold".An inline threshold is the maximum number of bytes thatcan be transmitted using one RDMA Send and one RDMA Receive.There are a pair of inline thresholds for a connection:a client-to-server threshold and a server-to-client threshold.

If an incoming RDMA message exceeds the sizeof a receiver's inline threshold,the Receive operation failsandthe RDMA provider typically terminates the connection.To convey an RPC message larger than the receiver's inline thresholdwithout risking receive failure,a sender must use explicit RDMA data transfer operations,which are more expensive than an RDMA Send.See Sections 3.3 and3.5 of[RFC8166]for a complete discussion.

The default value of inline thresholds for RPC-over-RDMA version 1connections is 1024 bytes (as defined inSection 3.3.3 of [RFC8166]).This value is adequate for nearly all NFS version 3 procedures.

NFS version 4 COMPOUND operations[RFC7530]are larger on averagethan NFS version 3 procedures[RFC1813],forcing clients to use explicit RDMA operationsfor frequently issued requests such as LOOKUP and GETATTR.The use of RPCSEC_GSS security also increases the average sizeof RPC messages,due to the larger size of RPCSEC_GSS credential materialincluded in RPC headers[RFC7861].

If a sender and receiver could somehow agree on larger inline thresholds,frequently used RPC transactions avoid the cost of explicit RDMA operations.

3.2.Remote Invalidation

After an RDMA data transfer operation completes,an RDMA consumer can requestthat its peer's RDMA Network Interface Card (RNIC)invalidate the Steering Tag (STag)associated with the data transfer[RFC5042].

An RDMA consumer requests remote invalidation by postingan RDMA Send with Invalidate operationin place of an RDMA Send operation.Each RDMA Send with Invalidate carries one STag to invalidate.The receiver of an RDMA Send with Invalidate performs therequested invalidation and then reports that invalidationas part of the completion of a waiting Receive operation.

If both peers support remote invalidation,an RPC-over-RDMA responder might use remote invalidationwhen replying to an RPC request that provided chunks.Because one of the chunks has already been invalidated,finalizing the results of the RPC is made simpler and faster.

However, there are some important caveats that contraindicatethe blanket use of remote invalidation:

  • Remote invalidation is not supported by all RNICs.
  • Not all RPC-over-RDMA responder implementations can generateRDMA Send with Invalidate operations.
  • Not all RPC-over-RDMA requester implementations can recognizewhen remote invalidation has occurred.
  • On one connection in different RPC-over-RDMA transactions,or in a single RPC-over-RDMA transaction,an RPC-over-RDMA requester can expose a mixture of STagsthat may be invalidated remotelyand some that must not be.No indication is provided at the RDMA layer as to which is which.

A responder therefore must not employ remote invalidation unless it isaware of support for it in its own RDMA stack, and on the requester.And, without altering the XDR structure of RPC-over-RDMA version 1 messages,it is not possible to support remote invalidation with requestersthat include an STag that must not be invalidatedremotely in an RPC with STags that may be invalidated. Likewise, itis not possible to support remote invalidation with requesters thatmix RPCs with STags that may be invalidated with RPCs with STags thatmust not be invalidated on the same connection.

There are some NFS/RDMA client implementations whose STagsare always safe to invalidate remotely.For such clients, indicating to the responder that remoteinvalidation is always safe can enable such invalidationwithout the need for additional protocol elements to be defined.

4.Private Data Message Format

With an InfiniBand lower layer, for example,RDMA connection setup uses a Connection Manager (CM)when establishing a Reliable Connection[IBA].When an RPC-over-RDMA version 1 transport connection is established,the client (which actively establishes connections)and the server (which passively accepts connections)populate the CM Private Data field exchangedas part of CM connection establishment.

The transport properties exchanged via this mechanismare fixed for the life of the connection.Each new connection presents an opportunityfor a fresh exchange.An implementation of the extension described in this documentMUST be prepared for the settings to change upon a reconnection.

For RPC-over-RDMA version 1, the CM Private Data fieldis formatted as described below. RPC clients and servers use the same format.If the capacity of the Private Data field is too smallto contain this message formatorthe underlying RDMA transport is not managed by a CM,the CM Private Data field cannot be used on behalf of RPC-over-RDMA version 1.

The first eight octets of the CM Private Data fieldare to be formatted as follows:

  0                   1                   2                   3  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |                       Format Identifier                       | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |    Version    |  Reserved   |R|   Send Size   | Receive Size  | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Format Identifier:
This field contains a fixed 32-bit value that identifiesthe content of the Private Data field as an RPC-over-RDMAversion 1 CM Private Data message.In RPC-over-RDMA version 1 Private Data,the value of this field is always 0xf6ab0e18, in network byte order.The use of this field is further expanded upon inSection 5.2.
Version:
This 8-bit field contains a message format version number.The value "1" in this field indicates that exactly eight octets are present,that they appear in the order described in this section,and that each has the meaning defined in this section.Further considerations about the use of this field are discussed inSection 6.
Reserved:
This 7-bit field is unused.SendersMUST set these bits to zero,andreceiversMUST ignore their value.
R:
This 1-bit field indicates thatthe sender supports remote invalidation.The field is set and interpreted as described inSection 4.1.
Send Size:
This 8-bit field contains an encoded valuecorresponding to the maximum number of bytesthis peer is prepared to transmit in a single RDMA Sendon this connection.The value is encoded as described inSection 4.2.
Receive Size:
This 8-bit field contains an encoded valuecorresponding to the maximum number of bytesthis peer is prepared to receive with a single RDMA Receiveon this connection.The value is encoded as described inSection 4.2.

4.1.Using the R Field

The R field indicates limited support for remote invalidationas described inSection 3.2.When both connection peers have set this bit flag in their CM Private Data,the responderMAY use RDMA Send with Invalidate operationswhen transmitting RPC Replies.Each RDMA Send with InvalidateMUST invalidate an STagassociated only with the Transaction ID (XID) in the rdma_xid fieldof the RPC-over-RDMA Transport Header it carries.

When either peer on a connection clears this flag,the responderMUST use only RDMA Send when transmitting RPC Replies.

4.2.Send and Receive Size Values

Inline threshold sizes from 1024 to 262144 octetscan be represented in the Send Size and Receive Size fields.The inline threshold values provide a pair of1024-octet-aligned maximum message lengths thatguarantee that Send and Receive operationsdo not fail due to length errors.

The minimum inline threshold for RPC-over-RDMA version 1is 1024 octets (seeSection 3.3.3 of [RFC8166]).The values in the Send Size and Receive Size fields representthe unsigned number of additional kilo-octets of lengthbeyond the first 1024 octets.Thus, a sender computes the encoded value bydividing its actual buffer size, in octets, by 1024andsubtracting one from the result.A receiver decodes an incoming Size value by performingthe inverse set of operations:it adds one to the encoded valueand thenmultiplies that result by 1024.

The client uses the smaller of its own send size andthe server's reported receive sizeas the client-to-server inline threshold.The server uses the smaller of its own send size andthe client's reported receive sizeas the server-to-client inline threshold.

5.Interoperability Considerations

The extension described in this document is designed to allowRPC-over-RDMA version implementations that use CM Private Datato interoperate fully withRPC-over-RDMA version 1 implementations that do not exchange this information.Implementations that use this extension must also interoperatefully with RDMA implementations that use CM Private Data for other purposes.Realizing these goals requires that implementations of this extensionfollow the practices described in the rest of this section.

5.1.Interoperability with RPC-over-RDMA Version 1 Implementations

When a peer does not receive a CM Private Data messagethat conforms toSection 4,it needs to act as if the remote peer supports only thedefault RPC-over-RDMA version 1 settings,as defined in[RFC8166].In other words, the peerMUST behave as if a Private Datamessage was received in which (1) bit 15 of the Flags field is zeroand (2) both Size fields contain the value zero.

5.2.Interoperability amongst RDMA Transports

The Format Identifier field defined inSection 4is provided to enable implementations to distinguish the Private Data definedin this document from Private Data inserted at other layers, such as theadditional Private Data defined by the MPAv2 protocol described in[RFC6581], and others.

As part of connection establishment,the buffer containing the received Private Data is searched for the Format Identifier word.The offset of the Format Identifier is not restricted to any alignment.If the RPC-over-RDMA version 1 CM Private Data Format Identifieris not present,an RPC-over-RDMA version 1 receiverMUSTbehave as if no RPC-over-RDMA version 1 CM Private Datahas been provided.

Once the RPC-over-RDMA version 1 CM Private Data Format Identifieris found,the receiver parses the subsequent octets asRPC-over-RDMA version 1 CM Private Data.As additional assurance that the content is validRPC-over-RDMA version 1 CM Private Data,the receiver should check thatthe format version number field contains a valid and recognized version numberandthe size of the content does not overrun the length of the buffer.

6.Updating the Message Format

Although the message format described in this documentprovides the ability for the client and serverto exchange particular information aboutthe local RPC-over-RDMA implementation,it is possible that there will be a future needto exchange additional properties.This would make it necessary to extend or otherwise modifythe format described in this document.

Any modification faces the problem of interoperating properlywith implementations of RPC-over-RDMA version 1that are unaware of the existence of the new format.These include implementations that do not recognizethe exchange of CM Private Dataas well asthose that recognize only the format described in this document.

Given the message format described in this document,these interoperability constraints could be met by the followingsorts of new message formats:

Although it is possible to reorganizethe last three of the eight bytes in the existing format,extended formats are unlikely to do so.New formats would take the form of extensionsof the format described in this document with added fieldsstarting at byte eight of the formator changes to the definition of bits in the Reserved field.

7.Security Considerations

The reader is directed to the Security Considerations section of[RFC8166]for background and further discussion.

The RPC-over-RDMA version 1 protocol framework dependson the semantics of the Reliable Connected (RC) queue pair (QP)type, as defined inSection 9.7.7 of[IBA].The integrity of CM Private Dataandthe authenticity of its sourceare ensured by the exclusive use of RC QPs.Any attempt to interfere with or hijack data in transiton an RC connectionresults in the RDMA provider terminating the connection.

The Security Considerations section of[RFC5042]refers the reader to further relevant discussionof generic RDMA transport security.That document recommends IPsec asthe default transport-layer security solution.When deployed with the Remote Direct Memory Access Protocol (RDMAP)[RFC5040], DDP[RFC5041], and MPA[RFC5044], IPsec establishes a protected channel before anyoperations are exchanged; thus, it protects the exchange of Private Data.However, IPsec is not available for InfiniBand or RDMA over Converged Ethernet (RoCE) deployments.Those fabrics rely onphysical securityandcyclic redundancy checksto protect network traffic.

Exchanging the informationcontained in the message format defined in this documentdoes not expose upper-layer payloads to an attacker.Furthermore, the behavior changes that occuras a result of exchanging the Private Datadescribed in the current documentdo not introduce any new risk of exposureof upper-layer payload data.

Improperly setting one of the fields in version 1Private Data can result in an increased risk of disconnection(i.e., self-imposed Denial of Service).A similar risk can ariseif non-RPC-over-RDMA CM Private Datainadvertently contains the Format Identifier thatidentifies this protocol's data structure.Additional checking of incoming Private Data,as described inSection 5.2,can help reduce this risk.

In addition to describing the structure of a new format version,any document that extends the Private Data format describedin the current document must discuss security considerationsof new data items exchanged between connection peers.Such documents should also explore the risksof erroneously identifying non-RPC-over-RDMA CM Private Dataas the new format.

8.IANA Considerations

IANA has created the "RDMA-CM Private Data Identifiers" subregistry within the "Remote Direct Data Placement" protocol category group.This is a subregistry of 32-bit numbers that identifythe upper-layer protocol associated with data that appears inthe application-specific RDMA-CM Private Data area.The fields in this subregistry include the following:Format Identifier,Length (format length, in octets),Description,andReference.

The initial contents of this registry are a single entry:

Table 1:New "RDMA-CM Private Data Identifiers" Registry
Format IdentifierLengthDescriptionReference
0xf6ab0e188RPC-over-RDMA version 1 CM Private DataRFC 8797

IANA is to assign subsequent new entries in this registry usingthe Specification Required policy as defined inSection 4.6 of [RFC8126].

8.1.Guidance for Designated Experts

The Designated Expert (DE), appointed by the IESG,should ascertain the existence of suitable documentation thatdefines the semantics and format of the Private Data,and verify that the document is permanently and publicly available.Documentation produced outside the IETF must not conflictwith work that is active or already published within the IETF.The new Reference field should containa reference to that documentation.

The Description field should contain the name of theupper-layer protocolthat generates and uses the Private Data.

The DE should assign a new Format Identifier so thatit does not conflict with existing entries in this registryand so thatit is not likely to be mistakenas part of the payload of other registered formats.

The DE shall post the request to the NFSV4 Working Group mailing list(or a successor to that list, if such a list exists)for comment and review.The DE shall approve or deny the request and publish noticeof the decision within 30 days.

9.References

9.1.Normative References

[IBA]
InfiniBand Trade Association,"InfiniBand Architecture Specification Volume 1",Release 1.3,,<https://www.infinibandta.org/>.
[RFC2119]
Bradner, S.,"Key words for use in RFCs to Indicate Requirement Levels",BCP 14,RFC 2119,DOI 10.17487/RFC2119,,<https://www.rfc-editor.org/info/rfc2119>.
[RFC4506]
Eisler, M., Ed.,"XDR: External Data Representation Standard",STD 67,RFC 4506,DOI 10.17487/RFC4506,,<https://www.rfc-editor.org/info/rfc4506>.
[RFC5040]
Recio, R., Metzler, B., Culley, P., Hilland, J., and D. Garcia,"A Remote Direct Memory Access Protocol Specification",RFC 5040,DOI 10.17487/RFC5040,,<https://www.rfc-editor.org/info/rfc5040>.
[RFC5042]
Pinkerton, J. and E. Deleganes,"Direct Data Placement Protocol (DDP) / Remote Direct Memory Access Protocol (RDMAP) Security",RFC 5042,DOI 10.17487/RFC5042,,<https://www.rfc-editor.org/info/rfc5042>.
[RFC8126]
Cotton, M., Leiba, B., and T. Narten,"Guidelines for Writing an IANA Considerations Section in RFCs",BCP 26,RFC 8126,DOI 10.17487/RFC8126,,<https://www.rfc-editor.org/info/rfc8126>.
[RFC8166]
Lever, C., Ed., Simpson, W., and T. Talpey,"Remote Direct Memory Access Transport for Remote Procedure Call Version 1",RFC 8166,DOI 10.17487/RFC8166,,<https://www.rfc-editor.org/info/rfc8166>.
[RFC8174]
Leiba, B.,"Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words",BCP 14,RFC 8174,DOI 10.17487/RFC8174,,<https://www.rfc-editor.org/info/rfc8174>.

9.2.Informative References

[RFC1813]
Callaghan, B., Pawlowski, B., and P. Staubach,"NFS Version 3 Protocol Specification",RFC 1813,DOI 10.17487/RFC1813,,<https://www.rfc-editor.org/info/rfc1813>.
[RFC5041]
Shah, H., Pinkerton, J., Recio, R., and P. Culley,"Direct Data Placement over Reliable Transports",RFC 5041,DOI 10.17487/RFC5041,,<https://www.rfc-editor.org/info/rfc5041>.
[RFC5044]
Culley, P., Elzur, U., Recio, R., Bailey, S., and J. Carrier,"Marker PDU Aligned Framing for TCP Specification",RFC 5044,DOI 10.17487/RFC5044,,<https://www.rfc-editor.org/info/rfc5044>.
[RFC5531]
Thurlow, R.,"RPC: Remote Procedure Call Protocol Specification Version 2",RFC 5531,DOI 10.17487/RFC5531,,<https://www.rfc-editor.org/info/rfc5531>.
[RFC5666]
Talpey, T. and B. Callaghan,"Remote Direct Memory Access Transport for Remote Procedure Call",RFC 5666,DOI 10.17487/RFC5666,,<https://www.rfc-editor.org/info/rfc5666>.
[RFC6581]
Kanevsky, A., Ed., Bestler, C., Ed., Sharp, R., and S. Wise,"Enhanced Remote Direct Memory Access (RDMA) Connection Establishment",RFC 6581,DOI 10.17487/RFC6581,,<https://www.rfc-editor.org/info/rfc6581>.
[RFC7530]
Haynes, T., Ed. and D. Noveck, Ed.,"Network File System (NFS) Version 4 Protocol",RFC 7530,DOI 10.17487/RFC7530,,<https://www.rfc-editor.org/info/rfc7530>.
[RFC7861]
Adamson, A. and N. Williams,"Remote Procedure Call (RPC) Security Version 3",RFC 7861,DOI 10.17487/RFC7861,,<https://www.rfc-editor.org/info/rfc7861>.

Acknowledgments

Thanks toChristoph HellwigandDevesh Sharmafor suggesting this approach,and toTom TalpeyandDavid Noveckfor their expert comments and review.The author also wishes to thankBill BakerandGreg Marsdenfor their support of this work.Also, thanks to expert reviewersSean HeftyandDave Minturn.

Special thanks go todocument shepherdBrian Pawlowski,Transport Area DirectorMagnus Westerlund,NFSV4 Working Group ChairsDavid NoveckandSpencer Shepler,andNFSV4 Working Group SecretaryThomas Haynes.

Author's Address

Charles Lever
Oracle Corporation
United States of America
Email:chuck.lever@oracle.com

[8]ページ先頭

©2009-2025 Movatter.jp