==RDS===

Overview

This readme tries to provide some background on the hows and whys of RDS,and will hopefully help you find your way around the code.

In addition, please see this email about RDS origins:http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html

RDS Architecture

RDS provides reliable, ordered datagram delivery by using a singlereliable connection between any two nodes in the cluster. This allowsapplications to use a single socket to talk to any other process in thecluster - so in a cluster with N processes you need N sockets, in contrastto N*N if you use a connection-oriented socket transport like TCP.

RDS is not Infiniband-specific; it was designed to support differenttransports. The current implementation used to support RDS over TCP as wellas IB.

The high-level semantics of RDS from the application’s point of view are

  • Addressing

    RDS uses IPv4 addresses and 16bit port numbers to identifythe end point of a connection. All socket operations that involvepassing addresses between kernel and user space generallyuse a struct sockaddr_in.

    The fact that IPv4 addresses are used does not mean the underlyingtransport has to be IP-based. In fact, RDS over IB uses areliable IB connection; the IP address is used exclusively tolocate the remote node’s GID (by ARPing for the given IP).

    The port space is entirely independent of UDP, TCP or any otherprotocol.

  • Socket interface

    RDS sockets workmostly as you would expect from a BSDsocket. The next section will cover the details. At any rate,all I/O is performed through the standard BSD socket API.Some additions like zerocopy support are implemented throughcontrol messages, while other extensions use the getsockopt/setsockopt calls.

    Sockets must be bound before you can send or receive data.This is needed because binding also selects a transport andattaches it to the socket. Once bound, the transport assignmentdoes not change. RDS will tolerate IPs moving around (eg ina active-active HA scenario), but only as long as the addressdoesn’t move to a different transport.

  • sysctls

    RDS supports a number of sysctls in /proc/sys/net/rds

Socket Interface

AF_RDS, PF_RDS, SOL_RDS
AF_RDS and PF_RDS are the domain type to be used with socket(2)to create RDS sockets. SOL_RDS is the socket-level to be usedwith setsockopt(2) and getsockopt(2) for RDS specific socketoptions.
fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
This creates a new, unbound RDS socket.
setsockopt(SOL_SOCKET): send and receive buffer size

RDS honors the send and receive buffer size socket options.You are not allowed to queue more than SO_SNDSIZE bytes toa socket. A message is queued when sendmsg is called, andit leaves the queue when the remote system acknowledgesits arrival.

The SO_RCVSIZE option controls the maximum receive queue length.This is a soft limit rather than a hard limit - RDS willcontinue to accept and queue incoming messages, even if thattakes the queue length over the limit. However, it will alsomark the port as “congested” and send a congestion update tothe source node. The source node is supposed to throttle anyprocesses sending to this congested port.

bind(fd, &sockaddr_in, …)
This binds the socket to a local IP address and port, and atransport, if one has not already been selected via theSO_RDS_TRANSPORT socket option
sendmsg(fd, …)

Sends a message to the indicated recipient. The kernel willtransparently establish the underlying reliable connectionif it isn’t up yet.

An attempt to send a message that exceeds SO_SNDSIZE willreturn with -EMSGSIZE

An attempt to send a message that would take the total numberof queued bytes over the SO_SNDSIZE threshold will returnEAGAIN.

An attempt to send a message to a destination that is markedas “congested” will return ENOBUFS.

recvmsg(fd, …)

Receives a message that was queued to this socket. The socketsrecv queue accounting is adjusted, and if the queue lengthdrops below SO_SNDSIZE, the port is marked uncongested, anda congestion update is sent to all peers.

Applications can ask the RDS kernel module to receivenotifications via control messages (for instance, there is anotification when a congestion update arrived, or when a RDMAoperation completes). These notifications are received throughthe msg.msg_control buffer of struct msghdr. The format of themessages is described in manpages.

poll(fd)

RDS supports the poll interface to allow the applicationto implement async I/O.

POLLIN handling is pretty straightforward. When there’s anincoming message queued to the socket, or a pending notification,we signal POLLIN.

POLLOUT is a little harder. Since you can essentially sendto any destination, RDS will always signal POLLOUT as long asthere’s room on the send queue (ie the number of bytes queuedis less than the sendbuf size).

However, the kernel will refuse to accept messages toa destination marked congested - in this case you will loopforever if you rely on poll to tell you what to do.This isn’t a trivial problem, but applications can deal withthis - by using congestion notifications, and by checking forENOBUFS errors returned by sendmsg.

setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)

This allows the application to discard all messages queued to aspecific destination on this particular socket.

This allows the application to cancel outstanding messages ifit detects a timeout. For instance, if it tried to send a message,and the remote host is unreachable, RDS will keep trying forever.The application may decide it’s not worth it, and cancel theoperation. In this case, it would use RDS_CANCEL_SENT_TO tonuke any pending messages.

setsockopt(fd,SOL_RDS,SO_RDS_TRANSPORT,(int*)&transport..),getsockopt(fd,SOL_RDS,SO_RDS_TRANSPORT,(int*)&transport..)
Set or read an integer defining the underlyingencapsulating transport to be used for RDS packets on thesocket. When setting the option, integer argument may beone of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving thevalue, RDS_TRANS_NONE will be returned on an unbound socket.This socket option may only be set exactly once on the socket,prior to binding it via the bind(2) system call. Attempts toset SO_RDS_TRANSPORT on a socket for which the transport hasbeen previously attached explicitly (by SO_RDS_TRANSPORT) orimplicitly (via bind(2)) will return an error of EOPNOTSUPP.An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE willalways return EINVAL.

RDMA for RDS

see rds-rdma(7) manpage (available in rds-tools)

Congestion Notifications

see rds(7) manpage

RDS Protocol

Message header

The message header is a ‘struct rds_header’ (see rds.h):

Fields:

h_sequence:
per-packet sequence number
h_ack:
piggybacked acknowledgment of last packet received
h_len:
length of data, not including header
h_sport:
source port
h_dport:
destination port
h_flags:

Can be:

CONG_BITMAPthis is a congestion update bitmap
ACK_REQUIREDreceiver must ack this packet
RETRANSMITTEDpacket has previously been sent
h_credit:
indicate to other end of connection thatit has more credits available (i.e. there ismore send room)
h_padding[4]:
unused, for future use
h_csum:
header checksum
h_exthdr:
optional data can be passed here. This is currently used forpassing RDMA-related information.

ACK and retransmit handling

One might think that with reliable IB connections you wouldn’t needto ack messages that have been received. The problem is that IBhardware generates an ack message before it has DMAed the messageinto memory. This creates a potential message loss if the HCA isdisabled for any reason between when it sends the ack and beforethe message is DMAed and processed. This is only a potential issueif another HCA is available for fail-over.

Sending an ack immediately would allow the sender to free the sentmessage from their send queue quickly, but could cause excessivetraffic to be used for acks. RDS piggybacks acks on sent datapackets. Ack-only packets are reduced by only allowing one to bein flight at a time, and by the sender only asking for acks whenits send buffers start to fill up. All retransmissions are alsoacked.

Flow Control

RDS’s IB transport uses a credit-based mechanism to verify thatthere is space in the peer’s receive buffers for more data. Thiseliminates the need for hardware retries on the connection.

Congestion

Messages waiting in the receive queue on the receiving socketare accounted against the sockets SO_RCVBUF option value. Onlythe payload bytes in the message are accounted for. If thenumber of bytes queued equals or exceeds rcvbuf then the socketis congested. All sends attempted to this socket’s addressshould return block or return -EWOULDBLOCK.

Applications are expected to be reasonably tuned such that thissituation very rarely occurs. An application encountering this“back-pressure” is considered a bug.

This is implemented by having each node maintain bitmaps whichindicate which ports on bound addresses are congested. As thebitmap changes it is sent through all the connections whichterminate in the local address of the bitmap which changed.

The bitmaps are allocated as connections are brought up. Thisavoids allocation in the interrupt handling path which queuessages on sockets. The dense bitmaps let transports send theentire bitmap on any bitmap change reasonably efficiently. Thisis much easier to implement than some finer-grainedcommunication of per-port congestion. The sender does a veryinexpensive bit test to test if the port it’s about to send tois congested or not.

RDS Transport Layer

As mentioned above, RDS is not IB-specific. Its code is dividedinto a general RDS layer and a transport layer.

The general layer handles the socket API, congestion handling,loopback, stats, usermem pinning, and the connection state machine.

The transport layer handles the details of the transport. The IBtransport, for example, handles all the queue pairs, work requests,CM event handlers, and other Infiniband details.

RDS Kernel Structures

struct rds_message
aka possibly “rds_outgoing”, the generic RDS layer copies data tobe sent and sets header fields as needed, based on the socket API.This is then queued for the individual connection and sent by theconnection’s transport.
struct rds_incoming
a generic struct referring to incoming data that can be handed fromthe transport to the general code and queued by the general codewhile the socket is awoken. It is then passed back to the transportcode to handle the actual copy-to-user.
struct rds_socket
per-socket information
struct rds_connection
per-connection information
struct rds_transport
pointers to transport-specific functions
struct rds_statistics
non-transport-specific statistics
struct rds_cong_map
wraps the raw congestion bitmap, contains rbnode, waitq, etc.

Connection management

Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, andERROR states.

The first time an attempt is made by an RDS socket to send data toa node, a connection is allocated and connected. That connection isthen maintained forever – if there are transport errors, theconnection will be dropped and re-established.

Dropping a connection while packets are queued will cause queued orpartially-sent datagrams to be retransmitted when the connection isre-established.

The send path

rds_sendmsg()
  • struct rds_message built from incoming data
  • CMSGs parsed (e.g. RDMA ops)
  • transport connection alloced and connected if not already
  • rds_message placed on send queue
  • send worker awoken
rds_send_worker()
  • calls rds_send_xmit() until queue is empty
rds_send_xmit()
  • transmits congestion map if one is pending
  • may set ACK_REQUIRED
  • calls transport to send either non-RDMA or RDMA message(RDMA ops never retransmitted)
rds_ib_xmit()
  • allocs work requests from send ring
  • adds any new send credits available to peer (h_credits)
  • maps the rds_message’s sg list
  • piggybacks ack
  • populates work requests
  • post send to connection’s queue pair

The recv path

rds_ib_recv_cq_comp_handler()
  • looks at write completions
  • unmaps recv buffer from device
  • no errors, call rds_ib_process_recv()
  • refill recv ring
rds_ib_process_recv()
  • validate header checksum
  • copy header to rds_ib_incoming struct if start of a new datagram
  • add to ibinc’s fraglist
  • if competed datagram:
    • update cong map if datagram was cong update
    • call rds_recv_incoming() otherwise
    • note if ack is required
rds_recv_incoming()
  • drop duplicate packets
  • respond to pings
  • find the sock associated with this datagram
  • add to sock queue
  • wake up sock
  • do some congestion calculations
rds_recvmsg
  • copy data into user iovec
  • handle CMSGs
  • return to application

Multipath RDS (mprds)

Mprds is multipathed-RDS, primarily intended for RDS-over-TCP(though the concept can be extended to other transports). The classicalimplementation of RDS-over-TCP is implemented by demultiplexing multiplePF_RDS sockets between any 2 endpoints (where endpoint == [IP address,port]) over a single TCP socket between the 2 IP addresses involved. Thishas the limitation that it ends up funneling multiple RDS flows over asingle TCP flow, thus it is(a) upper-bounded to the single-flow bandwidth,(b) suffers from head-of-line blocking for all the RDS sockets.

Better throughput (for a fixed small packet size, MTU) can be achievedby having multiple TCP/IP flows per rds/tcp connection, i.e., multipathedRDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcpconnection. RDS sockets will be attached to a path based on some hash(e.g., of local address and RDS port number) and packets for that RDSsocket will be sent over the attached path using TCP to segment/reassembleRDS datagrams on that path.

Multipathed RDS is implemented by splitting the struct rds_connection intoa common (to all paths) part, and a per-path struct rds_conn_path. AllI/O workqs and reconnect threads are driven from the rds_conn_path.Transports such as TCP that are multipath capable may then set up aTCP socket per rds_conn_path, and this is managed by the transport viathe transport privatee cp_transport_data pointer.

Transports announce themselves as multipath capable by setting thet_mp_capable bit during registration with the rds core module. When thetransport is multipath-capable, rds_sendmsg() hashes outgoing trafficacross multiple paths. The outgoing hash is computed based on thelocal address and port that the PF_RDS socket is bound to.

Additionally, even if the transport is MP capable, we may bepeering with some node that does not support mprds, or supportsa different number of paths. As a result, the peering nodes needto agree on the number of paths to be used for the connection.This is done by sending out a control packet exchange before thefirst data packet. The control packet exchange must have completedprior to outgoing hash completion in rds_sendmsg() when the transportis mutlipath capable.

The control packet is an RDS ping packet (i.e., packet to rds destport 0) with the ping packet having a rds extension header option oftype RDS_EXTHDR_NPATHS, length 2 bytes, and the value is thenumber of paths supported by the sender. The “probe” ping packet willget sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>)The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediatelybe able to compute the min(sender_paths, rcvr_paths). The pongsent in response to a probe-ping should contain the rcvr’s npathswhen the rcvr is mprds-capable.

If the rcvr is not mprds-capable, the exthdr in the ping will beignored. In this case the pong will not have any exthdrs, so the senderof the probe-ping can default to single-path mprds.