Kernel Connection Multiplexor

Kernel Connection Multiplexor (KCM) is a mechanism that provides a message basedinterface over TCP for generic application protocols. With KCM an applicationcan efficiently send and receive application protocol messages over TCP usingdatagram sockets.

KCM implements an NxM multiplexor in the kernel as diagrammed below:

+------------+   +------------+   +------------+   +------------+| KCM socket |   | KCM socket |   | KCM socket |   | KCM socket |+------------+   +------------+   +------------+   +------------+    |                 |               |                |    +-----------+     |               |     +----------+                |     |               |     |            +----------------------------------+            |           Multiplexor            |            +----------------------------------+                |   |           |           |  |    +---------+   |           |           |  ------------+    |             |           |           |              |+----------+  +----------+  +----------+  +----------+ +----------+|  Psock   |  |  Psock   |  |  Psock   |  |  Psock   | |  Psock   |+----------+  +----------+  +----------+  +----------+ +----------+    |              |           |            |             |+----------+  +----------+  +----------+  +----------+ +----------+| TCP sock |  | TCP sock |  | TCP sock |  | TCP sock | | TCP sock |+----------+  +----------+  +----------+  +----------+ +----------+

KCM sockets

The KCM sockets provide the user interface to the multiplexor. All the KCM socketsbound to a multiplexor are considered to have equivalent function, and I/Ooperations in different sockets may be done in parallel without the need forsynchronization between threads in userspace.

Multiplexor

The multiplexor provides the message steering. In the transmit path, messageswritten on a KCM socket are sent atomically on an appropriate TCP socket.Similarly, in the receive path, messages are constructed on each TCP socket(Psock) and complete messages are steered to a KCM socket.

TCP sockets & Psocks

TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocatedfor each bound TCP socket, this structure holds the state for constructingmessages on receive as well as other connection specific information for KCM.

Connected mode semantics

Each multiplexor assumes that all attached TCP connections are to the samedestination and can use the different connections for load balancing whentransmitting. The normal send and recv calls (include sendmmsg and recvmmsg)can be used to send and receive messages from the KCM socket.

Socket types

KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types.

Message delineation

Messages are sent over a TCP stream with some application protocol messageformat that typically includes a header which frames the messages. The lengthof a received message can be deduced from the application protocol header(often just a simple length field).

A TCP stream must be parsed to determine message boundaries. Berkeley PacketFilter (BPF) is used for this. When attaching a TCP socket to a multiplexor aBPF program must be specified. The program is called at the start of receivinga new message and is given an skbuff that contains the bytes received so far.It parses the message header and returns the length of the message. Given thisinformation, KCM will construct the message of the stated length and deliver itto a KCM socket.

TCP socket management

When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) andwrite space available (POLLOUT) events are handled by the multiplexor. If thereis a state change (disconnection) or other error on a TCP socket, an error isposted on the TCP socket so that a POLLERR event happens and KCM discontinuesusing the socket. When the application gets the error notification for aTCP socket, it should unattach the socket from KCM and then handle the errorcondition (the typical response is to close the socket and create a newconnection if necessary).

KCM limits the maximum receive message size to be the size of the receivesocket buffer on the attached TCP socket (the socket buffer size can be set bySO_RCVBUF). If the length of a new message reported by the BPF program isgreater than this limit a corresponding error (EMSGSIZE) is posted on the TCPsocket. The BPF program may also enforce a maximum messages size and report anerror when it is exceeded.

A timeout may be set for assembling messages on a receive socket. The timeoutvalue is taken from the receive timeout of the attached TCP socket (this is setby SO_RCVTIMEO). If the timer expires before assembly is complete an error(ETIMEDOUT) is posted on the socket.

User interface

Creating a multiplexor

A new multiplexor and initial KCM socket is created by a socket call:

socket(AF_KCM, type, protocol)
  • type is either SOCK_DGRAM or SOCK_SEQPACKET
  • protocol is KCMPROTO_CONNECTED

Cloning KCM sockets

After the first KCM socket is created using the socket call as describedabove, additional sockets for the multiplexor can be created by cloninga KCM socket. This is accomplished by an ioctl on a KCM socket:

/* From linux/kcm.h */struct kcm_clone {      int fd;};struct kcm_clone info;memset(&info, 0, sizeof(info));err = ioctl(kcmfd, SIOCKCMCLONE, &info);if (!err)  newkcmfd = info.fd;

Attach transport sockets

Attaching of transport sockets to a multiplexor is performed by calling anioctl on a KCM socket for the multiplexor. e.g.:

/* From linux/kcm.h */struct kcm_attach {      int fd;      int bpf_fd;};struct kcm_attach info;memset(&info, 0, sizeof(info));info.fd = tcpfd;info.bpf_fd = bpf_prog_fd;ioctl(kcmfd, SIOCKCMATTACH, &info);

The kcm_attach structure contains:

  • fd: file descriptor for TCP socket being attached
  • bpf_prog_fd: file descriptor for compiled BPF program downloaded

Unattach transport sockets

Unattaching a transport socket from a multiplexor is straightforward. An“unattach” ioctl is done with the kcm_unattach structure as the argument:

/* From linux/kcm.h */struct kcm_unattach {      int fd;};struct kcm_unattach info;memset(&info, 0, sizeof(info));info.fd = cfd;ioctl(fd, SIOCKCMUNATTACH, &info);

Disabling receive on KCM socket

A setsockopt is used to disable or enable receiving on a KCM socket.When receive is disabled, any pending messages in the socket’sreceive buffer are moved to other sockets. This feature is usefulif an application thread knows that it will be doing a lot ofwork on a request and won’t be able to service new messages for awhile. Example use:

int val = 1;setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val))

BFP programs for message delineation

BPF programs can be compiled using the BPF LLVM backend. For example,the BPF program for parsing Thrift is:

#include "bpf.h" /* for __sk_buff */#include "bpf_helpers.h" /* for load_word intrinsic */SEC("socket_kcm")int bpf_prog1(struct __sk_buff *skb){     return load_word(skb, 0) + 4;}char _license[] SEC("license") = "GPL";

Use in applications

KCM accelerates application layer protocols. Specifically, it allowsapplications to use a message based interface for sending and receivingmessages. The kernel provides necessary assurances that messages are sentand received atomically. This relieves much of the burden applications havein mapping a message based protocol onto the TCP stream. KCM also makeapplication layer messages a unit of work in the kernel for the purposes ofsteering and scheduling, which in turn allows a simpler networking model inmultithreaded applications.

Configurations

In an Nx1 configuration, KCM logically provides multiple socket handlesto the same TCP connection. This allows parallelism between in I/Ooperations on the TCP socket (for instance copyin and copyout of data isparallelized). In an application, a KCM socket can be opened for eachprocessing thread and inserted into the epoll (similar to how SO_REUSEPORTis used to allow multiple listener sockets on the same port).

In a MxN configuration, multiple connections are established to thesame destination. These are used for simple load balancing.

Message batching

The primary purpose of KCM is load balancing between KCM sockets and hencethreads in a nominal use case. Perfect load balancing, that is steeringeach received message to a different KCM socket or steering each sentmessage to a different TCP socket, can negatively impact performancesince this doesn’t allow for affinities to be established. Balancingbased on groups, or batches of messages, can be beneficial for performance.

On transmit, there are three ways an application can batch (pipeline)messages on a KCM socket.

  1. Send multiple messages in a single sendmmsg.
  2. Send a group of messages each with a sendmsg call, where all messagesexcept the last have MSG_BATCH in the flags of sendmsg call.
  3. Create “super message” composed of multiple messages and send thiswith a single sendmsg.

On receive, the KCM module attempts to queue messages received on thesame KCM socket during each TCP ready callback. The targeted KCM socketchanges at each receive ready callback on the KCM socket. The applicationdoes not need to configure this.

Error handling

An application should include a thread to monitor errors raised onthe TCP connection. Normally, this will be done by placing eachTCP socket attached to a KCM multiplexor in epoll set for POLLERRevent. If an error occurs on an attached TCP socket, KCM sets an EPIPEon the socket thus waking up the application thread. When the applicationsees the error (which may just be a disconnect) it should unattach thesocket from KCM and then close it. It is assumed that once an error isposted on the TCP socket the data stream is unrecoverable (i.e. an errormay have occurred in the middle of receiving a message).

TCP connection monitoring

In KCM there is no means to correlate a message to the TCP socket thatwas used to send or receive the message (except in the case there isonly one attached TCP socket). However, the application does retainan open file descriptor to the socket so it will be able to get statisticsfrom the socket which can be used in detecting issues (such as highretransmissions on the socket).