MSG_ZEROCOPY¶
Intro¶
The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.The feature is currently implemented for TCP, UDP and VSOCK (withvirtio transport) sockets.
Opportunity and Caveats¶
Copying large buffers between user process and kernel can beexpensive. Linux supports various interfaces that eschew copying,such as sendfile and splice. The MSG_ZEROCOPY flag extends theunderlying copy avoidance mechanism to common socket send calls.
Copy avoidance is not a free lunch. As implemented, with page pinning,it replaces per byte copy cost with page accounting and completionnotification overhead. As a result, MSG_ZEROCOPY is generally onlyeffective at writes over around 10 KB.
Page pinning also changes system call semantics. It temporarily sharesthe buffer between process and network stack. Unlike with copying, theprocess cannot immediately overwrite the buffer after system callreturn without possibly modifying the data in flight. Kernel integrityis not affected, but a buggy program can possibly corrupt its own datastream.
The kernel returns a notification when it is safe to modify data.Converting an existing application to MSG_ZEROCOPY is not always astrivial as just passing the flag, then.
More Info¶
Much of this document was derived from a longer paper presented atnetdev 2.1. For more in-depth information see that paper and talk,the excellent reporting over at LWN.net or read the original code.
- paper, slides, video
- LWN article
- patchset
[PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPYhttps://lore.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
Interface¶
Passing the MSG_ZEROCOPY flag is the most obvious step to enable copyavoidance, but not the only one.
Socket Setup¶
The kernel is permissive when applications pass undefined flags to thesend system call. By default it simply ignores these. To avoid enablingcopy avoidance mode for legacy processes that accidentally already passthis flag, a process must first signal intent by setting a socket option:
if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one))) error(1, errno, "setsockopt zerocopy");
Transmission¶
The change to send (or sendto, sendmsg, sendmmsg) itself is trivial.Pass the new flag.
ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
A zerocopy failure will return -1 with errno ENOBUFS. This happens ifthe socket exceeds its optmem limit or the user exceeds their ulimit onlocked pages.
Mixing copy avoidance and copying¶
Many workloads have a mixture of large and small buffers. Because copyavoidance is more expensive than copying for small packets, thefeature is implemented as a flag. It is safe to mix calls with the flagwith those without.
Notifications¶
The kernel has to notify the process when it is safe to reuse apreviously passed buffer. It queues completion notifications on thesocket error queue, akin to the transmit timestamping interface.
The notification itself is a simple scalar value. Each socketmaintains an internal unsigned 32-bit counter. Each send call withMSG_ZEROCOPY that successfully sends data increments the counter. Thecounter is not incremented on failure or if called with length zero.The counter counts system call invocations, not bytes. It wraps afterUINT_MAX calls.
Notification Reception¶
The below snippet demonstrates the API. In the simplest case, eachsend syscall is followed by a poll and recvmsg on the error queue.
Reading from the error queue is always a non-blocking operation. Thepoll call is there to block until an error is outstanding. It will setPOLLERR in its output flags. That flag does not have to be set in theevents field. Errors are signaled unconditionally.
pfd.fd = fd;pfd.events = 0;if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0) error(1, errno, "poll");ret = recvmsg(fd, &msg, MSG_ERRQUEUE);if (ret == -1) error(1, errno, "recvmsg");read_notification(msg);
The example is for demonstration purpose only. In practice, it is moreefficient to not wait for notifications, but read without blockingevery couple of send calls.
Notifications can be processed out of order with other operations onthe socket. A socket that has an error queued would normally blockother operations until the error is read. Zerocopy notifications havea zero error code, however, to not block send and recv calls.
Notification Batching¶
Multiple outstanding packets can be read at once using the recvmmsgcall. This is often not needed. In each message the kernel returns nota single value, but a range. It coalesces consecutive notificationswhile one is outstanding for reception on the error queue.
When a new notification is about to be queued, it checks whether thenew value extends the range of the notification at the tail of thequeue. If so, it drops the new notification packet and instead increasesthe range upper value of the outstanding notification.
For protocols that acknowledge data in-order, like TCP, eachnotification can be squashed into the previous one, so that no morethan one notification is outstanding at any one point.
Ordered delivery is the common case, but not guaranteed. Notificationsmay arrive out of order on retransmission and socket teardown.
Notification Parsing¶
The below snippet demonstrates how to parse the control message: theread_notification() call in the previous snippet. A notificationis encoded in the standard error format, sock_extended_err.
The level and type fields in the control data are protocol familyspecific, IP_RECVERR or IPV6_RECVERR (for TCP or UDP socket).For VSOCK socket, cmsg_level will be SOL_VSOCK and cmsg_type will beVSOCK_RECVERR.
Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,as explained before, to avoid blocking read and write system calls onthe socket.
The 32-bit notification range is encoded as [ee_info, ee_data]. Thisrange is inclusive. Other fields in thestructmust be treated asundefined, bar for ee_code, as discussed below.
struct sock_extended_err *serr;struct cmsghdr *cm;cm = CMSG_FIRSTHDR(msg);if (cm->cmsg_level != SOL_IP && cm->cmsg_type != IP_RECVERR) error(1, 0, "cmsg");serr = (void *) CMSG_DATA(cm);if (serr->ee_errno != 0 || serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) error(1, 0, "serr");printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);Deferred copies¶
Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copyavoidance, and a contract that the kernel will queue a completionnotification. It is not a guarantee that the copy is elided.
Copy avoidance is not always feasible. Devices that do not supportscatter-gather I/O cannot send packets made up of kernel generatedprotocol headers plus zerocopy user data. A packet may need to beconverted to a private copy of data deep in the stack, say to computea checksum.
In all these cases, the kernel returns a completion notification whenit releases its hold on the shared pages. That notification may arrivebefore the (copied) data is fully transmitted. A zerocopy completionnotification is not a transmit completion notification, therefore.
Deferred copies can be more expensive than a copy immediately in thesystem call, if the data is no longer warm in the cache. The processalso incurs notification processing cost for no benefit. For thisreason, the kernel signals if data was completed with a copy, bysetting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.A process may use this signal to stop passing flag MSG_ZEROCOPY onsubsequent requests on the same socket.
Implementation¶
Loopback¶
For TCP and UDP:Data sent to local sockets can be queued indefinitely if the receiveprocess does not read its socket. Unbound notification latency is notacceptable. For this reason all packets generated with MSG_ZEROCOPYthat are looped to a local socket will incur a deferred copy. Thisincludes looping onto packet sockets (e.g., tcpdump) and tun devices.
For VSOCK:Data path sent to local sockets is the same as for non-local sockets.
Testing¶
More realistic example code can be found in the kernel source undertools/testing/selftests/net/msg_zerocopy.c.
Be cognizant of the loopback constraint. The test can be run betweena pair of hosts. But if run between a local pair of processes, forinstance when run with msg_zerocopy.sh between a veth pair acrossnamespaces, the test will not show any improvement. For testing, theloopback restriction can be temporarily relaxed by makingskb_orphan_frags_rx identical to skb_orphan_frags.
For VSOCK type of socket example can be found intools/testing/vsock/vsock_test_zerocopy.c.