BPF ring buffer¶
This document describes BPF ring buffer design, API, and implementation details.
Motivation¶
There are two distinctive motivators for this work, which are not satisfied byexisting perf buffer, which prompted creation of a new ring bufferimplementation.
- more efficient memory utilization by sharing ring buffer across CPUs;
- preserving ordering of events that happen sequentially in time, even acrossmultiple CPUs (e.g., fork/exec/exit events for a task).
These two problems are independent, but perf buffer fails to satisfy both.Both are a result of a choice to have per-CPU perf ring buffer. Both can bealso solved by having an MPSC implementation of ring buffer. The orderingproblem could technically be solved for perf buffer with some in-kernelcounting, but given the first one requires an MPSC buffer, the same solutionwould solve the second problem automatically.
Semantics and APIs¶
Single ring buffer is presented to BPF programs as an instance of BPF map oftypeBPF_MAP_TYPE_RINGBUF. Two other alternatives considered, butultimately rejected.
One way would be to, similar toBPF_MAP_TYPE_PERF_EVENT_ARRAY, makeBPF_MAP_TYPE_RINGBUF could represent an array of ring buffers, but notenforce “same CPU only” rule. This would be more familiar interface compatiblewith existing perf buffer use in BPF, but would fail if application needed moreadvanced logic to lookup ring buffer by arbitrary key.BPF_MAP_TYPE_HASH_OF_MAPS addresses this with current approach.Additionally, given the performance of BPF ringbuf, many use cases would justopt into a simple single ring buffer shared among all CPUs, for which currentapproach would be an overkill.
Another approach could introduce a new concept, alongside BPF map, to representgeneric “container” object, which doesn’t necessarily have key/value interfacewith lookup/update/delete operations. This approach would add a lot of extrainfrastructure that has to be built for observability and verifier support. Itwould also add another concept that BPF developers would have to familiarizethemselves with, new syntax in libbpf, etc. But then would really provide noadditional benefits over the approach of using a map.BPF_MAP_TYPE_RINGBUFdoesn’t support lookup/update/delete operations, but so doesn’t few other maptypes (e.g., queue and stack; array doesn’t support delete, etc).
The approach chosen has an advantage of re-using existing BPF mapinfrastructure (introspection APIs in kernel, libbpf support, etc), beingfamiliar concept (no need to teach users a new type of object in BPF program),and utilizing existing tooling (bpftool). For common scenario of using a singlering buffer for all CPUs, it’s as simple and straightforward, as would be witha dedicated “container” object. On the other hand, by being a map, it can becombined withARRAY_OF_MAPS andHASH_OF_MAPS map-in-maps to implementa wide variety of topologies, from one ring buffer for each CPU (e.g., asa replacement for perf buffer use cases), to a complicated applicationhashing/sharding of ring buffers (e.g., having a small pool of ring bufferswith hashed task’s tgid being a look up key to preserve order, but reducecontention).
Key and value sizes are enforced to be zero.max_entries is used to specifythe size of ring buffer and has to be a power of 2 value.
There are a bunch of similarities between perf buffer(BPF_MAP_TYPE_PERF_EVENT_ARRAY) and new BPF ring buffer semantics:
- variable-length records;
- if there is no more space left in ring buffer, reservation fails, noblocking;
- memory-mappable data area for user-space applications for ease ofconsumption and high performance;
- epoll notifications for new incoming data;
- but still the ability to do busy polling for new data to achieve thelowest latency, if necessary.
BPF ringbuf provides two sets of APIs to BPF programs:
bpf_ringbuf_output()allows tocopy data from one place to a ringbuffer, similarly tobpf_perf_event_output();bpf_ringbuf_reserve()/bpf_ringbuf_commit()/bpf_ringbuf_discard()APIs split the whole process into two steps. First, a fixed amount of spaceis reserved. If successful, a pointer to a data inside ring buffer dataarea is returned, which BPF programs can use similarly to a data insidearray/hash maps. Once ready, this piece of memory is either committed ordiscarded. Discard is similar to commit, but makes consumer ignore therecord.
bpf_ringbuf_output() has disadvantage of incurring extra memory copy,because record has to be prepared in some other place first. But it allows tosubmit records of the length that’s not known to verifier beforehand. It alsoclosely matchesbpf_perf_event_output(), so will simplify migrationsignificantly.
bpf_ringbuf_reserve() avoids the extra copy of memory by providing a memorypointer directly to ring buffer memory. In a lot of cases records are largerthan BPF stack space allows, so many programs have use extra per-CPU array asa temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needscompletely. But in exchange, it only allows a known constant size of memory tobe reserved, such that verifier can verify that BPF program can’t access memoryoutside its reserved record space. bpf_ringbuf_output(), while slightly slowerdue to extra memory copy, covers some use cases that are not suitable forbpf_ringbuf_reserve().
The difference between commit and discard is very small. Discard just marksa record as discarded, and such records are supposed to be ignored by consumercode. Discard is useful for some advanced use-cases, such as ensuringall-or-nothing multi-record submission, or emulating temporarymalloc()/free() within single BPF program invocation.
Each reserved record is tracked by verifier through existingreference-tracking logic, similar to socket ref-tracking. It is thusimpossible to reserve a record, but forget to submit (or discard) it.
bpf_ringbuf_query() helper allows to query various properties of ringbuffer. Currently 4 are supported:
BPF_RB_AVAIL_DATAreturns amount of unconsumed data in ring buffer;BPF_RB_RING_SIZEreturns the size of ring buffer;BPF_RB_CONS_POS/BPF_RB_PROD_POSreturns current logical possitionof consumer/producer, respectively.
Returned values are momentarily snapshots of ring buffer state and could beoff by the time helper returns, so this should be used only fordebugging/reporting reasons or for implementing various heuristics, that takeinto account highly-changeable nature of some of those characteristics.
One such heuristic might involve more fine-grained control over poll/epollnotifications about new data availability in ring buffer. Together withBPF_RB_NO_WAKEUP/BPF_RB_FORCE_WAKEUP flags for output/commit/discardhelpers, it allows BPF program a high degree of control and, e.g., moreefficient batched notifications. Default self-balancing strategy, though,should be adequate for most applications and will work reliable and efficientlyalready.
Design and Implementation¶
This reserve/commit schema allows a natural way for multiple producers, eitheron different CPUs or even on the same CPU/in the same BPF program, to reserveindependent records and work with them without blocking other producers. Thismeans that if BPF program was interruped by another BPF program sharing thesame ring buffer, they will both get a record reserved (provided there isenough space left) and can work with it and submit it independently. Thisapplies to NMI context as well, except that due to using a spinlock duringreservation, in NMI context,bpf_ringbuf_reserve() might fail to geta lock, in which case reservation will fail even if ring buffer is not full.
The ring buffer itself internally is implemented as a power-of-2 sizedcircular buffer, with two logical and ever-increasing counters (which mightwrap around on 32-bit architectures, that’s not a problem):
- consumer counter shows up to which logical position consumer consumed thedata;
- producer counter denotes amount of data reserved by all producers.
Each time a record is reserved, producer that “owns” the record willsuccessfully advance producer counter. At that point, data is still not yetready to be consumed, though. Each record has 8 byte header, which contains thelength of reserved record, as well as two extra bits: busy bit to denote thatrecord is still being worked on, and discard bit, which might be set at committime if record is discarded. In the latter case, consumer is supposed to skipthe record and move on to the next one. Record header also encodes record’srelative offset from the beginning of ring buffer data area (in pages). Thisallowsbpf_ringbuf_commit()/bpf_ringbuf_discard() to accept only thepointer to the record itself, without requiring also the pointer to ring bufferitself. Ring buffer memory location will be restored from record metadataheader. This significantly simplifies verifier, as well as improving APIusability.
Producer counter increments are serialized under spinlock, so there isa strict ordering between reservations. Commits, on the other hand, arecompletely lockless and independent. All records become available to consumerin the order of reservations, but only after all previous records wherealready committed. It is thus possible for slow producers to temporarily holdoff submitted records, that were reserved later.
One interesting implementation bit, that significantly simplifies (and thusspeeds up as well) implementation of both producers and consumers is how dataarea is mapped twice contiguously back-to-back in the virtual memory. Thisallows to not take any special measures for samples that have to wrap aroundat the end of the circular buffer data area, because the next page after thelast data page would be first data page again, and thus the sample will stillappear completely contiguous in virtual memory. See comment and a simple ASCIIdiagram showing this visually inbpf_ringbuf_area_alloc().
Another feature that distinguishes BPF ringbuf from perf ring buffer isa self-pacing notifications of new data being availability.bpf_ringbuf_commit() implementation will send a notification of new recordbeing available after commit only if consumer has already caught up right up tothe record being committed. If not, consumer still has to catch up and thuswill see new data anyways without needing an extra poll notification.Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbufs.c) show thatthis allows to achieve a very high throughput without having to resort totricks like “notify only every Nth sample”, which are necessary with perfbuffer. For extreme cases, when BPF program wants more manual control ofnotifications, commit/discard/output helpers acceptBPF_RB_NO_WAKEUP andBPF_RB_FORCE_WAKEUP flags, which give full control over notifications ofdata availability, but require extra caution and diligence in using this API.