English

Chinese (Simplified)

NAPI¶

NAPI is the event handling mechanism used by the Linux networking stack.The name NAPI no longer stands for anything in particular[1].

In basic operation the device notifies the host about new eventsvia an interrupt.The host then schedules a NAPI instance to process the events.The device may also be polled for events via NAPI without receivinginterrupts first (busy polling).

NAPI processing usually happens in the software interrupt context,but there is an option to useseparate kernel threadsfor NAPI processing.

All in all NAPI abstracts away from the drivers the context and configurationof event (packet Rx and Tx) processing.

Driver API¶

The two most important elements of NAPI are thestructnapi_structand the associated poll method.structnapi_struct holds the stateof the NAPI instance while the method is the driver-specific eventhandler. The method will typically free Tx packets that have beentransmitted and process newly received packets.

Control API¶

netif_napi_add() andnetif_napi_del() add/remove a NAPI instancefrom the system. The instances are attached to the netdevice passedas argument (and will be deleted automatically when netdevice isunregistered). Instances are added in a disabled state.

napi_enable() andnapi_disable() manage the disabled state.A disabled NAPI can’t be scheduled and its poll method is guaranteedto not be invoked.napi_disable() waits for ownership of the NAPIinstance to be released.

The control APIs are not idempotent. Control API calls are safe againstconcurrent use of datapath APIs but an incorrect sequence of control APIcalls may result in crashes, deadlocks, or race conditions. For example,callingnapi_disable() multiple times in a row will deadlock.

Datapath API¶

napi_schedule() is the basic method of scheduling a NAPI poll.Drivers should call this function in their interrupt handler(seeScheduling and IRQ masking for more info). A successful call tonapi_schedule()will take ownership of the NAPI instance.

Later, after NAPI is scheduled, the driver’s poll method will becalled to process the events/packets. The method takes abudgetargument - drivers can process completions for any number of Txpackets but should only process up tobudget number ofRx packets. Rx processing is usually much more expensive.

In other words for Rx processing thebudget argument limits how manypackets driver can process in a single poll. Rx specific APIs like pagepool or XDP cannot be used at all whenbudget is 0.skb Tx processing should happen regardless of thebudget, but ifthe argument is 0 driver cannot call any XDP (or page pool) APIs.

Warning

Thebudget argument may be 0 if core tries to only processskb Tx completions and no Rx or XDP packets.

The poll method returns the amount of work done. If the driver stillhas outstanding work to do (e.g.budget was exhausted)the poll method should return exactlybudget. In that case,the NAPI instance will be serviced/polled again (without theneed to be scheduled).

If event processing has been completed (all outstanding packetsprocessed) the poll method should callnapi_complete_done()before returning.napi_complete_done() releases the ownershipof the instance.

Warning

The case of finishing all events and using exactlybudgetmust be handled carefully. There is no way to report this(rare) condition to the stack, so the driver must eithernot callnapi_complete_done() and wait to be called again,or returnbudget-1.

If thebudget is 0napi_complete_done() should never be called.

Call sequence¶

Drivers should not make assumptions about the exact sequencingof calls. The poll method may be called without the driver schedulingthe instance (unless the instance is disabled). Similarly,it’s not guaranteed that the poll method will be called, evenifnapi_schedule() succeeded (e.g. if the instance gets disabled).

As mentioned in theControl API section -napi_disable() and subsequentcalls to the poll method only wait for the ownership of the instanceto be released, not for the poll method to exit. This means thatdrivers should avoid accessing any data structures after callingnapi_complete_done().

Scheduling and IRQ masking¶

Drivers should keep the interrupts masked after schedulingthe NAPI instance - until NAPI polling finishes any furtherinterrupts are unnecessary.

Drivers which have to mask the interrupts explicitly (as opposedto IRQ being auto-masked by the device) should use thenapi_schedule_prep()and__napi_schedule() calls:

if(napi_schedule_prep(&v->napi)){mydrv_mask_rxtx_irq(v->idx);/* schedule after masking to avoid races */__napi_schedule(&v->napi);}

IRQ should only be unmasked after a successful call tonapi_complete_done():

if(budget&&napi_complete_done(&v->napi,work_done)){mydrv_unmask_rxtx_irq(v->idx);returnmin(work_done,budget-1);}

napi_schedule_irqoff() is a variant ofnapi_schedule() which takes advantageof guarantees given by being invoked in IRQ context (no need tomask interrupts).napi_schedule_irqoff() will fall back tonapi_schedule() ifIRQs are threaded (such as ifPREEMPT_RT is enabled).

Instance to queue mapping¶

Modern devices have multiple NAPI instances (structnapi_struct) perinterface. There is no strong requirement on how the instances aremapped to queues and interrupts. NAPI is primarily a polling/processingabstraction without specific user-facing semantics. That said, most networkingdevices end up using NAPI in fairly similar ways.

NAPI instances most often correspond 1:1:1 to interrupts and queue pairs(queue pair is a set of a single Rx and single Tx queue).

In less common cases a NAPI instance may be used for multiple queuesor Rx and Tx queues can be serviced by separate NAPI instances on a singlecore. Regardless of the queue assignment, however, there is usually stilla 1:1 mapping between NAPI instances and interrupts.

It’s worth noting that the ethtool API uses a “channel” terminology whereeach channel can be eitherrx,tx orcombined. It’s not clearwhat constitutes a channel; the recommended interpretation is to understanda channel as an IRQ/NAPI which services queues of a given type. For example,a configuration of 1rx, 1tx and 1combined channel is expectedto utilize 3 interrupts, 2 Rx and 2 Tx queues.

Persistent NAPI config¶

Drivers often allocate and free NAPI instances dynamically. This leads to lossof NAPI-related user configuration each time NAPI instances are reallocated.Thenetif_napi_add_config() API prevents this loss of configuration byassociating each NAPI instance with a persistent NAPI configuration based ona driver defined index value, like a queue number.

Using this API allows for persistent NAPI IDs (among other settings), which canbe beneficial to userspace programs usingSO_INCOMING_NAPI_ID. See thesections below for other NAPI configuration settings.

Drivers should try to usenetif_napi_add_config() whenever possible.

User API¶

User interactions with NAPI depend on NAPI instance ID. The instance IDsare only visible to the user thru theSO_INCOMING_NAPI_ID socket option.

Users can query NAPI IDs for a device or device queue using netlink. This canbe done programmatically in a user application or by using a script included inthe kernel source tree:tools/net/ynl/pyynl/cli.py.

For example, using the script to dump all of the queues for a device (whichwill reveal each queue’s NAPI ID):

$kernel-source/tools/net/ynl/pyynl/cli.py\--specDocumentation/netlink/specs/netdev.yaml\--dumpqueue-get\--json='{"ifindex": 2}'

SeeDocumentation/netlink/specs/netdev.yaml for more details onavailable operations and attributes.

Software IRQ coalescing¶

NAPI does not perform any explicit event coalescing by default.In most scenarios batching happens due to IRQ coalescing which is doneby the device. There are cases where software coalescing is helpful.

NAPI can be configured to arm a repoll timer instead of unmaskingthe hardware interrupts as soon as all packets are processed.Thegro_flush_timeout sysfs configuration of the netdeviceis reused to control the delay of the timer, whilenapi_defer_hard_irqs controls the number of consecutive empty pollsbefore NAPI gives up and goes back to using hardware IRQs.

The above parameters can also be set on a per-NAPI basis using netlink vianetdev-genl. When used with netlink and configured on a per-NAPI basis, theparameters mentioned above use hyphens instead of underscores:gro-flush-timeout andnapi-defer-hard-irqs.

Per-NAPI configuration can be done programmatically in a user applicationor by using a script included in the kernel source tree:tools/net/ynl/pyynl/cli.py.

For example, using the script:

$kernel-source/tools/net/ynl/pyynl/cli.py\--specDocumentation/netlink/specs/netdev.yaml\--donapi-set\--json='{"id": 345,                   "defer-hard-irqs": 111,                   "gro-flush-timeout": 11111}'

Similarly, the parameterirq-suspend-timeout can be set using netlinkvia netdev-genl. There is no global sysfs parameter for this value.

irq-suspend-timeout is used to determine how long an application cancompletely suspend IRQs. It is used in combination with SO_PREFER_BUSY_POLL,which can be set on a per-epoll context basis withEPIOCSPARAMS ioctl.

Busy polling¶

Busy polling allows a user process to check for incoming packets beforethe device interrupt fires. As is the case with any busy polling it tradesoff CPU cycles for lower latency (production uses of NAPI busy pollingare not well known).

Busy polling is enabled by either settingSO_BUSY_POLL onselected sockets or using the globalnet.core.busy_poll andnet.core.busy_read sysctls. An io_uring API for NAPI busy pollingalso exists. Threaded polling of NAPI also has a mode to busy poll forpackets (threaded busy polling) using the NAPIprocessing kthread.

epoll-based busy polling¶

It is possible to trigger packet processing directly from calls toepoll_wait. In order to use this feature, a user application must ensureall file descriptors which are added to an epoll context have the same NAPI ID.

If the application uses a dedicated acceptor thread, the application can obtainthe NAPI ID of the incoming connection using SO_INCOMING_NAPI_ID and thendistribute that file descriptor to a worker thread. The worker thread would addthe file descriptor to its epoll context. This would ensure each worker threadhas an epoll context with FDs that have the same NAPI ID.

Alternatively, if the application uses SO_REUSEPORT, a bpf or ebpf program canbe inserted to distribute incoming connections to threads such that each threadis only given incoming connections with the same NAPI ID. Care must be taken tocarefully handle cases where a system may have multiple NICs.

In order to enable busy polling, there are two choices:

/proc/sys/net/core/busy_poll can be set with a time in useconds to busyloop waiting for events. This is a system-wide setting and will cause allepoll-based applications to busy poll when they call epoll_wait. This maynot be desirable as many applications may not have the need to busy poll.
Applications using recent kernels can issue an ioctl on the epoll contextfile descriptor to set (EPIOCSPARAMS) or get (EPIOCGPARAMS)structepoll_params:, which user programs can define as follows:

structepoll_params{uint32_tbusy_poll_usecs;uint16_tbusy_poll_budget;uint8_tprefer_busy_poll;/* pad the struct to a multiple of 64bits */uint8_t__pad;};

IRQ mitigation¶

While busy polling is supposed to be used by low latency applications,a similar mechanism can be used for IRQ mitigation.

Very high request-per-second applications (especially routing/forwardingapplications and especially applications using AF_XDP sockets) may notwant to be interrupted until they finish processing a request or a batchof packets.

Such applications can pledge to the kernel that they will perform a busypolling operation periodically, and the driver should keep the device IRQspermanently masked. This mode is enabled by using theSO_PREFER_BUSY_POLLsocket option. To avoid system misbehavior the pledge is revokedifgro_flush_timeout passes without any busy poll call. For epoll-basedbusy polling applications, theprefer_busy_poll field ofstructepoll_params can be set to 1 and theEPIOCSPARAMS ioctl can be issued toenable this mode. See the above section for more details.

The NAPI budget for busy polling is lower than the default (which makessense given the low latency intention of normal busy polling). This isnot the case with IRQ mitigation, however, so the budget can be adjustedwith theSO_BUSY_POLL_BUDGET socket option. For epoll-based busy pollingapplications, thebusy_poll_budget field can be adjusted to the desired valueinstructepoll_params and set on a specific epoll context using theEPIOCSPARAMSioctl. See the above section for more details.

It is important to note that choosing a large value forgro_flush_timeoutwill defer IRQs to allow for better batch processing, but will induce latencywhen the system is not fully loaded. Choosing a small value forgro_flush_timeout can cause interference of the user application which isattempting to busy poll by device IRQs and softirq processing. This valueshould be chosen carefully with these tradeoffs in mind. epoll-based busypolling applications may be able to mitigate how much user processing happensby choosing an appropriate value formaxevents.

Users may want to consider an alternate approach, IRQ suspension, to help dealwith these tradeoffs.

IRQ suspension¶

IRQ suspension is a mechanism wherein device IRQs are masked while epolltriggers NAPI packet processing.

While application calls to epoll_wait successfully retrieve events, the kernel willdefer the IRQ suspension timer. If the kernel does not retrieve any eventswhile busy polling (for example, because network traffic levels subsided), IRQsuspension is disabled and the IRQ mitigation strategies described above areengaged.

This allows users to balance CPU consumption with network processingefficiency.

To use this mechanism:

The per-NAPI config parameterirq-suspend-timeout should be set to themaximum time (in nanoseconds) the application can have its IRQssuspended. This is done using netlink, as described above. This timeoutserves as a safety mechanism to restart IRQ driver interrupt processing ifthe application has stalled. This value should be chosen so that it coversthe amount of time the user application needs to process data from itscall to epoll_wait, noting that applications can control how much datathey retrieve by settingmax_events when calling epoll_wait.
The sysfs parameter or per-NAPI config parametersgro_flush_timeoutandnapi_defer_hard_irqs can be set to low values. They will be usedto defer IRQs after busy poll has found no data.
Theprefer_busy_poll flag must be set to true. This can be done usingtheEPIOCSPARAMS ioctl as described above.
The application uses epoll as described above to trigger NAPI packetprocessing.

As mentioned above, as long as subsequent calls to epoll_wait return events touserland, theirq-suspend-timeout is deferred and IRQs are disabled. Thisallows the application to process data without interference.

Once a call to epoll_wait results in no events being found, IRQ suspension isautomatically disabled and thegro_flush_timeout andnapi_defer_hard_irqs mitigation mechanisms take over.

It is expected thatirq-suspend-timeout will be set to a value much largerthangro_flush_timeout asirq-suspend-timeout should suspend IRQs forthe duration of one userland processing cycle.

While it is not strictly necessary to usenapi_defer_hard_irqs andgro_flush_timeout to use IRQ suspension, their use is stronglyrecommended.

IRQ suspension causes the system to alternate between polling mode andirq-driven packet delivery. During busy periods,irq-suspend-timeoutoverridesgro_flush_timeout and keeps the system busy polling, but whenepoll finds no events, the setting ofgro_flush_timeout andnapi_defer_hard_irqs determine the next step.

There are essentially three possible loops for network processing andpacket delivery:

hardirq -> softirq -> napi poll; basic interrupt delivery
timer -> softirq -> napi poll; deferred irq processing
epoll -> busy-poll -> napi poll; busy looping

Loop 2 can take control from Loop 1, ifgro_flush_timeout andnapi_defer_hard_irqs are set.

Ifgro_flush_timeout andnapi_defer_hard_irqs are set, Loops 2and 3 “wrestle” with each other for control.

During busy periods,irq-suspend-timeout is used as timer in Loop 2,which essentially tilts network processing in favour of Loop 3.

Ifgro_flush_timeout andnapi_defer_hard_irqs are not set, Loop 3cannot take control from Loop 1.

Therefore, settinggro_flush_timeout andnapi_defer_hard_irqs isthe recommended usage, because otherwise settingirq-suspend-timeoutmight not have any discernible effect.

Threaded NAPI busy polling¶

Threaded NAPI busy polling extends threaded NAPI and adds support to docontinuous busy polling of the NAPI. This can be useful for forwarding orAF_XDP applications.

Threaded NAPI busy polling can be enabled on per NIC queue basis using Netlink.

For example, using the following script:

$ynl--familynetdev--donapi-set\--json='{"id": 66, "threaded": "busy-poll"}'

The kernel will create a kthread that busy polls on this NAPI.

The user may elect to set the CPU affinity of this kthread to an unused CPUcore to improve how often the NAPI is polled at the expense of wasted CPUcycles. Note that this will keep the CPU core busy with 100% usage.

Once threaded busy polling is enabled for a NAPI, PID of the kthread can beretrieved using Netlink so the affinity of the kthread can be set up.

For example, the following script can be used to fetch the PID:

$ynl--familynetdev--donapi-get--json='{"id": 66}'

This will output something like following, the pid258 is the PID of thekthread that is polling this NAPI.

${'defer-hard-irqs':0,'gro-flush-timeout':0,'id':66,'ifindex':2,'irq-suspend-timeout':0,'pid':258,'threaded':'busy-poll'}

Threaded NAPI¶

Threaded NAPI is an operating mode that uses dedicated kernelthreads rather than software IRQ context for NAPI processing.Each threaded NAPI instance will spawn a separate thread(callednapi/${ifc-name}-${napi-id}).

It is recommended to pin each kernel thread to a single CPU, the sameCPU as the CPU which services the interrupt. Note that the mappingbetween IRQs and NAPI instances may not be trivial (and is driverdependent). The NAPI instance IDs will be assigned in the oppositeorder than the process IDs of the kernel threads.

Threaded NAPI is controlled by writing 0/1 to thethreaded file innetdev’s sysfs directory. It can also be enabled for a specific NAPI usingnetlink interface.