Network Devices, the Kernel, and You!¶
Introduction¶
The following is a random collection of documentation regardingnetwork devices. It is intended for driver developers.
struct net_device lifetime rules¶
Network device structures need to persist even after module is unloaded andmust be allocated withalloc_netdev_mqs() and friends.If device has registered successfully, it will be freed on last usebyfree_netdev(). This is required to handle the pathological case cleanly(example:rmmodmydriver</sys/class/net/myeth/mtu)
alloc_netdev_mqs() /alloc_netdev() reserve extra space for driverprivate data which gets freed when the network device is freed. Ifseparately allocated data is attached to the network device(netdev_priv()) then it is up to the module exit handler to free that.
There are two groups of APIs for registeringstructnet_device.First group can be used in normal contexts wherertnl_lock is not alreadyheld:register_netdev(),unregister_netdev().Second group can be used whenrtnl_lock is already held:register_netdevice(),unregister_netdevice(),free_netdevice().
Simple drivers¶
Most drivers (especially device drivers) handle lifetime ofstructnet_devicein context wherertnl_lock is not held (e.g. driver probe and remove paths).
In that case thestructnet_device registration is done usingtheregister_netdev(), andunregister_netdev() functions:
intprobe(){structmy_device_priv*priv;interr;dev=alloc_netdev_mqs(...);if(!dev)return-ENOMEM;priv=netdev_priv(dev);/* ... do all device setup before calling register_netdev() ... */err=register_netdev(dev);if(err)gotoerr_undo;/* net_device is visible to the user! */err_undo:/* ... undo the device setup ... */free_netdev(dev);returnerr;}voidremove(){unregister_netdev(dev);free_netdev(dev);}
Note that after callingregister_netdev() the device is visible in the system.Users can open it and start sending / receiving traffic immediately,or run any other callback, so all initialization must be done prior toregistration.
unregister_netdev() closes the device and waits for all users to be donewith it. The memory ofstructnet_device itself may still be referencedby sysfs but all operations on that device will fail.
free_netdev() can be called afterunregister_netdev() returns on whenregister_netdev() failed.
Device management under RTNL¶
Registeringstructnet_device while in context which already holdsthertnl_lock requires extra care. In those scenarios most driverswill want to make use ofstructnet_device’sneeds_free_netdevandpriv_destructor members for freeing of state.
Example flow of netdev handling underrtnl_lock:
staticvoidmy_setup(structnet_device*dev){dev->needs_free_netdev=true;}staticvoidmy_destructor(structnet_device*dev){some_obj_destroy(priv->obj);some_uninit(priv);}intcreate_link(){structmy_device_priv*priv;interr;ASSERT_RTNL();dev=alloc_netdev(sizeof(*priv),"net%d",NET_NAME_UNKNOWN,my_setup);if(!dev)return-ENOMEM;priv=netdev_priv(dev);/* Implicit constructor */err=some_init(priv);if(err)gotoerr_free_dev;priv->obj=some_obj_create();if(!priv->obj){err=-ENOMEM;gotoerr_some_uninit;}/* End of constructor, set the destructor: */dev->priv_destructor=my_destructor;err=register_netdevice(dev);if(err)/* register_netdevice() calls destructor on failure */gotoerr_free_dev;/* If anything fails now unregister_netdevice() (or unregister_netdev()) * will take care of calling my_destructor and free_netdev(). */return0;err_some_uninit:some_uninit(priv);err_free_dev:free_netdev(dev);returnerr;}
Ifstructnet_device.priv_destructor is set it will be called by the coresome time afterunregister_netdevice(), it will also be called ifregister_netdevice() fails. The callback may be invoked with or withoutrtnl_lock held.
There is no explicit constructor callback, driver “constructs” the privatenetdev state after allocating it and before registration.
Settingstructnet_device.needs_free_netdev makes core callfree_netdevice()automatically afterunregister_netdevice() when all references to the deviceare gone. It only takes effect after a successful call toregister_netdevice()so ifregister_netdevice() fails driver is responsible for callingfree_netdev().
free_netdev() is safe to call on error paths right afterunregister_netdevice()or whenregister_netdevice() fails. Parts of netdev (de)registration processhappen afterrtnl_lock is released, therefore in those casesfree_netdev()will defer some of the processing untilrtnl_lock is released.
Devices spawned fromstructrtnl_link_ops should never free thestructnet_device directly.
.ndo_init and .ndo_uninit¶
.ndo_init and.ndo_uninit callbacks are called during net_deviceregistration and de-registration, underrtnl_lock. Drivers can usethose e.g. when parts of their init process need to run underrtnl_lock.
.ndo_init runs before device is visible in the system,.ndo_uninitruns during de-registering after device is closed but other subsystemsmay still have outstanding references to the netdevice.
MTU¶
Each network device has a Maximum Transfer Unit. The MTU does notinclude any link layer protocol overhead. Upper layer protocols mustnot pass a socket buffer (skb) to a device to transmit with more datathan the mtu. The MTU does not include link layer header overhead, sofor example on Ethernet if the standard MTU is 1500 bytes used, theactual skb will contain up to 1514 bytes because of the Ethernetheader. Devices should allow for the 4 byte VLAN header as well.
Segmentation Offload (GSO, TSO) is an exception to this rule. Theupper layer protocol may pass a large socket buffer to the devicetransmit routine, and the device will break that up into separatepackets based on the current MTU.
MTU is symmetrical and applies both to receive and transmit. A devicemust be able to receive at least the maximum size packet allowed bythe MTU. A network device may use the MTU as mechanism to size receivebuffers, but the device should allow packets with VLAN header. Withstandard Ethernet mtu of 1500 bytes, the device should allow up to1518 byte packets (1500 + 14 header + 4 tag). The device may either:drop, truncate, or pass up oversize packets, but dropping oversizepackets is preferred.
struct net_device synchronization rules¶
- ndo_open:
Synchronization:
rtnl_lock()semaphore. In addition, netdev instancelock if the driver implements queue management or shaper API.Context: process- ndo_stop:
Synchronization:
rtnl_lock()semaphore. In addition, netdev instancelock if the driver implements queue management or shaper API.Context: processNote:netif_running()is guaranteed false- ndo_do_ioctl:
Synchronization:
rtnl_lock()semaphore.This is only called by network subsystems internally,not by user space calling ioctl as it was in beforelinux-5.14.
- ndo_siocbond:
Synchronization:
rtnl_lock()semaphore. In addition, netdev instancelock if the driver implements queue management or shaper API.Context: processUsed by the bonding driver for the SIOCBOND family ofioctl commands.
- ndo_siocwandev:
Synchronization:
rtnl_lock()semaphore. In addition, netdev instancelock if the driver implements queue management or shaper API.Context: processUsed by the drivers/net/wan framework to handlethe SIOCWANDEV ioctl with the if_settings structure.
- ndo_siocdevprivate:
Synchronization:
rtnl_lock()semaphore. In addition, netdev instancelock if the driver implements queue management or shaper API.Context: processThis is used to implement SIOCDEVPRIVATE ioctl helpers.These should not be added to new drivers, so don’t use.
- ndo_eth_ioctl:
Synchronization:
rtnl_lock()semaphore. In addition, netdev instancelock if the driver implements queue management or shaper API.Context: process- ndo_get_stats:
Synchronization: RCU (can be called concurrently with the statsupdate path).Context: atomic (can’t sleep under RCU)
- ndo_start_xmit:
Synchronization: __netif_tx_lock spinlock.
When the driver sets dev->lltx this will becalled without holding netif_tx_lock. In this case the driverhas to lock by itself when needed.The locking there should also properly protect againstset_rx_mode. WARNING: use of dev->lltx is deprecated.Don’t use it for new drivers.
- Context: Process with BHs disabled or BH (timer),
will be called with interrupts disabled by netconsole.
Return codes:
NETDEV_TX_OK everything ok.
NETDEV_TX_BUSY Cannot transmit packet, try laterUsually a bug, means queue start/stop flow control is broken inthe driver. Note: the driver must NOT put the skb in its DMA ring.
- ndo_tx_timeout:
Synchronization: netif_tx_lock spinlock; all TX queues frozen.Context: BHs disabledNotes:
netif_queue_stopped()is guaranteed true- ndo_set_rx_mode:
Synchronization: netif_addr_lock spinlock.Context: BHs disabled
- ndo_setup_tc:
TC_SETUP_BLOCKandTC_SETUP_FTare running under NFT locks(i.e. nortnl_lockand no device instance lock). The rest oftc_setup_typetypes run under netdev instance lock if the driverimplements queue management or shaper API.
Most ndo callbacks not specified in the list above are runningunderrtnl_lock. In addition, netdev instance lock is taken as well ifthe driver implements queue management or shaper API.
struct napi_struct synchronization rules¶
- napi->poll:
- Synchronization:
NAPI_STATE_SCHED bit in napi->state. Devicedriver’s ndo_stop method will invoke
napi_disable()onall NAPI instances which will do a sleeping poll on theNAPI_STATE_SCHED napi->state bit, waiting for all pendingNAPI activity to cease.- Context:
softirqwill be called with interrupts disabled by netconsole.
netdev instance lock¶
Historically, all networking control operations were protected by a singleglobal lock known asrtnl_lock. There is an ongoing effort to replace thisglobal lock with separate locks for each network namespace. Additionally,properties of individual netdev are increasingly protected by per-netdev locks.
For device drivers that implement shaping or queue management APIs, all controloperations will be performed under the netdev instance lock.Drivers can also explicitly request instance lock to be held during opsby settingrequest_ops_lock to true. Code comments and docs referto drivers which have ops called under the instance lock as “ops locked”.See also the documentation of thelock member ofstructnet_device.
In the future, there will be an option for individualdrivers to opt out of usingrtnl_lock and instead perform their controloperations directly under the netdev instance lock.
Devices drivers are encouraged to rely on the instance lock where possible.
For the (mostly software) drivers that need to interact with the core stack,there are two sets of interfaces:dev_xxx/netdev_xxx andnetif_xxx(e.g.,dev_set_mtu andnetif_set_mtu). Thedev_xxx/netdev_xxxfunctions handle acquiring the instance lock themselves, while thenetif_xxx functions assume that the driver has already acquiredthe instance lock.
struct net_device_ops¶
ndos are called without holding the instance lock for most drivers.
“Ops locked” drivers will have most of thendos invoked underthe instance lock.
struct ethtool_ops¶
Similarly tondos the instance lock is only held for select drivers.For “ops locked” drivers all ethtool ops without exceptions shouldbe called under the instance lock.
struct netdev_stat_ops¶
“qstat” ops are invoked under the instance lock for “ops locked” drivers,and under rtnl_lock for all other drivers.
struct net_shaper_ops¶
All net shaper callbacks are invoked while holding the netdev instancelock.rtnl_lock may or may not be held.
Note that supporting net shapers automatically enables “ops locking”.
struct netdev_queue_mgmt_ops¶
All queue management callbacks are invoked while holding the netdev instancelock.rtnl_lock may or may not be held.
Note that supportingstructnetdev_queue_mgmt_ops automatically enables“ops locking”.
Notifiers and netdev instance lock¶
For device drivers that implement shaping or queue management APIs,some of the notifiers (enumnetdev_cmd) are running under the netdevinstance lock.
The following netdev notifiers are always run under the instance lock:*NETDEV_XDP_FEAT_CHANGE
For devices with locked ops, currently only the following notifiers arerunning under the lock:*NETDEV_CHANGE*NETDEV_REGISTER*NETDEV_UP
The following notifiers are running without the lock:*NETDEV_UNREGISTER
There are no clear expectations for the remaining notifiers. Notifiers not onthe list may run with or without the instance lock, potentially even invokingthe same notifier type with and without the lock from different code paths.The goal is to eventually ensure that all (or most, with a few documentedexceptions) notifiers run under the instance lock. Please extend thisdocumentation whenever you make explicit assumption about lock being heldfrom a notifier.
NETDEV_INTERNAL symbol namespace¶
Symbols exported as NETDEV_INTERNAL can only be used in networkingcore and drivers which exclusively flow via the main networking list and trees.Note that the inverse is not true, most symbols outside of NETDEV_INTERNALare not expected to be used by random code outside netdev either.Symbols may lack the designation because they predate the namespaces,or simply due to an oversight.