Network Function Representors¶
This document describes the semantics and usage of representor netdevices, asused to control internal switching on SmartNICs. For the closely-related portrepresentors on physical (multi-port) switches, seeDocumentation/networking/switchdev.rst.
Motivation¶
Since the mid-2010s, network cards have started offering more complexvirtualisation capabilities than the legacy SR-IOV approach (with its simpleMAC/VLAN-based switching model) can support. This led to a desire to offloadsoftware-defined networks (such as OpenVSwitch) to these NICs to specify thenetwork connectivity of each function. The resulting designs are variouslycalled SmartNICs or DPUs.
Network function representors bring the standard Linux networking stack tovirtual switches and IOV devices. Just as each physical port of a Linux-controlled switch has a separate netdev, so does each virtual port of a virtualswitch.When the system boots, and before any offload is configured, all packets fromthe virtual functions appear in the networking stack of the PF via therepresentors. The PF can thus always communicate freely with the virtualfunctions.The PF can configure standard Linux forwarding between representors, the uplinkor any other netdev (routing, bridging, TC classifiers).
Thus, a representor is both a control plane object (representing the function inadministrative commands) and a data plane object (one end of a virtual pipe).As a virtual link endpoint, the representor can be configured like any othernetdevice; in some cases (e.g. link state) the representee will follow therepresentor’s configuration, while in others there are separate APIs toconfigure the representee.
Definitions¶
This document uses the term “switchdev function” to refer to the PCIe functionwhich has administrative control over the virtual switch on the device.Typically, this will be a PF, but conceivably a NIC could be configured to grantthese administrative privileges instead to a VF or SF (subfunction).Depending on NIC design, a multi-port NIC might have a single switchdev functionfor the whole device or might have a separate virtual switch, and henceswitchdev function, for each physical network port.If the NIC supports nested switching, there might be separate switchdevfunctions for each nested switch, in which case each switchdev function shouldonly create representors for the ports on the (sub-)switch it directlyadministers.
A “representee” is the object that a representor represents. So for example inthe case of a VF representor, the representee is the corresponding VF.
What does a representor do?¶
A representor has three main roles.
It is used to configure the network connection the representee sees, e.g.link up/down, MTU, etc. For instance, bringing the representoradministratively UP should cause the representee to see a link up / carrieron event.
It provides the slow path for traffic which does not hit any offloadedfast-path rules in the virtual switch. Packets transmitted on therepresentor netdevice should be delivered to the representee; packetstransmitted by the representee which fail to match any switching rule shouldbe received on the representor netdevice. (That is, there is a virtual pipeconnecting the representor to the representee, similar in concept to a vethpair.)This allows software switch implementations (such as OpenVSwitch or a Linuxbridge) to forward packets between representees and the rest of the network.
It acts as a handle by which switching rules (such as TC filters) can referto the representee, allowing these rules to be offloaded.
The combination of 2) and 3) means that the behaviour (apart from performance)should be the same whether a TC filter is offloaded or not. E.g. a TC ruleon a VF representor applies in software to packets received on that representornetdevice, while in hardware offload it would apply to packets transmitted bythe representee VF. Conversely, a mirred egress redirect to a VF representorcorresponds in hardware to delivery directly to the representee VF.
What functions should have a representor?¶
Essentially, for each virtual port on the device’s internal switch, thereshould be a representor.Some vendors have chosen to omit representors for the uplink and the physicalnetwork port, which can simplify usage (the uplink netdev becomes in effect thephysical port’s representor) but does not generalise to devices with multipleports or uplinks.
Thus, the following should all have representors:
VFs belonging to the switchdev function.
Other PFs on the local PCIe controller, and any VFs belonging to them.
PFs and VFs on external PCIe controllers on the device (e.g. for any embeddedSystem-on-Chip within the SmartNIC).
PFs and VFs with other personalities, including network block devices (suchas a vDPA virtio-blk PF backed by remote/distributed storage), if (and onlyif) their network access is implemented through a virtual switch port.[1]Note that such functions can require a representor despite the representeenot having a netdev.
Subfunctions (SFs) belonging to any of the above PFs or VFs, if they havetheir own port on the switch (as opposed to using their parent PF’s port).
Any accelerators or plugins on the device whose interface to the network isthrough a virtual switch port, even if they do not have a corresponding PCIePF or VF.
This allows the entire switching behaviour of the NIC to be controlled throughrepresentor TC rules.
It is a common misunderstanding to conflate virtual ports with PCIe virtualfunctions or their netdevs. While in simple cases there will be a 1:1correspondence between VF netdevices and VF representors, more advanced deviceconfigurations may not follow this.A PCIe function which does not have network access through the internal switch(not even indirectly through the hardware implementation of whatever servicesthe function provides) shouldnot have a representor (even if it has anetdev).Such a function has no switch virtual port for the representor to configure orto be the other end of the virtual pipe.The representor represents the virtual port, not the PCIe function nor the ‘enduser’ netdevice.
[1]The concept here is that a hardware IP stack in the device performs thetranslation between block DMA requests and network packets, so that onlynetwork packets pass through the virtual port onto the switch. The networkaccess that the IP stack “sees” would then be configurable through tc rules;e.g. its traffic might all be wrapped in a specific VLAN or VxLAN. However,any needed configuration of the block devicequa block device, not being anetworking entity, would not be appropriate for the representor and wouldthus use some other channel such as devlink.Contrast this with the case of a virtio-blk implementation which forwards theDMA requests unchanged to another PF whose driver then initiates andterminates IP traffic in software; in that case the DMA traffic wouldnotrun over the virtual switch and the virtio-blk PF should thusnot have arepresentor.
How are representors created?¶
The driver instance attached to the switchdev function should, for each virtualport on the switch, create a pure-software netdevice which has some form ofin-kernel reference to the switchdev function’s own netdevice or driver privatedata (netdev_priv()).This may be by enumerating ports at probe time, reacting dynamically to thecreation and destruction of ports at run time, or a combination of the two.
The operations of the representor netdevice will generally involve actingthrough the switchdev function. For example,ndo_start_xmit() might sendthe packet through a hardware TX queue attached to the switchdev function, witheither packet metadata or queue configuration marking it for delivery to therepresentee.
How are representors identified?¶
The representor netdevice shouldnot directly refer to a PCIe device (e.g.throughnet_dev->dev.parent /SET_NETDEV_DEV()), either of therepresentee or of the switchdev function.Instead, the driver should use theSET_NETDEV_DEVLINK_PORT macro toassign a devlink port instance to the netdevice before registering thenetdevice; the kernel uses the devlink port to provide thephys_switch_idandphys_port_name sysfs nodes.(Some legacy drivers implementndo_get_port_parent_id() andndo_get_phys_port_name() directly, but this is deprecated.) SeeDocumentation/networking/devlink/devlink-port.rst for thedetails of this API.
It is expected that userland will use this information (e.g. through udev rules)to construct an appropriately informative name or alias for the netdevice. Forinstance if the switchdev function iseth4 then a representor with aphys_port_name ofp0pf1vf2 might be renamedeth4pf1vf2rep.
There are as yet no established conventions for naming representors which do notcorrespond to PCIe functions (e.g. accelerators and plugins).
How do representors interact with TC rules?¶
Any TC rule on a representor applies (in software TC) to packets received bythat representor netdevice. Thus, if the delivery part of the rule correspondsto another port on the virtual switch, the driver may choose to offload it tohardware, applying it to packets transmitted by the representee.
Similarly, since a TC mirred egress action targeting the representor would (insoftware) send the packet through the representor (and thus indirectly deliverit to the representee), hardware offload should interpret this as delivery tothe representee.
As a simple example, ifPORT_DEV is the physical port representor andREP_DEV is a VF representor, the following rules:
tc filter add dev $REP_DEV parent ffff: protocol ipv4 flower \ action mirred egress redirect dev $PORT_DEVtc filter add dev $PORT_DEV parent ffff: protocol ipv4 flower skip_sw \ action mirred egress mirror dev $REP_DEV
would mean that all IPv4 packets from the VF are sent out the physical port, andall IPv4 packets received on the physical port are delivered to the VF inaddition toPORT_DEV. (Note that withoutskip_sw on the second rule,the VF would get two copies, as the packet reception onPORT_DEV wouldtrigger the TC rule again and mirror the packet toREP_DEV.)
On devices without separate port and uplink representors,PORT_DEV wouldinstead be the switchdev function’s own uplink netdevice.
Of course the rules can (if supported by the NIC) include packet-modifyingactions (e.g. VLAN push/pop), which should be performed by the virtual switch.
Tunnel encapsulation and decapsulation are rather more complicated, as theyinvolve a third netdevice (a tunnel netdev operating in metadata mode, such asa VxLAN device created withiplinkaddvxlan0typevxlanexternal) andrequire an IP address to be bound to the underlay device (e.g. switchdevfunction uplink netdev or port representor). TC rules such as:
tc filter add dev $REP_DEV parent ffff: flower \ action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \ dst_port 4789 \ action mirred egress redirect dev vxlan0tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \ enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \ action tunnel_key unset action mirred egress redirect dev $REP_DEV
whereLOCAL_IP is an IP address bound toPORT_DEV, andREMOTE_IP isanother IP address on the same subnet, mean that packets sent by the VF shouldbe VxLAN encapsulated and sent out the physical port (the driver has to deducethis by a route lookup ofLOCAL_IP leading toPORT_DEV, and alsoperform an ARP/neighbour table lookup to find the MAC addresses to use in theouter Ethernet frame), while UDP packets received on the physical port with UDPport 4789 should be parsed as VxLAN and, if their VSID matches$VNI,decapsulated and forwarded to the VF.
If this all seems complicated, just remember the ‘golden rule’ of TC offload:the hardware should ensure the same final results as if the packets wereprocessed through the slow path, traversed software TC (except ignoring anyskip_hw rules and applying anyskip_sw rules) and were transmitted orreceived through the representor netdevices.
Configuring the representee’s MAC¶
The representee’s link state is controlled through the representor. Setting therepresentor administratively UP or DOWN should cause carrier ON or OFF at therepresentee.
Setting an MTU on the representor should cause that same MTU to be reported tothe representee.(On hardware that allows configuring separate and distinct MTU and MRU values,the representor MTU should correspond to the representee’s MRU and vice-versa.)
Currently there is no way to use the representor to set the station permanentMAC address of the representee; other methods available to do this include:
legacy SR-IOV (
iplinksetDEVICEvfNUMmacLLADDR)devlink port function (seedevlink-port(8) andDocumentation/networking/devlink/devlink-port.rst)