Open vSwitch datapath developer documentation¶
The Open vSwitch kernel module allows flexible userspace control overflow-level packet processing on selected network devices. It can beused to implement a plain Ethernet switch, network device bonding,VLAN processing, network access control, flow-based network control,and so on.
The kernel module implements multiple “datapaths” (analogous tobridges), each of which can have multiple “vports” (analogous to portswithin a bridge). Each datapath also has associated with it a “flowtable” that userspace populates with “flows” that map from keys basedon packet headers and metadata to sets of actions. The most commonaction forwards the packet to another vport; other actions are alsoimplemented.
When a packet arrives on a vport, the kernel module processes it byextracting its flow key and looking it up in the flow table. If thereis a matching flow, it executes the associated actions. If there isno match, it queues the packet to userspace for processing (as part ofits processing, userspace will likely set up a flow to handle furtherpackets of the same type entirely in-kernel).
Flow key compatibility¶
Network protocols evolve over time. New protocols become importantand existing protocols lose their prominence. For the Open vSwitchkernel module to remain relevant, it must be possible for newerversions to parse additional protocols as part of the flow key. Itmight even be desirable, someday, to drop support for parsingprotocols that have become obsolete. Therefore, the Netlink interfaceto Open vSwitch is designed to allow carefully written userspaceapplications to work with any version of the flow key, past or future.
To support this forward and backward compatibility, whenever thekernel module passes a packet to userspace, it also passes along theflow key that it parsed from the packet. Userspace then extracts itsown notion of a flow key from the packet and compares it against thekernel-provided version:
- If userspace’s notion of the flow key for the packet matches thekernel’s, then nothing special is necessary.
- If the kernel’s flow key includes more fields than the userspaceversion of the flow key, for example if the kernel decoded IPv6headers but userspace stopped at the Ethernet type (because itdoes not understand IPv6), then again nothing special isnecessary. Userspace can still set up a flow in the usual way,as long as it uses the kernel-provided flow key to do it.
- If the userspace flow key includes more fields than thekernel’s, for example if userspace decoded an IPv6 header butthe kernel stopped at the Ethernet type, then userspace canforward the packet manually, without setting up a flow in thekernel. This case is bad for performance because every packetthat the kernel considers part of the flow must go to userspace,but the forwarding behavior is correct. (If userspace candetermine that the values of the extra fields would not affectforwarding behavior, then it could set up a flow anyway.)
How flow keys evolve over time is important to making this work, sothe following sections go into detail.
Flow key format¶
A flow key is passed over a Netlink socket as a sequence of Netlinkattributes. Some attributes represent packet metadata, defined as anyinformation about a packet that cannot be extracted from the packetitself, e.g. the vport on which the packet was received. Mostattributes, however, are extracted from headers within the packet,e.g. source and destination addresses from Ethernet, IP, or TCPheaders.
The <linux/openvswitch.h> header file defines the exact format of theflow key attributes. For informal explanatory purposes here, we writethem as comma-separated strings, with parentheses indicating argumentsand nesting. For example, the following could represent a flow keycorresponding to a TCP packet that arrived on vport 1:
in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,frag=no), tcp(src=49163, dst=80)
Often we ellipsize arguments not important to the discussion, e.g.:
in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
Wildcarded flow key format¶
A wildcarded flow is described with two sequences of Netlink attributespassed over the Netlink socket. A flow key, exactly as described above, and anoptional corresponding flow mask.
A wildcarded flow can represent a group of exact match flows. Each ‘1’ bitin the mask specifies a exact match with the corresponding bit in the flow key.A ‘0’ bit specifies a don’t care bit, which will match either a ‘1’ or ‘0’ bitof a incoming packet. Using wildcarded flow can improve the flow set up rateby reduce the number of new flows need to be processed by the user space program.
Support for the mask Netlink attribute is optional for both the kernel and userspace program. The kernel can ignore the mask attribute, installing an exactmatch flow, or reduce the number of don’t care bits in the kernel to less thanwhat was specified by the user space program. In this case, variations in bitsthat the kernel does not implement will simply result in additional flow setups.The kernel module will also work with user space programs that neither supportnor supply flow mask attributes.
Since the kernel may ignore or modify wildcard bits, it can be difficult forthe userspace program to know exactly what matches are installed. There aretwo possible approaches: reactively install flows as they miss the kernelflow table (and therefore not attempt to determine wildcard changes at all)or use the kernel’s response messages to determine the installed wildcards.
When interacting with userspace, the kernel should maintain the match portionof the key exactly as originally installed. This will provides a handle toidentify the flow for all future operations. However, when reporting themask of an installed flow, the mask should include any restrictions imposedby the kernel.
The behavior when using overlapping wildcarded flows is undefined. It is theresponsibility of the user space program to ensure that any incoming packetcan match at most one flow, wildcarded or not. The current implementationperforms best-effort detection of overlapping wildcarded flows and may rejectsome but not all of them. However, this behavior may change in future versions.
Unique flow identifiers¶
An alternative to using the original match portion of a key as the handle forflow identification is a unique flow identifier, or “UFID”. UFIDs are optionalfor both the kernel and user space program.
User space programs that support UFID are expected to provide it during flowsetup in addition to the flow, then refer to the flow using the UFID for allfuture operations. The kernel is not required to index flows by the originalflow key if a UFID is specified.
Basic rule for evolving flow keys¶
Some care is needed to really maintain forward and backwardcompatibility for applications that follow the rules listed under“Flow key compatibility” above.
The basic rule is obvious:
==================================================================New network protocol support must only supplement existing flowkey attributes. It must not change the meaning of already definedflow key attributes.==================================================================
This rule does have less-obvious consequences so it is worth workingthrough a few examples. Suppose, for example, that the kernel moduledid not already implement VLAN parsing. Instead, it just interpretedthe 802.1Q TPID (0x8100) as the Ethertype then stopped parsing thepacket. The flow key for any packet with an 802.1Q header would lookessentially like this, ignoring metadata:
eth(...), eth_type(0x8100)
Naively, to add VLAN support, it makes sense to add a new “vlan” flowkey attribute to contain the VLAN tag, then continue to decode theencapsulated headers beyond the VLAN tag using the existing fielddefinitions. With this change, a TCP packet in VLAN 10 would have aflow key much like this:
eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
But this change would negatively affect a userspace application thathas not been updated to understand the new “vlan” flow key attribute.The application could, following the flow compatibility rules above,ignore the “vlan” attribute that it does not understand and thereforeassume that the flow contained IP packets. This is a bad assumption(the flow only contains IP packets if one parses and skips over the802.1Q header) and it could cause the application’s behavior to changeacross kernel versions even though it follows the compatibility rules.
The solution is to use a set of nested attributes. This is, forexample, why 802.1Q support uses nested attributes. A TCP packet inVLAN 10 is actually expressed as:
eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),ip(proto=6, ...), tcp(...)))
Notice how the “eth_type”, “ip”, and “tcp” flow key attributes arenested inside the “encap” attribute. Thus, an application that doesnot understand the “vlan” key will not see either of those attributesand therefore will not misinterpret them. (Also, the outer eth_typeis still 0x8100, not changed to 0x0800.)
Handling malformed packets¶
Don’t drop packets in the kernel for malformed protocol headers, badchecksums, etc. This would prevent userspace from implementing asimple Ethernet switch that forwards every packet.
Instead, in such a case, include an attribute with “empty” content.It doesn’t matter if the empty content could be valid protocol values,as long as those values are rarely seen in practice, because userspacecan always forward all packets with those values to userspace andhandle them individually.
For example, consider a packet that contains an IP header thatindicates protocol 6 for TCP, but which is truncated just after the IPheader, so that the TCP header is missing. The flow key for thispacket would include a tcp attribute with all-zero src and dst, likethis:
eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
As another example, consider a packet with an Ethernet type of 0x8100,indicating that a VLAN TCI should follow, but which is truncated justafter the Ethernet type. The flow key for this packet would includean all-zero-bits vlan and an empty encap attribute, like this:
eth(...), eth_type(0x8100), vlan(0), encap()
Unlike a TCP packet with source and destination ports 0, anall-zero-bits VLAN TCI is not that rare, so the CFI bit (akaVLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlanattribute expressly to allow this situation to be distinguished.Thus, the flow key in this second example unambiguously indicates amissing or malformed VLAN TCI.
Other rules¶
The other rules for flow keys are much less subtle:
- Duplicate attributes are not allowed at a given nesting level.
- Ordering of attributes is not significant.
- When the kernel sends a given flow key to userspace, it alwayscomposes it the same way. This allows userspace to hash andcompare entire flow keys that it may not be able to fullyinterpret.