struct sk_buff¶
sk_buff is the main networking structure representinga packet.
Basic sk_buff geometry¶
structsk_buff itself is a metadata structure and does not hold any packetdata. All the data is held in associated buffers.
sk_buff.head points to the main “head” buffer. The head buffer is dividedinto two parts:
data buffer, containing headers and sometimes payload;this is the part of the skb operated on by the common helperssuch as
skb_put()orskb_pull();shared info (
structskb_shared_info) which holds an array of pointersto read-only data in the (page, offset, length) format.
Optionallyskb_shared_info.frag_list may point to another skb.
Basic diagram may look like this:
--------------- | sk_buff | --------------- ,--------------------------- + head / ,----------------- + data / / ,----------- + tail| | | , + end| | | |v v v v -----------------------------------------------| headroom | data | tailroom | skb_shared_info | ----------------------------------------------- + [page frag] + [page frag] + [page frag] + [page frag] --------- + frag_list --> | sk_buff | ---------
Shared skbs and skb clones¶
sk_buff.users is a simple refcount allowing multiple entitiesto keep astructsk_buff alive. skbs with ask_buff.users!=1 are referredto as shared skbs (seeskb_shared()).
skb_clone() allows for fast duplication of skbs. None of the data buffersget copied, but caller gets a new metadata struct (structsk_buff).&skb_shared_info.refcount indicates the number of skbs pointing at the samepacket data (i.e. clones).
dataref and headerless skbs¶
Transport layers send out clones of payload skbs they hold forretransmissions. To allow lower layers of the stack to prepend their headerswe splitskb_shared_info.dataref into two halves.The lower 16 bits count the overall number of references.The higher 16 bits indicate how many of the references are payload-only.skb_header_cloned() checks if skb is allowed to add / write the headers.
The creator of the skb (e.g. TCP) marks its skb assk_buff.nohdr(via__skb_header_release()). Any clone created from marked skb will getsk_buff.hdr_len populated with the available headroom.If there’s the only clone in existence it’s able to modify the headroomat will. The sequence of calls inside the transport layer is:
<alloc skb>skb_reserve()__skb_header_release()skb_clone()// send the clone down the stack
This is not a very generic construct and it depends on the transport layersdoing the right thing. In practice there’s usually only one payload-only skb.Having multiple payload-only skbs with different lengths of hdr_len is notpossible. The payload-only skbs should never leave their owner.
Checksum information¶
The interface for checksum offload between the stack and networking driversis as follows...
IP checksum related features¶
Drivers advertise checksum offload capabilities in the features of a device.From the stack’s point of view these are capabilities offered by the driver.A driver typically only advertises features that it is capable of offloadingto its device.
| The driver (or its device) is able to compute oneIP (one’s complement) checksum for any combinationof protocols or protocol layering. The checksum iscomputed and set in a packet per the CHECKSUM_PARTIALinterface (see below). |
| Driver (device) is only able to checksum plainTCP or UDP packets over IPv4. These are specificallyunencapsulated packets of the form IPv4|TCP orIPv4|UDP where the Protocol field in the IPv4 headeris TCP or UDP. The IPv4 header may contain IP options.This feature cannot be set in features for a devicewith NETIF_F_HW_CSUM also set. This feature is beingDEPRECATED (see below). |
| Driver (device) is only able to checksum plainTCP or UDP packets over IPv6. These are specificallyunencapsulated packets of the form IPv6|TCP orIPv6|UDP where the Next Header field in the IPv6header is either TCP or UDP. IPv6 extension headersare not supported with this feature. This featurecannot be set in features for a device withNETIF_F_HW_CSUM also set. This feature is beingDEPRECATED (see below). |
| Driver (device) performs receive checksum offload.This flag is only used to disable the RX checksumfeature for a device. The stack will accept receivechecksum indication in packets received on a deviceregardless of whether NETIF_F_RXCSUM is set. |
Checksumming of received packets by device¶
Indication of checksum verification is set insk_buff.ip_summed.Possible values are:
CHECKSUM_NONEDevice did not checksum this packet e.g. due to lack of capabilities.The packet contains full (though not verified) checksum in packet butnot in skb->csum. Thus, skb->csum is undefined in this case.
CHECKSUM_UNNECESSARYThe hardware you’re dealing with doesn’t calculate the full checksum(as in
CHECKSUM_COMPLETE), but it does parse headers and verify checksumsfor specific protocols. For such packets it will setCHECKSUM_UNNECESSARYif their checksums are okay.sk_buff.csumis still undefined in this casethough. A driver or device must never modify the checksum field in thepacket even if checksum is verified.CHECKSUM_UNNECESSARYis applicable to following protocols:TCP: IPv6 and IPv4.
UDP: IPv4 and IPv6. A device may apply CHECKSUM_UNNECESSARY to azero UDP checksum for either IPv4 or IPv6, the networking stackmay perform further validation in this case.
GRE: only if the checksum is present in the header.
SCTP: indicates the CRC in SCTP header has been validated.
FCOE: indicates the CRC in FC frame has been validated.
sk_buff.csum_levelindicates the number of consecutive checksums found inthe packet minus one that have been verified asCHECKSUM_UNNECESSARY.For instance if a device receives an IPv6->UDP->GRE->IPv4->TCP packetand a device is able to verify the checksums for UDP (possibly zero),GRE (checksum flag is set) and TCP,sk_buff.csum_levelwould be set totwo. If the device were only able to verify the UDP checksum and notGRE, either because it doesn’t support GRE checksum or because GREchecksum is bad, skb->csum_level would be set to zero (TCP checksum isnot considered in this case).CHECKSUM_COMPLETEThis is the most generic way. The device supplied checksum of the _whole_packet as seen by
netif_rx()and fills insk_buff.csum. This means thehardware doesn’t need to parse L3/L4 headers to implement this.Notes:
Even if device supports only some protocols, but is able to produceskb->csum, it MUST use CHECKSUM_COMPLETE, not CHECKSUM_UNNECESSARY.
CHECKSUM_COMPLETE is not applicable to SCTP and FCoE protocols.
CHECKSUM_PARTIALA checksum is set up to be offloaded to a device as described in theoutput description for CHECKSUM_PARTIAL. This may occur on a packetreceived directly from another Linux OS, e.g., a virtualized Linux kernelon the same host, or it may be set in the input path in GRO or remotechecksum offload. For the purposes of checksum verification, the checksumreferred to by skb->csum_start + skb->csum_offset and any precedingchecksums in the packet are considered verified. Any checksums in thepacket that are after the checksum being offloaded are not considered tobe verified.
Checksumming on transmit for non-GSO¶
The stack requests checksum offload in thesk_buff.ip_summed for a packet.Values are:
CHECKSUM_PARTIALThe driver is required to checksum the packet as seen by
hard_start_xmit()fromsk_buff.csum_startup to the end, and to record/write the checksum atoffsetsk_buff.csum_start+sk_buff.csum_offset.A driver may verify that thecsum_start and csum_offset values are valid values given the length andoffset of the packet, but it should not attempt to validate that thechecksum refers to a legitimate transport layer checksum -- it is thepurview of the stack to validate that csum_start and csum_offset are setcorrectly.When the stack requests checksum offload for a packet, the driver MUSTensure that the checksum is set correctly. A driver can either offload thechecksum calculation to the device, or call skb_checksum_help (in the casethat the device does not support offload for a particular checksum).
NETIF_F_IP_CSUMandNETIF_F_IPV6_CSUMare being deprecated in favor ofNETIF_F_HW_CSUM. New devices should useNETIF_F_HW_CSUMto indicatechecksum offload capability.skb_csum_hwoffload_help()can be called to resolveCHECKSUM_PARTIALbasedon network device checksumming capabilities: if a packet does not matchthem,skb_checksum_help()orskb_crc32c_help()(depending on the value ofsk_buff.csum_not_inet, seeNon-IP checksum (CRC) offloads)is called to resolve the checksum.CHECKSUM_NONEThe skb was already checksummed by the protocol, or a checksum is notrequired.
CHECKSUM_UNNECESSARYThis has the same meaning as CHECKSUM_NONE for checksum offload onoutput.
CHECKSUM_COMPLETENot used in checksum output. If a driver observes a packet with this valueset in skbuff, it should treat the packet as if
CHECKSUM_NONEwere set.
Non-IP checksum (CRC) offloads¶
| This feature indicates that a device is capable ofoffloading the SCTP CRC in a packet. To perform this offload the stackwill set csum_start and csum_offset accordingly, set ip_summed to |
| This feature indicates that a device is capable of offloading the FCOECRC in a packet. To perform this offload the stack will set ip_summedto |
Checksumming on output with GSO¶
In the case of a GSO packet (skb_is_gso() is true), checksum offloadis implied by the SKB_GSO_* flags in gso_type. Most obviously, if thegso_type isSKB_GSO_TCPV4 orSKB_GSO_TCPV6, TCP checksum offload aspart of the GSO operation is implied. If a checksum is being offloadedwith GSO then ip_summed isCHECKSUM_PARTIAL, and both csum_start andcsum_offset are set to refer to the outermost checksum being offloaded(two offloaded checksums are possible with UDP encapsulation).