Kernel TLS offload¶
Kernel TLS operation¶
Linux kernel provides TLS connection offload infrastructure. Once a TCPconnection is inESTABLISHED state user space can enable the TLS UpperLayer Protocol (ULP) and install the cryptographic connection state.For details regarding the user-facing interface refer to the TLSdocumentation inDocumentation/networking/tls.rst.
ktls can operate in three modes:
- Software crypto mode (
TLS_SW) - CPU handles the cryptography.In most basic cases only crypto operations synchronous with the CPUcan be used, but depending on calling context CPU may utilizeasynchronous crypto accelerators. The use of accelerators introduces extralatency on socket reads (decryption only starts when a read syscallis made) and additional I/O load on the system.- Packet-based NIC offload mode (
TLS_HW) - the NIC handles cryptoon a packet by packet basis, provided the packets arrive in order.This mode integrates best with the kernel stack and is described in detailin the remaining part of this document(ethtoolflagstls-hw-tx-offloadandtls-hw-rx-offload).- Full TCP NIC offload mode (
TLS_HW_RECORD) - mode of operation whereNIC driver and firmware replace the kernel networking stackwith its own TCP handling, it is not usable in production environmentsmaking use of the Linux networking stack for example any firewallingabilities or QoS and packet scheduling (ethtoolflagtls-hw-record).
The operation mode is selected automatically based on device configuration,offload opt-in or opt-out on per-connection basis is not currently supported.
TX¶
At a high level user write requests are turned into a scatter list, the TLS ULPintercepts them, inserts record framing, performs encryption (inTLS_SWmode) and then hands the modified scatter list to the TCP layer. From thispoint on the TCP stack proceeds as normal.
InTLS_HW mode the encryption is not performed in the TLS ULP.Instead packets reach a device driver, the driver will mark the packetsfor crypto offload based on the socket the packet is attached to,and send them to the device for encryption and transmission.
RX¶
On the receive side if the device handled decryption and authenticationsuccessfully, the driver will set the decrypted bit in the associatedstructsk_buff. The packets reach the TCP stack andare handled normally.ktls is informed when data is queued to the socketand thestrparser mechanism is used to delineate the records. Upon readrequest, records are retrieved from the socket and passed to decryption routine.If device decrypted all the segments of the record the decryption is skipped,otherwise software path handles decryption.
Layers of Kernel TLS stack
Device configuration¶
During driver initialization device sets theNETIF_F_HW_TLS_RX andNETIF_F_HW_TLS_TX features and installs itsstructtlsdev_opspointer in thetlsdev_ops member of thestructnet_device.
When TLS cryptographic connection state is installed on aktls socket(note that it is done twice, once for RX and once for TX direction,and the two are completely independent), the kernel checks if the underlyingnetwork device is offload-capable and attempts the offload. In case offloadfails the connection is handled entirely in software using the same mechanismas if the offload was never tried.
Offload request is performed via thetls_dev_add callback ofstructtlsdev_ops:
int(*tls_dev_add)(structnet_device*netdev,structsock*sk,enumtls_offload_ctx_dirdirection,structtls_crypto_info*crypto_info,u32start_offload_tcp_sn);
direction indicates whether the cryptographic information is forthe received or transmitted packets. Driver uses thesk parameterto retrieve the connection 5-tuple and socket family (IPv4 vs IPv6).Cryptographic information incrypto_info includes the key, iv, saltas well as TLS record sequence number.start_offload_tcp_sn indicateswhich TCP sequence number corresponds to the beginning of the record withsequence number fromcrypto_info. The driver can add its stateat the end of kernel structures (seedriver_state membersininclude/net/tls.h) to avoid additional allocations and pointerdereferences.
TX¶
After TX state is installed, the stack guarantees that the first segmentof the stream will start exactly at thestart_offload_tcp_sn sequencenumber, simplifying TCP sequence number matching.
TX offload being fully initialized does not imply that all segments passingthrough the driver and which belong to the offloaded socket will be afterthe expected sequence number and will have kernel record information.In particular, already encrypted data may have been queued to the socketbefore installing the connection state in the kernel.
RX¶
In RX direction local networking stack has little control over the segmentation,so the initial records’ TCP sequence number may be anywhere inside the segment.
Normal operation¶
At the minimum the device maintains the following state for each connection, ineach direction:
- crypto secrets (key, iv, salt)
- crypto processing state (partial blocks, partial authentication tag, etc.)
- record metadata (sequence number, processing offset and length)
- expected TCP sequence number
There are no guarantees on record length or record segmentation. In particularsegments may start at any point of a record and contain any number of records.Assuming segments are received in order, the device should be able to performcrypto operations and authentication regardless of segmentation. For thisto be possible device has to keep small amount of segment-to-segment state.This includes at least:
- partial headers (if a segment carried only a part of the TLS header)
- partial data block
- partial authentication tag (all data had been seen but part of theauthentication tag has to be written or read from the subsequent segment)
Record reassembly is not necessary for TLS offload. If the packets arrivein order the device should be able to handle them separately and makeforward progress.
TX¶
The kernel stack performs record framing reserving space for the authenticationtag and populating all other TLS header and tailer fields.
Both the device and the driver maintain expected TCP sequence numbersdue to the possibility of retransmissions and the lack of software fallbackonce the packet reaches the device.For segments passed in order, the driver marks the packets witha connection identifier (note that a 5-tuple lookup is insufficient to identifypackets requiring HW offload, see the5-tuple matching limitations section)and hands them to the device. The device identifies the packet as requiringTLS handling and confirms the sequence number matches its expectation.The device performs encryption and authentication of the record data.It replaces the authentication tag and TCP checksum with correct values.
RX¶
Before a packet is DMAed to the host (but after NIC’s embedded switchingand packet transformation functions) the device validates the Layer 4checksum and performs a 5-tuple lookup to find any TLS connection the packetmay belong to (technically a 4-tuplelookup is sufficient - IP addresses and TCP port numbers, as the protocolis always TCP). If connection is matched device confirms if the TCP sequencenumber is the expected one and proceeds to TLS handling (record delineation,decryption, authentication for each record in the packet). The device leavesthe record framing unmodified, the stack takes care of record decapsulation.Device indicates successful handling of TLS offload in the per-packet context(descriptor) passed to the host.
Upon reception of a TLS offloaded packet, the driver setsthedecrypted mark instructsk_buffcorresponding to the segment. Networking stack makes sure decryptedand non-decrypted segments do not get coalesced (e.g. by GRO or socket layer)and takes care of partial decryption.
Resync handling¶
In presence of packet drops or network packet reordering, the device may losesynchronization with the TLS stream, and require a resync with the kernel’sTCP stack.
Note that resync is only attempted for connections which were successfullyadded to the device table and are in TLS_HW mode. For example,if the table was full when cryptographic state was installed in the kernel,such connection will never get offloaded. Therefore the resync requestdoes not carry any cryptographic connection state.
TX¶
Segments transmitted from an offloaded socket can get out of syncin similar ways to the receive side-retransmissions - local dropsare possible, though network reorders are not. There are currentlytwo mechanisms for dealing with out of order segments.
Crypto state rebuilding¶
Whenever an out of order segment is transmitted the driver providesthe device with enough information to perform cryptographic operations.This means most likely that the part of the record preceding the currentsegment has to be passed to the device as part of the packet context,together with its TCP sequence number and TLS record number. The devicecan then initialize its crypto state, process and discard the precedingdata (to be able to insert the authentication tag) and move onto handlingthe actual packet.
In this mode depending on the implementation the driver can either askfor a continuation with the crypto state and the new sequence number(next expected segment is the one after the out of order one), or continuewith the previous stream state - assuming that the out of order segmentwas just a retransmission. The former is simpler, and does not requireretransmission detection therefore it is the recommended method untilsuch time it is proven inefficient.
Next record sync¶
Whenever an out of order segment is detected the driver requeststhat thektls software fallback code encrypt it. If the segment’ssequence number is lower than expected the driver assumes retransmissionand doesn’t change device state. If the segment is in the future, itmay imply a local drop, the driver asks the stack to sync the deviceto the next record state and falls back to software.
Resync request is indicated with:
voidtls_offload_tx_resync_request(structsock*sk,u32got_seq,u32exp_seq)
Until resync is complete driver should not access its expected TCPsequence number (as it will be updated from a different context).Following helper should be used to test if resync is complete:
booltls_offload_tx_resync_pending(structsock*sk)
Next timektls pushes a record it will first send its TCP sequence numberand TLS record number to the driver. Stack will also make sure thatthe new record will start on a segment boundary (like it does whenthe connection is initially added).
RX¶
A small amount of RX reorder events may not require a full resynchronization.In particular the device should not lose synchronizationwhen record boundary can be recovered:
Reorder of non-header segment
Green segments are successfully decrypted, blue ones are passedas received on wire, red stripes mark start of new records.
In above case segment 1 is received and decrypted successfully.Segment 2 was dropped so 3 arrives out of order. The device knowsthe next record starts inside 3, based on record length in segment 1.Segment 3 is passed untouched, because due to lack of data from segment 2the remainder of the previous record inside segment 3 cannot be handled.The device can, however, collect the authentication algorithm’s stateand partial block from the new record in segment 3 and when 4 and 5arrive continue decryption. Finally when 2 arrives it’s completely outsideof expected window of the device so it’s passed as is without specialhandling.ktls software fallback handles the decryption of recordspanning segments 1, 2 and 3. The device did not get out of sync,even though two segments did not get decrypted.
Kernel synchronization may be necessary if the lost segment containeda record header and arrived after the next record header has already passed:
Reorder of segment with a TLS header
In this example segment 2 gets dropped, and it contains a record header.Device can only detect that segment 4 also contains a TLS headerif it knows the length of the previous record from segment 2. In this casethe device will lose synchronization with the stream.
Stream scan resynchronization¶
When the device gets out of sync and the stream reaches TCP sequencenumbers more than a max size record past the expected TCP sequence number,the device starts scanning for a known header pattern. For examplefor TLS 1.2 and TLS 1.3 subsequent bytes of value0x030x03 occurin the SSL/TLS version field of the header. Once pattern is matchedthe device continues attempting parsing headers at expected locations(based on the length fields at guessed locations).Whenever the expected location does not contain a valid header the scanis restarted.
When the header is matched the device sends a confirmation requestto the kernel, asking if the guessed location is correct (if a TLS recordreally starts there), and which record sequence number the given header had.The kernel confirms the guessed location was correct and tells the devicethe record sequence number. Meanwhile, the device had been parsingand counting all records since the just-confirmed one, it adds the numberof records it had seen to the record number provided by the kernel.At this point the device is in sync and can resume decryption at nextsegment boundary.
In a pathological case the device may latch onto a sequence of matchingheaders and never hear back from the kernel (there is no negativeconfirmation from the kernel). The implementation may choose to periodicallyrestart scan. Given how unlikely falsely-matching stream is, however,periodic restart is not deemed necessary.
Special care has to be taken if the confirmation request is passedasynchronously to the packet stream and record may get processedby the kernel before the confirmation request.
Stack-driven resynchronization¶
The driver may also request the stack to perform resynchronizationwhenever it sees the records are no longer getting decrypted.If the connection is configured in this mode the stack automaticallyschedules resynchronization after it has received two completely encryptedrecords.
The stack waits for the socket to drain and informs the device aboutthe next expected record number and its TCP sequence number. If therecords continue to be received fully encrypted stack retries thesynchronization with an exponential back off (first after 2 encryptedrecords, then after 4 records, after 8, after 16… up until every128 records).
Error handling¶
TX¶
Packets may be redirected or rerouted by the stack to a differentdevice than the selected TLS offload device. The stack will handlesuch condition using thesk_validate_xmit_skb() helper(TLS offload code installstls_validate_xmit_skb() at this hook).Offload maintains information about all records until the data isfully acknowledged, so if skbs reach the wrong device they can be handledby software fallback.
Any device TLS offload handling error on the transmission side must resultin the packet being dropped. For example if a packet got out of orderdue to a bug in the stack or the device, reached the device and can’tbe encrypted such packet must be dropped.
RX¶
If the device encounters any problems with TLS offload on the receiveside it should pass the packet to the host’s networking stack as it wasreceived on the wire.
For example authentication failure for any record in the segment shouldresult in passing the unmodified packet to the software fallback. This meanspackets should not be modified “in place”. Splitting segments to handle partialdecryption is not advised. In other words either all records in the packethad been handled successfully and authenticated or the packet has to be passedto the host’s stack as it was on the wire (recovering original packet in thedriver if device provides precise error is sufficient).
The Linux networking stack does not provide a way of reporting per-packetdecryption and authentication errors, packets with errors must simply nothave thedecrypted mark set.
A packet should also not be handled by the TLS offload if it containsincorrect checksums.
Performance metrics¶
TLS offload can be characterized by the following basic metrics:
- max connection count
- connection installation rate
- connection installation latency
- total cryptographic performance
Note that each TCP connection requires a TLS session in both directions,the performance may be reported treating each direction separately.
Max connection count¶
The number of connections device can support can be exposed viadevlinkresource API.
Total cryptographic performance¶
Offload performance may depend on segment and record size.
Overload of the cryptographic subsystem of the device should not havesignificant performance impact on non-offloaded streams.
Statistics¶
Following minimum set of TLS-related statistics should be reportedby the driver:
rx_tls_decrypted_packets- number of successfully decrypted RX packetswhich were part of a TLS stream.rx_tls_decrypted_bytes- number of TLS payload bytes in RX packetswhich were successfully decrypted.rx_tls_ctx- number of TLS RX HW offload contexts added to device fordecryption.rx_tls_del- number of TLS RX HW offload contexts deleted from device(connection has finished).
rx_tls_resync_req_pkt- number of received TLS packets with a resync- request.
rx_tls_resync_req_start- number of times the TLS async resync request- was started.
rx_tls_resync_req_end- number of times the TLS async resync request- properly ended with providing the HW tracked tcp-seq.
rx_tls_resync_req_skip- number of times the TLS async resync request- procedure was started by not properly ended.
rx_tls_resync_res_ok- number of times the TLS resync response call to- the driver was successfully handled.
rx_tls_resync_res_skip- number of times the TLS resync response call to- the driver was terminated unsuccessfully.
rx_tls_err- number of RX packets which were part of a TLS streambut were not decrypted due to unexpected error in the state machine.tx_tls_encrypted_packets- number of TX packets passed to the devicefor encryption of their TLS payload.tx_tls_encrypted_bytes- number of TLS payload bytes in TX packetspassed to the device for encryption.tx_tls_ctx- number of TLS TX HW offload contexts added to device forencryption.tx_tls_ooo- number of TX packets which were part of a TLS streambut did not arrive in the expected order.tx_tls_skip_no_sync_data- number of TX packets which were part ofa TLS stream and arrived out-of-order, but skipped the HW offload routineand went to the regular transmit flow as they were retransmissions of theconnection handshake.tx_tls_drop_no_sync_data- number of TX packets which were part ofa TLS stream dropped, because they arrived out of order and associatedrecord could not be found.tx_tls_drop_bypass_req- number of TX packets which were part of a TLSstream dropped, because they contain both data that has been encrypted bysoftware and data that expects hardware crypto offload.
Notable corner cases, exceptions and additional requirements¶
5-tuple matching limitations¶
The device can only recognize received packets based on the 5-tupleof the socket. Currentktls implementation will not offload socketsrouted through software interfaces such as those used for tunnelingor virtual networking. However, many packet transformations performedby the networking stack (most notably any BPF logic) do not requireany intermediate software device, therefore a 5-tuple match mayconsistently miss at the device level. In such cases the deviceshould still be able to perform TX offload (encryption) and shouldfallback cleanly to software decryption (RX).
Out of order¶
Introducing extra processing in NICs should not cause packets to betransmitted or received out of order, for example pure ACK packetsshould not be reordered with respect to data segments.
Ingress reorder¶
A device is permitted to perform packet reordering for consecutiveTCP segments (i.e. placing packets in the correct order) but any formof additional buffering is disallowed.
Coexistence with standard networking offload features¶
Offloadedktls sockets should support standard TCP stack featurestransparently. Enabling device TLS offload should not cause any differencein packets as seen on the wire.
Transport layer transparency¶
The device should not modify any packet headers for the purposeof the simplifying TLS offload.
The device should not depend on any packet headers beyond what is strictlynecessary for TLS offload.
Segment drops¶
Dropping packets is acceptable only in the event of catastrophicsystem errors and should never be used as an error handling mechanismin cases arising from normal operation. In other words, relianceon TCP retransmissions to handle corner cases is not acceptable.
TLS device features¶
Drivers should ignore the changes to TLS the device feature flags.These flags will be acted upon accordingly by the corektls code.TLS device feature flags only control adding of new TLS connectionoffloads, old connections will remain active after flags are cleared.