RFC 9002 | QUIC Loss Detection | May 2021 |
Iyengar & Swett | Standards Track | [Page] |
This document describes loss detection and congestion control mechanisms forQUIC.¶
This is an Internet Standards Track document.¶
This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 7841.¶
Information about the current status of this document, any errata, and how to provide feedback on it may be obtained athttps://www.rfc-editor.org/info/rfc9002.¶
Copyright (c) 2021 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.¶
QUIC is a secure, general-purpose transport protocol, described in[QUIC-TRANSPORT]. This document describes loss detection and congestioncontrol mechanisms for QUIC.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD","SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in thisdocument are to be interpreted as described in BCP 14[RFC2119][RFC8174]when, and only when, they appear in all capitals, as shown here.¶
Definitions of terms that are used in this document:¶
All frames other than ACK, PADDING, and CONNECTION_CLOSE are consideredack-eliciting.¶
Packets that contain ack-eliciting frames elicit an ACK from the receiverwithin the maximum acknowledgment delay and are called ack-eliciting packets.¶
Packets are considered in flight when they are ack-eliciting or contain aPADDING frame, and they have been sent but are not acknowledged, declaredlost, or discarded along with old keys.¶
All transmissions in QUIC are sent with a packet-level header, which indicatesthe encryption level and includes a packet sequence number (referred to below asa packet number). The encryption level indicates the packet number space, asdescribed inSection 12.3 of [QUIC-TRANSPORT]. Packet numbers never repeatwithin a packet number space for the lifetime of a connection. Packet numbersare sent in monotonically increasing order within a space, preventing ambiguity.It is permitted for some packet numbers to never be used, leaving intentionalgaps.¶
This design obviates the need for disambiguating between transmissions andretransmissions; this eliminates significant complexity from QUIC'sinterpretation of TCP loss detection mechanisms.¶
QUIC packets can contain multiple frames of different types. The recoverymechanisms ensure that data and frames that need reliable delivery areacknowledged or declared lost and sent in new packets as necessary. The typesof frames contained in a packet affect recovery and congestion control logic:¶
Readers familiar with TCP's loss detection and congestion control will findalgorithms here that parallel well-known TCP ones. However, protocol differencesbetween QUIC and TCP contribute to algorithmic differences. These protocoldifferences are briefly described below.¶
QUIC uses separate packet number spaces for each encryption level,except 0-RTT and all generations of 1-RTT keys use the same packetnumber space. Separate packet number spaces ensures that theacknowledgment of packets sent with one level of encryption will notcause spurious retransmission of packets sent with a differentencryption level. Congestion control and round-trip time (RTT)measurement are unified across packet number spaces.¶
TCP conflates transmission order at the sender with delivery order at thereceiver, resulting in the retransmission ambiguity problem[RETRANSMISSION]. QUIC separates transmission order from delivery order:packet numbers indicate transmission order, and delivery order is determined bythe stream offsets in STREAM frames.¶
QUIC's packet number is strictly increasing within a packet number spaceand directly encodes transmission order. A higher packet number signifiesthat the packet was sent later, and a lower packet number signifies thatthe packet was sent earlier. When a packet containing ack-elicitingframes is detected lost, QUIC includes necessary frames in a new packetwith a new packet number, removing ambiguity about which packet isacknowledged when an ACK is received. Consequently, more accurate RTTmeasurements can be made, spurious retransmissions are trivially detected, andmechanisms such as Fast Retransmit can be applied universally, based only onpacket number.¶
This design point significantly simplifies loss detection mechanisms for QUIC.Most TCP mechanisms implicitly attempt to infer transmission ordering based onTCP sequence numbers -- a nontrivial task, especially when TCP timestamps arenot available.¶
QUIC starts a loss epoch when a packet is lost. The loss epoch ends when anypacket sent after the start of the epoch is acknowledged. TCP waits for the gapin the sequence number space to be filled, and so if a segment is lost multipletimes in a row, the loss epoch may not end for several round trips. Because bothshould reduce their congestion windows only once per epoch, QUIC will do it oncefor every round trip that experiences loss, while TCP may only do it once acrossmultiple round trips.¶
QUIC ACK frames contain information similar to that in TCP SelectiveAcknowledgments (SACKs)[RFC2018]. However, QUIC does not allow a packetacknowledgment to be reneged, greatly simplifying implementations on both sidesand reducing memory pressure on the sender.¶
QUIC supports many ACK ranges, as opposed to TCP's three SACK ranges. Inhigh-loss environments, this speeds recovery, reduces spurious retransmits, andensures forward progress without relying on timeouts.¶
QUIC endpoints measure the delay incurred between when a packet is received andwhen the corresponding acknowledgment is sent, allowing a peer to maintain amore accurate RTT estimate; seeSection 13.2 of [QUIC-TRANSPORT].¶
QUIC uses a probe timeout (PTO; seeSection 6.2), with a timer based on TCP'sretransmission timeout (RTO) computation; see[RFC6298]. QUIC's PTO includesthe peer's maximum expected acknowledgment delay instead of using a fixedminimum timeout.¶
Similar to the RACK-TLP loss detection algorithm for TCP[RFC8985], QUIC doesnot collapse the congestion window when the PTO expires, since a single packetloss at the tail does not indicate persistent congestion. Instead, QUICcollapses the congestion window when persistent congestion is declared; seeSection 7.6. In doing this, QUIC avoids unnecessary congestionwindow reductions, obviating the need for correcting mechanisms such as ForwardRTO-Recovery (F-RTO)[RFC5682]. Since QUIC does not collapse the congestionwindow on a PTO expiration, a QUIC sender is not limited from sending morein-flight packets after a PTO expiration if it still has available congestionwindow. This occurs when a sender is application limited and the PTO timerexpires. This is more aggressive than TCP's RTO mechanism when applicationlimited, but identical when not application limited.¶
QUIC allows probe packets to temporarily exceed the congestion window wheneverthe timer expires.¶
TCP uses a minimum congestion window of one packet. However, loss of that singlepacket means that the sender needs to wait for a PTO to recover (Section 6.2), whichcan be much longer than an RTT. Sending a single ack-eliciting packet alsoincreases the chances of incurring additional latency when a receiver delays itsacknowledgment.¶
QUIC therefore recommends that the minimum congestion window be twopackets. While this increases network load, it is considered safe since thesender will still reduce its sending rate exponentially under persistentcongestion (Section 6.2).¶
TCP treats the loss of SYN or SYN-ACK packet as persistent congestion andreduces the congestion window to one packet; see[RFC5681]. QUIC treats lossof a packet containing handshake data the same as other losses.¶
At a high level, an endpoint measures the time from when a packet was sent towhen it is acknowledged as an RTT sample. The endpoint uses RTT samples andpeer-reported host delays (seeSection 13.2 of [QUIC-TRANSPORT]) to generate astatistical description of the network path's RTT. An endpoint computes thefollowing three values for each path: the minimum value over a period of time(min_rtt), an exponentially weighted moving average (smoothed_rtt), and the meandeviation (referred to as "variation" in the rest of this document) in theobserved RTT samples (rttvar).¶
An endpoint generates an RTT sample on receiving an ACK frame that meets thefollowing two conditions:¶
The RTT sample, latest_rtt, is generated as the time elapsed since the largestacknowledged packet was sent:¶
latest_rtt = ack_time - send_time_of_largest_acked¶
An RTT sample is generated using only the largest acknowledged packet in thereceived ACK frame. This is because a peer reports acknowledgment delays foronly the largest acknowledged packet in an ACK frame. While the reportedacknowledgment delay is not used by the RTT sample measurement, it is used toadjust the RTT sample in subsequent computations of smoothed_rtt and rttvar(Section 5.3).¶
To avoid generating multiple RTT samples for a single packet, an ACK frameSHOULD NOT be used to update RTT estimates if it does not newly acknowledge thelargest acknowledged packet.¶
An RTT sampleMUST NOT be generated on receiving an ACK frame that does notnewly acknowledge at least one ack-eliciting packet. A peer usually does notsend an ACK frame when only non-ack-eliciting packets are received. Therefore,an ACK frame that contains acknowledgments for only non-ack-eliciting packetscould include an arbitrarily large ACK Delay value. Ignoringsuch ACK frames avoids complications in subsequent smoothed_rtt and rttvarcomputations.¶
A sender might generate multiple RTT samples per RTT when multiple ACK framesare received within an RTT. As suggested in[RFC6298], doing so might resultin inadequate history in smoothed_rtt and rttvar. Ensuring that RTT estimatesretain sufficient history is an open research question.¶
min_rtt is the sender's estimate of the minimum RTT observed for a given networkpath over a period of time. In this document, min_rtt is used by loss detectionto reject implausibly small RTT samples.¶
min_rttMUST be set to the latest_rtt on the first RTT sample. min_rttMUST beset to the lesser of min_rtt and latest_rtt (Section 5.1) on all othersamples.¶
An endpoint uses only locally observed times in computing the min_rtt and doesnot adjust for acknowledgment delays reported by the peer. Doing so allows theendpoint to set a lower bound for the smoothed_rtt based entirely on what itobserves (seeSection 5.3) and limits potential underestimation due toerroneously reported delays by the peer.¶
The RTT for a network path may change over time. If a path's actual RTTdecreases, the min_rtt will adapt immediately on the first low sample. If thepath's actual RTT increases, however, the min_rtt will not adapt to it, allowingfuture RTT samples that are smaller than the new RTT to be included insmoothed_rtt.¶
EndpointsSHOULD set the min_rtt to the newest RTT sample after persistentcongestion is established. This avoids repeatedly declaring persistentcongestion when the RTT increases. This also allows a connection to resetits estimate of min_rtt and smoothed_rtt after a disruptive network event;seeSection 5.3.¶
EndpointsMAY reestablish the min_rtt at other times in the connection, such aswhen traffic volume is low and an acknowledgment is received with a lowacknowledgment delay. ImplementationsSHOULD NOT refresh the min_rttvalue too often since the actual minimum RTT of the path is notfrequently observable.¶
smoothed_rtt is an exponentially weighted moving average of an endpoint's RTTsamples, and rttvar estimates the variation in the RTT samples using a meanvariation.¶
The calculation of smoothed_rtt uses RTT samples after adjusting them foracknowledgment delays. These delays are decoded from the ACK Delay field ofACK frames as described inSection 19.3 of [QUIC-TRANSPORT].¶
The peer might report acknowledgment delays that are larger than the peer'smax_ack_delay during the handshake (Section 13.2.1 of [QUIC-TRANSPORT]). Toaccount for this, the endpointSHOULD ignore max_ack_delay until the handshakeis confirmed, as defined inSection 4.1.2 of [QUIC-TLS]. When they occur,these large acknowledgment delays are likely to be non-repeating and limited tothe handshake. The endpoint can therefore use them without limiting them to themax_ack_delay, avoiding unnecessary inflation of the RTT estimate.¶
Note that a large acknowledgment delay can result in a substantially inflatedsmoothed_rtt if there is an error either in the peer's reporting of theacknowledgment delay or in the endpoint's min_rtt estimate. Therefore, priorto handshake confirmation, an endpointMAY ignore RTT samples if adjustingthe RTT sample for acknowledgment delay causes the sample to be less than themin_rtt.¶
After the handshake is confirmed, any acknowledgment delays reported by thepeer that are greater than the peer's max_ack_delay are attributed tounintentional but potentially repeating delays, such as scheduler latency at thepeer or loss of previous acknowledgments. Excess delays could also be due toa noncompliant receiver. Therefore, these extra delays are consideredeffectively part of path delay and incorporated into the RTT estimate.¶
Therefore, when adjusting an RTT sample using peer-reported acknowledgmentdelays, an endpoint:¶
Additionally, an endpoint might postpone the processing of acknowledgments whenthe corresponding decryption keys are not immediately available. For example, aclient might receive an acknowledgment for a 0-RTT packet that it cannotdecrypt because 1-RTT packet protection keys are not yet available to it. Insuch cases, an endpointSHOULD subtract such local delays from its RTT sampleuntil the handshake is confirmed.¶
Similar to[RFC6298], smoothed_rtt and rttvar are computed as follows.¶
An endpoint initializes the RTT estimator during connection establishment andwhen the estimator is reset during connection migration; seeSection 9.4 of [QUIC-TRANSPORT]. Before any RTT samples are available for a new path or whenthe estimator is reset, the estimator is initialized using the initial RTT; seeSection 6.2.2.¶
smoothed_rtt and rttvar are initialized as follows, where kInitialRtt containsthe initial RTT value:¶
smoothed_rtt = kInitialRttrttvar = kInitialRtt / 2¶
RTT samples for the network path are recorded in latest_rtt; seeSection 5.1. On the first RTT sample after initialization, the estimator isreset using that sample. This ensures that the estimator retains no history ofpast samples. Packets sent on other paths do not contribute RTT samples to thecurrent path, as described inSection 9.4 of [QUIC-TRANSPORT].¶
On the first RTT sample after initialization, smoothed_rtt and rttvar are set asfollows:¶
smoothed_rtt = latest_rttrttvar = latest_rtt / 2¶
On subsequent RTT samples, smoothed_rtt and rttvar evolve as follows:¶
ack_delay = decoded acknowledgment delay from ACK frameif (handshake confirmed): ack_delay = min(ack_delay, max_ack_delay)adjusted_rtt = latest_rttif (latest_rtt >= min_rtt + ack_delay): adjusted_rtt = latest_rtt - ack_delaysmoothed_rtt = 7/8 * smoothed_rtt + 1/8 * adjusted_rttrttvar_sample = abs(smoothed_rtt - adjusted_rtt)rttvar = 3/4 * rttvar + 1/4 * rttvar_sample¶
QUIC senders use acknowledgments to detect lost packets and a PTO to ensureacknowledgments are received; seeSection 6.2. This section provides a descriptionof these algorithms.¶
If a packet is lost, the QUIC transport needs to recover from that loss, suchas by retransmitting the data, sending an updated frame, or discarding theframe. For more information, seeSection 13.3 of [QUIC-TRANSPORT].¶
Loss detection is separate per packet number space, unlike RTT measurement andcongestion control, because RTT and congestion control are properties of thepath, whereas loss detection also relies upon key availability.¶
Acknowledgment-based loss detection implements the spirit of TCP's FastRetransmit[RFC5681], Early Retransmit[RFC5827], Forward Acknowledgment[FACK], SACK loss recovery[RFC6675], and RACK-TLP[RFC8985]. Thissection provides an overview of how these algorithms are implemented in QUIC.¶
A packet is declared lost if it meets all of the following conditions:¶
The acknowledgment indicates that a packet sent later was delivered, and thepacket and time thresholds provide some tolerance for packet reordering.¶
Spuriously declaring packets as lost leads to unnecessary retransmissions andmay result in degraded performance due to the actions of the congestioncontroller upon detecting loss. Implementations can detect spuriousretransmissions and increase the packet or time reordering threshold toreduce future spurious retransmissions and loss events. Implementations withadaptive time thresholdsMAY choose to start with smaller initial reorderingthresholds to minimize recovery latency.¶
TheRECOMMENDED initial value for the packet reordering threshold(kPacketThreshold) is 3, based on best practices for TCP loss detection[RFC5681][RFC6675]. In order to remain similar to TCP,implementationsSHOULD NOT use a packet threshold less than 3; see[RFC5681].¶
Some networks may exhibit higher degrees of packet reordering, causing a senderto detect spurious losses. Additionally, packet reordering could be more commonwith QUIC than TCP because network elements that could observe and reorder TCPpackets cannot do that for QUIC and also because QUIC packet numbers areencrypted. Algorithms that increase the reordering threshold after spuriouslydetecting losses, such as RACK[RFC8985], have proven to be useful in TCP andare expected to be at least as useful in QUIC.¶
Once a later packet within the same packet number space has been acknowledged,an endpointSHOULD declare an earlier packet lost if it was sent a thresholdamount of time in the past. To avoid declaring packets as lost too early, thistime thresholdMUST be set to at least the local timer granularity, asindicated by the kGranularity constant. The time threshold is:¶
max(kTimeThreshold * max(smoothed_rtt, latest_rtt), kGranularity)¶
If packets sent prior to the largest acknowledged packet cannot yet be declaredlost, then a timerSHOULD be set for the remaining time.¶
Using max(smoothed_rtt, latest_rtt) protects from the two following cases:¶
TheRECOMMENDED time threshold (kTimeThreshold), expressed as an RTT multiplier,is 9/8. TheRECOMMENDED value of the timer granularity (kGranularity) is 1millisecond.¶
Note: TCP's RACK[RFC8985] specifies a slightly larger threshold, equivalentto 5/4, for a similar purpose. Experience with QUIC shows that 9/8 works well.¶
ImplementationsMAY experiment with absolute thresholds, thresholds fromprevious connections, adaptive thresholds, or the including of RTT variation.Smaller thresholds reduce reordering resilience and increase spuriousretransmissions, and larger thresholds increase loss detection delay.¶
A Probe Timeout (PTO) triggers the sending of one or two probe datagrams whenack-eliciting packets are not acknowledged within the expected period oftime or the server may not have validated the client's address. A PTO enablesa connection to recover from loss of tail packets or acknowledgments.¶
As with loss detection, the PTO is per packet number space. That is, aPTO value is computed per packet number space.¶
A PTO timer expiration event does not indicate packet loss andMUST NOT causeprior unacknowledged packets to be marked as lost. When an acknowledgment isreceived that newly acknowledges packets, loss detection proceeds as dictatedby the packet and time threshold mechanisms; seeSection 6.1.¶
The PTO algorithm used in QUIC implements the reliability functions of Tail LossProbe[RFC8985], RTO[RFC5681], and F-RTO algorithms for TCP[RFC5682]. The timeout computation is based on TCP's RTO period[RFC6298].¶
When an ack-eliciting packet is transmitted, the sender schedules a timer forthe PTO period as follows:¶
PTO = smoothed_rtt + max(4*rttvar, kGranularity) + max_ack_delay¶
The PTO period is the amount of time that a sender ought to wait for anacknowledgment of a sent packet. This time period includes the estimatednetwork RTT (smoothed_rtt), the variation in the estimate (4*rttvar),and max_ack_delay, to account for the maximum time by which a receiver mightdelay sending an acknowledgment.¶
When the PTO is armed for Initial or Handshake packet number spaces, themax_ack_delay in the PTO period computation is set to 0, since the peer isexpected to not delay these packets intentionally; seeSection 13.2.1 of [QUIC-TRANSPORT].¶
The PTO periodMUST be at least kGranularity to avoid the timer expiringimmediately.¶
When ack-eliciting packets in multiple packet number spaces are in flight, thetimerMUST be set to the earlier value of the Initial and Handshake packetnumber spaces.¶
An endpointMUST NOT set its PTO timer for the Application Data packet numberspace until the handshake is confirmed. Doing so prevents the endpoint fromretransmitting information in packets when either the peer does not yet have thekeys to process them or the endpoint does not yet have the keys to process theiracknowledgments. For example, this can happen when a client sends 0-RTT packetsto the server; it does so without knowing whether the server will be able todecrypt them. Similarly, this can happen when a server sends 1-RTT packetsbefore confirming that the client has verified the server's certificate and cantherefore read these 1-RTT packets.¶
A senderSHOULD restart its PTO timer every time an ack-eliciting packet issent or acknowledged, or when Initial or Handshake keys are discarded(Section 4.9 of [QUIC-TLS]). This ensures the PTO is always set based on thelatest estimate of the RTT and for the correct packet across packetnumber spaces.¶
When a PTO timer expires, the PTO backoffMUST be increased, resulting in thePTO period being set to twice its current value. The PTO backoff factor is resetwhen an acknowledgment is received, except in the following case. A servermight take longer to respond to packets during the handshake than otherwise. Toprotect such a server from repeated client probes, the PTO backoff is not resetat a client that is not yet certain that the server has finished validating theclient's address. That is, a client does not reset the PTO backoff factor onreceiving acknowledgments in Initial packets.¶
This exponential reduction in the sender's rate is important because consecutivePTOs might be caused by loss of packets or acknowledgments due to severecongestion. Even when there are ack-eliciting packets in flight in multiplepacket number spaces, the exponential increase in PTO occurs across all spacesto prevent excess load on the network. For example, a timeout in the Initialpacket number space doubles the length of the timeout in the Handshake packetnumber space.¶
The total length of time over which consecutive PTOs expire is limited by theidle timeout.¶
The PTO timerMUST NOT be set if a timer is set for time thresholdloss detection; seeSection 6.1.2. A timer that is set for timethreshold loss detection will expire earlier than the PTO timerin most cases and is less likely to spuriously retransmit data.¶
Resumed connections over the same networkMAY use the previous connection'sfinal smoothed RTT value as the resumed connection's initial RTT. When noprevious RTT is available, the initial RTTSHOULD be set to 333 milliseconds.This results in handshakes starting with a PTO of 1 second, as recommended forTCP's initial RTO; seeSection 2 of [RFC6298].¶
A connectionMAY use the delay between sending a PATH_CHALLENGE and receiving aPATH_RESPONSE to set the initial RTT (see kInitialRtt inAppendix A.2) for a new path, but the delaySHOULD NOT beconsidered an RTT sample.¶
When the Initial keys and Handshake keys are discarded (seeSection 6.4), any Initial packets and Handshake packets canno longer be acknowledged, so they are removed from bytes inflight. When Initial or Handshake keys are discarded, the PTO and lossdetection timersMUST be reset, because discarding keys indicatesforward progress and the loss detection timer might have been set fora now-discarded packet number space.¶
Until the server has validated the client's address on the path, the amount ofdata it can send is limited to three times the amount of data received,as specified inSection 8.1 of [QUIC-TRANSPORT]. If no additional data can besent, the server's PTO timerMUST NOT be armed until datagrams have beenreceived from the client because packets sent on PTO count against theanti-amplification limit.¶
When the server receives a datagram from the client, the amplification limit isincreased and the server resets the PTO timer. If the PTO timer is then set toa time in the past, it is executed immediately. Doing so avoids sending new1-RTT packets prior to packets critical to the completion of the handshake.In particular, this can happen when 0-RTT is accepted but the server fails tovalidate the client's address.¶
Since the server could be blocked until more datagrams are received from theclient, it is the client's responsibility to send packets to unblock the serveruntil it is certain that the server has finished its address validation (seeSection 8 of [QUIC-TRANSPORT]). That is, the clientMUST set the PTO timerif the client has not received an acknowledgment for any of its Handshakepackets and the handshake is not confirmed (seeSection 4.1.2 of [QUIC-TLS]),even if there are no packets in flight. When the PTO fires, the clientMUSTsend a Handshake packet if it has Handshake keys, otherwise itMUST send anInitial packet in a UDP datagram with a payload of at least 1200 bytes.¶
When a server receives an Initial packet containing duplicate CRYPTO data,it can assume the client did not receive all of the server's CRYPTO data sentin Initial packets, or the client's estimated RTT is too small. When aclient receives Handshake or 1-RTT packets prior to obtaining Handshake keys,it may assume some or all of the server's Initial packets were lost.¶
To speed up handshake completion under these conditions, an endpointMAY, for alimited number of times per connection, send a packet containingunacknowledged CRYPTO data earlier than the PTO expiry, subject to the addressvalidation limits inSection 8.1 of [QUIC-TRANSPORT]. Doing so at most oncefor each connection is adequate to quickly recover from a single packet loss.An endpoint that always retransmits packets in response to receiving packetsthat it cannot process risks creating an infinite exchange of packets.¶
Endpoints can also use coalesced packets (seeSection 12.2 of [QUIC-TRANSPORT]) to ensure that each datagram elicits at least oneacknowledgment. For example, a client can coalesce an Initial packet containingPING and PADDING frames with a 0-RTT data packet, and a server can coalesce anInitial packet containing a PING frame with one or more packets in its firstflight.¶
When a PTO timer expires, a senderMUST send at least one ack-eliciting packetin the packet number space as a probe. An endpointMAY send up to twofull-sized datagrams containing ack-eliciting packets to avoid an expensiveconsecutive PTO expiration due to a single lost datagram or to transmit datafrom multiple packet number spaces. All probe packets sent on a PTOMUST beack-eliciting.¶
In addition to sending data in the packet number space for which the timerexpired, the senderSHOULD send ack-eliciting packets from other packet numberspaces with in-flight data, coalescing packets if possible. This isparticularly valuable when the server has both Initial and Handshake data inflight or when the client has both Handshake and Application Data in flightbecause the peer might only have receive keys for one of the two packet numberspaces.¶
If the sender wants to elicit a faster acknowledgment on PTO, it can skip apacket number to eliminate the acknowledgment delay.¶
An endpointSHOULD include new data in packets that are sent on PTO expiration.Previously sent dataMAY be sent if no new data can be sent. ImplementationsMAY use alternative strategies for determining the content of probe packets,including sending new or retransmitted data based on the application'spriorities.¶
It is possible the sender has no new or previously sent data to send.As an example, consider the following sequence of events: new application datais sent in a STREAM frame, deemed lost, then retransmitted in a new packet,and then the original transmission is acknowledged. When there is no data tosend, the senderSHOULD send a PING or other ack-eliciting frame in a singlepacket, rearming the PTO timer.¶
Alternatively, instead of sending an ack-eliciting packet, the senderMAY markany packets still in flight as lost. Doing so avoids sending an additionalpacket but increases the risk that loss is declared too aggressively, resultingin an unnecessary rate reduction by the congestion controller.¶
Consecutive PTO periods increase exponentially, and as a result, connectionrecovery latency increases exponentially as packets continue to be dropped inthe network. Sending two packets on PTO expiration increases resilience topacket drops, thus reducing the probability of consecutive PTO events.¶
When the PTO timer expires multiple times and new data cannot be sent,implementations must choose between sending the same payload every timeor sending different payloads. Sending the same payload may be simplerand ensures the highest priority frames arrive first. Sending differentpayloads each time reduces the chances of spurious retransmission.¶
A Retry packet causes a client to send another Initial packet, effectivelyrestarting the connection process. A Retry packet indicates that the Initialpacket was received but not processed. A Retry packet cannot be treated as anacknowledgment because it does not indicate that a packet was processed orspecify the packet number.¶
Clients that receive a Retry packet reset congestion control and loss recoverystate, including resetting any pending timers. Other connection state, inparticular cryptographic handshake messages, is retained; seeSection 17.2.5 of [QUIC-TRANSPORT].¶
The clientMAY compute an RTT estimate to the server as the time period fromwhen the first Initial packet was sent to when a Retry or a Version Negotiationpacket is received. The clientMAY use this value in place of its default forthe initial RTT estimate.¶
When Initial and Handshake packet protection keys are discarded(seeSection 4.9 of [QUIC-TLS]), all packets that were sent with those keyscan no longer be acknowledged because their acknowledgments cannot be processed.The senderMUST discard all recovery state associated with those packetsandMUST remove them from the count of bytes in flight.¶
Endpoints stop sending and receiving Initial packets once they start exchangingHandshake packets; seeSection 17.2.2.1 of [QUIC-TRANSPORT]. At this point,recovery state for all in-flight Initial packets is discarded.¶
When 0-RTT is rejected, recovery state for all in-flight 0-RTT packets isdiscarded.¶
If a server accepts 0-RTT, but does not buffer 0-RTT packets that arrivebefore Initial packets, early 0-RTT packets will be declared lost, but thatis expected to be infrequent.¶
It is expected that keys are discarded at some time after the packetsencrypted with them are either acknowledged or declared lost. However,Initial and Handshake secrets are discarded as soon as Handshake and1-RTT keys are proven to be available to both client and server; seeSection 4.9.1 of [QUIC-TLS].¶
This document specifies a sender-side congestion controller for QUIC similar toTCP NewReno[RFC6582].¶
The signals QUIC provides for congestion control are generic and are designed tosupport different sender-side algorithms. A sender can unilaterally choose adifferent algorithm to use, such as CUBIC[RFC8312].¶
If a sender uses a different controller than that specified in this document,the chosen controllerMUST conform to the congestion control guidelinesspecified inSection 3.1 of [RFC8085].¶
Similar to TCP, packets containing only ACK frames do not count toward bytesin flight and are not congestion controlled. Unlike TCP, QUIC can detect theloss of these packets andMAY use that information to adjust the congestioncontroller or the rate of ACK-only packets being sent, but this document doesnot describe a mechanism for doing so.¶
The congestion controller is per path, so packets sent on other paths do notalter the current path's congestion controller, as described inSection 9.4 of [QUIC-TRANSPORT].¶
The algorithm in this document specifies and uses the controller's congestionwindow in bytes.¶
An endpointMUST NOT send a packet if it would cause bytes_in_flight (seeAppendix B.2) to be larger than the congestion window, unless the packetis sent on a PTO timer expiration (seeSection 6.2) or when entering recovery(seeSection 7.3.2).¶
If a path has been validated to support Explicit Congestion Notification (ECN)[RFC3168][RFC8311], QUIC treats a Congestion Experienced (CE) codepointin the IP header as a signal of congestion. This document specifies anendpoint's response when the peer-reported ECN-CE count increases; seeSection 13.4.2 of [QUIC-TRANSPORT].¶
QUIC begins every connection in slow start with the congestion window set to aninitial value. EndpointsSHOULD use an initial congestion window of ten timesthe maximum datagram size (max_datagram_size), while limiting the window to thelarger of 14,720 bytes or twice the maximum datagram size. This follows theanalysis and recommendations in[RFC6928], increasing the byte limit toaccount for the smaller 8-byte overhead of UDP compared to the 20-byte overheadfor TCP.¶
If the maximum datagram size changes during the connection, the initialcongestion windowSHOULD be recalculated with the new size. If the maximumdatagram size is decreased in order to complete the handshake, thecongestion windowSHOULD be set to the new initial congestion window.¶
Prior to validating the client's address, the server can be further limited bythe anti-amplification limit as specified inSection 8.1 of [QUIC-TRANSPORT].Though the anti-amplification limit can prevent the congestion window frombeing fully utilized and therefore slow down the increase in congestion window,it does not directly affect the congestion window.¶
The minimum congestion window is the smallest value the congestion window canattain in response to loss, an increase in the peer-reported ECN-CE count,or persistent congestion. TheRECOMMENDED value is 2 * max_datagram_size.¶
The NewReno congestion controller described in this document has threedistinct states, as shown inFigure 1.¶
New path or +------------+ persistent congestion | Slow | (O)---------------------->| Start | +------------+ | Loss or | ECN-CE increase | v +------------+ Loss or +------------+ | Congestion | ECN-CE increase | Recovery | | Avoidance |------------------>| Period | +------------+ +------------+ ^ | | | +----------------------------+ Acknowledgment of packet sent during recovery
These states and the transitions between them are described in subsequentsections.¶
A NewReno sender is in slow start any time the congestion window is below theslow start threshold. A sender begins in slow start because the slow startthreshold is initialized to an infinite value.¶
While a sender is in slow start, the congestion window increases by the numberof bytes acknowledged when each acknowledgment is processed. This results inexponential growth of the congestion window.¶
The senderMUST exit slow start and enter a recovery period when a packet islost or when the ECN-CE count reported by its peer increases.¶
A sender reenters slow start any time the congestion window is less than theslow start threshold, which only occurs after persistent congestion isdeclared.¶
A NewReno sender enters a recovery period when it detects the loss of a packetor when the ECN-CE count reported by its peer increases. A sender that isalready in a recovery period stays in it and does not reenter it.¶
On entering a recovery period, a senderMUST set the slow start threshold tohalf the value of the congestion window when loss is detected. The congestionwindowMUST be set to the reduced value of the slow start threshold beforeexiting the recovery period.¶
ImplementationsMAY reduce the congestion window immediately upon entering arecovery period or use other mechanisms, such as Proportional Rate Reduction[PRR], to reduce the congestion window more gradually. If thecongestion window is reduced immediately, a single packet can be sent prior toreduction. This speeds up loss recovery if the data in the lost packet isretransmitted and is similar to TCP as described inSection 5 of [RFC6675].¶
The recovery period aims to limit congestion window reduction to once per roundtrip. Therefore, during a recovery period, the congestion window does not changein response to new losses or increases in the ECN-CE count.¶
A recovery period ends and the sender enters congestion avoidance when a packetsent during the recovery period is acknowledged. This is slightly differentfrom TCP's definition of recovery, which ends when the lost segment thatstarted recovery is acknowledged[RFC5681].¶
A NewReno sender is in congestion avoidance any time the congestion window isat or above the slow start threshold and not in a recovery period.¶
A sender in congestion avoidance uses an Additive Increase MultiplicativeDecrease (AIMD) approach thatMUST limit the increase to the congestion windowto at most one maximum datagram size for each congestion window that isacknowledged.¶
The sender exits congestion avoidance and enters a recovery period when apacket is lost or when the ECN-CE count reported by its peer increases.¶
During the handshake, some packet protection keys might not be available whena packet arrives, and the receiver can choose to drop the packet. In particular,Handshake and 0-RTT packets cannot be processed until the Initial packetsarrive, and 1-RTT packets cannot be processed until the handshake completes.EndpointsMAY ignore the loss of Handshake, 0-RTT, and 1-RTT packets that mighthave arrived before the peer had packet protection keys to process thosepackets. EndpointsMUST NOT ignore the loss of packets that were sent afterthe earliest acknowledged packet in a given packet number space.¶
Probe packetsMUST NOT be blocked by the congestion controller. A senderMUSThowever count these packets as being additionally in flight, since these packetsadd network load without establishing packet loss. Note that sending probepackets might cause the sender's bytes in flight to exceed the congestion windowuntil an acknowledgment is received that establishes loss or delivery ofpackets.¶
When a sender establishes loss of all packets sent over a long enough duration,the network is considered to be experiencing persistent congestion.¶
The persistent congestion duration is computed as follows:¶
(smoothed_rtt + max(4*rttvar, kGranularity) + max_ack_delay) * kPersistentCongestionThreshold¶
Unlike the PTO computation inSection 6.2, this duration includes the max_ack_delayirrespective of the packet number spaces in which losses are established.¶
This duration allows a sender to send as many packets before establishingpersistent congestion, including some in response to PTO expiration, as TCP doeswith Tail Loss Probes[RFC8985] and an RTO[RFC5681].¶
Larger values of kPersistentCongestionThreshold cause the sender to become lessresponsive to persistent congestion in the network, which can result inaggressive sending into a congested network. Too small a value can result in asender declaring persistent congestion unnecessarily, resulting in reducedthroughput for the sender.¶
TheRECOMMENDED value for kPersistentCongestionThreshold is 3, which results inbehavior that is approximately equivalent to a TCP sender declaring an RTO aftertwo TLPs.¶
This design does not use consecutive PTO events to establish persistentcongestion, since application patterns impact PTO expiration. For example, asender that sends small amounts of data with silence periods between themrestarts the PTO timer every time it sends, potentially preventing the PTO timerfrom expiring for a long period of time, even when no acknowledgments are beingreceived. The use of a duration enables a sender to establish persistentcongestion without depending on PTO expiration.¶
A sender establishes persistent congestion after the receipt of anacknowledgment if two packets that are ack-eliciting are declared lost, and:¶
These two packetsMUST be ack-eliciting, since a receiver is required toacknowledge only ack-eliciting packets within its maximum acknowledgment delay;seeSection 13.2 of [QUIC-TRANSPORT].¶
The persistent congestion periodSHOULD NOT start until there is at least oneRTT sample. Before the first RTT sample, a sender arms its PTO timer based onthe initial RTT (Section 6.2.2), which could be substantially larger thanthe actual RTT. Requiring a prior RTT sample prevents a sender from establishingpersistent congestion with potentially too few probes.¶
Since network congestion is not affected by packet number spaces, persistentcongestionSHOULD consider packets sent across packet number spaces. A senderthat does not have state for all packet number spaces or an implementation thatcannot compare send times across packet number spacesMAY use state for just thepacket number space that was acknowledged. This might result in erroneouslydeclaring persistent congestion, but it will not lead to a failure to detectpersistent congestion.¶
When persistent congestion is declared, the sender's congestion windowMUST bereduced to the minimum congestion window (kMinimumWindow), similar to a TCPsender's response on an RTO[RFC5681].¶
The following example illustrates how a sender might establish persistentcongestion. Assume:¶
smoothed_rtt + max(4*rttvar, kGranularity) + max_ack_delay = 2kPersistentCongestionThreshold = 3¶
Consider the following sequence of events:¶
Time | Action |
---|---|
t=0 | Send packet #1 (application data) |
t=1 | Send packet #2 (application data) |
t=1.2 | Receive acknowledgment of #1 |
t=2 | Send packet #3 (application data) |
t=3 | Send packet #4 (application data) |
t=4 | Send packet #5 (application data) |
t=5 | Send packet #6 (application data) |
t=6 | Send packet #7 (application data) |
t=8 | Send packet #8 (PTO 1) |
t=12 | Send packet #9 (PTO 2) |
t=12.2 | Receive acknowledgment of #9 |
Packets 2 through 8 are declared lost when the acknowledgment for packet 9 isreceived att = 12.2
.¶
The congestion period is calculated as the time between the oldest and newestlost packets:8 - 1 = 7
. The persistent congestion duration is2 * 3 = 6
.Because the threshold was reached and because none of the packets between theoldest and the newest lost packets were acknowledged, the network is consideredto have experienced persistent congestion.¶
While this example shows PTO expiration, they are not required for persistentcongestion to be established.¶
A senderSHOULD pace sending of all in-flight packets based on input from thecongestion controller.¶
Sending multiple packets into the network without any delay between them createsa packet burst that might cause short-term congestion and losses. SendersMUSTeither use pacing or limit such bursts. SendersSHOULD limit bursts to theinitial congestion window; seeSection 7.2. A sender with knowledge thatthe network path to the receiver can absorb larger burstsMAY use a higherlimit.¶
An implementation should take care to architect its congestion controller towork well with a pacer. For instance, a pacer might wrap the congestioncontroller and control the availability of the congestion window, or a pacermight pace out packets handed to it by the congestion controller.¶
Timely delivery of ACK frames is important for efficient loss recovery. To avoiddelaying their delivery to the peer, packets containing only ACK framesSHOULDtherefore not be paced.¶
Endpoints can implement pacing as they choose. A perfectly paced sender spreadspackets exactly evenly over time. For a window-based congestion controller, suchas the one in this document, that rate can be computed by averaging thecongestion window over the RTT. Expressed as a rate in units ofbytes per time, where congestion_window is in bytes:¶
rate = N * congestion_window / smoothed_rtt¶
Or expressed as an inter-packet interval in units of time:¶
interval = ( smoothed_rtt * packet_size / congestion_window ) / N¶
Using a value forN
that is small, but at least 1 (for example, 1.25) ensuresthat variations in RTT do not result in underutilization of thecongestion window.¶
Practical considerations, such as packetization, scheduling delays, andcomputational efficiency, can cause a sender to deviate from this rate over timeperiods that are much shorter than an RTT.¶
One possible implementation strategy for pacing uses a leaky bucket algorithm,where the capacity of the "bucket" is limited to the maximum burst size and therate the "bucket" fills is determined by the above function.¶
When bytes in flight is smaller than the congestion window and sending is notpacing limited, the congestion window is underutilized. This can happen due toinsufficient application data or flow control limits. When this occurs,the congestion windowSHOULD NOT be increased in either slow start orcongestion avoidance.¶
A sender that paces packets (seeSection 7.7) might delay sending packetsand not fully utilize the congestion window due to this delay. A senderSHOULD NOT consider itself application limited if it would have fullyutilized the congestion window without pacing delay.¶
A senderMAY implement alternative mechanisms to update its congestion windowafter periods of underutilization, such as those proposed for TCP in[RFC7661].¶
Loss detection and congestion control fundamentally involve the consumption ofsignals, such as delay, loss, and ECN markings, from unauthenticatedentities. An attacker can cause endpoints to reduce their sending rate bymanipulating these signals: by dropping packets, by altering path delaystrategically, or by changing ECN codepoints.¶
Packets that carry only ACK frames can be heuristically identified by observingpacket size. Acknowledgment patterns may expose information about linkcharacteristics or application behavior. To reduce leaked information,endpoints can bundle acknowledgments with other frames, or they can use PADDINGframes at a potential cost to performance.¶
A receiver can misreport ECN markings to alter the congestion response of asender. Suppressing reports of ECN-CE markings could cause a sender toincrease their send rate. This increase could result in congestion and loss.¶
A sender can detect suppression of reports by marking occasional packets that itsends with an ECN-CE marking. If a packet sent with an ECN-CE marking is notreported as having been CE marked when the packet is acknowledged, then thesender can disable ECN for that path by not setting ECN-Capable Transport (ECT)codepoints in subsequent packets sent on that path[RFC3168].¶
Reporting additional ECN-CE markings will cause a sender to reduce their sendingrate, which is similar in effect to advertising reduced connection flow controllimits and so no advantage is gained by doing so.¶
Endpoints choose the congestion controller that they use. Congestion controllersrespond to reports of ECN-CE by reducing their rate, but the response may vary.Markings can be treated as equivalent to loss[RFC3168], but otherresponses can be specified, such as[RFC8511] or[RFC8311].¶
We now describe an example implementation of the loss detection mechanismsdescribed inSection 6.¶
The pseudocode segments in this section are licensed as Code Components; see thecopyright notice.¶
To correctly implement congestion control, a QUIC sender tracks everyack-eliciting packet until the packet is acknowledged or lost.It is expected that implementations will be able to access this information bypacket number and crypto context and store the per-packet fields(Appendix A.1.1) for loss recovery and congestion control.¶
After a packet is declared lost, the endpoint can still maintain state for itfor an amount of time to allow for packet reordering; seeSection 13.3 of [QUIC-TRANSPORT]. This enables a sender to detect spurious retransmissions.¶
Sent packets are tracked for each packet number space, and ACKprocessing only applies to a single space.¶
The packet number of the sent packet.¶
A Boolean that indicates whether a packet is ack-eliciting.If true, it is expected that an acknowledgment will be received,though the peer could delay sending the ACK frame containing itby up to the max_ack_delay.¶
A Boolean that indicates whether the packet counts toward bytes inflight.¶
The number of bytes sent in the packet, not including UDP or IPoverhead, but including QUIC framing overhead.¶
The time the packet was sent.¶
Constants used in loss recovery are based on a combination of RFCs, papers, andcommon practice.¶
Maximum reordering in packets before packet threshold loss detectionconsiders a packet lost. The value recommended inSection 6.1.1 is 3.¶
Maximum reordering in time before time threshold loss detectionconsiders a packet lost. Specified as an RTT multiplier. The valuerecommended inSection 6.1.2 is 9/8.¶
Timer granularity. This is a system-dependent value, andSection 6.1.2recommends a value of 1 ms.¶
The RTT used before an RTT sample is taken. The value recommended inSection 6.2.2 is 333 ms.¶
An enum to enumerate the three packet number spaces:¶
enum kPacketNumberSpace { Initial, Handshake, ApplicationData,}¶
Variables required to implement the congestion control mechanismsare described in this section.¶
The most recent RTT measurement made when receiving an acknowledgment fora previously unacknowledged packet.¶
The smoothed RTT of the connection, computed as described inSection 5.3.¶
The RTT variation, computed as described inSection 5.3.¶
The minimum RTT seen over a period of time, ignoring acknowledgment delay, asdescribed inSection 5.2.¶
The time that the first RTT sample was obtained.¶
The maximum amount of time by which the receiver intends to delayacknowledgments for packets in the Application Data packet numberspace, as defined by the eponymous transport parameter (Section 18.2 of [QUIC-TRANSPORT]). Note that the actual ack_delay in a receivedACK frame may be larger due to late timers, reordering, or loss.¶
Multi-modal timer used for loss detection.¶
The number of times a PTO has been sent without receiving an acknowledgment.¶
The time the most recent ack-eliciting packet was sent.¶
The largest packet number acknowledged in the packet number space so far.¶
The time at which the next packet in that packet number space can beconsidered lost based on exceeding the reordering window in time.¶
An association of packet numbers in a packet number space to informationabout them. Described in detail above inAppendix A.1.¶
At the beginning of the connection, initialize the loss detection variables asfollows:¶
loss_detection_timer.reset()pto_count = 0latest_rtt = 0smoothed_rtt = kInitialRttrttvar = kInitialRtt / 2min_rtt = 0first_rtt_sample = 0for pn_space in [ Initial, Handshake, ApplicationData ]: largest_acked_packet[pn_space] = infinite time_of_last_ack_eliciting_packet[pn_space] = 0 loss_time[pn_space] = 0¶
After a packet is sent, information about the packet is stored. The parametersto OnPacketSent are described in detail above inAppendix A.1.1.¶
Pseudocode for OnPacketSent follows:¶
OnPacketSent(packet_number, pn_space, ack_eliciting, in_flight, sent_bytes): sent_packets[pn_space][packet_number].packet_number = packet_number sent_packets[pn_space][packet_number].time_sent = now() sent_packets[pn_space][packet_number].ack_eliciting = ack_eliciting sent_packets[pn_space][packet_number].in_flight = in_flight sent_packets[pn_space][packet_number].sent_bytes = sent_bytes if (in_flight): if (ack_eliciting): time_of_last_ack_eliciting_packet[pn_space] = now() OnPacketSentCC(sent_bytes) SetLossDetectionTimer()¶
When a server is blocked by anti-amplification limits, receivinga datagram unblocks it, even if none of the packets in thedatagram are successfully processed. In such a case, the PTOtimer will need to be rearmed.¶
Pseudocode for OnDatagramReceived follows:¶
OnDatagramReceived(datagram): // If this datagram unblocks the server, arm the // PTO timer to avoid deadlock. if (server was at anti-amplification limit): SetLossDetectionTimer() if loss_detection_timer.timeout < now(): // Execute PTO if it would have expired // while the amplification limit applied. OnLossDetectionTimeout()¶
When an ACK frame is received, it may newly acknowledge any number of packets.¶
Pseudocode for OnAckReceived and UpdateRtt follow:¶
IncludesAckEliciting(packets): for packet in packets: if (packet.ack_eliciting): return true return falseOnAckReceived(ack, pn_space): if (largest_acked_packet[pn_space] == infinite): largest_acked_packet[pn_space] = ack.largest_acked else: largest_acked_packet[pn_space] = max(largest_acked_packet[pn_space], ack.largest_acked) // DetectAndRemoveAckedPackets finds packets that are newly // acknowledged and removes them from sent_packets. newly_acked_packets = DetectAndRemoveAckedPackets(ack, pn_space) // Nothing to do if there are no newly acked packets. if (newly_acked_packets.empty()): return // Update the RTT if the largest acknowledged is newly acked // and at least one ack-eliciting was newly acked. if (newly_acked_packets.largest().packet_number == ack.largest_acked && IncludesAckEliciting(newly_acked_packets)): latest_rtt = now() - newly_acked_packets.largest().time_sent UpdateRtt(ack.ack_delay) // Process ECN information if present. if (ACK frame contains ECN information): ProcessECN(ack, pn_space) lost_packets = DetectAndRemoveLostPackets(pn_space) if (!lost_packets.empty()): OnPacketsLost(lost_packets) OnPacketsAcked(newly_acked_packets) // Reset pto_count unless the client is unsure if // the server has validated the client's address. if (PeerCompletedAddressValidation()): pto_count = 0 SetLossDetectionTimer()UpdateRtt(ack_delay): if (first_rtt_sample == 0): min_rtt = latest_rtt smoothed_rtt = latest_rtt rttvar = latest_rtt / 2 first_rtt_sample = now() return // min_rtt ignores acknowledgment delay. min_rtt = min(min_rtt, latest_rtt) // Limit ack_delay by max_ack_delay after handshake // confirmation. if (handshake confirmed): ack_delay = min(ack_delay, max_ack_delay) // Adjust for acknowledgment delay if plausible. adjusted_rtt = latest_rtt if (latest_rtt >= min_rtt + ack_delay): adjusted_rtt = latest_rtt - ack_delay rttvar = 3/4 * rttvar + 1/4 * abs(smoothed_rtt - adjusted_rtt) smoothed_rtt = 7/8 * smoothed_rtt + 1/8 * adjusted_rtt¶
QUIC loss detection uses a single timer for all timeout loss detection. Theduration of the timer is based on the timer's mode, which is set in the packetand timer events further below. The function SetLossDetectionTimer definedbelow shows how the single timer is set.¶
This algorithm may result in the timer being set in the past, particularly iftimers wake up late. Timers set in the past fire immediately.¶
Pseudocode for SetLossDetectionTimer follows (where the "^" operator representsexponentiation):¶
GetLossTimeAndSpace(): time = loss_time[Initial] space = Initial for pn_space in [ Handshake, ApplicationData ]: if (time == 0 || loss_time[pn_space] < time): time = loss_time[pn_space]; space = pn_space return time, spaceGetPtoTimeAndSpace(): duration = (smoothed_rtt + max(4 * rttvar, kGranularity)) * (2 ^ pto_count) // Anti-deadlock PTO starts from the current time if (no ack-eliciting packets in flight): assert(!PeerCompletedAddressValidation()) if (has handshake keys): return (now() + duration), Handshake else: return (now() + duration), Initial pto_timeout = infinite pto_space = Initial for space in [ Initial, Handshake, ApplicationData ]: if (no ack-eliciting packets in flight in space): continue; if (space == ApplicationData): // Skip Application Data until handshake confirmed. if (handshake is not confirmed): return pto_timeout, pto_space // Include max_ack_delay and backoff for Application Data. duration += max_ack_delay * (2 ^ pto_count) t = time_of_last_ack_eliciting_packet[space] + duration if (t < pto_timeout): pto_timeout = t pto_space = space return pto_timeout, pto_spacePeerCompletedAddressValidation(): // Assume clients validate the server's address implicitly. if (endpoint is server): return true // Servers complete address validation when a // protected packet is received. return has received Handshake ACK || handshake confirmedSetLossDetectionTimer(): earliest_loss_time, _ = GetLossTimeAndSpace() if (earliest_loss_time != 0): // Time threshold loss detection. loss_detection_timer.update(earliest_loss_time) return if (server is at anti-amplification limit): // The server's timer is not set if nothing can be sent. loss_detection_timer.cancel() return if (no ack-eliciting packets in flight && PeerCompletedAddressValidation()): // There is nothing to detect lost, so no timer is set. // However, the client needs to arm the timer if the // server might be blocked by the anti-amplification limit. loss_detection_timer.cancel() return timeout, _ = GetPtoTimeAndSpace() loss_detection_timer.update(timeout)¶
When the loss detection timer expires, the timer's mode determines the actionto be performed.¶
Pseudocode for OnLossDetectionTimeout follows:¶
OnLossDetectionTimeout(): earliest_loss_time, pn_space = GetLossTimeAndSpace() if (earliest_loss_time != 0): // Time threshold loss Detection lost_packets = DetectAndRemoveLostPackets(pn_space) assert(!lost_packets.empty()) OnPacketsLost(lost_packets) SetLossDetectionTimer() return if (no ack-eliciting packets in flight): assert(!PeerCompletedAddressValidation()) // Client sends an anti-deadlock packet: Initial is padded // to earn more anti-amplification credit, // a Handshake packet proves address ownership. if (has Handshake keys): SendOneAckElicitingHandshakePacket() else: SendOneAckElicitingPaddedInitialPacket() else: // PTO. Send new data if available, else retransmit old data. // If neither is available, send a single PING frame. _, pn_space = GetPtoTimeAndSpace() SendOneOrTwoAckElicitingPackets(pn_space) pto_count++ SetLossDetectionTimer()¶
DetectAndRemoveLostPackets is called every time an ACK is received or the timethreshold loss detection timer expires. This function operates on thesent_packets for that packet number space and returns a list of packets newlydetected as lost.¶
Pseudocode for DetectAndRemoveLostPackets follows:¶
DetectAndRemoveLostPackets(pn_space): assert(largest_acked_packet[pn_space] != infinite) loss_time[pn_space] = 0 lost_packets = [] loss_delay = kTimeThreshold * max(latest_rtt, smoothed_rtt) // Minimum time of kGranularity before packets are deemed lost. loss_delay = max(loss_delay, kGranularity) // Packets sent before this time are deemed lost. lost_send_time = now() - loss_delay foreach unacked in sent_packets[pn_space]: if (unacked.packet_number > largest_acked_packet[pn_space]): continue // Mark packet as lost, or set time when it should be marked. // Note: The use of kPacketThreshold here assumes that there // were no sender-induced gaps in the packet number space. if (unacked.time_sent <= lost_send_time || largest_acked_packet[pn_space] >= unacked.packet_number + kPacketThreshold): sent_packets[pn_space].remove(unacked.packet_number) lost_packets.insert(unacked) else: if (loss_time[pn_space] == 0): loss_time[pn_space] = unacked.time_sent + loss_delay else: loss_time[pn_space] = min(loss_time[pn_space], unacked.time_sent + loss_delay) return lost_packets¶
When Initial or Handshake keys are discarded, packets from the spaceare discarded and loss detection state is updated.¶
Pseudocode for OnPacketNumberSpaceDiscarded follows:¶
OnPacketNumberSpaceDiscarded(pn_space): assert(pn_space != ApplicationData) RemoveFromBytesInFlight(sent_packets[pn_space]) sent_packets[pn_space].clear() // Reset the loss detection and PTO timer time_of_last_ack_eliciting_packet[pn_space] = 0 loss_time[pn_space] = 0 pto_count = 0 SetLossDetectionTimer()¶
We now describe an example implementation of the congestion controller describedinSection 7.¶
The pseudocode segments in this section are licensed as Code Components; see thecopyright notice.¶
Constants used in congestion control are based on a combination of RFCs, papers,and common practice.¶
Default limit on the initial bytes in flight as described inSection 7.2.¶
Minimum congestion window in bytes as described inSection 7.2.¶
Scaling factor applied to reduce the congestion window when a new loss eventis detected.Section 7 recommends a value of 0.5.¶
Period of time for persistent congestion to be established, specified as a PTOmultiplier.Section 7.6 recommends a value of 3.¶
Variables required to implement the congestion control mechanismsare described in this section.¶
The sender's current maximum payload size. This does not include UDP or IPoverhead. The max datagram size is used for congestion windowcomputations. An endpoint sets the value of this variable based on its PathMaximum Transmission Unit (PMTU; seeSection 14.2 of [QUIC-TRANSPORT]), witha minimum value of 1200 bytes.¶
The highest value reported for the ECN-CE counter in the packet number spaceby the peer in an ACK frame. This value is used to detect increases in thereported ECN-CE counter.¶
The sum of the size in bytes of all sent packets that contain at least oneack-eliciting or PADDING frame and have not been acknowledged or declaredlost. The size does not include IP or UDP overhead, but does include the QUICheader and Authenticated Encryption with Associated Data (AEAD) overhead.Packets only containing ACK frames do not count toward bytes_in_flight toensure congestion control does not impede congestion feedback.¶
Maximum number of bytes allowed to be in flight.¶
The time the current recovery period started due to the detection of lossor ECN. When a packet sent after this time is acknowledged, QUIC exitscongestion recovery.¶
Slow start threshold in bytes. When the congestion window is below ssthresh,the mode is slow start and the window grows by the number of bytesacknowledged.¶
The congestion control pseudocode also accesses some of the variables from theloss recovery pseudocode.¶
At the beginning of the connection, initialize the congestion controlvariables as follows:¶
congestion_window = kInitialWindowbytes_in_flight = 0congestion_recovery_start_time = 0ssthresh = infinitefor pn_space in [ Initial, Handshake, ApplicationData ]: ecn_ce_counters[pn_space] = 0¶
Whenever a packet is sent and it contains non-ACK frames, the packetincreases bytes_in_flight.¶
OnPacketSentCC(sent_bytes): bytes_in_flight += sent_bytes¶
This is invoked from loss detection's OnAckReceived and is supplied with thenewly acked_packets from sent_packets.¶
In congestion avoidance, implementers that use an integer representationfor congestion_window should be careful with division and can usethe alternative approach suggested inSection 2.1 of [RFC3465].¶
InCongestionRecovery(sent_time): return sent_time <= congestion_recovery_start_timeOnPacketsAcked(acked_packets): for acked_packet in acked_packets: OnPacketAcked(acked_packet)OnPacketAcked(acked_packet): if (!acked_packet.in_flight): return; // Remove from bytes_in_flight. bytes_in_flight -= acked_packet.sent_bytes // Do not increase congestion_window if application // limited or flow control limited. if (IsAppOrFlowControlLimited()) return // Do not increase congestion window in recovery period. if (InCongestionRecovery(acked_packet.time_sent)): return if (congestion_window < ssthresh): // Slow start. congestion_window += acked_packet.sent_bytes else: // Congestion avoidance. congestion_window += max_datagram_size * acked_packet.sent_bytes / congestion_window¶
This is invoked from ProcessECN and OnPacketsLost when a new congestion event isdetected. If not already in recovery, this starts a recovery period andreduces the slow start threshold and congestion window immediately.¶
OnCongestionEvent(sent_time): // No reaction if already in a recovery period. if (InCongestionRecovery(sent_time)): return // Enter recovery period. congestion_recovery_start_time = now() ssthresh = congestion_window * kLossReductionFactor congestion_window = max(ssthresh, kMinimumWindow) // A packet can be sent to speed up loss recovery. MaybeSendOnePacket()¶
This is invoked when an ACK frame with an ECN section is received from the peer.¶
ProcessECN(ack, pn_space): // If the ECN-CE counter reported by the peer has increased, // this could be a new congestion event. if (ack.ce_counter > ecn_ce_counters[pn_space]): ecn_ce_counters[pn_space] = ack.ce_counter sent_time = sent_packets[ack.largest_acked].time_sent OnCongestionEvent(sent_time)¶
This is invoked when DetectAndRemoveLostPackets deems packets lost.¶
OnPacketsLost(lost_packets): sent_time_of_last_loss = 0 // Remove lost packets from bytes_in_flight. for lost_packet in lost_packets: if lost_packet.in_flight: bytes_in_flight -= lost_packet.sent_bytes sent_time_of_last_loss = max(sent_time_of_last_loss, lost_packet.time_sent) // Congestion event if in-flight packets were lost if (sent_time_of_last_loss != 0): OnCongestionEvent(sent_time_of_last_loss) // Reset the congestion window if the loss of these // packets indicates persistent congestion. // Only consider packets sent after getting an RTT sample. if (first_rtt_sample == 0): return pc_lost = [] for lost in lost_packets: if lost.time_sent > first_rtt_sample: pc_lost.insert(lost) if (InPersistentCongestion(pc_lost)): congestion_window = kMinimumWindow congestion_recovery_start_time = 0¶
When Initial or Handshake keys are discarded, packets sent in that space nolonger count toward bytes in flight.¶
Pseudocode for RemoveFromBytesInFlight follows:¶
RemoveFromBytesInFlight(discarded_packets): // Remove any unacknowledged packets from flight. foreach packet in discarded_packets: if packet.in_flight bytes_in_flight -= size¶
The IETF QUIC Working Group received an enormous amount of support from manypeople. The following people provided substantive contributions to thisdocument:¶