Measuring and (slightly) Improving Post-Quantum Handshake Performance

2024-12-17

To defend against the potential advent of "Cryptographically Relevant Quantum Computers"there is a move to using "hybrid" key exchange algorithms. These glue togethera widely-deployed classical algorithm (likeX25519) and a new post-quantum-secure algorithm(likeML-KEM) and treat the result as one TLS-level key exchange algorithm (likeX25519MLKEM768).

In this report, first we'll measure the additional cost of post-quantum-secure key exchange.Then we'll describe and measure an optimization we have implemented.

Headline measurements

All these measurements are taken on our amd64 benchmarking machine, which has aXeon E-2386G CPU. We'll compare:

rustls using post-quantum-insecure X25519 key exchange,
rustls using post-quantum-secure X25519MLKEM768 key exchange, and
OpenSSL 3.3.2 using post-quantum-insecure X25519 key exchange.

All three are taken on the same hardware, and the latter measurements are fromour previous report -- which also contains reproductioninstructions and describes what the benchmarks measure.

One important thing to note is that post-quantum key exchange involves sending andreceiving much larger messages than classical ones. Our benchmark design only coversCPU costs -- and does not include networking -- so real-world performance willbe worse than these measurements.

client handshake performance results on amd64 architecture

server handshake performance results on amd64 architecture

The cost of X25519MLKEM768 post-quantum key exchange is clearly visible forboth clients and servers.

We can see that the performance headroom that rustls has attained means we canalmostcompletely absorb the extra cost of post-quantum key exchange, while still performingbetter than (post-quantum-insecure) OpenSSL -- with the exception of client resumption.

We will do further comparative benchmarking in this area when OpenSSL gains post-quantum keyexchange support.

Sharing X25519 setup costs

Background

In TLS1.3, the client starts the key exchange in its first message (theClientHello).TheClientHello includes both a description of which algorithms the client supports, andzero or more presumptive "key shares".

The server then evaluates which algorithms it is willing to use, and either uses oneof the presumptive key shares, or replies with aHelloRetryRequest which instructsthe client to send newClientHello with a specific, mutually-acceptable key share.

AHelloRetryRequest can be expensive, because it introduces an additional round tripinto the handshake. It also means any work the client did for its presumptive keyshares is wasted.

It's therefore advantageous for a client to avoidHelloRetryRequests, by:

Having prior knowledge of the server's preferences.draft-ietf-tls-key-share-predictionis an effort to standardize a mechanism for a client to learn this out-of-band.
Remembering a server's preferences from a previous connection. rustls hasdone this since adding support for TLS1.3 in 2017. This generally meansa client making many connections to one server may avoid repeatedHelloRetryRequests.
Sending many presumptive key shares. Though there's an obvious trade-offin terms of wasted computation and message size.
Following ecosystem preferences.X25519 key exchange is overwhelminglypreferred in TLS1.3 implementations, due to its performance and implementationquality.

The key shares in aClientHello would look like:

diagram of TLS1.3 client key exchange with X25519MLKEM768

At least for a transitional period, we want to avoid aHelloRetryRequest roundtrip when connecting to a server that hasn't been upgraded to support X25519MLKEM768.That means also offering a separate X25519 key share:

diagram of TLS1.3 client key exchange with X25519MLKEM768 and X25519

However, this arrangement is not optimal. While X25519 setup is very fast, we are doing it twiceand then we are guaranteed to throw away half of that work, because the server can only ever selectone key share to use.

Instead, we can do:

diagram of TLS1.3 optimized client key exchange with X25519MLKEM768 and X25519

This report measures the benefit of that optimization.

This optimization is described further indraft-ietf-tls-hybrid-design section 3.2.

Micro benchmarking

First, we can micro-benchmark the time to construct and serialize aClientHello, in a varietyof situations:

X25519 key share included only.
X25519MLKEM768 and X25519 key shares, with the optimization.
X25519MLKEM768 and X25519 key shares, without the optimization.

We run this on two machines that cover both amd64 (Xeon E-2386G) and aarch64 (Ampere Altra Q80-30)architectures.

micro benchmark results on amd64 architecture

micro benchmark results on arm64 architecture

From this we can see:

There is a small but measurable benefit, as expected.
ML-KEM-768 key generation costs are significantly more expensive than X25519.

Whole handshakes

Next, let's measure the same scenarios in the context of whole client handshakes.The remaining measurements are only done on our amd64 benchmark machine.

The above optimization only affects the client's first message, so now we'll seewhether the effect of the optimization is meaningful when compared to the restof the computation a client must do.