- Notifications
You must be signed in to change notification settings - Fork0
Cryptographic libraries comparison framework
License
lelegard/cryptobench
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Contents:
- Purpose
- Software support
- Building and prerequisites
- Command line usage
- Test results
- Bugs and limitations
- Miscellaneous observations
This project evaluates a comparison of cryptographic libraries from two perspectives:
- Comparing libraries: On each system, the time of each operation is given in micro-seconds.
- Comparing processors efficiency: For each processor, a "standard" performance test isrun first. Then, for each cryptographic operation, a "score" is reported, based on thestandard test of the processor.
The "score" compares the efficiency of an algorithm between processors. It is the ratioof the processing time of a cryptographic operation over the "standard" performance test.
If a given cryptographic operation had the same efficiency on all processors, itsscore would be the same on all processors, slow low-end processors or fast high-endprocessors.
If a score on processor A is higher than on a processor B, this means that the correspondingcryptographic operation is less efficient on A (it takes more time, relatively to the"standard" performances of the processor).
For a given cryptographic library, if an operation is implemented with the samesource code on distinct processors, the scores for that operation are directindications of the relative efficiencies of the two processors. However, if thetwo scores are extremely different, this may be an indication that the operationmay have been compiled with accelerated instructions on the processor with the lowerscore and with portable source code on the processor with the higher score.
The "scores" are relative values within a given cryptographic operation.The absolute values have no significance. Comparing the scores of differentcryptographic operations is pointless.
The "standard" test uses integer arithmetics and bitwise operations. It is supposed tobe representative of the basic operations in symmetric and asymmetric cryptography.There is no floating point tests since this type of data is not used in cryptography.Currently, this test is very basic (see filesrc/bench_reference.cpp
). Ideas toimprove it are welcome.
The results are presented in the fileRESULTS.md.
Thecryptobench
program can be built on:
- Linux (tested on Ubuntu, Debian, Fedora, Redhat)
- macOS
Tested architectures:
- Intel x86_64 (aka amd64)
- Arm64 (aka aarch64)
Tested cryptographic libraries:
- OpenSSL
- GnuTLS
- MbedTLS
- Nettle (also used through GnuTLS)
- Libtomcrypt (with libtommath and/or GMP as math backends)
Tested algorithms:
- RSA (2048 and 4096 bits, encryption and signature)
- AES (128 and 256 bits, CBC mode on large data)
- 2048-bit modular arithmetic using OpenSSL
The modular arithmetic tests give performance indicators on the various elementaryoperations which are used by RSA. These tests may help to understand the differencesin RSA performances between systems.
Software prerequisites to buildcryptobench
:
- Ubuntu/Debian:
sudo apt install libssl-dev libmbedtls-dev libgnutls28-dev nettle-dev libtomcrypt-dev libtommath-dev libgmp-dev
- Fedora/Redhat:
sudo dnf install openssl-devel mbedtls-devel gnutls-devel nettle-devel libtomcrypt-devel libtommath-devel gmp-devel
- macOS:
brew install openssl mbedtls gnutls nettle libtomcrypt libtommath gmp
Note: On Ubuntu/Debian, the packagelibgnutls28-dev
is probably misnamed.It is the gnutls development environment for the latest versions of gnutls,3.x.x as of this writing. It is not (or no longer) gnutls version 2.8.
Just runmake
to build thecryptobench
executable. All objects areproduced in subdirectorybuild
.
Without parameter, thecryptobench
program runs all tests forall libraries. Each cryptographic operation is repeatedly run duringat least 2 seconds.
Using command line options, it is possible to run only selected testson selected libraries. Any combination is possible.
$ cryptobench --helpCommand line options: -h, --help : display this help text -m sec, --min-time sec : minimum run time of each test in seconds --min-iterations count : minimum number of iteration for each test --aes-data-size count : data size for AES-CBC tests -r, --reference : run reference benchmark onlyAlgorithms selection (all by default): --aes, --rsa, --mathCrytographic libraries selection (all by default): --gnutls, --mbedtls, --openssl, --nettle, --tomcrypt, --tomcrypt-gmp, --arm64
The--math
option selects the individual modular arithmetics tests whichcharacterize RSA operations.
Thearm64
pseudo-library is an implementation of AES using Arm64 specializedaccelerated instructions, through C/C++ intrinsics. It is run only on Arm64processors supporting FEAT_AES.
Test reports are stored in the subdirectoryreports
and saved in the gitrepository for reference.
The scripttools/run-test.sh
produces a benchmark report on the current system.All tests are run during 5 seconds, so the execution may take some time, 12 minutesor more. It also produces a description of the current system for reference.
Typical usage:
$ tools/run-test.sh >reports/system-cpu-os.txt
On some systems, the CPU is not correctly identified by the script.Be sure to verify the line containingsystem: cpu:
in the reportfile and manually fix it when necessary before committing to git.
To regenerate the fileRESULTS.md, use the following script:
$ tools/analyze-reports.py reports/*.txt >RESULTS.md
GnuTLS: It is unclear which type of padding is used for RSA encrypt/decrypt.How can we do OAEP with GnuTLS? What kind of padding doesgnutls_pubkey_encrypt_data()
use?With sign/verify, using PSS is explicit. But there is no equivalent parametersin RSA encryption functions.
MbedTLS: PKCS#1 v1.5 padding is used on RSA signature instead of PSS.With MbedTLS v3, trying to sign with PSS returns the error codeMBEDTLS_ERR_PK_BAD_INPUT_DATA
. With MbedTLS v2, the functionmbedtls_pk_sign_ext()
which includes a padding argument does not exist.
Nettle: RSA tests are currently disabled. There is an issue loading public and privatekey files, as created by OpenSSL. The parsing of the DER sequence fails. See sourcefilesrc/lib_nettle.cpp
for a fix. To be investigated...
Looking atRESULTS.md, we see that the RSA performances are surprisingly lowon Ampere Altra CPU's. By "low", we mean that the RSA operations (encryption, decryption,signature, verification) are slow, relatively to any other cryptographic operation on thesame CPU. On other CPU's, the RSA operations run much faster, again relatively to theperformance of their respective CPU. This is where the "relative performance score"indicators are useful.
No need to blame the Arm ISA, other Arm CPU's such as the AWS Graviton 3 or the Apple M1do not have this performance bias on RSA. They execute RSA operations with the samerelative efficiency as any other cryptographic operation.
Looking at the OpenSSL modular arithmetic operations benchmarks at the end ofRESULTS.md, we see the exact same bias on the Ampere Altra only,on the modular multiplication (Montgomery algorithm only), the modular squareroot and the modular exponentiation.
This means that the root cause is the implementation of the Montgomery algorithm forthe multiplication. The four basic RSA operations are based on the modular exponentiationwhich in turn uses the Montgomery multiplication. And the modular square root is basedon the modular exponentiation too.
So, what is different in the Ampere Altra CPU to impact the Montgomery algorithmand nothing else?
The answer is the difference of CPU core and especially the implementation of the64-bit multiplication in the Ampere Altra compared to the AWS Graviton 3 or theApple M1.
Ampere Altra is based on an Arm Neoverse N1 core. AWS Graviton 3 is based on a morerecent Arm Neoverse V1 core. Apple M1 is based on Apple Firestorm/Icestorm cores.In the tested systems, the Ampere Altra is the only one with a Neoverse N1 core(note: the older Graviton 2 is also based on Neoverse N1 but I did not have access to any).
To justify this conclusion, let's investigate how OpenSSL implements the Montgomeryalgorithm. Because it is highly critical for asymmetric cryptography performances,this algorithm is implemented in assembly code(source code for Armv8 here).
This implementation makes a heavy usage of the MUL and UMULH instructions. Basedon the public documentation from Arm, we see that the performance of these instructionsis significantly lower on the Neoverse N1 and much better on the Neoverse V1.
Reference public documents:
These documents precisely describe the performances of each instruction. Let's checkthe latency of the MUL and UMULH instructions. The following table gives the numberof cycles per instruction. The additional number in parenthesis, when present, isdescribed as follow in the N1 documentation:"Multiply high operations stall themultiplier pipeline for N extra cycles before any other type M uop can be issued tothat pipeline, with N shown in parentheses."
Latency | Neoverse N1 | Neoverse V1 |
---|---|---|
MADD, MUL, W-form (32 bits) | 2(1) | 2(1) |
MADD, MUL, X-form (64 bits) | 4(3) | 2(1) |
UMULH | 5(3) | 3 |
Surprisingly, on the N1, the 64-bit multiplication (MUL) is much more expensive thanthe 32-bit one. On the V1, they have the same cost. In short 64-bit MUL and UMULHare twice as expensive on the N1, compared to the V1.
Looking the OpenSSL assembly source codearmv8-montfor the Montgomery algorithm, we see sequences of adjacent instructions such as"UMULH,MUL, UMULH, MUL, MOV, UMULH, MUL, SUBS, UMULH, ADC". All these instructions - except MOV -use the M pipeline which consequently becomes a bottleneck. Each pair of adjacent MUL, UMULHcosts 15 cycles, especially because of the extra stall.
This bottleneck does not exist on the Arm Neoverse V1. Each pair of adjacent MUL, UMULHonly costs 6 cycles (instead of 15).
There is no equivalent public document about the Apple Firestorm/Icestorm cores but we canassume that there is no such issue either.
To confirm this diagnostic, we have tweaked the OpenSSL assembly code for the Montgomeryalgorithm. First, we replaced all UMULH instructions with MUL. Then, we replaced allMUL and UMULH instructions with NOP. The result is of course no longer functionalbut we focus only on the execution time.
The table below evaluates the execution time in microseconds of the Montgomeryalgorithm on the various CPU cores, with and without MUL or UMULH instructions.Thus, we evaluate the marginal cost of these specific instructions.
Execution time (microseconds) | Neoverse N1 | Neoverse V1 | Apple M1 |
---|---|---|---|
OpenSSL original source code | 4.82 | 1.45 | 1.04 |
Replace all UMULH with MUL | 4.13 | 1.17 | 1.04 |
Replace all UMULH and MUL with NOP | 1.06 | 1.01 | 1.00 |
We can see that the MUL and UMULH instructions have a huge cost on the Neoverse N1,much higher that on other CPU cores.
This explains why RSA is so slow on CPU's which are based on the Arm Neoverse N1 core.This problem no longer exists on Arm Neoverse V1 and more recent Arm cores.
The investigation in the previous section started from a visible effect: poor RSAperformance on Ampere Altra (Neoverse N1) compared to Graviton 3 (Neoverse V1),as well as compared to Intel CPU's, relatively to the standard performance of eachCPU. The identified reason is the sequence of interspersed MUL and MULH.
Out of curiosity, let's look at the OpenSSL assembly code for the Montgomerymultiplication again and let's try to find other characteristic sequences ofinstructions involving multiplications.
We easily spot another long sequence of ADCS MUL ADCS UMULH instructions.So, let's try to characterize it.
This time, we try to evaluate the mean time of an instruction inside a givensequence. We start with a sequence of NOP to have a reference. Then we testmultiple combinations of instructions. Most sequences are 8 instructions long andare repeated a large number of times.
In each 8-instruction sequence, the output registers of the instructions are alldistinct to avoid execution dependencies. The input registers are the same inall instructions. The same sequence (and consequently the same output registers)are reused after 8 instructions.
Last, we test the full long sequence from OpenSSL, with many MUL ADCS UMULH ADCS(there are also some ADC and ADDS). Finally, we test the same sequence with ADDinstead of ADCS, ADC, ADDS.
See thetest source code for the details.
The results are summarized below:
Mean instruction time (nanoseconds) | Cortex A72 | Neoverse N1 | Neoverse V1 | Neoverse V2 | Apple M1 | Apple M3 |
---|---|---|---|---|---|---|
NOP | 0.419 | 0.043 | 0.000 | 0.000 | 0.020 | 0.011 |
ADD | 0.209 | 0.042 | 0.079 | 0.037 | 0.041 | 0.014 |
ADC | 0.209 | 0.042 | 0.079 | 0.058 | 0.078 | 0.042 |
ADDS | 0.209 | 0.042 | 0.085 | 0.087 | 0.078 | 0.041 |
ADCS | 0.418 | 0.250 | 0.337 | 0.267 | 0.274 | 0.219 |
MUL | 1.532 | 0.918 | 0.144 | 0.115 | 0.117 | 0.093 |
MUL UMULH | 1.810 | 1.085 | 0.144 | 0.115 | 0.117 | 0.093 |
MUL ADCS UMULH ADCS | 0.835 | 0.501 | 0.248 | 0.204 | 0.117 | 0.095 |
MUL ADCS | 0.696 | 0.417 | 0.144 | 0.134 | 0.117 | 0.093 |
MUL ADD UMULH ADD | 0.835 | 0.501 | 0.092 | 0.064 | 0.064 | 0.035 |
Full OpenSSL sequence (MUL & ADCS) | 0.875 | 0.524 | 0.130 | 0.098 | 0.065 | 0.050 |
Same sequence with ADD only | 0.875 | 0.524 | 0.106 | 0.068 | 0.062 | 0.052 |
The results are quite surprising when comparing the Neoverse N1 and V1.
For a start, NOP is not measureable on the Neoverse V1 while its execution givesperfectly measurable execution times on other cores (even twice as long as theaddition on the Cortex A72). But let's not focus on NOP.
On the Neoverse N1, interleaving MUL and UMULH is not good (we already knew that).Having said that, the 64-bit multiplication alone is not that good either.Replacing the interspersed UMULH with MUL is slightly faster but not that much(0.918 nanosecond per instruction instead of 1.085).
On the Neoverse V1, MUL and UMULH are much faster for the reasons which wereexplained before. Additionally MUL and UMULH have the same execution time (0.144 ns).The situation was clearly improved in the V1.
Since our characteristic sequence in OpenSSL interleaves multiplications andadditions, let's focus on the addition for a start.
For the record, ADCS is an addition with input carry (C) and setting thecarry on output (S). Variants ADC, ADDS and ADD exist with carry on inputonly (C), output only (S), or no carry at all.
Using the carry on input, output, or both, creates a potential dependencybetween instructions which may slown down the execution.
We see that sequences of ADD, ADC, or ADDS execute at the same speed (0.042 nson the N1). However, the sequence of ADCS uses 0.250 ns per instruction. Unlikethe 3 previous sequences, an ADCS needs to wait for the output carry of theprevious instruction to use it as input carry. This explains the lower speed.
However, when the carry is not used (ADD), always read but never written (ADC),always written but never read (ADDS), there is no dependency between instructionsand they all execute at the same faster speed.
The important lesson is:when the carry is not used by an instruction A, accessingit from an adjacent instruction B shall have no influence on the execution time of A.If A and B are otherwise independent, using independent pipelines, etc.
Let's add two observations:
- The execution times of ADD, ADC, ADDS, are stricly identical on Cortex A72 andNeoverse N1 only. On Neoverse V1, ADDS is slightly more expensive. On Apple M1,ADD is cheaper.
- While the multiplication was improved between Neoverse N1 and V1(0.918 to 0.144 ns), the addition was degraded (0.042 to 0.079 for ADD and0.250 to 0,337 for ADCS). The improvement of the multiplication (-84%) hardlycompensates the degradation of the addition (+88% for ADD, +35% for ADCS).At application level, the improvement is perceptible because the multiplicationis much more expensive in absolute time than the addition. However, in relativeexecution time, one just compensates the other.
When we intersperse ADCS between MUL and/or UMULH, as used in the OpenSSL sequence,the time per instruction drops by a half on Neoverse N1 (0.501 ns instead of 1.085),compared MUL/UMULH alone.
According to the Arm Neoverse N1 and V1 Software Optimization Guides, MUL and UMULHuse the pipeline M or M0. All forms of ADxx use the pipeline I. The two pipelines areindependent in the two cores. According to the instruction pseudo-code in theArm Architecture Reference Manual,MUL and UMULH do not use the carry or any other PSTATE flag.
Therefore, MUL/UMULH and ADCS are completely independent.
Interleaving independent instructions explains the performance boost on the Neoverse N1,from 1.085 nanosecond per instruction down to 0.501.
However... on the Neoverse V1, when we intersperse ADCS between MUL and UMULH,the mean time per instruction almostdoubles (from 0.144 to 0.248 ns),instead of beingdivided by two as on the Neoverse N1.
Interestingly, if we replace ADCS with ADD (no use of carry), the mean time perinstruction drops to 0.092 (compared to 0.248 with ADCS and 0.144 with MUL).Without using the carry, interleaving independent instructions reduces themean instruction time by a third, an expected result.
We already observed that the addition was degraded from Neoverse N1 to V1.However, the degradation of ADD without carry (+88%) was worse than thedegradion of ADCS (+35%). This was evaluated for these instructions alone.When we interleave ADD and then ADCS with MUL/UMULH, the relative improvementsare reversed: -82% with ADD and -50% with ADCS. This is still an improvementbecause of the improvement of the very costly multiplication. However, theimprovement is much better with ADD, compared to ADCS, while the degradationof ADD alone is worse than the degradation of ADCS alone.
Consequently, on Neoverse V1, it is safe to assume that the usage of thecarry on addition (and maybe other instructions, still to be tested) has anunexpected impact on the instruction throughput, even though the otherinstructions execute on a distinct pipeline and do not use the carry.This is new and did not exist in the Neoverse N1.
The impact on the real code sequence in OpenSSL is slightly less importantbut still by 30%. This sequence is likely executed less often than the sequenceof MUL UMULH. Thus, the macroscopic effect on RSA is less noticeable.
About
Cryptographic libraries comparison framework