lelegard/cryptobenchPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star2

Cryptographic libraries comparison framework

License

BSD-2-Clause license

2 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
issues		issues
reports		reports
src		src
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RESULTS.md		RESULTS.md

Repository files navigation

Cryptographic libraries comparative benchmarks

Contents:

Purpose

This project evaluates a comparison of cryptographic libraries from two perspectives:

Comparing libraries: On each system, the time of each operation is given in micro-seconds.
Comparing processors efficiency: For each processor, a "standard" performance test isrun first. Then, for each cryptographic operation, a "score" is reported, based on thestandard test of the processor.

The "score" compares the efficiency of an algorithm between processors. It is the ratioof the processing time of a cryptographic operation over the "standard" performance test.

If a given cryptographic operation had the same efficiency on all processors, itsscore would be the same on all processors, slow low-end processors or fast high-endprocessors.

If a score on processor A is higher than on a processor B, this means that the correspondingcryptographic operation is less efficient on A (it takes more time, relatively to the"standard" performances of the processor).

For a given cryptographic library, if an operation is implemented with the samesource code on distinct processors, the scores for that operation are directindications of the relative efficiencies of the two processors. However, if thetwo scores are extremely different, this may be an indication that the operationmay have been compiled with accelerated instructions on the processor with the lowerscore and with portable source code on the processor with the higher score.

The "scores" are relative values within a given cryptographic operation.The absolute values have no significance. Comparing the scores of differentcryptographic operations is pointless.

The "standard" test uses integer arithmetics and bitwise operations. It is supposed tobe representative of the basic operations in symmetric and asymmetric cryptography.There is no floating point tests since this type of data is not used in cryptography.Currently, this test is very basic (see filesrc/bench_reference.cpp). Ideas toimprove it are welcome.

The results are presented in the fileRESULTS.md.

Software support

Thecryptobench program can be built on:

Linux (tested on Ubuntu, Debian, Fedora, Redhat)
macOS

Tested architectures:

Intel x86_64 (aka amd64)
Arm64 (aka aarch64)

Tested cryptographic libraries:

OpenSSL
GnuTLS
MbedTLS
Nettle (also used through GnuTLS)
Libtomcrypt (with libtommath and/or GMP as math backends)

Tested algorithms:

RSA (2048 and 4096 bits, encryption and signature)
AES (128 and 256 bits, CBC mode on large data)
2048-bit modular arithmetic using OpenSSL

The modular arithmetic tests give performance indicators on the various elementaryoperations which are used by RSA. These tests may help to understand the differencesin RSA performances between systems.

Building and prerequisites

Software prerequisites to buildcryptobench:

Ubuntu/Debian:sudo apt install libssl-dev libmbedtls-dev libgnutls28-dev nettle-dev libtomcrypt-dev libtommath-dev libgmp-dev
Fedora/Redhat:sudo dnf install openssl-devel mbedtls-devel gnutls-devel nettle-devel libtomcrypt-devel libtommath-devel gmp-devel
macOS:brew install openssl mbedtls gnutls nettle libtomcrypt libtommath gmp

Note: On Ubuntu/Debian, the packagelibgnutls28-dev is probably misnamed.It is the gnutls development environment for the latest versions of gnutls,3.x.x as of this writing. It is not (or no longer) gnutls version 2.8.

Just runmake to build thecryptobench executable. All objects areproduced in subdirectorybuild.

Command line usage

Without parameter, thecryptobench program runs all tests forall libraries. Each cryptographic operation is repeatedly run duringat least 2 seconds.

Using command line options, it is possible to run only selected testson selected libraries. Any combination is possible.

$ cryptobench --helpCommand line options:  -h, --help : display this help text  -m sec, --min-time sec : minimum run time of each test in seconds  --min-iterations count : minimum number of iteration for each test  --aes-data-size count : data size for AES-CBC tests  -r, --reference : run reference benchmark onlyAlgorithms selection (all by default):  --aes, --rsa, --mathCrytographic libraries selection (all by default):  --gnutls, --mbedtls, --openssl, --nettle, --tomcrypt, --tomcrypt-gmp, --arm64

The--math option selects the individual modular arithmetics tests whichcharacterize RSA operations.

Thearm64 pseudo-library is an implementation of AES using Arm64 specializedaccelerated instructions, through C/C++ intrinsics. It is run only on Arm64processors supporting FEAT_AES.

Test results

Test reports are stored in the subdirectoryreports and saved in the gitrepository for reference.

The scripttools/run-test.sh produces a benchmark report on the current system.All tests are run during 5 seconds, so the execution may take some time, 12 minutesor more. It also produces a description of the current system for reference.

Typical usage:

$ tools/run-test.sh >reports/system-cpu-os.txt

On some systems, the CPU is not correctly identified by the script.Be sure to verify the line containingsystem: cpu: in the reportfile and manually fix it when necessary before committing to git.

To regenerate the fileRESULTS.md, use the following script:

$ tools/analyze-reports.py reports/*.txt >RESULTS.md

Bugs and limitations

GnuTLS: It is unclear which type of padding is used for RSA encrypt/decrypt.How can we do OAEP with GnuTLS? What kind of padding doesgnutls_pubkey_encrypt_data() use?With sign/verify, using PSS is explicit. But there is no equivalent parametersin RSA encryption functions.

MbedTLS: PKCS#1 v1.5 padding is used on RSA signature instead of PSS.With MbedTLS v3, trying to sign with PSS returns the error codeMBEDTLS_ERR_PK_BAD_INPUT_DATA. With MbedTLS v2, the functionmbedtls_pk_sign_ext()which includes a padding argument does not exist.

Nettle: RSA tests are currently disabled. There is an issue loading public and privatekey files, as created by OpenSSL. The parsing of the DER sequence fails. See sourcefilesrc/lib_nettle.cpp for a fix. To be investigated...

Miscellaneous observations

RSA performances on old Arm CPU cores

Looking atRESULTS.md, we see that the RSA performances are surprisingly lowon Ampere Altra CPU's. By "low", we mean that the RSA operations (encryption, decryption,signature, verification) are slow, relatively to any other cryptographic operation on thesame CPU. On other CPU's, the RSA operations run much faster, again relatively to theperformance of their respective CPU. This is where the "relative performance score"indicators are useful.

No need to blame the Arm ISA, other Arm CPU's such as the AWS Graviton 3 or the Apple M1do not have this performance bias on RSA. They execute RSA operations with the samerelative efficiency as any other cryptographic operation.

Looking at the OpenSSL modular arithmetic operations benchmarks at the end ofRESULTS.md, we see the exact same bias on the Ampere Altra only,on the modular multiplication (Montgomery algorithm only), the modular squareroot and the modular exponentiation.

This means that the root cause is the implementation of the Montgomery algorithm forthe multiplication. The four basic RSA operations are based on the modular exponentiationwhich in turn uses the Montgomery multiplication. And the modular square root is basedon the modular exponentiation too.

So, what is different in the Ampere Altra CPU to impact the Montgomery algorithmand nothing else?

The answer is the difference of CPU core and especially the implementation of the64-bit multiplication in the Ampere Altra compared to the AWS Graviton 3 or theApple M1.

Ampere Altra is based on an Arm Neoverse N1 core. AWS Graviton 3 is based on a morerecent Arm Neoverse V1 core. Apple M1 is based on Apple Firestorm/Icestorm cores.In the tested systems, the Ampere Altra is the only one with a Neoverse N1 core(note: the older Graviton 2 is also based on Neoverse N1 but I did not have access to any).

To justify this conclusion, let's investigate how OpenSSL implements the Montgomeryalgorithm. Because it is highly critical for asymmetric cryptography performances,this algorithm is implemented in assembly code(source code for Armv8 here).

This implementation makes a heavy usage of the MUL and UMULH instructions. Basedon the public documentation from Arm, we see that the performance of these instructionsis significantly lower on the Neoverse N1 and much better on the Neoverse V1.

Reference public documents:

These documents precisely describe the performances of each instruction. Let's checkthe latency of the MUL and UMULH instructions. The following table gives the numberof cycles per instruction. The additional number in parenthesis, when present, isdescribed as follow in the N1 documentation:"Multiply high operations stall themultiplier pipeline for N extra cycles before any other type M uop can be issued tothat pipeline, with N shown in parentheses."

Latency	Neoverse N1	Neoverse V1
MADD, MUL, W-form (32 bits)	2(1)	2(1)
MADD, MUL, X-form (64 bits)	4(3)	2(1)
UMULH	5(3)	3

Surprisingly, on the N1, the 64-bit multiplication (MUL) is much more expensive thanthe 32-bit one. On the V1, they have the same cost. In short 64-bit MUL and UMULHare twice as expensive on the N1, compared to the V1.

Looking the OpenSSL assembly source codearmv8-montfor the Montgomery algorithm, we see sequences of adjacent instructions such as"UMULH,MUL, UMULH, MUL, MOV, UMULH, MUL, SUBS, UMULH, ADC". All these instructions - except MOV -use the M pipeline which consequently becomes a bottleneck. Each pair of adjacent MUL, UMULHcosts 15 cycles, especially because of the extra stall.

This bottleneck does not exist on the Arm Neoverse V1. Each pair of adjacent MUL, UMULHonly costs 6 cycles (instead of 15).

There is no equivalent public document about the Apple Firestorm/Icestorm cores but we canassume that there is no such issue either.

To confirm this diagnostic, we have tweaked the OpenSSL assembly code for the Montgomeryalgorithm. First, we replaced all UMULH instructions with MUL. Then, we replaced allMUL and UMULH instructions with NOP. The result is of course no longer functionalbut we focus only on the execution time.

The table below evaluates the execution time in microseconds of the Montgomeryalgorithm on the various CPU cores, with and without MUL or UMULH instructions.Thus, we evaluate the marginal cost of these specific instructions.

Execution time (microseconds)	Neoverse N1	Neoverse V1	Apple M1
OpenSSL original source code	4.82	1.45	1.04
Replace all UMULH with MUL	4.13	1.17	1.04
Replace all UMULH and MUL with NOP	1.06	1.01	1.00

We can see that the MUL and UMULH instructions have a huge cost on the Neoverse N1,much higher that on other CPU cores.

This explains why RSA is so slow on CPU's which are based on the Arm Neoverse N1 core.This problem no longer exists on Arm Neoverse V1 and more recent Arm cores.

Multiplication and carry on Neoverse V1

The investigation in the previous section started from a visible effect: poor RSAperformance on Ampere Altra (Neoverse N1) compared to Graviton 3 (Neoverse V1),as well as compared to Intel CPU's, relatively to the standard performance of eachCPU. The identified reason is the sequence of interspersed MUL and MULH.

Out of curiosity, let's look at the OpenSSL assembly code for the Montgomerymultiplication again and let's try to find other characteristic sequences ofinstructions involving multiplications.

We easily spot another long sequence of ADCS MUL ADCS UMULH instructions.So, let's try to characterize it.

This time, we try to evaluate the mean time of an instruction inside a givensequence. We start with a sequence of NOP to have a reference. Then we testmultiple combinations of instructions. Most sequences are 8 instructions long andare repeated a large number of times.

In each 8-instruction sequence, the output registers of the instructions are alldistinct to avoid execution dependencies. The input registers are the same inall instructions. The same sequence (and consequently the same output registers)are reused after 8 instructions.

Last, we test the full long sequence from OpenSSL, with many MUL ADCS UMULH ADCS(there are also some ADC and ADDS). Finally, we test the same sequence with ADDinstead of ADCS, ADC, ADDS.

See thetest source code for the details.

The results are summarized below:

Mean instruction time (nanoseconds)	Cortex A72	Neoverse N1	Neoverse V1	Neoverse V2	Apple M1	Apple M3
NOP	0.419	0.043	0.000	0.000	0.020	0.011
ADD	0.209	0.042	0.079	0.037	0.041	0.014
ADC	0.209	0.042	0.079	0.058	0.078	0.042
ADDS	0.209	0.042	0.085	0.087	0.078	0.041
ADCS	0.418	0.250	0.337	0.267	0.274	0.219
MUL	1.532	0.918	0.144	0.115	0.117	0.093
MUL UMULH	1.810	1.085	0.144	0.115	0.117	0.093
MUL ADCS UMULH ADCS	0.835	0.501	0.248	0.204	0.117	0.095
MUL ADCS	0.696	0.417	0.144	0.134	0.117	0.093
MUL ADD UMULH ADD	0.835	0.501	0.092	0.064	0.064	0.035
Full OpenSSL sequence (MUL & ADCS)	0.875	0.524	0.130	0.098	0.065	0.050
Same sequence with ADD only	0.875	0.524	0.106	0.068	0.062	0.052

The results are quite surprising when comparing the Neoverse N1 and V1.

For a start, NOP is not measureable on the Neoverse V1 while its execution givesperfectly measurable execution times on other cores (even twice as long as theaddition on the Cortex A72). But let's not focus on NOP.

About the multiplication

On the Neoverse N1, interleaving MUL and UMULH is not good (we already knew that).Having said that, the 64-bit multiplication alone is not that good either.Replacing the interspersed UMULH with MUL is slightly faster but not that much(0.918 nanosecond per instruction instead of 1.085).

On the Neoverse V1, MUL and UMULH are much faster for the reasons which wereexplained before. Additionally MUL and UMULH have the same execution time (0.144 ns).The situation was clearly improved in the V1.

About the addition

Since our characteristic sequence in OpenSSL interleaves multiplications andadditions, let's focus on the addition for a start.

For the record, ADCS is an addition with input carry (C) and setting thecarry on output (S). Variants ADC, ADDS and ADD exist with carry on inputonly (C), output only (S), or no carry at all.

Using the carry on input, output, or both, creates a potential dependencybetween instructions which may slown down the execution.

We see that sequences of ADD, ADC, or ADDS execute at the same speed (0.042 nson the N1). However, the sequence of ADCS uses 0.250 ns per instruction. Unlikethe 3 previous sequences, an ADCS needs to wait for the output carry of theprevious instruction to use it as input carry. This explains the lower speed.

However, when the carry is not used (ADD), always read but never written (ADC),always written but never read (ADDS), there is no dependency between instructionsand they all execute at the same faster speed.

The important lesson is:when the carry is not used by an instruction A, accessingit from an adjacent instruction B shall have no influence on the execution time of A.If A and B are otherwise independent, using independent pipelines, etc.

Let's add two observations:

The execution times of ADD, ADC, ADDS, are stricly identical on Cortex A72 andNeoverse N1 only. On Neoverse V1, ADDS is slightly more expensive. On Apple M1,ADD is cheaper.
While the multiplication was improved between Neoverse N1 and V1(0.918 to 0.144 ns), the addition was degraded (0.042 to 0.079 for ADD and0.250 to 0,337 for ADCS). The improvement of the multiplication (-84%) hardlycompensates the degradation of the addition (+88% for ADD, +35% for ADCS).At application level, the improvement is perceptible because the multiplicationis much more expensive in absolute time than the addition. However, in relativeexecution time, one just compensates the other.

Interleaving multiplications and additions

When we intersperse ADCS between MUL and/or UMULH, as used in the OpenSSL sequence,the time per instruction drops by a half on Neoverse N1 (0.501 ns instead of 1.085),compared MUL/UMULH alone.

According to the Arm Neoverse N1 and V1 Software Optimization Guides, MUL and UMULHuse the pipeline M or M0. All forms of ADxx use the pipeline I. The two pipelines areindependent in the two cores. According to the instruction pseudo-code in theArm Architecture Reference Manual,MUL and UMULH do not use the carry or any other PSTATE flag.

Therefore, MUL/UMULH and ADCS are completely independent.

Interleaving independent instructions explains the performance boost on the Neoverse N1,from 1.085 nanosecond per instruction down to 0.501.

However... on the Neoverse V1, when we intersperse ADCS between MUL and UMULH,the mean time per instruction almostdoubles (from 0.144 to 0.248 ns),instead of beingdivided by two as on the Neoverse N1.

Interestingly, if we replace ADCS with ADD (no use of carry), the mean time perinstruction drops to 0.092 (compared to 0.248 with ADCS and 0.144 with MUL).Without using the carry, interleaving independent instructions reduces themean instruction time by a third, an expected result.

We already observed that the addition was degraded from Neoverse N1 to V1.However, the degradation of ADD without carry (+88%) was worse than thedegradion of ADCS (+35%). This was evaluated for these instructions alone.When we interleave ADD and then ADCS with MUL/UMULH, the relative improvementsare reversed: -82% with ADD and -50% with ADCS. This is still an improvementbecause of the improvement of the very costly multiplication. However, theimprovement is much better with ADD, compared to ADCS, while the degradationof ADD alone is worse than the degradation of ADCS alone.

Consequently, on Neoverse V1, it is safe to assume that the usage of thecarry on addition (and maybe other instructions, still to be tested) has anunexpected impact on the instruction throughput, even though the otherinstructions execute on a distinct pipeline and do not use the carry.This is new and did not exist in the Neoverse N1.

The impact on the real code sequence in OpenSSL is slightly less importantbut still by 30%. This sequence is likely executed less often than the sequenceof MUL UMULH. Thus, the macroscopic effect on RSA is less noticeable.

About

Cryptographic libraries comparison framework

Releases

No releases published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Cryptographic libraries comparative benchmarks

Purpose

Software support

Building and prerequisites

Command line usage

Test results

Bugs and limitations

Miscellaneous observations

RSA performances on old Arm CPU cores

Multiplication and carry on Neoverse V1

About the multiplication

About the addition

Interleaving multiplications and additions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Languages

Movatterモバイル変換

License

lelegard/cryptobench

Folders and files

Latest commit

History

Repository files navigation

Cryptographic libraries comparative benchmarks

Purpose

Software support

Building and prerequisites

Command line usage

Test results

Bugs and limitations

Miscellaneous observations

RSA performances on old Arm CPU cores

Multiplication and carry on Neoverse V1

About the multiplication

About the addition

Interleaving multiplications and additions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Languages