Movatterモバイル変換

Learning fast elliptic-curve cryptography

It’s great to create tools that serve as a foundation for developer projects. Huge impact. Last year, we’ve released Chokidar 3 file watcher and started saving terabytes of NPM traffic daily while improving overall security of JS ecosystem. You may have Chokidar on your machine if you’re coding inVSCode.

After this, i’ve started diving more into cryptography. It’s very useful to know how those algorithms work, to be able to verify them & to create safe systems.

It was immediately clear that Elliptic Curve Cryptography (ECC) libraries must be improved. In this post i’ll describe how to makeone of the fastest JS implementations ofsecp256k1, that can beaudited by non-cryptographers. The initial working code will fit in898 bytes.

State of cryptography in JS

The state of JavaScript cryptography can be summed up in one word: “sad”:

Dependency Hell: Many libraries are bloated with dependencies. As i’ve mentioned before,every dependency is a potential security vulnerability — any underlying package could get hacked & malwared. This is unacceptable for cryptographic primitives, which are made to defend user secrets.
Bad WebCrypto: Thewindow.crypto standard is complicated and lacks support for popular curves like secp256k1. Those APIs which do exist are terrible and make it easy to shoot oneself in foot
Unverifiable WASM: While WASM offers performance benefits, it lacks transparency and auditability.There is no infrastructure for reproducible builds and code signing. Everybody just downloads opaque binaries,which could easily be infested
Modular Arithmetic Issues: JavaScript’s % operator is a remainder operator, not a true modulo, which necessitates re-implementing basic operations.

Naïve first take

Taking all of this into consideration, i’ve decided to create TypeScript libraries that don’t use dependencies & are simple to audit for everyone. Having no “deep” math background, it wasn’t that simple.Cloudflare blog post,Nakov’s book andAndrea’s blog post have all been very useful.

And you can’t just go ahead and read through the code. Almost every implementation — even in Haskell — work with unreadable and highly optimized code. In a real world, we’ll seesomething more like:

for(i=254;i>=0;--i){r=(z[i>>>3]>>>(i&7))&1;sel25519(a,b,r);sel25519(c,d,r);A(e,a,c);Z(a,a,c);A(c,b,d);Z(b,b,d);S(d,e);}

The only lib i’ve found useful wasfastecdsa.py. Unfortunately,it’s rare for libraries to be built on top of big integers. It’s much easier to reason with numbers, instead of byte arrays, which everyone prefers.

Public keys

We will start with a function that takes private key and generates public key from it.

To generate a public key from a private key using elliptic curve cryptography (ECC), you need to performelliptic curve point multiplication. This involves multiplying the base (generator) pointG by the private keyp to obtain the public keyP:

P = G × p

Multiplication can be thought of as repeated addition of the base pointG+G+G... -p times.How do we add(x1, y1) + (x2, y2) to get(x3, y3)?

Imagine drawing a straight line between the two points P and Q on the elliptic curve
This line will generally intersect the curve at a third point.
Reflect this third point vertically (i.e., flip the y-coordinate) to get the resulting point Q, resulting in(x3, -y3)
- This reflection is necessary to ensure that point subtraction works correctly; for example, when computing R - Q, you actually add R to the negation of Q. By flipping Q’s y-coordinate and drawing the line between R and -Q, you obtain the expected result P.
There is a special case for doubling. When point is added to itself, we can’t draw a straight line.Instead, we calculate the slope (λ, lambda) of the tangent line at the point.
There is a special case for adding point to same flipped point:P + -P should equal to0 (special point-at-infinity)

In math terms(source):

ifx1 != x2:
- λ = (y2-y1)/(x2-x1)
- x3 = λ^2 - x1 - x2
- y3 = λ(x1 - x3) - y1
ifx1 = x2 andy1 = y2 (point doubling):
- λ = (3*x1^2+a)/(2y1)
- x3 = λ^2 - 2*x1
- y3 = λ(x1 - x3) - y1

Simple, but not quite. Keep in mind: we’re working in a finite field over some big primeP. This basically means all operations — additions, multiplications, subtractions — would be donemodulo P. And, there is no classic division in finite fields. Instead, amodular multiplicative inverse is used. It is most efficiently calculated by iterative version of Euclid’s GCD algorithm.

Let’s code this:

// PART 1: unsafe pub// secp256k1 curve parameters. Verify using https://www.secg.org/sec2-v2.pdfconst_256=2n**256n;constP=_256-0x1000003d1n;// curve's field primeconstN=_256-0x14551231950b75fc4402da1732fc9bebfn;// curve (group) orderconsterr=(m=''):never=>{thrownewError(m);};// error helperconstM=(a:bigint,b:bigint=P)=>{// mod divisionconstr=a%b;returnr>=0n?r:b+r;// a % b = r};// prettier-ignoreconstinv=(num:bigint,md:bigint):bigint=>{// modular inversionif(num===0n||md<=0n)err('no inverse n='+num+' mod='+md);// no neg exponent for nowleta=M(num,md),b=md,x=0n,y=1n,u=1n,v=0n;while(a!==0n){// uses euclidean gcd algorithmconstq=b/a,r=b%a;// not constant-timeconstm=x-u*q,n=y-v*q;b=a,a=r,x=u,y=v,u=m,v=n;}returnb===1n?M(x,md):err('no inverse');// b is gcd at this point};// Point in 2d affine (x, y) coordinatesinterfaceAffinePoint{x:bigint;y:bigint;}constaffine=(x:bigint,y:bigint):AffinePoint=>({x:x,y:y});constPoint_ZERO={x:0n,y:0n};// Point at infinity aka identity point aka zero// Adds point to itself. https://hyperelliptic.org/EFD/g1p/auto-shortw.htmlconstpdouble=(a:AffinePoint)=>{constX1=a.x,Y1=a.y;// Calculates slope of the tangent lineconstlam=M(3n*X1**2n*inv(2n*Y1,P));// λ = (3x₁² + a) / (2y₁)constX3=M(lam*lam-2n*X1);// x₃ = λ² - 2x₁constY3=M(lam*(X1-X3)-Y1);// y₃ = λ * (x₁ - x₃) - y₁returnaffine(X3,Y3);};// Adds point to other point. https://hyperelliptic.org/EFD/g1p/auto-shortw.htmlconstpadd=(a:AffinePoint,b:AffinePoint):AffinePoint=>{constX1=a.x,Y1=a.y,X2=b.x,Y2=b.y;if(X1===0n||Y1===0n)returnb;if(X2===0n||Y2===0n)returna;if(X1===X2&&Y1===Y2)returnpdouble(a);// special caseif(X1===X2&&Y1===M(-Y2))returnPoint_ZERO;// special caseconstlam=M((Y2-Y1)*inv(X2-X1,P));constX3=M(lam*lam-X1-X2);constY3=M(lam*(X1-X3)-Y1);returnaffine(X3,Y3);};/** Generator / base point */// G x, y values taken from official secp256k1 document: https://www.secg.org/sec2-v2.pdfconstGx=0x79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798n;constGy=0x483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8n;constPoint_BASE=affine(Gx,Gy);constmul_unsafe=(q:AffinePoint,n:bigint)=>{letcurr=affine(q.x,q.y);letp=Point_ZERO;while(n>0n){if(n&1n)p=padd(p,curr);curr=pdouble(curr);n>>=1n;}returnp;};const_getPublicKey=(privateKey:bigint):AffinePoint=>{returnmul_unsafe(Point_BASE,privateKey);};

Yay, it works.

Signatures

A good ECC library should also be able to produce & verify signatures.Formulas for ECDSA sigs are simple.

Let:

m be the hash of the message, converted to a number.
d be the private key, converted to a number.
k be the secret (random) nonce number.
G be the generator point.
n be the order of the curve.

Then the signing process is defined as:

(x_1, y_1) = G × k

r = x_1 mod n

s = k^-1 ⋅ (m + d⋅r) mod n

For verification, given:

r, s as the signature outputs
Q as the public key corresponding to the private keyd.

u_1 = s^-1 ⋅ m mod n

u_2 = s^-1 ⋅ r mod n

(x_2, y_2) = G × u_1 + Q × u_2

x_2 mod n == r

Let’s code them:

// PART 2: unsafe sigsinterfaceSignature{r:bigint;s:bigint;recovery?:number;}// Convert Uint8Array to bigint, big endian// prettier-ignoreconstbytesToNumBE=(bytes:Uint8Array):bigint=>{consthex=Array.from(bytes).map((e)=>e.toString(16).padStart(2,"0")).join("");returnhex?BigInt("0x"+hex):0n;};// Get random k from CSPRNG.// (rand32 mod N) is naive version// (rand48 mod N-1)+1 is ours, proper version.// +16 more bytes are fetched to make modulo bias small.// (mod N-1)+1 is done to eliminate 0 from output.constrandK=():bigint=>{// @ts-ignoreconstb=crypto.getRandomValues(newUint8Array(48));constnum=M(bytesToNumBE(b),N-1n);returnnum+1n;};const_sign=(msgh:Uint8Array,priv:bigint):Signature=>{constm=bytesToNumBE(msgh);constd=priv;letr=0n;lets=0n;letq;do{constk=randK();constik=inv(k,N);q=mul_unsafe(Point_BASE,k);r=M(q.x,N);s=M(ik*M(m+d*r,N),N);}while(r===0n||s===0n);if(s>N>>1n){s=M(-s,N);}return{r,s};};const_verify=(sig:Signature,msgh:Uint8Array,pub:AffinePoint):boolean=>{consth=M(bytesToNumBE(msgh),N);constis=inv(sig.s,N);constu1=M(h*is,N);constu2=M(sig.r*is,N);constG_u1=mul_unsafe(Point_BASE,u1);constP_u2=mul_unsafe(pub,u2);constR=padd(G_u1,P_u2);if(R.x===0n&&R.y===0n)returnfalse;constv=M(R.x,N);returnv===sig.r;};

Randomness and PlayStation

You may have noticed we’re using this weird algorithm for generating a randomk.Why is that important?Turns out, if you reusek to sign two different messages under the same private key,your key would get exposed and leaked. It’s really bad.

There are two common ways of solving this situation:

Fetchk from a secure source of randomness (CSPRNG)
- It was popular, untilPlayStation 3 private keys got leaked, because PS3 CSPRNG randomness was not really random
- This is a big problem on some devices which are not able to generate enough entropy
- Entropy generator could also be malicious (backdoored)
Calculatek frompriv andmsgh deterministically.RFC6979 doesexactly that, by using HMAC-DRBG
- Got massive adoption these days
- The DRBG seems a bit complicated: perhaps, if the RFC was created later, they could have just used something like HKDF
- While the method protects against bad randomness, it does not protect against so-called “fault attacks”.Suppose in the code above, an error happens ininv for some values ofinv(k, N).It would get broken and produce garbage, leaking private keys.

In the article we implement 1 for simplicity, while in thelibrary we implement 2. Article’s version has some important tricks, like using FIPS 186 4.1 technique to fetch more randomness than necessary and eliminate zeros from produced values.

Combining 1 and 2 is the best approach and described in RFC6979 3.6 (additional k). It’s called “hedged signatures” / “deterministic signatures with noise” / “extra entropy”.

Fighting timing attacks

Timing attack are a type of side-channel attack where an adversary measures how long your algorithms take to execute, leveraging variations in execution time to infer secret information.

Consider a web app that signs user messages with a private key — if the processing times for a high volume of requests (say, 100,000 calls to a/sign endpoint) differ based on the input, an attacker could potentially deduce the private key. More advanced side-channel methods even analyze power consumption to extract secrets.

To mitigate these risks, cryptographic software must operate in constant time, making sure that operations like character-by-character string comparisons always take the same duration regardless of input. Fortunately, such timing attacks are relatively uncommon, and breaches are more likely to result from vulnerabilities in third-party dependencies.

Let’s see if we’re protected against timing attacks right now:

constmeasure=(label:string,fn:any)=>{console.time(label);fn();console.timeEnd(label);};measure("small DA",()=>mul_unsafe(Point_BASE,2n));measure("large DA",()=>mul_unsafe(Point_BASE,2n**255n-19n));// small DA: 0.053ms// large DA: 10.377ms// getPublicKey_unsafe 1: 9.218ms// getPublicKey_unsafe 2: 0.076ms// sign_slow_unsafe: 7.132ms// verify_slow: 13.986ms

Depending on a private key, it could take up to 190x more time to compute the public key.Seems like we are not protected. How could this be solved?

Private key is 256 bits: 256 zeros or ones
Fast keys have many zeros and a few ones e.g.0000001000000
Slow keys have less zeros e.g.1111011101111111
Right now we only add points with each other when1 is present
We need to ensure that we also do it when bit is0

Very simple idea: when1 is present, do business as usual. When0 is present, add something to a fake point:

// PART 3: safe mul// Constant Time multiplicationconstgetPowers=(q:AffinePoint)=>{letpoints:AffinePoint[]=[];for(letbit=0,dbl=q;bit<=256;bit++,dbl=pdouble(dbl)){points.push(dbl);}returnpoints;};constmul_CT_slow=(q:AffinePoint,n:bigint)=>{constpows=getPowers(q);letp=Point_ZERO;letf=Point_ZERO;// fake pointfor(letbit=0;bit<=256;bit++){constat_bit_i=pows[bit];if(n&1n)p=padd(p,at_bit_i);elsef=padd(f,at_bit_i);n>>=1n;}returnp;};

Let’s measuregetPublicKey now.

measure("small CT",()=>mul_CT_slow(Point_BASE,2n));measure("large CT",()=>mul_CT_slow(Point_BASE,2n**255n-19n));// small CT: 9.409ms// large CT: 9.381ms

Now it is slower on average, but at least it’s “constant-time”.

BigInts, JIT, GC and other scary words

Native JS BigInts are often labeled “unsuitable for cryptography” because their operations aren’t guaranteed to be constant-time. However, my tests indicate that this constant-timeness holds up to approximately 1024-bit numbers. Since most of our operations occur modulo P and the operands involved in addition or multiplication remain below 1024 bits, the native BigInt implementation handles these values efficiently without noticeable timing differences. In fact, executing a timing attack on an algorithmically correct Point#multiply would require exponentially more effort, which suggests that creating an alternative library focused solely on achieving constant-time performance is unnecessary.

It’s also important to acknowledge that JavaScript is a dynamic language, and there are inherent factors that can disrupt any constant-time guarantees, such as:

Garbage collection: A function that normally executes in 5ms might take 7ms when the JS engine performs object reclamation.
Just-in-time compilation: The multiple optimizations and deoptimizations by modern JIT compilers can lead to variation in execution times based on context.

Thus, whether we use native BigInts or a custom implementation like bn.js (which is over 6,000 lines of code), obtaining true constant-time behavior in JavaScript is unattainable.

“Constant-time” is impossible in javascript. That’s it. The best we can do is an algorithmic constant-time & removing dependencies which make us more likely to get malwared.

Projecting…coordinates?

Right now, our call structure consists of:

Initialization: 256 doubles
Multiplication: 256 adds

Let’s dive into what actually happens inside theadd anddouble operations. Both of these functions perform several additions and multiplications, and crucially, each includes oneinvert (i.e., finite field division). Profiling reveals that invert is extremely slow — typically20 to 100 times slower than a multiplication—which means each multiplication might involve up to 512 inverts.

Projective (or homogeneous) coordinates provide an elegant solution by eliminating these costly inversions from both functions. Note that our default x, y coordinates are in Affine form. In projective coordinates, a point is represented by the triple (X, Y, Z); conversion between representations is straightforward. To convert from Projective to Affine, we use:

Ax = X / Z;Ay = Y / Z;

Recall thatm/n = m * inv(n).Converting a point to Projective form is even simpler: we set Z to 1, and copy X and Y directly.

We’re also planning to implement dedicatedPoint#double &add methods using theRenes-Costello-Batina 2015 exception-free addition / doubling formulas. “Exception-free” means those formulas avoid conditional branches, such as checking whether two points are identical (or zero), as seen in our initial mul implementation. Using such formulas is important for constant-timeness.

Finally, we will modify ourgetPrecomputes function to operate exclusively onPoint objects, converting them back to Affine form viaPoint#toAffine() just before returning results.

We’ll changegetPrecomputes to work withPoint everywhere. And just before return we’ll executePoint#toAffine() to getAffinePoint.

// PART 4: proj mulconstCURVE={P:P,n:N,a:0n,b:7n};constequals=(a:AffinePoint,b:AffinePoint)=>a.x===b.x&&a.y===b.y;// Point in 3d projective (x, y, z) coordinatesclassPoint{px:bigint;py:bigint;pz:bigint;constructor(x:bigint,y:bigint,z:bigint){this.px=x;this.py=y;this.pz=z;}/** Create 3d xyz point from 2d xy. Edge case: (0, 0) => (0, 1, 0), not (0, 0, 1) */staticfromAffine(p:AffinePoint):Point{returnp.x===0n&&p.y===0n?newPoint(0n,1n,0n):newPoint(p.x,p.y,1n);}toAffine(iz:bigint=inv(this.pz,P)):AffinePoint{constx=M(this.px*iz);consty=M(this.py*iz);return{x,y};}// prettier-ignoreadd(other:Point){const{px:X1,py:Y1,pz:Z1}=this;const{px:X2,py:Y2,pz:Z2}=other;const{a,b}=CURVE;letX3=0n,Y3=0n,Z3=0n;constb3=M(b*3n);// step 1lett0=M(X1*X2),t1=M(Y1*Y2),t2=M(Z1*Z2),t3=M(X1+Y1);lett4=M(X2+Y2);// step 5t3=M(t3*t4);t4=M(t0+t1);t3=M(t3-t4);t4=M(X1+Z1);lett5=M(X2+Z2);// step 10t4=M(t4*t5);t5=M(t0+t2);t4=M(t4-t5);t5=M(Y1+Z1);X3=M(Y2+Z2);// step 15t5=M(t5*X3);X3=M(t1+t2);t5=M(t5-X3);Z3=M(a*t4);X3=M(b3*t2);// step 20Z3=M(X3+Z3);X3=M(t1-Z3);Z3=M(t1+Z3);Y3=M(X3*Z3);t1=M(t0+t0);// step 25t1=M(t1+t0);t2=M(a*t2);t4=M(b3*t4);t1=M(t1+t2);t2=M(t0-t2);// step 30t2=M(a*t2);t4=M(t4+t2);t0=M(t1*t4);Y3=M(Y3+t0);t0=M(t5*t4);// step 35X3=M(t3*X3);X3=M(X3-t0);t0=M(t3*t1);Z3=M(t5*Z3);Z3=M(Z3+t0);// step 40returnnewPoint(X3,Y3,Z3);}double(){returnthis.add(this);}negate(){returnnewPoint(this.px,M(-this.py),this.pz);}}constProj_ZERO=Point.fromAffine(Point_ZERO);constProj_BASE=Point.fromAffine(Point_BASE);constgetPowersProj=(qxy:AffinePoint):Point[]=>{constq=Point.fromAffine(qxy);letpoints:Point[]=[];for(letbit=0,dbl=q;bit<=256;bit++,dbl=dbl.double()){points.push(dbl);}returnpoints;};constmul_CT=(q:AffinePoint,n:bigint)=>{constpows=getPowersProj(q);letp=Proj_ZERO;letf=Proj_ZERO;// fake pointfor(letbit=0;bit<=256;bit++){constat_bit_i=pows[bit];if(n&1n)p=p.add(at_bit_i);elsef=f.add(at_bit_i);n>>=1n;}returnp.toAffine();};

Let’s see how this improves things for us:

CT proj: 1.499ms

Woah! That’s 6x improvement formultiply.

Precomputes and wNAF

Currently, we do 256 point additions, for each of 256 bits of a scalar, by whichPoint is multiplied.Simplest optimization would be to cachegetPowers method.This way, powers of G would get precomputed once and for all.

Instead of doing that, let’s get more efficient and split scalar into 4/8/16-bit chunks,pre-compute powers AND additions within those chunks.

w-ary non-adjacent form method (wNAF) allows to do exactly that. Instead of caching 256 points like we did before, we would cache 520 (W=4), 4224 (W=8) or 557056 (W=16) points.

We could also do this with alternative, “windowed method”, but wNAF saves us ½ RAM and is 2x faster in init time. The reason for this is that wNAF does addition and subtraction — and for subtraction it simply negates point, which can be done in constant time.

// PART 5: wnaf mulconstW=8;// Precomputes-related code. W = window size// prettier-ignoreconstprecompute=()=>{// They give 12x faster getPublicKey(),constpoints:Point[]=[];// 10x sign(), 2x verify(). To achieve this,constwindows=256/W+1;// app needs to spend 40ms+ to calculateletp=Proj_BASE,b=p;// a lot of points related to base point G.for(letw=0;w<windows;w++){// Points are stored in array and usedb=p;// any time Gx multiplication is done.points.push(b);// They consume 16-32 MiB of RAM.for(leti=1;i<2**(W-1);i++){b=b.add(p);points.push(b);}p=b.double();// Precomputes don't speed-up getSharedKey,}// which multiplies user point by scalar,returnpoints;// when precomputes are using base point}letGpows:Point[]|undefined=undefined;// precomputes for base point G// prettier-ignoreconstwNAF=(n:bigint):{p:Point;f:Point}=>{// w-ary non-adjacent form (wNAF) method.// Compared to other point mult methods,constcomp=Gpows||(Gpows=precompute());// stores 2x less points using subtractionconstneg=(cnd:boolean,p:Point)=>{letn=p.negate();returncnd?n:p;}// negateletp=Proj_ZERO,f=Proj_BASE;// f must be G, or could become I in the endconstwindows=1+256/W;// W=8 17 windowsconstwsize=2**(W-1);// W=8 128 window sizeconstmask=BigInt(2**W-1);// W=8 will create mask 0b11111111constmaxNum=2**W;// W=8 256constshiftBy=BigInt(W);// W=8 8for(letw=0;w<windows;w++){constoff=w*wsize;letwbits=Number(n&mask);// extract W bits.n>>=shiftBy;// shift number by W bits.if(wbits>wsize){wbits-=maxNum;n+=1n;}// split if bits > max: +224 => 256-32constoff1=off,off2=off+Math.abs(wbits)-1;// offsets, evaluate bothconstcnd1=w%2!==0,cnd2=wbits<0;// conditions, evaluate bothif(wbits===0){f=f.add(neg(cnd1,comp[off1]));// bits are 0: add garbage to fake point}else{//          ^ can't add off2, off2 = Ip=p.add(neg(cnd2,comp[off2]));// bits are 1: add to result point}}return{p,f}// return both real and fake points for JIT};// !! you can disable precomputes by commenting-out call of the wNAF() inside Point#mul()constmul_G_wnaf=(n:bigint)=>{const{p,f}=wNAF(n);f.toAffine();// result ignoredreturnp.toAffine();};

Let’s see where this lands us:

mul_G_wnaf: 0.196ms

So, 10x faster; even thoughinit got slightly slower. You may have noticed that it’s possible to adjustwNAFW parameter: it specifies how many precomputed points we’ll calculate.

Endomorphism

Let’s get hardcore. But not too much — otherwise, the code would be unauditable.To improve performance even more, we must look at the properties at our underlying elliptic curve.secp256k1 is Koblitz curve (e.g. Short Weierstrass curve witha=0).

Koblitz curves allow usingefficiently-computable GLV endomorphism ψ:

GLV endomorphism ψ transforms a point:P = (x, y) ↦ ψ(P) = (β·x mod p, y)
GLV scalar decomposition transforms a scalar:k ≡ k₁ + k₂·λ (mod n)
Then these are combined:k·P = k₁·P + k₂·ψ(P)
Two 128-bit point-by-scalar multiplications + one point addition is faster than one 256-bit multiplication.

Endomorphism uses 2x less RAM, speeds up precomputation by 2x and ECDH / key recovery by 20%.For precomputed wNAF it trades off 1/2 init time & 1/3 ram for 20% perf hit.

Calculating endomorphism is simple(check out code):

beta: β ∈ Fₚ with β³ = 1, β ≠ 1
lambda: λ ∈ Fₙ with λ³ = 1, λ ≠ 1
k ↦ k₁, k₂ is decomposed using splitScalar function, which uses reduced basis vectors.Gauss lattice reduction calculates them from initial basis vectors(n, 0), (-λ, 0)

Inmultiply, we will need to split scalar into two, and then calculate sum of two resulting points.We can also adjustmul_CT andmul_G_wnaf (to reduce init time / memory by 2x),but we won’t do that here.

// PART 6: endoconstCURVE_beta=0x7ae96a2b657c07106e64479eac3434e99cf0497512f58995c1396c28719501een;// const CURVE_lambda = 0x5363ad4cc05c30e0a5261c028812645a122e22ea20816678df02967c1b23bd72n;constCURVE_basis=[[0x3086d221a7d46bcde86c90e49284eb15n,-0xe4437ed6010e88286f547fa90abfe4c3n],[0x114ca50f7a8e2f3f657c1108d9d44cfd8n,0x3086d221a7d46bcde86c90e49284eb15n]];constdivNearest=(a:bigint,b:bigint)=>(a+b/2n)/b;constsplitScalar=(k:bigint)=>{const[[a1,b1],[a2,b2]]=CURVE_basis;// Use Babai's round-off algorithm:// Calculate continuous coordinates in the basisconstc1=divNearest(b2*k,N);constc2=divNearest(-b1*k,N);// Calculate the closest lattice point to (k, 0)// Calculate k1 = k - b1 (mod n) and k2 = -b2 (mod n)letk1=M(k-c1*a1-c2*a2,N);letk2=M(-c1*b1-c2*b2,N);constPOW_2_128=0x100000000000000000000000000000000n;// (2n**128n).toString(16)constk1neg=k1>POW_2_128;constk2neg=k2>POW_2_128;if(k1neg)k1=N-k1;if(k2neg)k2=N-k2;if(k1>POW_2_128||k2>POW_2_128)err("endomorphism failed, k="+k);return{k1neg,k1,k2neg,k2};};constmul_endo=(q:AffinePoint,n:bigint)=>{let{k1neg,k1,k2neg,k2}=splitScalar(n);letk1p=Proj_ZERO;letk2p=Proj_ZERO;letd:Point=Point.fromAffine(q);while(k1>0n||k2>0n){if(k1&1n)k1p=k1p.add(d);if(k2&1n)k2p=k2p.add(d);d=d.double();k1>>=1n;k2>>=1n;}if(k1neg)k1p=k1p.negate();if(k2neg)k2p=k2p.negate();k2p=newPoint(M(k2p.px*CURVE_beta),k2p.py,k2p.pz);returnk1p.add(k2p).toAffine();};

This idea was popularized for the curveby Hal Finney.and is described on pages 125-129 of the Guide to ECC, by Hankerson, Menezes and Vanstone.

Endgame: combining everything

Let’s combine 3 different multiplication algorithms to get the fastest of all worlds:

wNAF can only be used for base / generator point, because we don’t calculate it for others
Endomorphism is fast, but slower than wNAF. However, we didn’t write endomorphism forfake points in the article. So, it can only be used for cases when timing attacks are not relevant.One such case is verification, which does not operate on private inputs.
For everything else there ismul_CT, which is slower than endomorphism, but safe

Additionally, let’s implementgetSharedSecret (elliptic curve diffie-hellman), and addrecovery bit to sign() output; to ensure a way to recover public keys from signatures.

// PART 7, final pub + sign + verify + sharedconstmul=(q:AffinePoint,n:bigint,safe=true)=>{if(equals(q,Point_BASE))returnmul_G_wnaf(n);returnsafe?mul_CT(q,n):mul_endo(q,n);};functiongetPublicKey(privateKey:bigint):AffinePoint{returnmul(Point_BASE,privateKey);}functionsign(msgh:Uint8Array,priv:bigint):Signature{constm=bytesToNumBE(msgh);constd=priv;letr=0n;lets=0n;letq;do{constk=randK();constik=inv(k,N);q=mul(Point_BASE,k);r=M(q.x,N);s=M(ik*M(m+d*r,N),N);}while(r===0n||s===0n);letrecovery=(q.x===r?0:2)|Number(q.y&1n);if(s>N>>1n){recovery^=1;s=M(-s,N);}return{r,s,recovery};}functionverify(sig:Signature,msgh:Uint8Array,pub:AffinePoint):boolean{consth=M(bytesToNumBE(msgh),N);constis=inv(sig.s,N);constu1=M(h*is,N);constu2=M(sig.r*is,N);constG_u1=mul(Point_BASE,u1);constP_u2=mul(pub,u2,false);constR=padd(G_u1,P_u2);constv=M(R.x,N);returnv===sig.r;}functiongetSharedSecret(privA:bigint,pubB:AffinePoint):AffinePoint{returnmul(pubB,privA);}

The benchmarks look like that now:

getPublicKey: 0.234mssign: 0.258msverify: 1.264ms

While we started with:

getPublicKey_unsafe 1: 12.468msgetPublicKey_unsafe 2: 0.11mssign_slow_unsafe: 9.442msverify_slow: 19.073ms

Awesome!

Extra tricks

One thing worth mentioning is Montgomery Batch Inversion.

We get a bunch a numbers, do one inverse, and then we apply it to all numbers.At this point, we don’t really useinvert anywhere except for a few places.

functioninvBatch(nums:bigint[]):bigint[]{consttmp=newArray(nums.length);// Walk from first to last, multiply them by each other MOD pconstlastMultiplied=nums.reduce((acc,num,i)=>{if(num===0n)returnacc;tmp[i]=acc;returnM(acc*num);},1n);// Invert last elementconstinverted=inv(lastMultiplied,P);// Walk from last to first, multiply them by inverted each other MOD pnums.reduceRight((acc,num,i)=>{if(num===0n)returnacc;tmp[i]=M(acc*tmp[i]);returnM(acc*num);},inverted);returntmp;}constinvPointsBatch=(points:Point[])=>{returnpoints;// const points = [p, f];constinverted=invBatch(points.map((p)=>p.pz));returnpoints.map((p,i)=>Point.fromAffine(p.toAffine(inverted[i])));};// REPLACE in `mul_G_wnaf`:// return invPointsBatch([p, f])[0].toAffine();// REPLACE IN `wNAF`:// const comp = Gpows || (Gpows = invPointsBatch(precompute()));

The gains are small, but could be used.One unexpected place iswNAF: we can normalize all precomputed points to haveZ=1,which would slightly improve speed ofgetPublicKey andsign.

Another useful trick is Multi-Scalar Multiplication (MSM).It is commonly implemented usingPippenger algorithm.MSM could be used for calculating addition of many points at once:aP + bQ + cR + .... It only makes sense to use it with bigger inputs.

Final thoughts and source code

We’ve gotsign from unsafe 7.028ms to safe 0.177ms, a 39x speed-up.All without re-implementing bigints, esoteric math (well, besides endomorphism) and low-level languages. At this point it’s the fastest secp256k1 ECDSA lib in pure JavaScript.Something likeBIP340 Schnorr signatures can easily be implemented on top of these primitives.

The code from the article is available:
- secp256k1-1kb.min.js — JS, parts 1+2, unsafe, fits in898 bytes gzipped
- secp256k1.ts — TS, all methods
noble-secp256k1 —4kb safe, full-featured library is ready to for use in all kinds of projects.
noble-curves — larger, generic and safe standard for JS ECC cryptography, with ed25519, BLS12-381, and other algorithms

Join the journey to auditable cryptography viaX.com &GitHub.