NotificationsYou must be signed in to change notification settings
Fork740
Star1.2k

Fix #10513 - powmod is slow for ulong#10688

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

thewilsonator merged 30 commits intodlang:masterfromAditya-132:issue-#10513

Mar 27, 2025

Merged

Fix #10513 - powmod is slow for ulong#10688

thewilsonator merged 30 commits intodlang:masterfromAditya-132:issue-#10513

Mar 27, 2025

Conversation

Copy link

Contributor

Aditya-132 commentedMar 18, 2025

The key optimizations I've made are:

128-bit multiplication via inline assembly:

For 64-bit types on x86_64 platforms, I've added direct use of hardware multiplication that produces a 128-bit result (stored in high:low registers)
This avoids the repeated addition loop in the original algorithm

Efficient modular reduction:

Implemented a reduction algorithm that handles the 128-bit result efficiently
Uses a simplified approach for when the high bits are zero

Early reduction:

The base value is reduced modulo m at the beginning, which improves performance when the base is large and modulus is small

Special case handling:

Added explicit handling for modulus=1 and exponent=0 cases up front

Version checks:

The code falls back to the original algorithm if inline assembly is not available
This ensures portability while providing optimizations where possible

Aditya-132 added2 commits

March 18, 2025 18:55

The current implementation of powmod is very slow for the ulong typed…

b889996

…lang#10513

some imports changes

f860160

Aditya-132 requested a review fromibuclaw as acode owner

March 18, 2025 13:38

Copy link

Contributor

dlang-bot commentedMar 18, 2025

Thanks for your pull request and interest in making D better,@Aditya-132! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

My PR is fully covered with tests (you can see the coverage diff by visiting thedetails link of the codecov check)
My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
I have provided a detailed rationale explaining my changes
New or modified functions have Ddoc comments (withParams: andReturns:)

Please seeCONTRIBUTING.md for more information.

If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, pleasereference a Bugzilla issue or create amanual changelog.

Testing this PR locally

If you don't have alocal development environment setup, you can useDigger to test this PR:

dub run digger -- build"master + phobos#10688"

Copy link

Member

ibuclaw commentedMar 18, 2025

@Aditya-132 before I go through it, if you can run and post benchmarks, that'll help with review.

Out of curiosity, would you also be able to.do gcc-style asm for gdc and ldc?

Copy link

ContributorAuthor

Aditya-132 commentedMar 18, 2025

@Aditya-132 before I go through it, if you can run and post benchmarks, that'll help with review.
Out of curiosity, would you also be able to.do gcc-style asm for gdc and ldc?

sure

Copy link

Contributor

thewilsonator commentedMar 18, 2025

Please give the PR and commit titles a description based on the issues this solves e.g.Fix #10513 - powmod is slow for ulong

Copy link

ContributorAuthor

Aditya-132 commentedMar 18, 2025

@Aditya-132 before I go through it, if you can run and post benchmarks, that'll help with review.
Out of curiosity, would you also be able to.do gcc-style asm for gdc and ldc?

can you provide any example which is already present in code

Aditya-132 changed the title~~Fixing Issue-#10513~~Fix #10513 - powmod is slow for ulong

Mar 18, 2025

Copy link

ContributorAuthor

Aditya-132 commentedMar 18, 2025

@Aditya-132 before I go through it, if you can run and post benchmarks, that'll help with review.
Out of curiosity, would you also be able to.do gcc-style asm for gdc and ldc?

instead of modifying powmod we can diredctly modify mulmod

Copy link

Member

ibuclaw commentedMar 18, 2025•
edited
Loading

@Aditya-132 before I go through it, if you can run and post benchmarks, that'll help with review.
Out of curiosity, would you also be able to.do gcc-style asm for gdc and ldc?
can you provide any example which is already present in code

So, unlike dmd iasm which~~is an uninlineable black box~~ forces you manually put in all load and stores, gcc iasm instead has the notion of input and output operands (first outputs, then inputs)

As all you're doing is mul, and taking the result from two registers, the equivalent should be

low = a;asm @trusted nothrow @nogc {    "mull %0"    : "=d" (high), "+a"(low)    : "0" (b);}

Copy link

ContributorAuthor

Aditya-132 commentedMar 18, 2025•
edited by ibuclaw
Loading

@ibuclaw

TESTING RESULT

BENCHMARK 1: Testing mulmod implementationsTesting with 32-bit integers:Testing first implementation (ASM-based):Testing second implementation (Standard D):First implementation (ASM):  9 ms, 270 μs, and 5 hnsecsSecond implementation (D):   13 ms, 119 μs, and 8 hnsecsPerformance ratio: 141.52% (higher means ASM is faster)Verification values: 312463588 312463588Testing with 64-bit integers:Testing first implementation (ASM-based):Testing second implementation (Standard D):First implementation (ASM):  45 ms, 296 μs, and 4 hnsecsSecond implementation (D):   556 ms and 119 μsPerformance ratio: 1227.74% (higher means ASM is faster)Verification values: 311557887622942605 311557887622942605BENCHMARK 2: Testing powmod implementationsTesting with 32-bit integers:Testing first powmod implementation (with ASM mulmod):Testing second powmod implementation (with standard D mulmod):First implementation (ASM):  1 ms, 862 μs, and 9 hnsecsSecond implementation (D):   1 ms, 904 μs, and 1 hnsecPerformance ratio: 102.26% (higher means ASM is faster)Verification values: 8312587 8312587Testing with 64-bit integers:Testing first powmod implementation (with ASM mulmod):Testing second powmod implementation (with standard D mulmod):First implementation (ASM):  2 ms and 932 μsSecond implementation (D):   42 ms, 178 μs, and 2 hnsecsPerformance ratio: 1438.54% (higher means ASM is faster)Verification values: 12615239 12615239(dmd-2.110.0)➜  testing

Script which i use for testing

importstd.stdio;importstd.datetime.stopwatch;importstd.random;importstd.traits;importstd.meta;importstd.algorithm;importcore.time;// First implementation from the pasted functionstaticTmulmod1(T)(T a, T b, T c) {staticif (T.sizeof==8) {version (D_InlineAsm_X86_64) {// Fast path for small valuesulong low =void;ulong high =void;// Perform 128-bit multiplication: a * b = [high:low]asmpure@trustednothrow@nogc            {                movRAX, a;// Load a into RAX                mul b;// Multiply by b (RDX:RAX = a * b)                mov low,RAX;// Store low 64 bits                mov high,RDX;// Store high 64 bits            }// Handle overflow for large valuesif (high>= c) {                high%= c;            }if (high==0) {return low% c;            }// Reduce the high 64 bits modulo modasmpure@trustednothrow@nogc            {                movRAX, high;// Load high part                xorRDX,RDX;// Clear RDX for division                div c;// RAX = high / mod, RDX = high % mod                mov high,RDX;// Store remainder (high % mod)                movRAX, low;// Load low part                movRDX, high;// Load reduced high part                div c;// Divide full 128-bit number by mod                mov low,RDX;// Store remainder (final result)            }return low;// Return (a * b) % mod        }else {// Fallback for non-x86_64 platforms            T result =0;            a%= c;            b%= c;while (b>0) {if (b &1) {                    result = (result+ a)% c;                }                a = (a*2)% c;                b>>=1;            }return result;        }    }elsestaticif (T.sizeof==4) {version (D_InlineAsm_X86) {// Fast path for small valuesuint low =void;// Changed to uint for 32-bituint high =void;// Changed to uint for 32-bit// Perform 64-bit multiplication: a * b = [high:low]asmpure@trustednothrow@nogc            {                movEAX, a;// Load a into EAX                mul b;// Multiply by b                mov low,EAX;// Store low bits                mov high,EDX;// Store high bits            }// Handle overflow for large valuesif (high>= c) {                high%= c;            }if (high==0) {return low% c;            }// Reduce the high bits modulo modasmpure@trustednothrow@nogc            {                movEAX, high;// Load high part                xorEDX,EDX;// Clear EDX for division                div c;// EAX = high / mod, EDX = high % mod                mov high,EDX;// Store remainder (high % mod)                movEAX, low;// Load low part                movEDX, high;// Load reduced high part                div c;// Divide full number by mod                mov low,EDX;// Store remainder (final result)            }return low;// Return (a * b) % mod        }else {// Use 64-bit type for the calculationulong result = (cast(ulong)a*cast(ulong)b)% c;returncast(T)result;        }    }else {// Fallback for other sizesimportstd.bigint;returncast(T)((BigInt(a)* BigInt(b))% BigInt(c));    }}// Second implementation from function provided in the querytemplateLargest(T...) {staticif (T.length==1)alias Largest = T[0];elsestaticif (T[0].sizeof> Largest!(T[1..$]).sizeof)alias Largest = T[0];elsealias Largest = Largest!(T[1..$]);}staticTmulmod2(T)(T a, T b, T c) {staticif (T.sizeof==8) {staticTaddmod(T a, T b, T c) {            b = c- b;if (a>= b)return a- b;elsereturn c- b+ a;        }        T result =0;        b%= c;while (a>0) {if (a &1)                result = addmod(result, b, c);            a>>=1;            b = addmod(b, b, c);        }return result;    }else {alias DoubleT = AliasSeq!(void,ushort,uint,void,ulong)[T.sizeof];        DoubleT result =cast(DoubleT) (cast(DoubleT) a*cast(DoubleT) b);return result% c;    }}// FIX: Explicitly specify the return typeFpowmod1(F, G, H)(F x, G n, H m)if (isUnsigned!F&& isUnsigned!G&& isUnsigned!H){alias ReturnT =std.traits.Unqual!(Largest!(F, H));    ReturnT base =cast(ReturnT)(x% m);    ReturnT result =1;    ReturnT modulus =cast(ReturnT)m;std.traits.Unqual!G exponent = n;while (exponent>0) {if (exponent &1)            result = mulmod1!ReturnT(result, base, modulus);        base = mulmod1!ReturnT(base, base, modulus);        exponent>>=1;    }returncast(F)result;}// FIX: Explicitly specify the return typeFpowmod2(F, G, H)(F x, G n, H m)if (isUnsigned!F&& isUnsigned!G&& isUnsigned!H){alias ReturnT =std.traits.Unqual!(Largest!(F, H));    ReturnT base =cast(ReturnT)(x% m);    ReturnT result =1;    ReturnT modulus =cast(ReturnT)m;std.traits.Unqual!G exponent = n;while (exponent>0) {if (exponent &1)            result = mulmod2!ReturnT(result, base, modulus);        base = mulmod2!ReturnT(base, base, modulus);        exponent>>=1;    }returncast(F)result;}// DELETED conflicting Unqual templatevoidmain() {// Run tests first to ensure correctness    writeln("VERIFICATION TESTS:");    runTests();// Test for 32-bit integers    writeln("\nBENCHMARK 1: Testing mulmod implementations");    writeln("Testing with 32-bit integers:");    benchmarkMulmod!(uint)();// Test for 64-bit integers    writeln("\nTesting with 64-bit integers:");    benchmarkMulmod!(ulong)();// Test powmod implementations    writeln("\n\nBENCHMARK 2: Testing powmod implementations");    writeln("Testing with 32-bit integers:");    benchmarkPowmod!(uint)();        writeln("\nTesting with 64-bit integers:");    benchmarkPowmod!(ulong)();}voidrunTests() {    writeln("Running unit tests...");// Test cases provided in the query// First set of testsassert(powmod1(1U,10U,3U)==1);assert(powmod1(3U,2U,6U)==3);assert(powmod1(5U,5U,15U)==5);assert(powmod1(2U,3U,5U)==3);assert(powmod1(2U,4U,5U)==1);assert(powmod1(2U,5U,5U)==2);// Second implementationassert(powmod2(1U,10U,3U)==1);assert(powmod2(3U,2U,6U)==3);assert(powmod2(5U,5U,15U)==5);assert(powmod2(2U,3U,5U)==3);assert(powmod2(2U,4U,5U)==1);assert(powmod2(2U,5U,5U)==2);// Second set of tests - ulongulong a =18446744073709551615u, b =20u, c =18446744073709551610u;assert(powmod1(a, b, c)==95367431640625u);    a =100; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==18223853583554725198u);    a =117; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==11493139548346411394u);    a =134; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==10979163786734356774u);    a =151; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==7023018419737782840u);    a =168; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==58082701842386811u);    a =185; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==17423478386299876798u);    a =202; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==5522733478579799075u);    a =219; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==15230218982491623487u);    a =236; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==5198328724976436000u);        a =0; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==0);    a =123; b =0; c =18446744073709551557u;assert(powmod1(a, b, c)==1);immutableulong a1 =253, b1 =7919, c1 =18446744073709551557u;assert(powmod1(a1, b1, c1)==3883707345459248860u);// Second implementation    a =18446744073709551615u; b =20u; c =18446744073709551610u;assert(powmod2(a, b, c)==95367431640625u);    a =100; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==18223853583554725198u);    a =117; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==11493139548346411394u);    a =134; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==10979163786734356774u);    a =151; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==7023018419737782840u);    a =168; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==58082701842386811u);    a =185; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==17423478386299876798u);    a =202; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==5522733478579799075u);    a =219; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==15230218982491623487u);    a =236; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==5198328724976436000u);        a =0; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==0);    a =123; b =0; c =18446744073709551557u;assert(powmod2(a, b, c)==1);assert(powmod2(a1, b1, c1)==3883707345459248860u);// Third set of tests - uintuint x =100, y =7919, z =1844674407u;assert(powmod1(x, y, z)==1613100340u);    x =134; y =7919; z =1844674407u;assert(powmod1(x, y, z)==734956622u);    x =151; y =7919; z =1844674407u;assert(powmod1(x, y, z)==1738696945u);    x =168; y =7919; z =1844674407u;assert(powmod1(x, y, z)==1247580927u);    x =185; y =7919; z =1844674407u;assert(powmod1(x, y, z)==1293855176u);    x =202; y =7919; z =1844674407u;assert(powmod1(x, y, z)==1566963682u);    x =219; y =7919; z =1844674407u;assert(powmod1(x, y, z)==181227807u);    x =236; y =7919; z =1844674407u;assert(powmod1(x, y, z)==217988321u);    x =253; y =7919; z =1844674407u;assert(powmod1(x, y, z)==1588843243u);        x =0; y =7919; z =184467u;assert(powmod1(x, y, z)==0);    x =123; y =0; z =1844674u;assert(powmod1(x, y, z)==1);// Second implementation    x =100; y =7919; z =1844674407u;assert(powmod2(x, y, z)==1613100340u);    x =134; y =7919; z =1844674407u;assert(powmod2(x, y, z)==734956622u);    x =151; y =7919; z =1844674407u;assert(powmod2(x, y, z)==1738696945u);    x =168; y =7919; z =1844674407u;assert(powmod2(x, y, z)==1247580927u);    x =185; y =7919; z =1844674407u;assert(powmod2(x, y, z)==1293855176u);    x =202; y =7919; z =1844674407u;assert(powmod2(x, y, z)==1566963682u);    x =219; y =7919; z =1844674407u;assert(powmod2(x, y, z)==181227807u);    x =236; y =7919; z =1844674407u;assert(powmod2(x, y, z)==217988321u);    x =253; y =7919; z =1844674407u;assert(powmod2(x, y, z)==1588843243u);        x =0; y =7919; z =184467u;assert(powmod2(x, y, z)==0);    x =123; y =0; z =1844674u;assert(powmod2(x, y, z)==1);        writeln("All tests passed!");}voidbenchmarkMulmod(T)() {// Number of iterations for benchmarkingimmutable iterations = 1_000_000;// Generate random test cases    T[] a_values =new T[iterations];    T[] b_values =new T[iterations];    T[] m_values =new T[iterations];auto rnd =Random(unpredictableSeed);// Fill arrays with random valuesfor (size_t i =0; i< iterations; i++) {staticif (is(T==ulong)) {            a_values[i] = uniform!("[]")(T.min, T.max/2, rnd);            b_values[i] = uniform!("[]")(T.min, T.max/2, rnd);// Avoid modulus being too small or zero            m_values[i] = uniform!("[]")(T.max/4, T.max-1, rnd);        }else {            a_values[i] = uniform!("[]")(T.min, T.max, rnd);            b_values[i] = uniform!("[]")(T.min, T.max, rnd);// Avoid modulus being too small or zero            m_values[i] = uniform!("[]")(T.max/2, T.max-1, rnd);        }    }// Variables to store results to prevent compiler optimizations    T result1 =0;    T result2 =0;// Test first implementation    writeln("Testing first implementation (ASM-based):");auto sw1 = StopWatch(AutoStart.yes);for (size_t i =0; i< iterations; i++) {        result1^= mulmod1!T(a_values[i], b_values[i], m_values[i]);    }        sw1.stop();// Test second implementation    writeln("Testing second implementation (Standard D):");auto sw2 = StopWatch(AutoStart.yes);for (size_t i =0; i< iterations; i++) {        result2^= mulmod2!T(a_values[i], b_values[i], m_values[i]);    }        sw2.stop();// Print results    writefln("First implementation (ASM):  %s", sw1.peek());    writefln("Second implementation (D):   %s", sw2.peek());double ratio = (sw2.peek().total!"usecs"*100.0/ sw1.peek().total!"usecs");    writefln("Performance ratio: %.2f%% (higher means ASM is faster)", ratio);// Print dummy value to prevent optimization    writefln("Verification values: %s %s", result1, result2);}voidbenchmarkPowmod(T)() {// Number of iterations for benchmarkingimmutable iterations = 10_000;// Generate random test cases    T[] x_values =new T[iterations];    T[] n_values =new T[iterations];    T[] m_values =new T[iterations];auto rnd =Random(unpredictableSeed);// Fill arrays with random values - using realistic rangesfor (size_t i =0; i< iterations; i++) {staticif (is(T==ulong)) {            x_values[i] = uniform!("[]")(2,1000000, rnd);            n_values[i] = uniform!("[]")(10,10000, rnd);// Avoid modulus being too small or zero            m_values[i] = uniform!("[]")(1000,10000000, rnd);        }else {            x_values[i] = uniform!("[]")(2,1000000, rnd);            n_values[i] = uniform!("[]")(10,10000, rnd);// Avoid modulus being too small or zero            m_values[i] = uniform!("[]")(1000,10000000, rnd);        }    }// Variables to store results to prevent compiler optimizations    T result1 =0;    T result2 =0;// Test first implementation    writeln("Testing first powmod implementation (with ASM mulmod):");auto sw1 = StopWatch(AutoStart.yes);for (size_t i =0; i< iterations; i++) {        result1^= powmod1!T(x_values[i], n_values[i], m_values[i]);    }        sw1.stop();// Test second implementation    writeln("Testing second powmod implementation (with standard D mulmod):");auto sw2 = StopWatch(AutoStart.yes);for (size_t i =0; i< iterations; i++) {        result2^= powmod2!T(x_values[i], n_values[i], m_values[i]);    }        sw2.stop();// Print results    writefln("First implementation (ASM):  %s", sw1.peek());    writefln("Second implementation (D):   %s", sw2.peek());double ratio = (sw2.peek().total!"usecs"*100.0/ sw1.peek().total!"usecs");    writefln("Performance ratio: %.2f%% (higher means ASM is faster)", ratio);// Print dummy value to prevent optimization    writefln("Verification values: %s %s", result1, result2);}

made asm functions for mulmod and

7fc002d

Copy link

ContributorAuthor

Aditya-132 commentedMar 18, 2025

@ibuclaw can you explain why a = 100; b = 7919; c = 18446744073709551557u;
assert(powmod(a, b, c) == 18223853583554725198u); it exactly failed on x86 systems

changes made for x86 32 bits system

43c35de

Copy link

ContributorAuthor

Aditya-132 commentedMar 18, 2025

Anyone has solution of it please tell me i cant unterstand this errors on why they are coming on x86 systems

how much i figureout is that this occured due to 32 bits x86 system the value overflowed

Aditya-132 force-pushed theissue-#10513 branch 2 times, most recently from8c7d8c0 to43c35deCompare

March 19, 2025 10:50

Merge branch 'dlang:master' into issue-dlang#10513

394af25

Copy link

Member

ibuclaw commentedMar 19, 2025

Anyone has solution of it please tell me i cant unterstand this errors on why they are coming on x86 systems
how much i figureout is that this occured due to 32 bits x86 system the value overflowed

Yes. The new fallback code is prone to overflow. The original implementation can't be made simpler by removing theaddmod function.

Copy link

ContributorAuthor

Aditya-132 commentedMar 19, 2025•
edited
Loading

Anyone has solution of it please tell me i cant unterstand this errors on why they are coming on x86 systems
how much i figureout is that this occured due to 32 bits x86 system the value overflowed
Yes. The new fallback code is prone to overflow. The original implementation can't be made simpler by removing theaddmod function.

i have a solution for that like we do in assembly we treat 16bit number in two register(in 8085) instaed of one and then perform operations on it . we can do same here by implementing 128 bits multipication using more registers in 32 bits (its complex)

Copy link

Member

ibuclaw commentedMar 19, 2025

I don't understand why you need inline assembler for 32bit powmod, when the original D code should be sufficient?

alias DoubleT = AliasSeq!(void, ushort, uint, void, ulong)[T.sizeof];DoubleT result = cast(DoubleT) (cast(DoubleT) a * cast(DoubleT) b);return result % c;

Copy link

ContributorAuthor

Aditya-132 commentedMar 19, 2025

I don't understand why you need inline assembler for 32bit powmod, when the original D code should be sufficient?
alias DoubleT = AliasSeq!(void, ushort, uint, void, ulong)[T.sizeof];DoubleT result = cast(DoubleT) (cast(DoubleT) a * cast(DoubleT) b);return result % c;

ok then i will make changes for it

Copy link

Member

ibuclaw commentedMar 19, 2025•
edited
Loading

For the slow path, theaddmod() function that protects against overflow is only required for modulus values greater thanuint.max+1 IIUC - (4294967296 * 4294967296 = 18446744073709551616 which is 0 with ulong, so no overflow gets lost when modulus is also <= 4294967296).

Running the benchmark locally, the D code measured faster than the inline asm when the modulus was within this bounds.

staticTmulmod2(T)(T a, T b, T c) {staticif (T.sizeof==8) {staticTaddmod(T a, T b, T c) {            b = c- b;if (a>= b)return a- b;elsereturn c- b+ a;        }        T result =void;// <---- start new snipif (c<=0x100000000)        {            result = a* b;return result% c;        }        result =0;// <---- end new snip        b%= c;while (a>0) {if (a &1)                result = addmod(result, b, c);             a>>=1;            b = addmod(b, b, c);         }return result;    }else {alias DoubleT = AliasSeq!(void,ushort,uint,void,ulong)[T.sizeof];        DoubleT result =cast(DoubleT) (cast(DoubleT) a*cast(DoubleT) b);return result% c;    }}

Testing with 64-bit integers:Testing first implementation (ASM-based):Testing second implementation (Standard D):First implementation (ASM):  72 ms, 230 μs, and 8 hnsecsSecond implementation (D):   31 ms, 106 μs, and 6 hnsecsPerformance ratio: 43.07% (higher means ASM is faster)Verification values: 270309499 270309499Testing with 64-bit integers:Testing first powmod implementation (with ASM mulmod):Testing second powmod implementation (with standard D mulmod):First implementation (ASM):  699 ms, 729 μs, and 5 hnsecsSecond implementation (D):   601 ms, 519 μs, and 3 hnsecsPerformance ratio: 85.96% (higher means ASM is faster)Verification values: 1636129629 1636129629

Copy link

ContributorAuthor

Aditya-132 commentedMar 19, 2025

For the slow path, theaddmod() function that protects against overflow is only required for modulus values greater thanuint.max+1 IIUC - (4294967296 * 4294967296 = 18446744073709551616 (which is 0 with ulong, so no overflow gets lost when modulus is also <= 4294967296).

Running the benchmark locally, the D code measured faster than the inline asm when the modulus was within this bounds.

staticTmulmod2(T)(T a, T b, T c) {staticif (T.sizeof==8) {staticTaddmod(T a, T b, T c) {            b = c- b;if (a>= b)return a- b;elsereturn c- b+ a;        }        T result =void;// <---- start new snipif (c<=0x100000000)        {            result = a* b;return result% c;        }        result =0;// <---- end new snip        b%= c;while (a>0) {if (a &1)                result = addmod(result, b, c);             a>>=1;            b = addmod(b, b, c);         }return result;    }else {alias DoubleT = AliasSeq!(void,ushort,uint,void,ulong)[T.sizeof];        DoubleT result =cast(DoubleT) (cast(DoubleT) a*cast(DoubleT) b);return result% c;    }}

Testing with 64-bit integers:Testing first implementation (ASM-based):Testing second implementation (Standard D):First implementation (ASM):  72 ms, 230 μs, and 8 hnsecsSecond implementation (D):   31 ms, 106 μs, and 6 hnsecsPerformance ratio: 43.07% (higher means ASM is faster)Verification values: 270309499 270309499Testing with 64-bit integers:Testing first powmod implementation (with ASM mulmod):Testing second powmod implementation (with standard D mulmod):First implementation (ASM):  699 ms, 729 μs, and 5 hnsecsSecond implementation (D):   601 ms, 519 μs, and 3 hnsecsPerformance ratio: 85.96% (higher means ASM is faster)Verification values: 1636129629 1636129629

ok i agree then what is better way of optimizing it

Copy link

Member

ibuclaw commentedMar 19, 2025•
edited
Loading

So I think it should be sufficient to optimize mulmod in the following way:

staticTmulmod(T a, T b, T c){staticif (T.sizeof==8)    {if (c<=0x100000000)        {            T result = a* b;return result% c;        }        T result =0;version (D_InlineAsm_X86_64)        {// 128-bit mul with asm        }else version (GNU_OR_LDC_X86_64)        {// 128-bit mul with gcc extended asm        }else        {// slow addmod() method (does pragma(inline, true) help?)        }return result;    }else    {// DoubleT method    }}

Does this make sense?

Copy link

Member

ibuclaw commentedMar 19, 2025•
edited
Loading

So I think it should be sufficient to optimize mulmod in the following way:
staticTmulmod(T a, T b, T c){staticif (T.sizeof==8)    {if (c<=0x100000000)        {            T result = a* b;return result% c;        }

It might even help whenT.sizeof == 4 andc <= 0x10000 as well to elide using DoubleT (64-bit) mul/div.

Copy link

ContributorAuthor

Aditya-132 commentedMar 19, 2025

So I think it should be sufficient to optimize mulmod in the following way:

staticTmulmod(T a, T b, T c){staticif (T.sizeof==8)    {if (c<=0x100000000)        {            T result = a* b;return result% c;        }        T result =0;version (D_InlineAsm_X86_64)        {// 128-bit mul with asm        }else version (GNU_OR_LDC_X86_64)        {// 128-bit mul with gcc extended asm        }else        {// slow addmod() method (does pragma(inline, true) help?)        }return result;    }else    {// DoubleT method    }}

Does this make sense?

i think you are correct here this looks similiar to other open source language iplementation of function

Copy link

ContributorAuthor

Aditya-132 commentedMar 19, 2025•
edited
Loading

Sorry for inconvenience
Will need 1 Day time to implement it due to exams
hopefully you will consider it.

implmented the given changes

e1cb9a9

Merge branch 'dlang:master' into issue-dlang#10513

23dffa7

kinke mentioned this pull request

Mar 24, 2025

[druntime] core.int128: Optimize a bit and prepare for more LDC/GDC optimizationsdlang/dmd#21071

Merged

added method with core.int128

933d72f

Copy link

ContributorAuthor

Aditya-132 commentedMar 24, 2025

sorry for wasting your time with inLine ASM should have thought of using core.int128 earlier

Copy link

ContributorAuthor

Aditya-132 commentedMar 24, 2025

@kinke @ibuclaw you can review this implmentation with core.int128

ibuclaw reviewed

Mar 24, 2025

View reviewed changes

std/math/exponential.d OutdatedShow resolvedHide resolved

Aditya-132 requested review fromibuclaw andkinke

March 24, 2025 01:07

Aditya-132and others added2 commits

March 24, 2025 07:41

removed line 61 to 65

2bf559a

Co-authored-by: Iain Buclaw <ibuclaw@gdcproject.org>

styling changees

ef8a738

Co-authored-by: Iain Buclaw <ibuclaw@gdcproject.org>

ibuclaw reviewed

Mar 24, 2025

View reviewed changes

std/math/exponential.d OutdatedShow resolvedHide resolved

Aditya-132 force-pushed theissue-#10513 branch from6edc6ab toef8a738Compare

March 25, 2025 03:21

styling changes

dfcfb4c

ibuclaw reviewed

Mar 25, 2025

View reviewed changes

std/math/exponential.d Outdated

Comment on lines 841 to 844

		const product128 = mul(Cent(a), Cent(b));
		Cent centRemainder;
		udivmod(product128, Cent(c), centRemainder);
		return cast(T)(centRemainder.lo);

Copy link

Member

ibuclawMar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sorry@Aditya-132, looks like@kinke wasn't done with changes to core.int128.

dlang/dmd#21081

With that PR merged, the correct overloads to call are:

Suggested change

	const product128 = mul(Cent(a), Cent(b));
	Cent centRemainder;
	udivmod(product128,Cent(c), centRemainder);
	returncast(T)(centRemainder.lo);
	const product128 = mul(a, b);
	T remainder =void;
	udivmod(product128,c, remainder);
	returnremainder;

Thanks for the patience. 🙇

Copy link

Member

ibuclawMar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

UFCS should also be possible

T remainder = void;mul(a, b).udivmod(c, remainder);return remainder;

Copy link

ContributorAuthor

Aditya-132Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

UFCS should also be possible

T remainder = void;mul(a, b).udivmod(c, remainder);return remainder;

its giving error

Copy link

ContributorAuthor

Aditya-132Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change are also giving error

Copy link

Member

ibuclawMar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

DMD PR is merged now, please rebase on top of phobos/master.

Copy link

ContributorAuthor

Aditya-132Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

done

Copy link

ContributorAuthor

Aditya-132Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@kinke done with changes

Copy link

Member

ibuclawMar 26, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

So the error is floating point exception?

There was overflow code between the mul and div once, but that disappeared for some reason?

auto product = mul(a, b);if (product.hi >= c)    product.hi %= c;T remainder = void;udivmod(product, c, remainder);return remainder;

Copy link

ContributorAuthor

Aditya-132Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

should i make above changes in code

Copy link

Member

ibuclawMar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, adding the overflow condition should allow you to use 64bit udivmod.

Aditya-132and others added3 commits

March 26, 2025 02:24

changes udivmod implementaion

8f643bb

Co-authored-by: Iain Buclaw <ibuclaw@gdcproject.org>

removed Cent from mul operation

71a1e9f

changed to Cent(lo: c)

4f79940

Aditya-132 requested a review fromibuclaw

March 26, 2025 02:43

changes as per review

6e7531c

ibuclaw reviewed

Mar 26, 2025

View reviewed changes

std/math/exponential.d Outdated

Comment on lines 838 to 847

		else
		{
		if (a & 1)
		result = addmod(result, b, c);

		a >>= 1;
		b = addmod(b, b, c);
		import core.int128 : Cent, mul, udivmod;
		auto product = mul(a, b);
		if (product.hi >= c)
		product.hi %= c;
		T remainder = void;
		udivmod(product, c, remainder);
		return remainder;
		}

Copy link

Member

ibuclawMar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is this being rendered wrong or is indentation missing?

Copy link

ContributorAuthor

Aditya-132Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

done with correction

Aditya-132 added2 commits

March 27, 2025 05:53

indentation changes

c8a2400

indentation changes

6d6fad6

Aditya-132 requested a review fromibuclaw

March 27, 2025 00:47

Merge branch 'dlang:master' into issue-dlang#10513

bced460

Copy link

ContributorAuthor

Aditya-132 commentedMar 27, 2025

Is there any specific reason why it fails for [Main / FreeBSD 13.2 x64 (pull_request)

Copy link

Member

ibuclaw commentedMar 27, 2025

Is there any specific reason why it fails for [Main / FreeBSD 13.2 x64 (pull_request)

The action 'Run in VM' has timed out after 19 minutes.

Unrelated.

ibuclaw approved these changes

Mar 27, 2025

View reviewed changes

thewilsonator merged commitdfc3d02 intodlang:master

Mar 27, 2025

9 of 10 checks passed

Aditya-132 deleted the issue-#10513 branch

March 27, 2025 10:54

VPanteleev-S7 pushed a commit to VPanteleev-S7/phobos that referenced this pull request

Jun 5, 2025

Fixdlang#10513- powmod is slow for ulong (dlang#10688)

96fc0f3

Co-authored-by: Iain Buclaw <ibuclaw@gdcproject.org>

Labels

None yet

5 participants

Movatterモバイル変換

Uh oh!

Fix #10513 - powmod is slow for ulong#10688

Fix #10513 - powmod is slow for ulong#10688

Uh oh!

Conversation

Aditya-132 commentedMar 18, 2025

Uh oh!

dlang-bot commentedMar 18, 2025

Bugzilla references

Testing this PR locally

Uh oh!

ibuclaw commentedMar 18, 2025

Uh oh!

Aditya-132 commentedMar 18, 2025

Uh oh!

thewilsonator commentedMar 18, 2025

Uh oh!

Aditya-132 commentedMar 18, 2025

Uh oh!

Aditya-132 commentedMar 18, 2025

Uh oh!

ibuclaw commentedMar 18, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Aditya-132 commentedMar 18, 2025• edited by ibuclawLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

TESTING RESULT

Script which i use for testing

Uh oh!

Aditya-132 commentedMar 18, 2025

Uh oh!

Aditya-132 commentedMar 18, 2025

Uh oh!

ibuclaw commentedMar 19, 2025

Uh oh!

Aditya-132 commentedMar 19, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

ibuclaw commentedMar 19, 2025

Uh oh!

Aditya-132 commentedMar 19, 2025

Uh oh!

ibuclaw commentedMar 19, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Aditya-132 commentedMar 19, 2025

Uh oh!

ibuclaw commentedMar 19, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

ibuclaw commentedMar 19, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Aditya-132 commentedMar 19, 2025

Uh oh!

Aditya-132 commentedMar 19, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Aditya-132 commentedMar 24, 2025

Uh oh!

Aditya-132 commentedMar 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ibuclaw commentedMar 18, 2025•
edited
Loading

Aditya-132 commentedMar 18, 2025•
edited by ibuclaw
Loading

Aditya-132 commentedMar 19, 2025•
edited
Loading

ibuclaw commentedMar 19, 2025•
edited
Loading

ibuclaw commentedMar 19, 2025•
edited
Loading

ibuclaw commentedMar 19, 2025•
edited
Loading

Aditya-132 commentedMar 19, 2025•
edited
Loading

ibuclawMar 26, 2025•
edited
Loading