Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Fix #10513 - powmod is slow for ulong#10688

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
thewilsonator merged 30 commits intodlang:masterfromAditya-132:issue-#10513
Mar 27, 2025

Conversation

Aditya-132
Copy link
Contributor

The key optimizations I've made are:

128-bit multiplication via inline assembly:

For 64-bit types on x86_64 platforms, I've added direct use of hardware multiplication that produces a 128-bit result (stored in high:low registers)
This avoids the repeated addition loop in the original algorithm

Efficient modular reduction:

Implemented a reduction algorithm that handles the 128-bit result efficiently
Uses a simplified approach for when the high bits are zero

Early reduction:

The base value is reduced modulo m at the beginning, which improves performance when the base is large and modulus is small

Special case handling:

Added explicit handling for modulus=1 and exponent=0 cases up front

Version checks:

The code falls back to the original algorithm if inline assembly is not available
This ensures portability while providing optimizations where possible

@dlang-bot
Copy link
Contributor

Thanks for your pull request and interest in making D better,@Aditya-132! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

  • My PR is fully covered with tests (you can see the coverage diff by visiting thedetails link of the codecov check)
  • My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
  • I have provided a detailed rationale explaining my changes
  • New or modified functions have Ddoc comments (withParams: andReturns:)

Please seeCONTRIBUTING.md for more information.


If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, pleasereference a Bugzilla issue or create amanual changelog.

Testing this PR locally

If you don't have alocal development environment setup, you can useDigger to test this PR:

dub run digger -- build"master + phobos#10688"

@ibuclaw
Copy link
Member

@Aditya-132 before I go through it, if you can run and post benchmarks, that'll help with review.

Out of curiosity, would you also be able to.do gcc-style asm for gdc and ldc?

@Aditya-132
Copy link
ContributorAuthor

@Aditya-132 before I go through it, if you can run and post benchmarks, that'll help with review.

Out of curiosity, would you also be able to.do gcc-style asm for gdc and ldc?

sure

@thewilsonator
Copy link
Contributor

Please give the PR and commit titles a description based on the issues this solves e.g.Fix #10513 - powmod is slow for ulong

@Aditya-132
Copy link
ContributorAuthor

@Aditya-132 before I go through it, if you can run and post benchmarks, that'll help with review.

Out of curiosity, would you also be able to.do gcc-style asm for gdc and ldc?

can you provide any example which is already present in code

@Aditya-132Aditya-132 changed the titleFixing Issue-#10513Fix #10513 - powmod is slow for ulongMar 18, 2025
@Aditya-132
Copy link
ContributorAuthor

@Aditya-132 before I go through it, if you can run and post benchmarks, that'll help with review.

Out of curiosity, would you also be able to.do gcc-style asm for gdc and ldc?

instead of modifying powmod we can diredctly modify mulmod

@ibuclaw
Copy link
Member

ibuclaw commentedMar 18, 2025
edited
Loading

@Aditya-132 before I go through it, if you can run and post benchmarks, that'll help with review.

Out of curiosity, would you also be able to.do gcc-style asm for gdc and ldc?

can you provide any example which is already present in code

So, unlike dmd iasm whichis an uninlineable black box forces you manually put in all load and stores, gcc iasm instead has the notion of input and output operands (first outputs, then inputs)

As all you're doing is mul, and taking the result from two registers, the equivalent should be

low = a;asm @trusted nothrow @nogc {    "mull %0"    : "=d" (high), "+a"(low)    : "0" (b);}

@Aditya-132
Copy link
ContributorAuthor

Aditya-132 commentedMar 18, 2025
edited by ibuclaw
Loading

@ibuclaw

TESTING RESULT

BENCHMARK 1: Testing mulmod implementationsTesting with 32-bit integers:Testing first implementation (ASM-based):Testing second implementation (Standard D):First implementation (ASM):  9 ms, 270 μs, and 5 hnsecsSecond implementation (D):   13 ms, 119 μs, and 8 hnsecsPerformance ratio: 141.52% (higher means ASM is faster)Verification values: 312463588 312463588Testing with 64-bit integers:Testing first implementation (ASM-based):Testing second implementation (Standard D):First implementation (ASM):  45 ms, 296 μs, and 4 hnsecsSecond implementation (D):   556 ms and 119 μsPerformance ratio: 1227.74% (higher means ASM is faster)Verification values: 311557887622942605 311557887622942605BENCHMARK 2: Testing powmod implementationsTesting with 32-bit integers:Testing first powmod implementation (with ASM mulmod):Testing second powmod implementation (with standard D mulmod):First implementation (ASM):  1 ms, 862 μs, and 9 hnsecsSecond implementation (D):   1 ms, 904 μs, and 1 hnsecPerformance ratio: 102.26% (higher means ASM is faster)Verification values: 8312587 8312587Testing with 64-bit integers:Testing first powmod implementation (with ASM mulmod):Testing second powmod implementation (with standard D mulmod):First implementation (ASM):  2 ms and 932 μsSecond implementation (D):   42 ms, 178 μs, and 2 hnsecsPerformance ratio: 1438.54% (higher means ASM is faster)Verification values: 12615239 12615239(dmd-2.110.0)➜  testing

Script which i use for testing

importstd.stdio;importstd.datetime.stopwatch;importstd.random;importstd.traits;importstd.meta;importstd.algorithm;importcore.time;// First implementation from the pasted functionstaticTmulmod1(T)(T a, T b, T c) {staticif (T.sizeof==8) {version (D_InlineAsm_X86_64) {// Fast path for small valuesulong low =void;ulong high =void;// Perform 128-bit multiplication: a * b = [high:low]asmpure@trustednothrow@nogc            {                movRAX, a;// Load a into RAX                mul b;// Multiply by b (RDX:RAX = a * b)                mov low,RAX;// Store low 64 bits                mov high,RDX;// Store high 64 bits            }// Handle overflow for large valuesif (high>= c) {                high%= c;            }if (high==0) {return low% c;            }// Reduce the high 64 bits modulo modasmpure@trustednothrow@nogc            {                movRAX, high;// Load high part                xorRDX,RDX;// Clear RDX for division                div c;// RAX = high / mod, RDX = high % mod                mov high,RDX;// Store remainder (high % mod)                movRAX, low;// Load low part                movRDX, high;// Load reduced high part                div c;// Divide full 128-bit number by mod                mov low,RDX;// Store remainder (final result)            }return low;// Return (a * b) % mod        }else {// Fallback for non-x86_64 platforms            T result =0;            a%= c;            b%= c;while (b>0) {if (b &1) {                    result = (result+ a)% c;                }                a = (a*2)% c;                b>>=1;            }return result;        }    }elsestaticif (T.sizeof==4) {version (D_InlineAsm_X86) {// Fast path for small valuesuint low =void;// Changed to uint for 32-bituint high =void;// Changed to uint for 32-bit// Perform 64-bit multiplication: a * b = [high:low]asmpure@trustednothrow@nogc            {                movEAX, a;// Load a into EAX                mul b;// Multiply by b                mov low,EAX;// Store low bits                mov high,EDX;// Store high bits            }// Handle overflow for large valuesif (high>= c) {                high%= c;            }if (high==0) {return low% c;            }// Reduce the high bits modulo modasmpure@trustednothrow@nogc            {                movEAX, high;// Load high part                xorEDX,EDX;// Clear EDX for division                div c;// EAX = high / mod, EDX = high % mod                mov high,EDX;// Store remainder (high % mod)                movEAX, low;// Load low part                movEDX, high;// Load reduced high part                div c;// Divide full number by mod                mov low,EDX;// Store remainder (final result)            }return low;// Return (a * b) % mod        }else {// Use 64-bit type for the calculationulong result = (cast(ulong)a*cast(ulong)b)% c;returncast(T)result;        }    }else {// Fallback for other sizesimportstd.bigint;returncast(T)((BigInt(a)* BigInt(b))% BigInt(c));    }}// Second implementation from function provided in the querytemplateLargest(T...) {staticif (T.length==1)alias Largest = T[0];elsestaticif (T[0].sizeof> Largest!(T[1..$]).sizeof)alias Largest = T[0];elsealias Largest = Largest!(T[1..$]);}staticTmulmod2(T)(T a, T b, T c) {staticif (T.sizeof==8) {staticTaddmod(T a, T b, T c) {            b = c- b;if (a>= b)return a- b;elsereturn c- b+ a;        }        T result =0;        b%= c;while (a>0) {if (a &1)                result = addmod(result, b, c);            a>>=1;            b = addmod(b, b, c);        }return result;    }else {alias DoubleT = AliasSeq!(void,ushort,uint,void,ulong)[T.sizeof];        DoubleT result =cast(DoubleT) (cast(DoubleT) a*cast(DoubleT) b);return result% c;    }}// FIX: Explicitly specify the return typeFpowmod1(F, G, H)(F x, G n, H m)if (isUnsigned!F&& isUnsigned!G&& isUnsigned!H){alias ReturnT =std.traits.Unqual!(Largest!(F, H));    ReturnT base =cast(ReturnT)(x% m);    ReturnT result =1;    ReturnT modulus =cast(ReturnT)m;std.traits.Unqual!G exponent = n;while (exponent>0) {if (exponent &1)            result = mulmod1!ReturnT(result, base, modulus);        base = mulmod1!ReturnT(base, base, modulus);        exponent>>=1;    }returncast(F)result;}// FIX: Explicitly specify the return typeFpowmod2(F, G, H)(F x, G n, H m)if (isUnsigned!F&& isUnsigned!G&& isUnsigned!H){alias ReturnT =std.traits.Unqual!(Largest!(F, H));    ReturnT base =cast(ReturnT)(x% m);    ReturnT result =1;    ReturnT modulus =cast(ReturnT)m;std.traits.Unqual!G exponent = n;while (exponent>0) {if (exponent &1)            result = mulmod2!ReturnT(result, base, modulus);        base = mulmod2!ReturnT(base, base, modulus);        exponent>>=1;    }returncast(F)result;}// DELETED conflicting Unqual templatevoidmain() {// Run tests first to ensure correctness    writeln("VERIFICATION TESTS:");    runTests();// Test for 32-bit integers    writeln("\nBENCHMARK 1: Testing mulmod implementations");    writeln("Testing with 32-bit integers:");    benchmarkMulmod!(uint)();// Test for 64-bit integers    writeln("\nTesting with 64-bit integers:");    benchmarkMulmod!(ulong)();// Test powmod implementations    writeln("\n\nBENCHMARK 2: Testing powmod implementations");    writeln("Testing with 32-bit integers:");    benchmarkPowmod!(uint)();        writeln("\nTesting with 64-bit integers:");    benchmarkPowmod!(ulong)();}voidrunTests() {    writeln("Running unit tests...");// Test cases provided in the query// First set of testsassert(powmod1(1U,10U,3U)==1);assert(powmod1(3U,2U,6U)==3);assert(powmod1(5U,5U,15U)==5);assert(powmod1(2U,3U,5U)==3);assert(powmod1(2U,4U,5U)==1);assert(powmod1(2U,5U,5U)==2);// Second implementationassert(powmod2(1U,10U,3U)==1);assert(powmod2(3U,2U,6U)==3);assert(powmod2(5U,5U,15U)==5);assert(powmod2(2U,3U,5U)==3);assert(powmod2(2U,4U,5U)==1);assert(powmod2(2U,5U,5U)==2);// Second set of tests - ulongulong a =18446744073709551615u, b =20u, c =18446744073709551610u;assert(powmod1(a, b, c)==95367431640625u);    a =100; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==18223853583554725198u);    a =117; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==11493139548346411394u);    a =134; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==10979163786734356774u);    a =151; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==7023018419737782840u);    a =168; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==58082701842386811u);    a =185; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==17423478386299876798u);    a =202; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==5522733478579799075u);    a =219; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==15230218982491623487u);    a =236; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==5198328724976436000u);        a =0; b =7919; c =18446744073709551557u;assert(powmod1(a, b, c)==0);    a =123; b =0; c =18446744073709551557u;assert(powmod1(a, b, c)==1);immutableulong a1 =253, b1 =7919, c1 =18446744073709551557u;assert(powmod1(a1, b1, c1)==3883707345459248860u);// Second implementation    a =18446744073709551615u; b =20u; c =18446744073709551610u;assert(powmod2(a, b, c)==95367431640625u);    a =100; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==18223853583554725198u);    a =117; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==11493139548346411394u);    a =134; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==10979163786734356774u);    a =151; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==7023018419737782840u);    a =168; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==58082701842386811u);    a =185; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==17423478386299876798u);    a =202; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==5522733478579799075u);    a =219; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==15230218982491623487u);    a =236; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==5198328724976436000u);        a =0; b =7919; c =18446744073709551557u;assert(powmod2(a, b, c)==0);    a =123; b =0; c =18446744073709551557u;assert(powmod2(a, b, c)==1);assert(powmod2(a1, b1, c1)==3883707345459248860u);// Third set of tests - uintuint x =100, y =7919, z =1844674407u;assert(powmod1(x, y, z)==1613100340u);    x =134; y =7919; z =1844674407u;assert(powmod1(x, y, z)==734956622u);    x =151; y =7919; z =1844674407u;assert(powmod1(x, y, z)==1738696945u);    x =168; y =7919; z =1844674407u;assert(powmod1(x, y, z)==1247580927u);    x =185; y =7919; z =1844674407u;assert(powmod1(x, y, z)==1293855176u);    x =202; y =7919; z =1844674407u;assert(powmod1(x, y, z)==1566963682u);    x =219; y =7919; z =1844674407u;assert(powmod1(x, y, z)==181227807u);    x =236; y =7919; z =1844674407u;assert(powmod1(x, y, z)==217988321u);    x =253; y =7919; z =1844674407u;assert(powmod1(x, y, z)==1588843243u);        x =0; y =7919; z =184467u;assert(powmod1(x, y, z)==0);    x =123; y =0; z =1844674u;assert(powmod1(x, y, z)==1);// Second implementation    x =100; y =7919; z =1844674407u;assert(powmod2(x, y, z)==1613100340u);    x =134; y =7919; z =1844674407u;assert(powmod2(x, y, z)==734956622u);    x =151; y =7919; z =1844674407u;assert(powmod2(x, y, z)==1738696945u);    x =168; y =7919; z =1844674407u;assert(powmod2(x, y, z)==1247580927u);    x =185; y =7919; z =1844674407u;assert(powmod2(x, y, z)==1293855176u);    x =202; y =7919; z =1844674407u;assert(powmod2(x, y, z)==1566963682u);    x =219; y =7919; z =1844674407u;assert(powmod2(x, y, z)==181227807u);    x =236; y =7919; z =1844674407u;assert(powmod2(x, y, z)==217988321u);    x =253; y =7919; z =1844674407u;assert(powmod2(x, y, z)==1588843243u);        x =0; y =7919; z =184467u;assert(powmod2(x, y, z)==0);    x =123; y =0; z =1844674u;assert(powmod2(x, y, z)==1);        writeln("All tests passed!");}voidbenchmarkMulmod(T)() {// Number of iterations for benchmarkingimmutable iterations = 1_000_000;// Generate random test cases    T[] a_values =new T[iterations];    T[] b_values =new T[iterations];    T[] m_values =new T[iterations];auto rnd =Random(unpredictableSeed);// Fill arrays with random valuesfor (size_t i =0; i< iterations; i++) {staticif (is(T==ulong)) {            a_values[i] = uniform!("[]")(T.min, T.max/2, rnd);            b_values[i] = uniform!("[]")(T.min, T.max/2, rnd);// Avoid modulus being too small or zero            m_values[i] = uniform!("[]")(T.max/4, T.max-1, rnd);        }else {            a_values[i] = uniform!("[]")(T.min, T.max, rnd);            b_values[i] = uniform!("[]")(T.min, T.max, rnd);// Avoid modulus being too small or zero            m_values[i] = uniform!("[]")(T.max/2, T.max-1, rnd);        }    }// Variables to store results to prevent compiler optimizations    T result1 =0;    T result2 =0;// Test first implementation    writeln("Testing first implementation (ASM-based):");auto sw1 = StopWatch(AutoStart.yes);for (size_t i =0; i< iterations; i++) {        result1^= mulmod1!T(a_values[i], b_values[i], m_values[i]);    }        sw1.stop();// Test second implementation    writeln("Testing second implementation (Standard D):");auto sw2 = StopWatch(AutoStart.yes);for (size_t i =0; i< iterations; i++) {        result2^= mulmod2!T(a_values[i], b_values[i], m_values[i]);    }        sw2.stop();// Print results    writefln("First implementation (ASM):  %s", sw1.peek());    writefln("Second implementation (D):   %s", sw2.peek());double ratio = (sw2.peek().total!"usecs"*100.0/ sw1.peek().total!"usecs");    writefln("Performance ratio: %.2f%% (higher means ASM is faster)", ratio);// Print dummy value to prevent optimization    writefln("Verification values: %s %s", result1, result2);}voidbenchmarkPowmod(T)() {// Number of iterations for benchmarkingimmutable iterations = 10_000;// Generate random test cases    T[] x_values =new T[iterations];    T[] n_values =new T[iterations];    T[] m_values =new T[iterations];auto rnd =Random(unpredictableSeed);// Fill arrays with random values - using realistic rangesfor (size_t i =0; i< iterations; i++) {staticif (is(T==ulong)) {            x_values[i] = uniform!("[]")(2,1000000, rnd);            n_values[i] = uniform!("[]")(10,10000, rnd);// Avoid modulus being too small or zero            m_values[i] = uniform!("[]")(1000,10000000, rnd);        }else {            x_values[i] = uniform!("[]")(2,1000000, rnd);            n_values[i] = uniform!("[]")(10,10000, rnd);// Avoid modulus being too small or zero            m_values[i] = uniform!("[]")(1000,10000000, rnd);        }    }// Variables to store results to prevent compiler optimizations    T result1 =0;    T result2 =0;// Test first implementation    writeln("Testing first powmod implementation (with ASM mulmod):");auto sw1 = StopWatch(AutoStart.yes);for (size_t i =0; i< iterations; i++) {        result1^= powmod1!T(x_values[i], n_values[i], m_values[i]);    }        sw1.stop();// Test second implementation    writeln("Testing second powmod implementation (with standard D mulmod):");auto sw2 = StopWatch(AutoStart.yes);for (size_t i =0; i< iterations; i++) {        result2^= powmod2!T(x_values[i], n_values[i], m_values[i]);    }        sw2.stop();// Print results    writefln("First implementation (ASM):  %s", sw1.peek());    writefln("Second implementation (D):   %s", sw2.peek());double ratio = (sw2.peek().total!"usecs"*100.0/ sw1.peek().total!"usecs");    writefln("Performance ratio: %.2f%% (higher means ASM is faster)", ratio);// Print dummy value to prevent optimization    writefln("Verification values: %s %s", result1, result2);}

@Aditya-132
Copy link
ContributorAuthor

@ibuclaw can you explain why a = 100; b = 7919; c = 18446744073709551557u;
assert(powmod(a, b, c) == 18223853583554725198u); it exactly failed on x86 systems

@Aditya-132
Copy link
ContributorAuthor

Anyone has solution of it please tell me i cant unterstand this errors on why they are coming on x86 systems

how much i figureout is that this occured due to 32 bits x86 system the value overflowed

@Aditya-132Aditya-132force-pushed theissue-#10513 branch 2 times, most recently from8c7d8c0 to43c35deCompareMarch 19, 2025 10:50
@ibuclaw
Copy link
Member

Anyone has solution of it please tell me i cant unterstand this errors on why they are coming on x86 systems

how much i figureout is that this occured due to 32 bits x86 system the value overflowed

Yes. The new fallback code is prone to overflow. The original implementation can't be made simpler by removing theaddmod function.

@Aditya-132
Copy link
ContributorAuthor

Aditya-132 commentedMar 19, 2025
edited
Loading

Anyone has solution of it please tell me i cant unterstand this errors on why they are coming on x86 systems
how much i figureout is that this occured due to 32 bits x86 system the value overflowed

Yes. The new fallback code is prone to overflow. The original implementation can't be made simpler by removing theaddmod function.

i have a solution for that like we do in assembly we treat 16bit number in two register(in 8085) instaed of one and then perform operations on it . we can do same here by implementing 128 bits multipication using more registers in 32 bits (its complex)

@ibuclaw
Copy link
Member

I don't understand why you need inline assembler for 32bit powmod, when the original D code should be sufficient?

alias DoubleT = AliasSeq!(void, ushort, uint, void, ulong)[T.sizeof];DoubleT result = cast(DoubleT) (cast(DoubleT) a * cast(DoubleT) b);return result % c;

@Aditya-132
Copy link
ContributorAuthor

I don't understand why you need inline assembler for 32bit powmod, when the original D code should be sufficient?

alias DoubleT = AliasSeq!(void, ushort, uint, void, ulong)[T.sizeof];DoubleT result = cast(DoubleT) (cast(DoubleT) a * cast(DoubleT) b);return result % c;

ok then i will make changes for it

@ibuclaw
Copy link
Member

ibuclaw commentedMar 19, 2025
edited
Loading

For the slow path, theaddmod() function that protects against overflow is only required for modulus values greater thanuint.max+1 IIUC - (4294967296 * 4294967296 = 18446744073709551616 which is 0 with ulong, so no overflow gets lost when modulus is also <= 4294967296).

Running the benchmark locally, the D code measured faster than the inline asm when the modulus was within this bounds.

staticTmulmod2(T)(T a, T b, T c) {staticif (T.sizeof==8) {staticTaddmod(T a, T b, T c) {            b = c- b;if (a>= b)return a- b;elsereturn c- b+ a;        }        T result =void;// <---- start new snipif (c<=0x100000000)        {            result = a* b;return result% c;        }        result =0;// <---- end new snip        b%= c;while (a>0) {if (a &1)                result = addmod(result, b, c);             a>>=1;            b = addmod(b, b, c);         }return result;    }else {alias DoubleT = AliasSeq!(void,ushort,uint,void,ulong)[T.sizeof];        DoubleT result =cast(DoubleT) (cast(DoubleT) a*cast(DoubleT) b);return result% c;    }}
Testing with 64-bit integers:Testing first implementation (ASM-based):Testing second implementation (Standard D):First implementation (ASM):  72 ms, 230 μs, and 8 hnsecsSecond implementation (D):   31 ms, 106 μs, and 6 hnsecsPerformance ratio: 43.07% (higher means ASM is faster)Verification values: 270309499 270309499Testing with 64-bit integers:Testing first powmod implementation (with ASM mulmod):Testing second powmod implementation (with standard D mulmod):First implementation (ASM):  699 ms, 729 μs, and 5 hnsecsSecond implementation (D):   601 ms, 519 μs, and 3 hnsecsPerformance ratio: 85.96% (higher means ASM is faster)Verification values: 1636129629 1636129629

@Aditya-132
Copy link
ContributorAuthor

For the slow path, theaddmod() function that protects against overflow is only required for modulus values greater thanuint.max+1 IIUC - (4294967296 * 4294967296 = 18446744073709551616 (which is 0 with ulong, so no overflow gets lost when modulus is also <= 4294967296).

Running the benchmark locally, the D code measured faster than the inline asm when the modulus was within this bounds.

staticTmulmod2(T)(T a, T b, T c) {staticif (T.sizeof==8) {staticTaddmod(T a, T b, T c) {            b = c- b;if (a>= b)return a- b;elsereturn c- b+ a;        }        T result =void;// <---- start new snipif (c<=0x100000000)        {            result = a* b;return result% c;        }        result =0;// <---- end new snip        b%= c;while (a>0) {if (a &1)                result = addmod(result, b, c);             a>>=1;            b = addmod(b, b, c);         }return result;    }else {alias DoubleT = AliasSeq!(void,ushort,uint,void,ulong)[T.sizeof];        DoubleT result =cast(DoubleT) (cast(DoubleT) a*cast(DoubleT) b);return result% c;    }}
Testing with 64-bit integers:Testing first implementation (ASM-based):Testing second implementation (Standard D):First implementation (ASM):  72 ms, 230 μs, and 8 hnsecsSecond implementation (D):   31 ms, 106 μs, and 6 hnsecsPerformance ratio: 43.07% (higher means ASM is faster)Verification values: 270309499 270309499Testing with 64-bit integers:Testing first powmod implementation (with ASM mulmod):Testing second powmod implementation (with standard D mulmod):First implementation (ASM):  699 ms, 729 μs, and 5 hnsecsSecond implementation (D):   601 ms, 519 μs, and 3 hnsecsPerformance ratio: 85.96% (higher means ASM is faster)Verification values: 1636129629 1636129629

ok i agree then what is better way of optimizing it

@ibuclaw
Copy link
Member

ibuclaw commentedMar 19, 2025
edited
Loading

So I think it should be sufficient to optimize mulmod in the following way:

staticTmulmod(T a, T b, T c){staticif (T.sizeof==8)    {if (c<=0x100000000)        {            T result = a* b;return result% c;        }        T result =0;version (D_InlineAsm_X86_64)        {// 128-bit mul with asm        }else version (GNU_OR_LDC_X86_64)        {// 128-bit mul with gcc extended asm        }else        {// slow addmod() method (does pragma(inline, true) help?)        }return result;    }else    {// DoubleT method    }}

Does this make sense?

@ibuclaw
Copy link
Member

ibuclaw commentedMar 19, 2025
edited
Loading

So I think it should be sufficient to optimize mulmod in the following way:

staticTmulmod(T a, T b, T c){staticif (T.sizeof==8)    {if (c<=0x100000000)        {            T result = a* b;return result% c;        }

It might even help whenT.sizeof == 4 andc <= 0x10000 as well to elide using DoubleT (64-bit) mul/div.

Aditya-132 reacted with thumbs up emoji

@Aditya-132
Copy link
ContributorAuthor

So I think it should be sufficient to optimize mulmod in the following way:

staticTmulmod(T a, T b, T c){staticif (T.sizeof==8)    {if (c<=0x100000000)        {            T result = a* b;return result% c;        }        T result =0;version (D_InlineAsm_X86_64)        {// 128-bit mul with asm        }else version (GNU_OR_LDC_X86_64)        {// 128-bit mul with gcc extended asm        }else        {// slow addmod() method (does pragma(inline, true) help?)        }return result;    }else    {// DoubleT method    }}

Does this make sense?

i think you are correct here this looks similiar to other open source language iplementation of function

@Aditya-132
Copy link
ContributorAuthor

Aditya-132 commentedMar 19, 2025
edited
Loading

Sorry for inconvenience
Will need 1 Day time to implement it due to exams
hopefully you will consider it.

@Aditya-132
Copy link
ContributorAuthor

sorry for wasting your time with inLine ASM should have thought of using core.int128 earlier

@Aditya-132
Copy link
ContributorAuthor

@kinke@ibuclaw you can review this implmentation with core.int128

Aditya-132and others added2 commitsMarch 24, 2025 07:41
Co-authored-by: Iain Buclaw <ibuclaw@gdcproject.org>
Co-authored-by: Iain Buclaw <ibuclaw@gdcproject.org>
Comment on lines 841 to 844
const product128 = mul(Cent(a), Cent(b));
Cent centRemainder;
udivmod(product128, Cent(c), centRemainder);
return cast(T)(centRemainder.lo);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sorry@Aditya-132, looks like@kinke wasn't done with changes to core.int128.

dlang/dmd#21081

With that PR merged, the correct overloads to call are:

Suggested change
const product128 = mul(Cent(a), Cent(b));
Cent centRemainder;
udivmod(product128,Cent(c), centRemainder);
returncast(T)(centRemainder.lo);
const product128 = mul(a, b);
T remainder =void;
udivmod(product128,c, remainder);
returnremainder;

Thanks for the patience. 🙇

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

UFCS should also be possible

T remainder = void;mul(a, b).udivmod(c, remainder);return remainder;

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

UFCS should also be possible

T remainder = void;mul(a, b).udivmod(c, remainder);return remainder;

its giving error

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change are also giving error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

DMD PR is merged now, please rebase on top of phobos/master.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

done

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@kinke done with changes

Copy link
Member

@ibuclawibuclawMar 26, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

So the error is floating point exception?

There was overflow code between the mul and div once, but that disappeared for some reason?

auto product = mul(a, b);if (product.hi >= c)    product.hi %= c;T remainder = void;udivmod(product, c, remainder);return remainder;

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

should i make above changes in code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, adding the overflow condition should allow you to use 64bit udivmod.

Comment on lines 838 to 847
else
{
if (a & 1)
result = addmod(result, b, c);

a >>= 1;
b = addmod(b, b, c);
import core.int128 : Cent, mul, udivmod;
auto product = mul(a, b);
if (product.hi >= c)
product.hi %= c;
T remainder = void;
udivmod(product, c, remainder);
return remainder;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is this being rendered wrong or is indentation missing?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

done with correction

@Aditya-132
Copy link
ContributorAuthor

Is there any specific reason why it fails for [Main / FreeBSD 13.2 x64 (pull_request)

@ibuclaw
Copy link
Member

Is there any specific reason why it fails for [Main / FreeBSD 13.2 x64 (pull_request)

The action 'Run in VM' has timed out after 19 minutes.

Unrelated.

@thewilsonatorthewilsonator merged commitdfc3d02 intodlang:masterMar 27, 2025
9 of 10 checks passed
@Aditya-132Aditya-132 deleted the issue-#10513 branchMarch 27, 2025 10:54
VPanteleev-S7 pushed a commit to VPanteleev-S7/phobos that referenced this pull requestJun 5, 2025
Co-authored-by: Iain Buclaw <ibuclaw@gdcproject.org>
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@ibuclawibuclawibuclaw approved these changes

@kinkekinkeAwaiting requested review from kinke

Assignees

@kinkekinke

Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

5 participants
@Aditya-132@dlang-bot@ibuclaw@thewilsonator@kinke

[8]ページ先頭

©2009-2025 Movatter.jp