In the previous implementation, the cost of evaluatingfmod(x, y) was linear inlog(|x/y|) (or, the difference in exponents). Especially for the larger float types, this made some inputs very slow to evaluate. E.g.fmod(f64::MAX, 13.0) would run a loop with over 1000 iterations.

This implementation changes that scaling tolog(log(|x/y|)), giving a significant improvement for the worst cases.

Todos:

For some more typical inputs this can be a small regression, so I still need to look into that closer.
Currently not available forf128, even though that's the one that would benefit the most

Implement Barrett reduction for modular multiplication

bf85502

tgross35 reviewed

May 1, 2025

View reviewed changes

libm/src/math/generic/fmod.rs

Comment on lines +63 to +66

		// FIXME: This is a temporary hack to get around the lack of `u256 / u256`.
		// Actually, the algorithm only needs the operation `(x << I::BITS) / y`
		// where `x < y`. That is, a division `u256 / u128` where the quotient must
		// not overflow `u128` would be sufficient for `f128`.

Copy link

Contributor

tgross35May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Would this be easier with u256? We have an implementation here

compiler-builtins/libm/src/math/support/big.rs

Line 15 inf456aa8

pubstructu256{

but only shifts and widening multiplication are supported currently.

Copy link

ContributorAuthor

quaternicMay 1, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ifu256 already implementedDiv, then the generic code here could just use that, so it would be easy, yes. But implementing that sounds like extra work just to achieve a suboptimal solution.

a divisionu2N / uN where the quotient must not overflowuN

For context, x86'sdiv-instruction works just like this. Take the double-wide dividend in a pair of registers, divide by a value in a third register, and replace the low and high halves of the dividend with the quotient and remainder respectively. If the quotient would overflow (which is exactly whenx.hi() >= y), signal divide error.

So that's the abstraction I'd like to use; something like

unsafefnunchecked_wide_div_rem<U:HInt>(U::D,U) ->(U,U);

But of course, that would be even more work to implement since it doesn't exist yet, and I don't think other arches have a native operation for it.

Another idea would be to get rid of the integer division altogether, and compute the reciprocal from the original floating point value. I expect that this could have better performance, but it needs more careful analysis.

Optimize performance of fmod with Barrett multiplication

c811592

quaternic force-pushed thefmod-opt branch from77429e2 toc811592Compare

May 2, 2025 19:48

Copy link

ContributorAuthor

quaternic commentedMay 2, 2025

It looks like CI was previously failing only because there were a fewdebug_assert_eq which include formatting code for the arguments. Switching those todebug_assert!(x == y) seems to have been sufficient.

Copy link

ContributorAuthor

quaternic commentedMay 2, 2025

icount::icount_bench_fmod_group::icount_bench_fmod logspace:setup_fmod()  Baselines:                      hardfloat|hardfloat  Instructions:                       74379|1019076              (-92.7013%) [-13.7011x]  L1 Hits:                            84569|1021276              (-91.7193%) [-12.0762x]  L2 Hits:                                6|1                    (+500.000%) [+6.00000x]  RAM Hits:                              18|11                   (+63.6364%) [+1.63636x]  Total read+write:                   84593|1021288              (-91.7170%) [-12.0730x]  Estimated Cycles:                   85229|1021666              (-91.6578%) [-11.9873x]icount::icount_bench_fmodf128_group::icount_bench_fmodf128 logspace:setup_fmodf128()  Baselines:                      hardfloat|hardfloat  Instructions:                    30701786|30701786             (No change)  L1 Hits:                         30717674|30717679             (-0.00002%) [-1.00000x]  L2 Hits:                               10|5                    (+100.000%) [+2.00000x]  RAM Hits:                              37|37                   (No change)  Total read+write:                30717721|30717721             (No change)  Estimated Cycles:                30719019|30718999             (+0.00007%) [+1.00000x]icount::icount_bench_fmodf16_group::icount_bench_fmodf16 logspace:setup_fmodf16()  Baselines:                      hardfloat|hardfloat  Instructions:                       50481|50124                (+0.71223%) [+1.00712x]  L1 Hits:                            62816|56769                (+10.6519%) [+1.10652x]  L2 Hits:                                6|1                    (+500.000%) [+6.00000x]  RAM Hits:                              23|15                   (+53.3333%) [+1.53333x]  Total read+write:                   62845|56785                (+10.6718%) [+1.10672x]  Estimated Cycles:                   63651|57299                (+11.0857%) [+1.11086x]icount::icount_bench_fmodf_group::icount_bench_fmodf logspace:setup_fmodf()  Baselines:                      hardfloat|hardfloat  Instructions:                       54988|151547               (-63.7155%) [-2.75600x]  L1 Hits:                            63383|153749               (-58.7750%) [-2.42571x]  L2 Hits:                                4|1                    (+300.000%) [+4.00000x]  RAM Hits:                              15|9                    (+66.6667%) [+1.66667x]  Total read+write:                   63402|153759               (-58.7653%) [-2.42514x]  Estimated Cycles:                   63928|154069               (-58.5069%) [-2.41004x]

If I understand correctly, the inputs for these is a sampling withlog(|x|) andlog(|y|) each independently evenly spaced, so it should give a reasonable estimate of overall performance for the general case.

Labels

None yet

2 participants

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve performance scaling of`fmod` using modular exponentiation#898

Are you sure you want to change the base?

Improve performance scaling of`fmod` using modular exponentiation#898

Uh oh!