Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Improve performance scaling offmod using modular exponentiation#898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
quaternic wants to merge2 commits intorust-lang:master
base:master
Choose a base branch
Loading
fromquaternic:fmod-opt

Conversation

quaternic
Copy link
Contributor

In the previous implementation, the cost of evaluatingfmod(x, y) was linear inlog(|x/y|) (or, the difference in exponents). Especially for the larger float types, this made some inputs very slow to evaluate. E.g.fmod(f64::MAX, 13.0) would run a loop with over 1000 iterations.

This implementation changes that scaling tolog(log(|x/y|)), giving a significant improvement for the worst cases.

Todos:

  • For some more typical inputs this can be a small regression, so I still need to look into that closer.
  • Currently not available forf128, even though that's the one that would benefit the most

Comment on lines +63 to +66
// FIXME: This is a temporary hack to get around the lack of `u256 / u256`.
// Actually, the algorithm only needs the operation `(x << I::BITS) / y`
// where `x < y`. That is, a division `u256 / u128` where the quotient must
// not overflow `u128` would be sufficient for `f128`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Would this be easier with u256? We have an implementation here

but only shifts and widening multiplication are supported currently.

Copy link
ContributorAuthor

@quaternicquaternicMay 1, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ifu256 already implementedDiv, then the generic code here could just use that, so it would be easy, yes. But implementing that sounds like extra work just to achieve a suboptimal solution.

a divisionu2N / uN where the quotient must not overflowuN

For context, x86'sdiv-instruction works just like this. Take the double-wide dividend in a pair of registers, divide by a value in a third register, and replace the low and high halves of the dividend with the quotient and remainder respectively. If the quotient would overflow (which is exactly whenx.hi() >= y), signal divide error.

So that's the abstraction I'd like to use; something like

unsafefnunchecked_wide_div_rem<U:HInt>(U::D,U) ->(U,U);

But of course, that would be even more work to implement since it doesn't exist yet, and I don't think other arches have a native operation for it.

Another idea would be to get rid of the integer division altogether, and compute the reciprocal from the original floating point value. I expect that this could have better performance, but it needs more careful analysis.

tgross35 reacted with thumbs up emoji
@quaternic
Copy link
ContributorAuthor

It looks like CI was previously failing only because there were a fewdebug_assert_eq which include formatting code for the arguments. Switching those todebug_assert!(x == y) seems to have been sufficient.

@quaternic
Copy link
ContributorAuthor

icount::icount_bench_fmod_group::icount_bench_fmod logspace:setup_fmod()  Baselines:                      hardfloat|hardfloat  Instructions:                       74379|1019076              (-92.7013%) [-13.7011x]  L1 Hits:                            84569|1021276              (-91.7193%) [-12.0762x]  L2 Hits:                                6|1                    (+500.000%) [+6.00000x]  RAM Hits:                              18|11                   (+63.6364%) [+1.63636x]  Total read+write:                   84593|1021288              (-91.7170%) [-12.0730x]  Estimated Cycles:                   85229|1021666              (-91.6578%) [-11.9873x]icount::icount_bench_fmodf128_group::icount_bench_fmodf128 logspace:setup_fmodf128()  Baselines:                      hardfloat|hardfloat  Instructions:                    30701786|30701786             (No change)  L1 Hits:                         30717674|30717679             (-0.00002%) [-1.00000x]  L2 Hits:                               10|5                    (+100.000%) [+2.00000x]  RAM Hits:                              37|37                   (No change)  Total read+write:                30717721|30717721             (No change)  Estimated Cycles:                30719019|30718999             (+0.00007%) [+1.00000x]icount::icount_bench_fmodf16_group::icount_bench_fmodf16 logspace:setup_fmodf16()  Baselines:                      hardfloat|hardfloat  Instructions:                       50481|50124                (+0.71223%) [+1.00712x]  L1 Hits:                            62816|56769                (+10.6519%) [+1.10652x]  L2 Hits:                                6|1                    (+500.000%) [+6.00000x]  RAM Hits:                              23|15                   (+53.3333%) [+1.53333x]  Total read+write:                   62845|56785                (+10.6718%) [+1.10672x]  Estimated Cycles:                   63651|57299                (+11.0857%) [+1.11086x]icount::icount_bench_fmodf_group::icount_bench_fmodf logspace:setup_fmodf()  Baselines:                      hardfloat|hardfloat  Instructions:                       54988|151547               (-63.7155%) [-2.75600x]  L1 Hits:                            63383|153749               (-58.7750%) [-2.42571x]  L2 Hits:                                4|1                    (+300.000%) [+4.00000x]  RAM Hits:                              15|9                    (+66.6667%) [+1.66667x]  Total read+write:                   63402|153759               (-58.7653%) [-2.42514x]  Estimated Cycles:                   63928|154069               (-58.5069%) [-2.41004x]

If I understand correctly, the inputs for these is a sampling withlog(|x|) andlog(|y|) each independently evenly spaced, so it should give a reasonable estimate of overall performance for the general case.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@tgross35tgross35tgross35 left review comments

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

2 participants
@quaternic@tgross35

[8]ページ先頭

©2009-2025 Movatter.jp