Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

<random>: Implement Lemire's fast integer generation#3012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged

Conversation

@MattStephanson
Copy link
Contributor

@MattStephansonMattStephanson commentedAug 9, 2022
edited by StephanTLavavej
Loading

Implements@lemire's "Fast Random Integer Generation in an Interval",https://dl.acm.org/doi/10.1145/3230636 andhttps://arxiv.org/abs/1805.10941.Fixes#178.

I'm not happy with the x86 or LCG performance, but I've been tinkering with it for weeks and haven't been able to improve it further. I'm using a Surface Pro 8, i5-1135G7. It's plugged in and set to "Best Performance", but I'm otherwise not very knowledgeable about how to run good microbenchmarks. If anyone has any thoughts, I'd love to hear them.

Benchmark code
#include<random>#include<benchmark/benchmark.h>/// Test URBGs alonestaticvoidBM_mt19937(benchmark::State& state) {    std::mt19937 gen;for (auto _ : state) {benchmark::DoNotOptimize(gen());    }}BENCHMARK(BM_mt19937);staticvoidBM_mt19937_64(benchmark::State& state) {    std::mt19937_64 gen;for (auto _ : state) {benchmark::DoNotOptimize(gen());    }}BENCHMARK(BM_mt19937_64);staticvoidBM_lcg(benchmark::State& state) {    std::minstd_rand gen;for (auto _ : state) {benchmark::DoNotOptimize(gen());    }}BENCHMARK(BM_lcg);uint32_tGetMax() {    std::random_device gen;    std::uniform_int_distribution<uint32_t>dist(10'000'000,20'000'000);returndist(gen);}staticconstuint32_t max = GetMax();// random divisor to prevent strength reduction/// Test mt19937staticvoidBM_raw_mt19937_old(benchmark::State& state) {    std::mt19937 gen;    std::_Rng_from_urng<uint32_t,decltype(gen)>rng(gen);for (auto _ : state) {benchmark::DoNotOptimize(rng(max));    }}BENCHMARK(BM_raw_mt19937_old);staticvoidBM_raw_mt19937_new(benchmark::State& state) {    std::mt19937 gen;    std::_Rng_from_urng_v2<uint32_t,decltype(gen)>rng(gen);for (auto _ : state) {benchmark::DoNotOptimize(rng(max));    }}BENCHMARK(BM_raw_mt19937_new);/// Test mt19937_64staticvoidBM_raw_mt19937_64_old(benchmark::State& state) {    std::mt19937_64 gen;    std::_Rng_from_urng<uint64_t,decltype(gen)>rng(gen);for (auto _ : state) {benchmark::DoNotOptimize(rng(max));    }}BENCHMARK(BM_raw_mt19937_64_old);staticvoidBM_raw_mt19937_64_new(benchmark::State& state) {    std::mt19937_64 gen;    std::_Rng_from_urng_v2<uint64_t,decltype(gen)>rng(gen);for (auto _ : state) {benchmark::DoNotOptimize(rng(max));    }}BENCHMARK(BM_raw_mt19937_64_new);/// Test minstd_randstaticvoidBM_raw_lcg_old(benchmark::State& state) {    std::minstd_rand gen;    std::_Rng_from_urng<uint32_t,decltype(gen)>rng(gen);for (auto _ : state) {benchmark::DoNotOptimize(rng(max));    }}BENCHMARK(BM_raw_lcg_old);staticvoidBM_raw_lcg_new(benchmark::State& state) {    std::minstd_rand gen;    std::_Rng_from_urng_v2<uint32_t,decltype(gen)>rng(gen);for (auto _ : state) {benchmark::DoNotOptimize(rng(max));    }}BENCHMARK(BM_raw_lcg_new);BENCHMARK_MAIN();
Benchmark results

x86

2022-08-08T19:53:31-07:00Running C:\Users\steph\source\repos\sandbox\Release\sandbox.exeRun on (8 X 2424.25 MHz CPU s)CPU Caches:  L1 Data 48 KiB (x4)  L1 Instruction 32 KiB (x4)  L2 Unified 1280 KiB (x4)  L3 Unified 8192 KiB (x1)----------------------------------------------------------------Benchmark                      Time             CPU   Iterations----------------------------------------------------------------BM_mt19937                  4.38 ns         4.39 ns    160000000BM_mt19937_64               9.79 ns         9.77 ns     64000000BM_lcg                      9.39 ns         8.54 ns     64000000BM_raw_mt19937_old          7.75 ns         7.67 ns    112000000BM_raw_mt19937_new          5.18 ns         5.16 ns    100000000BM_raw_mt19937_64_old       21.2 ns         21.0 ns     32000000BM_raw_mt19937_64_new       19.0 ns         18.8 ns     37333333BM_raw_lcg_old              25.9 ns         26.1 ns     26352941BM_raw_lcg_new              28.2 ns         28.3 ns     24888889

x64

2022-08-08T19:54:41-07:00Running C:\Users\steph\source\repos\sandbox\x64\Release\sandbox.exeRun on (8 X 2444.76 MHz CPU s)CPU Caches:  L1 Data 48 KiB (x4)  L1 Instruction 32 KiB (x4)  L2 Unified 1280 KiB (x4)  L3 Unified 8192 KiB (x1)----------------------------------------------------------------Benchmark                      Time             CPU   Iterations----------------------------------------------------------------BM_mt19937                  3.77 ns         3.75 ns    179200000BM_mt19937_64               3.87 ns         3.84 ns    179200000BM_lcg                      3.96 ns         4.01 ns    179200000BM_raw_mt19937_old          5.70 ns         5.72 ns    112000000BM_raw_mt19937_new          4.20 ns         4.24 ns    165925926BM_raw_mt19937_64_old       8.50 ns         8.58 ns     74666667BM_raw_mt19937_64_new       4.64 ns         4.50 ns    149333333BM_raw_lcg_old              15.2 ns         15.4 ns     49777778BM_raw_lcg_new              17.3 ns         17.3 ns     40727273

cpplearner, frederick-vs-ja, and celonymire reacted with eyes emoji
@MattStephansonMattStephanson changed the titleLemire's fast integer generation<random>: Implement Lemire's fast integer generationAug 9, 2022
@CaseyCarterCaseyCarter added the performanceMust go faster labelAug 9, 2022
@lemire
Copy link

These results are interesting...

BM_raw_mt19937_64_old       8.50 ns         8.58 ns     74666667BM_raw_mt19937_64_new       4.64 ns         4.50 ns    149333333

Copy link
Contributor

@frederick-vs-jafrederick-vs-ja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I suppose that this is a great step towardsDevCom-879048.

@statementreply
Copy link
Contributor

I added xoshiro256** and xoshiro128** for comparison, which are fast and have small states. Benchmark results onMattStephanson@0156b8d (reformatted):

AMD Ryzen 7 5700X, x86
2022-08-15T09:43:59+08:00Running benchmark_x86.exeRun on (16 X 3394 MHz CPU s)CPU Caches:  L1 Data 32 KiB (x8)  L1 Instruction 32 KiB (x8)  L2 Unified 512 KiB (x8)  L3 Unified 32768 KiB (x1)------------------------------------------------------------------Benchmark                        Time             CPU   Iterations------------------------------------------------------------------BM_mt19937                    3.42 ns         3.38 ns    203636364BM_mt19937_64                 9.46 ns         9.59 ns     89600000BM_lcg                        5.52 ns         5.58 ns    112000000BM_xoshiro256xx               5.85 ns         5.86 ns    112000000BM_xoshiro128xx               1.35 ns         1.35 ns    497777778BM_raw_mt19937_old            4.74 ns         4.71 ns    149333333BM_raw_mt19937_new            3.67 ns         3.57 ns    179200000BM_raw_mt19937_64_old        13.6  ns        13.8  ns     49777778BM_raw_mt19937_64_new        17.8  ns        18.0  ns     40727273BM_raw_lcg_old               17.5  ns        17.6  ns     40727273BM_raw_lcg_new               20.5  ns        20.4  ns     34461538BM_raw_xoshiro256xx_old      11.4  ns        11.7  ns     64000000BM_raw_xoshiro256xx_new      13.8  ns        14.0  ns     56000000BM_raw_xoshiro128xx_old       3.88 ns         3.92 ns    179200000BM_raw_xoshiro128xx_new       2.29 ns         2.29 ns    320000000
AMD Ryzen 7 5700X, x64
2022-08-15T09:41:08+08:00Running benchmark_x64Run on (16 X 3394 MHz CPU s)CPU Caches:  L1 Data 32 KiB (x8)  L1 Instruction 32 KiB (x8)  L2 Unified 512 KiB (x8)  L3 Unified 32768 KiB (x1)------------------------------------------------------------------Benchmark                        Time             CPU   Iterations------------------------------------------------------------------BM_mt19937                    2.58 ns         2.62 ns    280000000BM_mt19937_64                 2.86 ns         2.83 ns    248888889BM_lcg                        3.23 ns         3.22 ns    213333333BM_xoshiro256xx               1.29 ns         1.28 ns    560000000BM_xoshiro128xx               1.08 ns         1.07 ns    640000000BM_raw_mt19937_old            4.99 ns         5.00 ns    100000000BM_raw_mt19937_new            3.90 ns         3.92 ns    179200000BM_raw_mt19937_64_old         5.56 ns         5.58 ns    112000000BM_raw_mt19937_64_new         4.12 ns         4.17 ns    172307692BM_raw_lcg_old               11.4  ns        11.2  ns     56000000BM_raw_lcg_new               11.8  ns        11.7  ns     56000000BM_raw_xoshiro256xx_old       4.53 ns         4.55 ns    154482759BM_raw_xoshiro256xx_new       1.73 ns         1.73 ns    407272727BM_raw_xoshiro128xx_old       3.88 ns         3.84 ns    179200000BM_raw_xoshiro128xx_new       1.51 ns         1.50 ns    407272727
Intel Core i5-8400, x86
2022-08-13T21:48:46+08:00Running benchmark_x86Run on (6 X 2808 MHz CPU s)CPU Caches:  L1 Data 32 KiB (x6)  L1 Instruction 32 KiB (x6)  L2 Unified 256 KiB (x6)  L3 Unified 9216 KiB (x1)------------------------------------------------------------------Benchmark                        Time             CPU   Iterations------------------------------------------------------------------BM_mt19937                    4.61 ns         4.55 ns    154482759BM_mt19937_64                11.8  ns        11.7  ns     56000000BM_lcg                        8.50 ns         8.58 ns     74666667BM_xoshiro256xx               6.01 ns         6.00 ns    112000000BM_xoshiro128xx               2.12 ns         2.13 ns    344615385BM_raw_mt19937_old           10.4  ns        10.3  ns     74666667BM_raw_mt19937_new            5.13 ns         5.02 ns    112000000BM_raw_mt19937_64_old        25.8  ns        25.5  ns     26352941BM_raw_mt19937_64_new        21.6  ns        21.5  ns     32000000BM_raw_lcg_old               31.8  ns        30.0  ns     21333333BM_raw_lcg_new               34.0  ns        33.7  ns     21333333BM_raw_xoshiro256xx_old      21.4  ns        21.5  ns     32000000BM_raw_xoshiro256xx_new      18.5  ns        18.8  ns     40727273BM_raw_xoshiro128xx_old       7.99 ns         8.02 ns     89600000BM_raw_xoshiro128xx_new       3.65 ns         3.69 ns    194782609
Intel Core i5-8400, x64
2022-08-13T21:49:07+08:00Running benchmark_x64Run on (6 X 2808 MHz CPU s)CPU Caches:  L1 Data 32 KiB (x6)  L1 Instruction 32 KiB (x6)  L2 Unified 256 KiB (x6)  L3 Unified 9216 KiB (x1)------------------------------------------------------------------Benchmark                        Time             CPU   Iterations------------------------------------------------------------------BM_mt19937                    3.50 ns         3.53 ns    203636364BM_mt19937_64                 3.53 ns         3.53 ns    194782609BM_lcg                        4.03 ns         3.99 ns    172307692BM_xoshiro256xx               1.60 ns         1.53 ns    407272727BM_xoshiro128xx               1.46 ns         1.46 ns    448000000BM_raw_mt19937_old           11.0  ns        11.0  ns     64000000BM_raw_mt19937_new            4.90 ns         4.87 ns    144516129BM_raw_mt19937_64_old        24.9  ns        25.1  ns     28000000BM_raw_mt19937_64_new         5.15 ns         5.16 ns    100000000BM_raw_lcg_old               20.1  ns        19.9  ns     34461538BM_raw_lcg_new               15.2  ns        15.0  ns     44800000BM_raw_xoshiro256xx_old      21.2  ns        21.3  ns     34461538BM_raw_xoshiro256xx_new       2.54 ns         2.57 ns    280000000BM_raw_xoshiro128xx_old       7.75 ns         7.67 ns     89600000BM_raw_xoshiro128xx_new       2.00 ns         2.01 ns    373333333
MattStephanson and frederick-vs-ja reacted with heart emoji

@frederick-vs-ja

This comment was marked as resolved.

Copy link
Member

@StephanTLavavejStephanTLavavej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks, this approach looks good to me! (I skipped the int128 changes as I assume we'll want to land@frederick-vs-ja's#3036 first.)

The benchmark results look convincing enough to me, especially@statementreply's cases. 😻

@MattStephanson
Copy link
ContributorAuthor

Some of@StephanTLavavej's feedback affects codegen, so here are my updated benchmark results. I think they're pretty similar to what I originally posted.

x86

2022-08-18T20:20:48-07:00Running C:\Users\steph\source\repos\sandbox\Release\sandbox.exeRun on (8 X 2433.76 MHz CPU s)CPU Caches:  L1 Data 48 KiB (x4)  L1 Instruction 32 KiB (x4)  L2 Unified 1280 KiB (x4)  L3 Unified 8192 KiB (x1)----------------------------------------------------------------Benchmark                      Time             CPU   Iterations----------------------------------------------------------------BM_mt19937                  3.80 ns         3.85 ns    186666667BM_mt19937_64               9.52 ns         9.42 ns     74666667BM_lcg                      8.44 ns         8.54 ns     89600000BM_raw_mt19937_old          5.90 ns         5.86 ns    112000000BM_raw_mt19937_new          4.45 ns         4.52 ns    165925926BM_raw_mt19937_64_old       17.7 ns         17.3 ns     40727273BM_raw_mt19937_64_new       17.1 ns         17.3 ns     40727273BM_raw_lcg_old              26.4 ns         26.7 ns     26352941BM_raw_lcg_new              27.9 ns         27.9 ns     26352941

x64

2022-08-18T20:22:16-07:00Running C:\Users\steph\source\repos\sandbox\x64\Release\sandbox.exeRun on (8 X 2437.08 MHz CPU s)CPU Caches:  L1 Data 48 KiB (x4)  L1 Instruction 32 KiB (x4)  L2 Unified 1280 KiB (x4)  L3 Unified 8192 KiB (x1)----------------------------------------------------------------Benchmark                      Time             CPU   Iterations----------------------------------------------------------------BM_mt19937                  3.76 ns         3.77 ns    186666667BM_mt19937_64               3.82 ns         3.85 ns    186666667BM_lcg                      3.91 ns         3.92 ns    179200000BM_raw_mt19937_old          5.60 ns         5.58 ns    112000000BM_raw_mt19937_new          4.18 ns         4.17 ns    172307692BM_raw_mt19937_64_old       8.35 ns         8.37 ns     89600000BM_raw_mt19937_64_new       4.55 ns         4.50 ns    149333333BM_raw_lcg_old              15.0 ns         15.0 ns     44800000BM_raw_lcg_new              17.0 ns         16.9 ns     40727273
StephanTLavavej and frederick-vs-ja reacted with heart emoji

@MattStephansonMattStephanson marked this pull request as ready for reviewAugust 19, 2022 05:29
@MattStephansonMattStephanson requested a review froma team as acode ownerAugust 19, 2022 05:29
@strega-nil-msstrega-nil-ms added the blockedSomething is preventing work on this labelAug 23, 2022
@StephanTLavavej

This comment was marked as resolved.

@StephanTLavavejStephanTLavavej removed the blockedSomething is preventing work on this labelSep 1, 2022
@StephanTLavavej

This comment was marked as resolved.

@MattStephanson

This comment was marked as resolved.

@StephanTLavavejStephanTLavavej self-assigned thisSep 3, 2022
@StephanTLavavej

This comment was marked as outdated.

@StephanTLavavejStephanTLavavej removed their assignmentSep 3, 2022
@StephanTLavavej
Copy link
Member

StephanTLavavej commentedSep 3, 2022
edited
Loading

This is a rich source of compiler bugs 😹, caused by dragging<__msvc_int128.hpp> into many more TUs by default. I've reported:

  • Can work around by disabling affected tests:
    • VSO-1611408 "Standard Library Modules: Compiler assertion failed:inserted, file template.cpp, line 10275"
    • VSO-1611411 "Standard Library Modules: Compiler assertion failed:IsInClassDefn() && (region->owner == SU_TAG), file reader.cpp, line 2592"
  • Fixed byMSVC-PR-422542:
    • VSO-1611417 "__msvc_int128.hpp's operators cause compiler assertion:(count == 1) && (pFunction != nullptr), file esumem.c, line 8752"
    • VSO-1612442 "__msvc_int128.hpp operator overloads cause bogus error C3861:'/': identifier not found"

No changes requested for this PR, but I'm going to pull it out of the current merge batch (where I speculatively/greedily tried to add it).

@StephanTLavavej

This comment was marked as resolved.

@frederick-vs-ja

This comment was marked as resolved.

@StephanTLavavej
Copy link
Member

@JonCavesMSFT just fixed the non-modules bugs (yay! 😻) and the affected modules tests can be skipped. (I don't believe that Standard Library Header Units or the named Standard Library Modules will be affected, as we're already dragging<__msvc_int128.hpp> into them, so this is purely a concern for the old internal tests.) Therefore, nothing should be preventing this PR from being merged.

@strega-nil-ms
Copy link
Contributor

I've added the benchmark code to the repo; thanks!

@StephanTLavavej
Copy link
Member

@strega-nil-ms Was the force-push here unintentional?

@StephanTLavavej
Copy link
Member

I see, a damaged commit was amended, but the history wasn't further rewritten.

@StephanTLavavej
Copy link
Member

@strega-nil-ms I pushed minor changes to the benchmark after validating that it still builds and runs properly. Thanks for adding it!

Copy link
Contributor

@strega-nil-msstrega-nil-ms left a comment
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks so much!!!
akko having a face like 'ahhhh thanks so much'

Copy link
Contributor

@barcharcrazbarcharcraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This seems correct to me, as far as my understanding of goes. I think this is ready.

@StephanTLavavej
Copy link
Member

StephanTLavavej commentedSep 20, 2022
edited
Loading

I've pushed a merge withmain to resolve a trivial merge conflict -_Rng_from_urng_v2 was being added immediately above declarations where I addedextern "C++" for modules.

✅ I double-checked that this PR should have no impact on modules.

lemire and TheFanatr reacted with rocket emoji

@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavejStephanTLavavej merged commit8334fca intomicrosoft:mainSep 22, 2022
@StephanTLavavej
Copy link
Member

Thank you for improving the performance of<random>'s most popular distribution! Also thanks to@lemire for inventing the algorithm,@chris0x44 for the original PR, and@strega-nil-ms for adding your benchmark to the growing collection! 😻 🎉 🚀

This will be available in either VS 2022 17.5 Preview 1 or Preview 2 (depending on internal merge logistics; the Changelog will record our current expectation).

lemire and TheFanatr reacted with thumbs up emoji

CaseyCarter pushed a commit to CaseyCarter/STL that referenced this pull requestOct 6, 2022
Co-authored-by: Nicole Mazzuca <mazzucan@outlook.com>Co-authored-by: Stephan T. Lavavej <stl@nuwen.net>
@MattStephansonMattStephanson deleted the gh_178_uniform_int branchMay 10, 2023 01:33
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@StephanTLavavejStephanTLavavejStephanTLavavej approved these changes

+3 more reviewers

@frederick-vs-jafrederick-vs-jafrederick-vs-ja left review comments

@barcharcrazbarcharcrazbarcharcraz approved these changes

@strega-nil-msstrega-nil-msstrega-nil-ms approved these changes

Reviewers whose approvals may not affect merge requirements

Assignees

@StephanTLavavejStephanTLavavej

Labels

performanceMust go faster

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

<random>: Optimizeuniform_int_distribution with multiply-shift

9 participants

@MattStephanson@lemire@statementreply@frederick-vs-ja@StephanTLavavej@strega-nil-ms@barcharcraz@CaseyCarter@strega-nil

[8]ページ先頭

©2009-2025 Movatter.jp