Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Vectorizesearch_n for small values of n#5352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
StephanTLavavej merged 32 commits intomicrosoft:mainfromAlexGuteniev:search_n
Apr 22, 2025

Conversation

@AlexGuteniev
Copy link
Contributor

@AlexGutenievAlexGuteniev commentedMar 22, 2025
edited by StephanTLavavej
Loading

⚙️ The optimization

Like I mentioned in#5346, bothstd::search_n andranges::search_n make steps byn elements, and avoid going back for a good input (where there are few potential matches), so for largen values vectorization wouldn't be an improvement.

Still for smalln, such that vector register width is larger thann, and therefore, the vector step is bigger, it is possible to vectorize in a way that would be faster even for an input with few matches. For more matches, such vectorization will have more advantage, as it would not need to go back.

The approach is to compare elements, get bit mask, and look for contiguous set of ones of proper length.@Alcaro suggested:

you can do things like

match &= match>>1match &= match>>2match &= match>>3

to find 7 consecutive 1s, but that probably does something ruinous to instruction parallelism and may cost more than it saves

Turns out this is efficient enough for AVX2 with the values ofn up twice smaller than AVX register size in elements. Despite there seems to be indeed high cost of ruined parallelism, I cannot find anything faster.

The shift values are computed based onn. To save one variable (general purpose register), we rely onn=1 to be handled separately, and assume at least one shift to happen.

To deal with matches on vector register boundary, the bitmask is concatenated with the previous one. AVX bitmask is 32 bits for 32 bytes of AVX value, doubled it is 64 bit, still fits x64 register perfectly. The alternative to concatenation could be handling the boundary case withlzcnt/tzcnt, this turned out to be not faster.

The fallback is used for tails and too largen values. For tails it useslzcnt with inverted carry value to have smooth transition from potential partial match in vectors to the scalar part. The fallback recreatesranges::search_n in<algorithm>, with slight variation.

🥔 Down-level architectures support

SSE4.2 version is implementable in both senses of backporting the current approach to SSE and usingpcmpestri. I'd expect either to be of advantage forn values twice smaller than SSE register. Just feel like should not bother trying that.

x86 version works the same way as x64. However, unlike many other vectorization algorithms, this one relies a lot on general-purpose 64 bit integer ops. To mitigate the impact__ull_rshift is used instead of the plain shift. This intrinsic usage doesn't impact 64-bit code, but makes 32-bit codegen better (at the expense of not handling huge shifts, which we don't need anyway). The shift values are ofint type to match the intrinsic parameter type.

Still, the efficiency on x86 is questionable (see benchmark results below). Apart from having shifts in multiple instructions, it is apparently due to general purpose registers deficit. The compiler isn't being helpful here too, some register spills look superfluous.

For 32-bit and 64-bit elements, it is possible to use the floating point bit mask, instead of integer bit mask, like in#4987/#5092. This will save bit width. But apart from the mysterious "bypass delay" (integers and floats instructions mix potential penalty), it will also make the bit magic more complicated, more dependent on element width, and still won't reduce the bit width for 8-bit and 16-bit elements, so this doesn't seem to be worth doing.

We could just skip x86. But we don't have precedent of having vectorization for x64, but not having it for x86, so I didn't want to introduce one.

1️⃣ Specialn=1 case

We need to handle this case as justfind vectorization.find vectorization is more efficient than this one, plus the assumption that the shift happens at least once saves a variable/register.

The question is where we should handle this:

  1. Only in separately compiled code
  2. Only in headers
  3. Both in headers and in separately compiled code

The latter two are indistinguishable in practice, so the real question is, if we should:

  1. Handle it in separately compiled code, effectively revertingUsefind forsearch_n when n=1 #5346 optimization
  2. Keep handling it in header

With removaln=1 case from headers we get:

  • Better throughput
  • Simpler header implementation

With keepingn=1 case in headers we get:

  • Some non-vectorization optimization for non-vector element types (I believe it is noticeable, but not like multiple times)
  • Some auto-vectorization from Clang and probably MSVC in future (Clang recognizesfind pattern)
  • memchr for corresponding type and disabled vectorization mode

✅ Test coverage

To cover the variety of possibilities, the randomized test should try different input lengths, differentn, and different actual matches lengths (including too long matches, too short matches, and different gap between matches). This has to have long run time, so it deserves a dedicated test.

The test coverage is not only useful for vectorization, it also compensates missing non-vectorization coverage, asked in#933.

This PR still doesn't fully address#933 as it is asked because:

  • It does not cover the forward-only iterator branches
  • It does not have features, like nice error case print, or seed parameter acceptance

I'm not sure how much these features are required, though. If they are required, further work to complete#933 would certainly need a different PR.

🏁 Benchmarks

In addition to theTwoZones case inherited from ##5346 , it hasDenseSmallSequences.

These two are close to normal case and worst case respectively.

TwoZones (Zones in the table below) has half of range with mismatch character and half of rangers with match character. So the search should quickly proceed to the match part then check the first match which is successful.

DenseSmallSequences (Dense in the table below) has too short matches of random with from 0 ton-1 interrupted by a single mismatch character.

The vectorization improvement is more forDenseSmallSequences, but we should probably care aboutTwoZones somewhat more. If worst case is a priority, we can lift threshold for the vectorization twice.

⏱️ Benchmark results

Click to expand:
BenchmarkV. alg?x64 Beforex64 Afterx64 🏎️x86 Beforex86 Afterx86 🏎️
u8/Std/Zones/3000/40no45.4 ns46.7 ns0.9742.4 ns66.8 ns0.63
u8/Std/Zones/3000/18no63.0 ns61.8 ns1.0262.1 ns83.2 ns0.75
u8/Std/Zones/3000/16yes66.0 ns69.3 ns0.9577.8 ns128 ns0.61
u8/Std/Zones/3000/14yes68.7 ns69.0 ns1.0071.7 ns129 ns0.56
u8/Std/Zones/3000/10yes85.8 ns72.4 ns1.1993.9 ns130 ns0.72
u8/Std/Zones/3000/8yes103 ns69.4 ns1.48113 ns128 ns0.88
u8/Std/Zones/3000/5yes157 ns74.2 ns2.12171 ns128 ns1.34
u8/Std/Zones/3000/4yes189 ns72.4 ns2.61210 ns125 ns1.68
u8/Std/Zones/3000/3yes250 ns71.6 ns3.49272 ns132 ns2.06
u8/Std/Zones/3000/2yes368 ns72.5 ns5.08402 ns130 ns3.09
u8/Std/Zones/3000/1find18.0 ns18.2 ns0.9918.3 ns21.8 ns0.84
u8/Rng/Zones/3000/40no47.7 ns45.8 ns1.0452.4 ns66.9 ns0.78
u8/Rng/Zones/3000/18no78.2 ns60.8 ns1.2979.7 ns83.7 ns0.95
u8/Rng/Zones/3000/16yes84.9 ns71.1 ns1.1985.6 ns129 ns0.66
u8/Rng/Zones/3000/14yes90.3 ns71.4 ns1.2693.7 ns128 ns0.73
u8/Rng/Zones/3000/10yes118 ns72.3 ns1.63118 ns128 ns0.92
u8/Rng/Zones/3000/8yes141 ns71.7 ns1.97144 ns128 ns1.13
u8/Rng/Zones/3000/5yes215 ns75.5 ns2.85212 ns125 ns1.70
u8/Rng/Zones/3000/4yes303 ns72.9 ns4.16265 ns129 ns2.05
u8/Rng/Zones/3000/3yes346 ns73.8 ns4.69344 ns130 ns2.65
u8/Rng/Zones/3000/2yes509 ns74.8 ns6.80506 ns129 ns3.92
u8/Rng/Zones/3000/1find18.2 ns18.4 ns0.9918.5 ns18.7 ns0.99
u8/Std/Dense/3000/40no818 ns381 ns2.15823 ns654 ns1.26
u8/Std/Dense/3000/18no1006 ns501 ns2.011036 ns774 ns1.34
u8/Std/Dense/3000/16yes985 ns135 ns7.301022 ns236 ns4.33
u8/Std/Dense/3000/14yes987 ns136 ns7.261004 ns244 ns4.11
u8/Std/Dense/3000/10yes1071 ns144 ns7.441094 ns245 ns4.47
u8/Std/Dense/3000/8yes1140 ns138 ns8.261239 ns246 ns5.04
u8/Std/Dense/3000/5yes1301 ns147 ns8.851356 ns279 ns4.86
u8/Std/Dense/3000/4yes1303 ns147 ns8.861418 ns243 ns5.84
u8/Std/Dense/3000/3yes1300 ns147 ns8.841460 ns248 ns5.89
u8/Std/Dense/3000/2yes1191 ns149 ns7.991363 ns244 ns5.59
u8/Std/Dense/3000/1find49.4 ns47.0 ns1.0548.3 ns49.6 ns0.97
u8/Rng/Dense/3000/40no830 ns382 ns2.17584 ns653 ns0.89
u8/Rng/Dense/3000/18no813 ns506 ns1.61622 ns768 ns0.81
u8/Rng/Dense/3000/16yes853 ns143 ns5.97660 ns237 ns2.78
u8/Rng/Dense/3000/14yes843 ns137 ns6.15665 ns241 ns2.76
u8/Rng/Dense/3000/10yes875 ns138 ns6.34707 ns243 ns2.91
u8/Rng/Dense/3000/8yes936 ns139 ns6.73771 ns240 ns3.21
u8/Rng/Dense/3000/5yes1057 ns148 ns7.14858 ns240 ns3.58
u8/Rng/Dense/3000/4yes1155 ns148 ns7.80876 ns248 ns3.53
u8/Rng/Dense/3000/3yes1240 ns147 ns8.44889 ns252 ns3.53
u8/Rng/Dense/3000/2yes1096 ns149 ns7.361074 ns251 ns4.28
u8/Rng/Dense/3000/1find51.6 ns49.4 ns1.0448.9 ns48.6 ns1.01
u16/Std/Zones/3000/40no41.2 ns50.2 ns0.8246.1 ns55.3 ns0.83
u16/Std/Zones/3000/18no66.3 ns69.2 ns0.9668.1 ns76.4 ns0.89
u16/Std/Zones/3000/16no71.0 ns75.8 ns0.9475.3 ns83.0 ns0.91
u16/Std/Zones/3000/14no77.3 ns83.5 ns0.9379.9 ns92.0 ns0.87
u16/Std/Zones/3000/10no97.1 ns105 ns0.92103 ns116 ns0.89
u16/Std/Zones/3000/8yes117 ns107 ns1.09126 ns175 ns0.72
u16/Std/Zones/3000/5yes166 ns107 ns1.55195 ns174 ns1.12
u16/Std/Zones/3000/4yes194 ns107 ns1.81231 ns173 ns1.34
u16/Std/Zones/3000/3yes270 ns117 ns2.31309 ns177 ns1.75
u16/Std/Zones/3000/2yes385 ns118 ns3.26438 ns172 ns2.55
u16/Std/Zones/3000/1find48.2 ns48.9 ns0.9937.5 ns52.1 ns0.72
u16/Rng/Zones/3000/40no49.1 ns49.7 ns0.9950.1 ns55.0 ns0.91
u16/Rng/Zones/3000/18no85.7 ns70.1 ns1.22107 ns76.6 ns1.40
u16/Rng/Zones/3000/16no95.8 ns81.1 ns1.18117 ns83.9 ns1.39
u16/Rng/Zones/3000/14no108 ns84.1 ns1.28128 ns91.5 ns1.40
u16/Rng/Zones/3000/10no156 ns103 ns1.51168 ns115 ns1.46
u16/Rng/Zones/3000/8yes185 ns108 ns1.71202 ns172 ns1.17
u16/Rng/Zones/3000/5yes304 ns108 ns2.81313 ns171 ns1.83
u16/Rng/Zones/3000/4yes377 ns106 ns3.56394 ns174 ns2.26
u16/Rng/Zones/3000/3yes500 ns118 ns4.24518 ns172 ns3.01
u16/Rng/Zones/3000/2yes734 ns118 ns6.22747 ns173 ns4.32
u16/Rng/Zones/3000/1find47.3 ns48.9 ns0.9737.8 ns51.1 ns0.74
u16/Std/Dense/3000/40no827 ns385 ns2.15854 ns422 ns2.02
u16/Std/Dense/3000/18no964 ns499 ns1.93994 ns554 ns1.79
u16/Std/Dense/3000/16no1019 ns528 ns1.931010 ns568 ns1.78
u16/Std/Dense/3000/14no1043 ns577 ns1.811064 ns585 ns1.82
u16/Std/Dense/3000/10no1202 ns695 ns1.731186 ns708 ns1.68
u16/Std/Dense/3000/8yes1308 ns211 ns6.201268 ns337 ns3.76
u16/Std/Dense/3000/5yes1514 ns210 ns7.211490 ns340 ns4.38
u16/Std/Dense/3000/4yes1494 ns211 ns7.081458 ns355 ns4.11
u16/Std/Dense/3000/3yes1438 ns232 ns6.201398 ns365 ns3.83
u16/Std/Dense/3000/2yes1136 ns232 ns4.901271 ns346 ns3.67
u16/Std/Dense/3000/1find74.3 ns76.0 ns0.9887.2 ns89.2 ns0.98
u16/Rng/Dense/3000/40no423 ns388 ns1.09526 ns524 ns1.00
u16/Rng/Dense/3000/18no549 ns506 ns1.08643 ns702 ns0.92
u16/Rng/Dense/3000/16no576 ns528 ns1.09702 ns619 ns1.13
u16/Rng/Dense/3000/14no599 ns576 ns1.04701 ns634 ns1.11
u16/Rng/Dense/3000/10no699 ns702 ns1.00779 ns764 ns1.02
u16/Rng/Dense/3000/8yes769 ns216 ns3.56894 ns347 ns2.58
u16/Rng/Dense/3000/5yes874 ns211 ns4.141002 ns341 ns2.94
u16/Rng/Dense/3000/4yes1023 ns210 ns4.871110 ns339 ns3.27
u16/Rng/Dense/3000/3yes1320 ns232 ns5.691260 ns344 ns3.66
u16/Rng/Dense/3000/2yes1823 ns233 ns7.821769 ns344 ns5.14
u16/Rng/Dense/3000/1find75.7 ns73.7 ns1.0383.0 ns90.3 ns0.92
u32/Std/Zones/3000/40no44.4 ns43.7 ns1.0245.3 ns58.1 ns0.78
u32/Std/Zones/3000/18no61.9 ns64.2 ns0.9671.3 ns79.4 ns0.90
u32/Std/Zones/3000/16no64.6 ns69.5 ns0.9374.8 ns84.0 ns0.89
u32/Std/Zones/3000/14no72.6 ns76.1 ns0.9581.1 ns103 ns0.79
u32/Std/Zones/3000/10no90.9 ns96.3 ns0.94103 ns129 ns0.80
u32/Std/Zones/3000/8no113 ns116 ns0.97126 ns153 ns0.82
u32/Std/Zones/3000/5no167 ns176 ns0.95186 ns237 ns0.78
u32/Std/Zones/3000/4yes196 ns162 ns1.21228 ns230 ns0.99
u32/Std/Zones/3000/3yes262 ns162 ns1.62302 ns232 ns1.30
u32/Std/Zones/3000/2yes393 ns163 ns2.41440 ns229 ns1.92
u32/Std/Zones/3000/1find80.1 ns80.3 ns1.0070.5 ns75.8 ns0.93
u32/Rng/Zones/3000/40no49.2 ns42.4 ns1.1652.4 ns53.4 ns0.98
u32/Rng/Zones/3000/18no101 ns59.0 ns1.71100 ns77.2 ns1.30
u32/Rng/Zones/3000/16no110 ns68.7 ns1.60110 ns82.0 ns1.34
u32/Rng/Zones/3000/14no125 ns75.9 ns1.65122 ns101 ns1.21
u32/Rng/Zones/3000/10no159 ns95.9 ns1.66162 ns127 ns1.28
u32/Rng/Zones/3000/8no194 ns118 ns1.64198 ns154 ns1.29
u32/Rng/Zones/3000/5no302 ns175 ns1.73313 ns232 ns1.35
u32/Rng/Zones/3000/4yes374 ns163 ns2.29381 ns231 ns1.65
u32/Rng/Zones/3000/3yes494 ns163 ns3.03511 ns232 ns2.20
u32/Rng/Zones/3000/2yes732 ns162 ns4.52756 ns233 ns3.24
u32/Rng/Zones/3000/1find80.4 ns80.3 ns1.0070.6 ns74.3 ns0.95
u32/Std/Dense/3000/40no921 ns360 ns2.56821 ns534 ns1.54
u32/Std/Dense/3000/18no1171 ns455 ns2.57995 ns593 ns1.68
u32/Std/Dense/3000/16no1187 ns475 ns2.50978 ns624 ns1.57
u32/Std/Dense/3000/14no1212 ns509 ns2.381000 ns659 ns1.52
u32/Std/Dense/3000/10no1337 ns605 ns2.211059 ns865 ns1.22
u32/Std/Dense/3000/8no1463 ns689 ns2.121119 ns952 ns1.18
u32/Std/Dense/3000/5no1547 ns849 ns1.821268 ns1131 ns1.12
u32/Std/Dense/3000/4yes1460 ns334 ns4.371332 ns470 ns2.83
u32/Std/Dense/3000/3yes1442 ns333 ns4.331406 ns475 ns2.96
u32/Std/Dense/3000/2yes1149 ns333 ns3.451211 ns477 ns2.54
u32/Std/Dense/3000/1find163 ns158 ns1.03151 ns153 ns0.99
u32/Rng/Dense/3000/40no496 ns357 ns1.39553 ns537 ns1.03
u32/Rng/Dense/3000/18no600 ns458 ns1.31672 ns585 ns1.15
u32/Rng/Dense/3000/16no638 ns473 ns1.35665 ns623 ns1.07
u32/Rng/Dense/3000/14no665 ns511 ns1.30687 ns664 ns1.03
u32/Rng/Dense/3000/10no777 ns613 ns1.27784 ns856 ns0.92
u32/Rng/Dense/3000/8no857 ns688 ns1.25873 ns961 ns0.91
u32/Rng/Dense/3000/5no991 ns852 ns1.161013 ns1127 ns0.90
u32/Rng/Dense/3000/4yes1105 ns343 ns3.221050 ns470 ns2.23
u32/Rng/Dense/3000/3yes1258 ns337 ns3.731275 ns474 ns2.69
u32/Rng/Dense/3000/2yes1751 ns337 ns5.201863 ns470 ns3.96
u32/Rng/Dense/3000/1find160 ns159 ns1.01157 ns152 ns1.03
u64/Std/Zones/3000/40no40.9 ns50.5 ns0.8148.3 ns54.5 ns0.89
u64/Std/Zones/3000/18no58.2 ns74.7 ns0.7868.8 ns76.3 ns0.90
u64/Std/Zones/3000/16no63.3 ns82.8 ns0.7673.9 ns83.7 ns0.88
u64/Std/Zones/3000/14no68.0 ns90.9 ns0.7579.9 ns92.9 ns0.86
u64/Std/Zones/3000/10no87.0 ns114 ns0.76102 ns118 ns0.86
u64/Std/Zones/3000/8no106 ns140 ns0.76124 ns143 ns0.87
u64/Std/Zones/3000/5no168 ns209 ns0.80187 ns223 ns0.84
u64/Std/Zones/3000/4no192 ns259 ns0.74230 ns292 ns0.79
u64/Std/Zones/3000/3no266 ns332 ns0.80295 ns354 ns0.83
u64/Std/Zones/3000/2yes372 ns248 ns1.50434 ns395 ns1.10
u64/Std/Zones/3000/1find152 ns151 ns1.01157 ns158 ns0.99
u64/Rng/Zones/3000/40no66.2 ns49.5 ns1.3459.0 ns55.9 ns1.06
u64/Rng/Zones/3000/18no105 ns73.8 ns1.42101 ns74.9 ns1.35
u64/Rng/Zones/3000/16no117 ns81.2 ns1.44111 ns83.4 ns1.33
u64/Rng/Zones/3000/14no130 ns90.4 ns1.44126 ns91.9 ns1.37
u64/Rng/Zones/3000/10no171 ns112 ns1.53161 ns118 ns1.36
u64/Rng/Zones/3000/8no209 ns137 ns1.53201 ns141 ns1.43
u64/Rng/Zones/3000/5no325 ns204 ns1.59312 ns211 ns1.48
u64/Rng/Zones/3000/4no402 ns251 ns1.60381 ns262 ns1.45
u64/Rng/Zones/3000/3no531 ns333 ns1.59501 ns346 ns1.45
u64/Rng/Zones/3000/2yes796 ns242 ns3.29746 ns357 ns2.09
u64/Rng/Zones/3000/1find149 ns150 ns0.99152 ns150 ns1.01
u64/Std/Dense/3000/40no936 ns384 ns2.441172 ns578 ns2.03
u64/Std/Dense/3000/18no1122 ns545 ns2.061244 ns601 ns2.07
u64/Std/Dense/3000/16no1177 ns545 ns2.161239 ns622 ns1.99
u64/Std/Dense/3000/14no1208 ns597 ns2.021225 ns632 ns1.94
u64/Std/Dense/3000/10no1320 ns766 ns1.721293 ns666 ns1.94
u64/Std/Dense/3000/8no1426 ns910 ns1.571395 ns735 ns1.90
u64/Std/Dense/3000/5no1582 ns1075 ns1.471606 ns966 ns1.66
u64/Std/Dense/3000/4no1488 ns1219 ns1.221673 ns1095 ns1.53
u64/Std/Dense/3000/3no1506 ns1296 ns1.161702 ns1126 ns1.51
u64/Std/Dense/3000/2yes1464 ns470 ns3.111457 ns689 ns2.11
u64/Std/Dense/3000/1find285 ns303 ns0.94288 ns291 ns0.99
u64/Rng/Dense/3000/40no576 ns384 ns1.50483 ns578 ns0.84
u64/Rng/Dense/3000/18no876 ns546 ns1.60521 ns590 ns0.88
u64/Rng/Dense/3000/16no866 ns540 ns1.60546 ns624 ns0.88
u64/Rng/Dense/3000/14no883 ns593 ns1.49559 ns625 ns0.89
u64/Rng/Dense/3000/10no944 ns773 ns1.22632 ns674 ns0.94
u64/Rng/Dense/3000/8no1052 ns914 ns1.15743 ns731 ns1.02
u64/Rng/Dense/3000/5no1071 ns1079 ns0.99970 ns950 ns1.02
u64/Rng/Dense/3000/4no1136 ns1224 ns0.931123 ns1096 ns1.02
u64/Rng/Dense/3000/3no1303 ns1282 ns1.021317 ns1130 ns1.17
u64/Rng/Dense/3000/2yes1761 ns474 ns3.721801 ns691 ns2.61
u64/Rng/Dense/3000/1find286 ns302 ns0.95290 ns296 ns0.98

🥈 Results interpretation

For x64 and for the vectorized n there is a certain improvement for Zones. For Dense the improvement is even greater.

The non-vectorized cases vary a lot, The fallback happen to be faster than header implementation often, but not always. Out of the header implementations, surprisingly, the ranges one is slower for Zones case.

The x86 results are not very good, but not too bad either.

The table contains a lot of rows, but I don't see a reasonable way to reduce it without losing important information.

@StephanTLavavejStephanTLavavej moved this fromInitial Review toWork In Progress inSTL Code ReviewsMar 22, 2025
@StephanTLavavejStephanTLavavej added the performanceMust go faster labelMar 22, 2025
@AlexGuteniev

This comment was marked as resolved.

@AlexGutenievAlexGuteniev marked this pull request as ready for reviewMarch 25, 2025 05:29
@AlexGutenievAlexGuteniev requested a review froma team as acode ownerMarch 25, 2025 05:29
# Conflicts:#benchmarks/src/search_n.cpp#stl/inc/algorithm#stl/src/vector_algorithms.cpp
@StephanTLavavejStephanTLavavej self-assigned thisMar 25, 2025
@StephanTLavavejStephanTLavavej moved this fromWork In Progress toInitial Review inSTL Code ReviewsMar 25, 2025
@StephanTLavavej
Copy link
Member

Thanks! 😻 I pushed moderate changes - please double-check.

5950X results:
BenchmarkBeforeAfterSpeedup
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/4037.4 ns38.8 ns0.96
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/1865.9 ns47.9 ns1.38
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/1672.1 ns51.0 ns1.41
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/1479.4 ns51.3 ns1.55
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/10105 ns51.3 ns2.05
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/8128 ns51.1 ns2.50
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/5199 ns51.9 ns3.83
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/4246 ns50.9 ns4.83
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/3323 ns53.8 ns6.00
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/2484 ns52.5 ns9.22
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/123.0 ns19.7 ns1.17
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/4036.6 ns38.9 ns0.94
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/1860.1 ns47.7 ns1.26
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/1666.1 ns51.2 ns1.29
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/1474.1 ns51.4 ns1.44
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/10104 ns51.4 ns2.02
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/8127 ns51.7 ns2.46
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/5197 ns52.1 ns3.78
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/4244 ns51.1 ns4.77
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/3331 ns59.5 ns5.56
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/2479 ns54.4 ns8.81
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/123.2 ns19.6 ns1.18
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/40992 ns569 ns1.74
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/18925 ns608 ns1.52
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/16924 ns98.4 ns9.39
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/14952 ns102 ns9.33
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/101042 ns101 ns10.32
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/81108 ns102 ns10.86
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/51187 ns102 ns11.64
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/41241 ns105 ns11.82
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/31179 ns103 ns11.45
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/21125 ns104 ns10.82
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/142.6 ns38.5 ns1.11
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/40364 ns566 ns0.64
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/18451 ns612 ns0.74
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/16478 ns98.0 ns4.88
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/14490 ns100 ns4.90
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/10559 ns100 ns5.59
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/8641 ns102 ns6.28
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/5803 ns102 ns7.87
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/4900 ns105 ns8.57
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/3977 ns104 ns9.39
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/21176 ns104 ns11.31
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/146.1 ns41.3 ns1.12
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/4037.3 ns46.1 ns0.81
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/1866.4 ns78.5 ns0.85
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/1672.6 ns86.4 ns0.84
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/1479.7 ns97.8 ns0.81
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/10105 ns136 ns0.77
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/8128 ns93.0 ns1.38
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/5198 ns93.0 ns2.13
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/4245 ns94.3 ns2.60
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/3324 ns96.9 ns3.34
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/2479 ns95.8 ns5.00
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/146.8 ns48.3 ns0.97
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/4036.7 ns46.0 ns0.80
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/1863.4 ns78.5 ns0.81
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/1669.8 ns86.4 ns0.81
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/1479.0 ns97.7 ns0.81
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/10118 ns136 ns0.87
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/8144 ns93.3 ns1.54
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/5224 ns93.0 ns2.41
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/4266 ns94.5 ns2.81
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/3353 ns97.0 ns3.64
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/2523 ns95.8 ns5.46
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/147.2 ns48.7 ns0.97
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/401069 ns405 ns2.64
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/18959 ns525 ns1.83
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/161002 ns573 ns1.75
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/141036 ns594 ns1.74
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/101117 ns721 ns1.55
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/81221 ns172 ns7.10
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/51368 ns172 ns7.95
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/41377 ns172 ns8.01
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/31419 ns176 ns8.06
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/21428 ns176 ns8.11
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/181.8 ns85.2 ns0.96
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/40556 ns414 ns1.34
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/18612 ns528 ns1.16
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/16647 ns573 ns1.13
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/14662 ns598 ns1.11
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/10723 ns729 ns0.99
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/8810 ns173 ns4.68
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/5918 ns172 ns5.34
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/41005 ns172 ns5.84
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/31048 ns176 ns5.95
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/21264 ns177 ns7.14
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/182.3 ns85.4 ns0.96
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/4040.5 ns30.5 ns1.33
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/1866.1 ns47.7 ns1.39
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/1671.4 ns52.5 ns1.36
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/1480.4 ns58.7 ns1.37
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/10104 ns84.2 ns1.24
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/8128 ns103 ns1.24
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/5197 ns157 ns1.25
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/4246 ns151 ns1.63
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/3324 ns152 ns2.13
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/2481 ns152 ns3.16
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/184.7 ns87.7 ns0.97
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/4036.5 ns30.5 ns1.20
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/1862.3 ns47.8 ns1.30
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/1669.0 ns52.5 ns1.31
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/1477.9 ns58.7 ns1.33
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/10118 ns84.8 ns1.39
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/8146 ns103 ns1.42
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/5224 ns157 ns1.43
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/4280 ns152 ns1.84
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/3358 ns152 ns2.36
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/2535 ns153 ns3.50
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/184.8 ns87.7 ns0.97
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/401013 ns385 ns2.63
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/18960 ns441 ns2.18
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/16994 ns467 ns2.13
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/141010 ns481 ns2.10
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/101102 ns545 ns2.02
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/81186 ns633 ns1.87
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/51304 ns792 ns1.65
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/41381 ns289 ns4.78
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/31411 ns288 ns4.90
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/21420 ns287 ns4.95
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/1157 ns164 ns0.96
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/40389 ns387 ns1.01
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/18489 ns441 ns1.11
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/16530 ns469 ns1.13
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/14556 ns483 ns1.15
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/10652 ns547 ns1.19
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/8756 ns636 ns1.19
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/5957 ns795 ns1.20
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/41086 ns288 ns3.77
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/31227 ns287 ns4.28
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/21351 ns287 ns4.71
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/1157 ns164 ns0.96
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/4040.4 ns40.4 ns1.00
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/1866.0 ns55.5 ns1.19
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/1671.2 ns60.2 ns1.18
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/1479.9 ns67.4 ns1.19
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/10104 ns95.6 ns1.09
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/8128 ns116 ns1.10
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/5197 ns177 ns1.11
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/4243 ns219 ns1.11
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/3322 ns288 ns1.12
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/2479 ns241 ns1.99
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/1160 ns167 ns0.96
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/4036.3 ns41.0 ns0.89
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/1862.7 ns56.4 ns1.11
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/1669.7 ns60.8 ns1.15
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/1477.9 ns67.6 ns1.15
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/10121 ns96.2 ns1.26
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/8149 ns117 ns1.27
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/5230 ns178 ns1.29
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/4286 ns219 ns1.31
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/3375 ns288 ns1.30
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/2561 ns239 ns2.35
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/1160 ns167 ns0.96
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/401559 ns565 ns2.76
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/181426 ns609 ns2.34
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/161435 ns643 ns2.23
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/141427 ns654 ns2.18
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/101437 ns714 ns2.01
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/81497 ns812 ns1.84
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/51547 ns919 ns1.68
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/41522 ns1013 ns1.50
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/31455 ns1044 ns1.39
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/21257 ns461 ns2.73
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/1307 ns322 ns0.95
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/40386 ns565 ns0.68
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/18480 ns607 ns0.79
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/16517 ns642 ns0.81
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/14549 ns657 ns0.84
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/10644 ns716 ns0.90
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/8742 ns813 ns0.91
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/5944 ns925 ns1.02
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/41075 ns1017 ns1.06
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/31192 ns1045 ns1.14
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/21476 ns460 ns3.21
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/1308 ns321 ns0.96

@StephanTLavavejStephanTLavavej removed their assignmentApr 21, 2025
@StephanTLavavejStephanTLavavej moved this fromInitial Review toReady To Merge inSTL Code ReviewsApr 21, 2025
@StephanTLavavej

This comment was marked as resolved.

@azure-pipelines

This comment was marked as resolved.

@AlexGuteniev
Copy link
ContributorAuthor

please double-check.

All good.

I want you to also review PR description and explicitly answer aboutn=1 case.

@StephanTLavavej
Copy link
Member

Thanks! I think we should keep handling n=1 in the headers. Having a separate check in the separately compiled code is fine. If you want to have the checkonly in the headers, then a comment in the separately compiled code that we're assuming the check has already been done, would be a good idea.

AlexGuteniev reacted with thumbs up emoji

@StephanTLavavejStephanTLavavej moved this fromReady To Merge toMerging inSTL Code ReviewsApr 22, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavejStephanTLavavej merged commitb0bd6a7 intomicrosoft:mainApr 22, 2025
39 checks passed
@github-project-automationgithub-project-automationbot moved this fromMerging toDone inSTL Code ReviewsApr 22, 2025
@StephanTLavavej
Copy link
Member

🕵️ 🔍 🔢

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@StephanTLavavejStephanTLavavejStephanTLavavej approved these changes

Assignees

No one assigned

Labels

performanceMust go faster

Projects

Archived in project

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@AlexGuteniev@StephanTLavavej

[8]ページ先頭

©2009-2025 Movatter.jp