Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Vectorizesearch_n for CPUs with SSE4.2 but not AVX2 support; handle AVX2 tail#5544

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
StephanTLavavej merged 8 commits intomicrosoft:mainfromAlexGuteniev:search-n-see
Oct 4, 2025

Conversation

@AlexGuteniev
Copy link
Contributor

@AlexGutenievAlexGuteniev commentedMay 25, 2025
edited
Loading

Copy AVX2 path to SSE4.2, adapt it as follows:

  • Twice less than vector size threshold is 8
    • As a result, not engaged for 64-bit elements (disabled at compile time)
    • And only two variable shifts
  • Sincemovemask output fits 16 bit (despite using 32-bit register), the mask with carry fits 32 bit
    • As a result, no tricky shift intrinsic needed, it will be good for x86 already
  • Use BSF/BSR instead of TZCNT/LZCNT
    • Nonzero input always, albeit somewhat tricky to prove for BSR

Copy AVX2 path to AVX2 tail path, adapt it as follows:

  • AND with tail mask before movemask
  • Update_Carry with shifted_Msk_with_carry to take into account previous_Carry for tails smaller thann.

SSE4.2 benchmark results

These results are actually fake. AVX2 path is artificially disabled on an AVX2-capable CPU.
Also they suffer much from random variation.
Still they are useful enough to see that SSE4.2 path is indeed helpful.

I've highlighted the relevant results.

BenchmarkBeforeAfterSpeedup
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/4042.0 ns41.2 ns1.02
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/1857.1 ns57.2 ns1.00
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/1662.5 ns63.2 ns0.99
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/1469.2 ns68.7 ns1.01
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/1084.7 ns84.8 ns1.00
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/8103 ns118 ns0.87 📈
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/5161 ns117 ns1.38 📉
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/4191 ns134 ns1.43 📉
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/3248 ns117 ns2.12 📉
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/2373 ns116 ns3.22 📉
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/118.1 ns28.7 ns0.63
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/4042.1 ns41.4 ns1.02
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/1857.7 ns57.5 ns1.00
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/1663.0 ns62.9 ns1.00
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/1470.1 ns69.8 ns1.00
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/1086.6 ns85.7 ns1.01
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/8104 ns120 ns0.87 📈
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/5157 ns120 ns1.31 📉
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/4189 ns118 ns1.60 📉
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/3244 ns118 ns2.07 📉
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/2369 ns117 ns3.15 📉
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/117.7 ns28.4 ns0.62
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/40353 ns355 ns0.99
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/18461 ns459 ns1.00
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/16474 ns478 ns0.99
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/14513 ns530 ns0.97
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/10636 ns633 ns1.00
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/8739 ns216 ns3.42 📉
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/5930 ns213 ns4.37 📉
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/4972 ns218 ns4.46 📉
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/31014 ns222 ns4.57 📉
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/21053 ns220 ns4.79 📉
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/150.2 ns38.7 ns1.30
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/40356 ns355 ns1.00
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/18450 ns463 ns0.97
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/16480 ns483 ns0.99
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/14516 ns521 ns0.99
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/10618 ns645 ns0.96
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/8736 ns216 ns3.41 📉
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/5951 ns217 ns4.38 📉
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/41039 ns216 ns4.81 📉
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/31086 ns214 ns5.07 📉
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/21855 ns218 ns8.51 📉
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/151.8 ns38.6 ns1.34
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/4058.8 ns45.4 ns1.30
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/1874.8 ns62.7 ns1.19
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/1679.1 ns68.3 ns1.16
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/1482.4 ns84.6 ns0.97
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/10102 ns108 ns0.94
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/8131 ns135 ns0.97
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/5197 ns200 ns0.99
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/4233 ns162 ns1.44 📉
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/3311 ns161 ns1.93 📉
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/2450 ns161 ns2.80 📉
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/136.1 ns35.1 ns1.03
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/4053.4 ns46.6 ns1.15
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/1868.9 ns69.3 ns0.99
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/1674.6 ns75.7 ns0.99
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/1481.4 ns84.2 ns0.97
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/10104 ns112 ns0.93
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/8126 ns131 ns0.96
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/5191 ns195 ns0.98
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/4228 ns161 ns1.42 📉
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/3298 ns161 ns1.85 📉
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/2454 ns160 ns2.84 📉
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/138.0 ns36.0 ns1.06
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/40613 ns372 ns1.65
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/18651 ns449 ns1.45
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/16710 ns476 ns1.49
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/14732 ns517 ns1.42
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/10812 ns633 ns1.28
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/8873 ns706 ns1.24
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/5986 ns868 ns1.14
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/41077 ns331 ns3.25 📉
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/31152 ns327 ns3.52 📉
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/21138 ns333 ns3.42 📉
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/181.8 ns79.8 ns1.03
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/40610 ns370 ns1.65
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/18660 ns454 ns1.45
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/16722 ns484 ns1.49
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/14724 ns507 ns1.43
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/10816 ns625 ns1.31
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/8894 ns715 ns1.25
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/5988 ns880 ns1.12
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/41061 ns329 ns3.22 📉
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/31131 ns330 ns3.43 📉
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/21124 ns327 ns3.44 📉
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/181.1 ns81.2 ns1.00
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/4045.9 ns51.4 ns0.89
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/1867.8 ns70.1 ns0.97
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/1674.2 ns77.3 ns0.96
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/1482.2 ns84.8 ns0.97
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/10104 ns105 ns0.99
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/8125 ns128 ns0.98
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/5184 ns187 ns0.98
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/4231 ns227 ns1.02
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/3294 ns292 ns1.01
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/2430 ns244 ns1.76 📉
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/169.9 ns68.4 ns1.02
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/4046.4 ns51.2 ns0.91
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/1868.5 ns71.5 ns0.96
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/1674.5 ns77.3 ns0.96
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/1484.3 ns85.0 ns0.99
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/10107 ns104 ns1.03
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/8131 ns125 ns1.05
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/5193 ns191 ns1.01
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/4220 ns226 ns0.97
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/3293 ns294 ns1.00
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/2441 ns243 ns1.81 📉
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/170.0 ns70.1 ns1.00
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/40373 ns369 ns1.01
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/18501 ns521 ns0.96
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/16531 ns551 ns0.96
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/14565 ns591 ns0.96
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/10699 ns771 ns0.91
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/8831 ns938 ns0.89
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/51033 ns1069 ns0.97
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/41154 ns1221 ns0.95
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/31231 ns1280 ns0.96
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/21454 ns462 ns3.15 📉
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/1155 ns146 ns1.06
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/40377 ns408 ns0.92
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/18496 ns555 ns0.89
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/16529 ns541 ns0.98
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/14562 ns601 ns0.94
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/10687 ns772 ns0.89
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/8799 ns1005 ns0.80
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/51031 ns1276 ns0.81
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/41157 ns1316 ns0.88
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/31246 ns1371 ns0.91
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/21424 ns514 ns2.77 📉
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/1148 ns164 ns0.90

StephanTLavavej and Oxicid reacted with heart emoji
@StephanTLavavejStephanTLavavej self-assigned thisMay 27, 2025
@StephanTLavavejStephanTLavavej added the performanceMust go faster labelMay 27, 2025
@StephanTLavavejStephanTLavavej removed their assignmentOct 1, 2025
@StephanTLavavejStephanTLavavej moved this fromInitial Review toReady To Merge inSTL Code ReviewsOct 1, 2025
@StephanTLavavejStephanTLavavej moved this fromReady To Merge toMerging inSTL Code ReviewsOct 3, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavejStephanTLavavej merged commitc95e938 intomicrosoft:mainOct 4, 2025
39 checks passed
@github-project-automationgithub-project-automationbot moved this fromMerging toDone inSTL Code ReviewsOct 4, 2025
@StephanTLavavej
Copy link
Member

🔍 🕵️ 🔎

@AlexGutenievAlexGuteniev deleted the search-n-see branchOctober 4, 2025 04:54
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@StephanTLavavejStephanTLavavejStephanTLavavej approved these changes

Assignees

No one assigned

Labels

performanceMust go faster

Projects

Archived in project

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@AlexGuteniev@StephanTLavavej

[8]ページ先頭

©2009-2025 Movatter.jp