NotificationsYou must be signed in to change notification settings
Fork1.6k
Star10.9k

`<algorithm>`: Implement worst-case linear-time`nth_element`#5100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

StephanTLavavej merged 28 commits intomicrosoft:mainfrommuellerj2:nth_element-worstcase-linear

Apr 23, 2025

Merged

`<algorithm>`: Implement worst-case linear-time`nth_element`#5100

StephanTLavavej merged 28 commits intomicrosoft:mainfrommuellerj2:nth_element-worstcase-linear

Apr 23, 2025

Conversation

Copy link

Contributor

muellerj2 commentedNov 19, 2024

Fixes#856 forstd::nth_element andstd::ranges::nth_element. This implements a fallback to the median-of-medians-of-five algorithm when the quickselect algorithm seems to be making too little progress.

The median-of-medians algorithm is mostly the textbook version, with two minor tweaks:

If the processed sequence doesn't cleanly divide into groups of five elements, the remainder group with less than five elements isn't considered for the median computation. (This reduces the amount of code and doesn't make any difference in the asymptotics. I couldn't observe any practical difference in running time, too.)
When the pivot (=median-of-medians) has been computed, all (greater) medians located after the pivot are moved to the very end of the processed sequence and the pivot is swapped into the middle of the sequence. This is because all of these elements are guaranteed to be moved by the pivot partitioning algorithm, so this step immediately moves them into an appropriate position (or the pivot probably closer to it). This way, the medians can also be excluded from the sequence on which the partitioning algorithm is applied, avoiding some unnecessary comparisons. (In practice, the benchmarks suggested that this makes the algorithm a few percent faster, but the difference is minor.)

Benchmark results

bm_uniform just appliesnth_element to an integer array of the given length. The integer array is uniformly sampled from a fixed seed. This is to check that the worst-case fallback does not noticeably worsen the processing time on such a sequence.

bm_tunkey_adversary appliesnth_element to a sequence on which the implemented quickselect algorithm performs terribly.

Before:

--------------------------------------------------------------------------------Benchmark                                      Time             CPU   Iterations--------------------------------------------------------------------------------bm_uniform<alg_type::std_fn>/1024           1845 ns         1803 ns       407273bm_uniform<alg_type::std_fn>/2048           3966 ns         3990 ns       172308bm_uniform<alg_type::std_fn>/4096           7702 ns         7673 ns        89600bm_uniform<alg_type::std_fn>/8192          18090 ns        18032 ns        40727bm_uniform<alg_type::rng>/1024              1759 ns         1758 ns       373333bm_uniform<alg_type::rng>/2048              3985 ns         4011 ns       179200bm_uniform<alg_type::rng>/4096              7694 ns         7847 ns        89600bm_uniform<alg_type::rng>/8192             18015 ns        17997 ns        37333bm_tunkey_adversary<alg_type::std_fn>      12995 ns        13393 ns        56000bm_tunkey_adversary<alg_type::rng>         12714 ns        12835 ns        56000

After:

--------------------------------------------------------------------------------Benchmark                                      Time             CPU   Iterations--------------------------------------------------------------------------------bm_uniform<alg_type::std_fn>/1024           1599 ns         1604 ns       448000bm_uniform<alg_type::std_fn>/2048           3626 ns         3610 ns       194783bm_uniform<alg_type::std_fn>/4096           7068 ns         7150 ns        89600bm_uniform<alg_type::std_fn>/8192          16469 ns        16044 ns        44800bm_uniform<alg_type::rng>/1024              1701 ns         1709 ns       448000bm_uniform<alg_type::rng>/2048              3841 ns         3931 ns       194783bm_uniform<alg_type::rng>/4096              7447 ns         7324 ns        74667bm_uniform<alg_type::rng>/8192             17024 ns        16741 ns        37333bm_tunkey_adversary<alg_type::std_fn>       6075 ns         5929 ns        89600bm_tunkey_adversary<alg_type::rng>          6270 ns         6278 ns       112000

As expected, the fallback greatly improves the running time forbm_tunkey_adversary. The timings forbm_uniform are about on par, or more precisely even a bit better with this PR on my machine.

The fallback heuristic

std::sort switches to its fallback when the recursion depth exceeds some logarithmic threshold. We could use the same heuristic as well, however, this would not guarantee linear time in the worst case but "only" an$O(n \log n)$ bound. Alternatively, we could limit the recursion depth to some constant, but that's likely a pessimization for large sequences.

So I opted for an adaptive depth limit: Like the heuristic forstd::sort, it assumes that each iteration should reduce the range of inspected elements by 25 %. But whilestd::sort derives a maximum recursion depth from this assumption, this heuristic falls back to the median-of-medians algorithm when the actual size of the processed sequence exceeds the desired size by some constant tolerance factor (currently about 2) during some iteration. Thus, the total number of processed elements over all quickselect iterations is bounded by a multiple of the sequence length times a geometric sum, ensuring worst-case linear time overall. At the same time, the tolerance factor introduces some leeway so that one or two bad iterations (especially at the beginning) don't trigger the fallback immediately.

Obviously, there are many possible choices for the desired percentage reduction per iteration and the tolerance factor. But the benchmarks seem to suggest that the chosen values aren't too bad; a smaller percentage reduction or a larger margin factor noticeably worsen thebm_tunkey_adversary benchmark, but result in little difference forbm_uniform. Besides, the implementation ofstd::sort already sets a precedent for a desired reduction of 25 % per iteration.

Test

The newly added test appliesnth_element to the same worst-case sequence as the benchmark. This makes sure that the fallback is actually exerted by the test. (I think it's also the first test that exerts the quickselect algorithm and not the just the insertion sort fallback.)

muellerj2 added3 commits

November 19, 2024 12:50

implement worst-case linear-timestd::nth_element

a4eb67a

implement worst-case linear timestd::ranges::nth_element

3a03983

add benchmark

c03a11e

muellerj2 requested a review froma team as acode owner

November 19, 2024 12:43

fix benchmark compilation on x86

6e7c382

CaseyCarter added the bugSomething isn't working label

Nov 19, 2024

StephanTLavavej added the performanceMust go faster label

Nov 19, 2024

StephanTLavavej self-assigned this

Nov 19, 2024

This comment was marked as resolved.

tunkey -> tukey

11026a3

This comment was marked as resolved.

case

53c51c7

StephanTLavavej assigneddavidmrdavid

Jan 29, 2025

davidmrdavid reviewed

Feb 25, 2025

View reviewed changes

Copy link

Member

davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks a lot for this PR ⭐.
Left some comments, mostly nits that I think would improve readability, and some questions.

benchmarks/src/nth_element.cpp OutdatedShow resolvedHide resolved

stl/inc/algorithmShow resolvedHide resolved

stl/inc/algorithm OutdatedShow resolvedHide resolved

stl/inc/algorithmShow resolvedHide resolved

tests/std/tests/GH_000856_nth_element_linear/test.cpp OutdatedShow resolvedHide resolved

stl/inc/algorithmShow resolvedHide resolved

stl/inc/algorithm OutdatedShow resolvedHide resolved

davidmrdavid mentioned this pull request

Feb 25, 2025

Maintainer priorities#4700

Open

StephanTLavavej unassigneddavidmrdavid

Feb 26, 2025

muellerj2 added3 commits

March 2, 2025 12:36

Merge branch 'main' into nth_element-worstcase-linear

8a96664

address some review comments

757e5f9

extend one more comment

5babd6c

Copy link

ContributorAuthor

muellerj2 commentedMar 9, 2025•
edited
Loading

New benchmarks with new adversary2 and minor changes to benchmark code (AMD Ryzen 7 7840HS):

Before:

------------------------------------------------------------------------------------------Benchmark                                                Time             CPU   Iterations------------------------------------------------------------------------------------------bm_uniform<alg_type::std_fn>/1024                     1550 ns         1500 ns       448000bm_uniform<alg_type::std_fn>/2048                     3413 ns         3376 ns       203636bm_uniform<alg_type::std_fn>/4096                     6583 ns         6627 ns        89600bm_uniform<alg_type::std_fn>/8192                    15467 ns        15346 ns        44800bm_uniform<alg_type::rng>/1024                        1552 ns         1538 ns       497778bm_uniform<alg_type::rng>/2048                        3494 ns         3299 ns       203636bm_uniform<alg_type::rng>/4096                        6502 ns         6557 ns       112000bm_uniform<alg_type::rng>/8192                       15444 ns        15067 ns        49778bm_tukey_adversary<alg_type::std_fn>/adversary1      11353 ns        11440 ns        56000bm_tukey_adversary<alg_type::rng>/adversary1         11505 ns        11475 ns        64000bm_tukey_adversary<alg_type::std_fn>/adversary2      59287 ns        58594 ns        11200bm_tukey_adversary<alg_type::rng>/adversary2         55013 ns        54688 ns        10000

After:

------------------------------------------------------------------------------------------Benchmark                                                Time             CPU   Iterations------------------------------------------------------------------------------------------bm_uniform<alg_type::std_fn>/1024                     1552 ns         1538 ns       497778bm_uniform<alg_type::std_fn>/2048                     3412 ns         3376 ns       203636bm_uniform<alg_type::std_fn>/4096                     6622 ns         6417 ns       112000bm_uniform<alg_type::std_fn>/8192                    15345 ns        15346 ns        44800bm_uniform<alg_type::rng>/1024                        1458 ns         1444 ns       497778bm_uniform<alg_type::rng>/2048                        3296 ns         3296 ns       213333bm_uniform<alg_type::rng>/4096                        6513 ns         6557 ns       112000bm_uniform<alg_type::rng>/8192                       14922 ns        14753 ns        49778bm_tukey_adversary<alg_type::std_fn>/adversary1       6043 ns         5999 ns       112000bm_tukey_adversary<alg_type::rng>/adversary1          5324 ns         5441 ns       112000bm_tukey_adversary<alg_type::std_fn>/adversary2       4069 ns         4098 ns       179200bm_tukey_adversary<alg_type::rng>/adversary2          3649 ns         3683 ns       203636

This comment was marked as outdated.

Merge branch 'main' into nth_element-worstcase-linear

e346593

This comment was marked as resolved.

StephanTLavavej added7 commits

April 23, 2025 02:42

Merge branch 'main' into nth_element-worstcase-linear

0b432d4

Enable clang-format and remove trailing commas for The Adversaries.

1b857f9

Move using-directives right after header inclusions.

fdc49d6

benchmark_common() can takeconst vector<int>& src.

4b71fc6

Pass arrays by const reference, avoid temporary vectors.

3ba677c

bm_tukey_adversary is now redundant withbenchmark_common.

8c8a87c

Adjust newlines.

1c74cab

StephanTLavavej added11 commits

April 23, 2025 04:19

Drop unnecessarystatic_cast<vector<int>::difference_type> around i…

1c1ddcd

…nteger literals.

Extractsrc_ssize.

bfc1bf3

Use 4-argis_permutation().

f36b303

We don't need<ranges>;<algorithm> provides `ranges::nth_element…

0340aef

…` and `ranges::generate`.

Extractcomputed.begin() + nth asmid.

124292a

Drop unnecessaryreserve() call.

2312ec8

Adjust comments.

5b30c5c

Add const.

7fea073

(_Last - _First) =>_Length (still valid here)

e37ab68

Add "intentional ADL" comment.

e439320

Add "by pivot _Pfirst" to comments.

65b4f56

Also restore a comment to `_Partition_by_median_guess_unchecked`.

StephanTLavavej reviewed

Apr 23, 2025

View reviewed changes

benchmarks/src/nth_element.cpp OutdatedShow resolvedHide resolved

stl/inc/algorithm OutdatedShow resolvedHide resolved

stl/inc/algorithmShow resolvedHide resolved

benchmarks/src/nth_element.cpp OutdatedShow resolvedHide resolved

stl/inc/algorithmShow resolvedHide resolved

StephanTLavavej approved these changes

Apr 23, 2025

View reviewed changes

Copy link

Member

StephanTLavavej commentedApr 23, 2025

Thanks! 💚 And apologies for the significant delay in getting around to reviewing this!

I pushed minor cleanups to the product code and moderate refactorings to the test/benchmark code. Please meow if I messed anything up.

I reviewed@davidmrdavid's comments and I believe we've addressed them all.

Click to expand 5950X benchmark results:

Benchmark	Before	After	Speedup
`bm_uniform<alg_type::std_fn>/1024`	1877 ns	1801 ns	1.04
`bm_uniform<alg_type::std_fn>/2048`	4190 ns	3968 ns	1.06
`bm_uniform<alg_type::std_fn>/4096`	8131 ns	7701 ns	1.06
`bm_uniform<alg_type::std_fn>/8192`	20427 ns	19078 ns	1.07
`bm_uniform<alg_type::rng>/1024`	1951 ns	1736 ns	1.12
`bm_uniform<alg_type::rng>/2048`	4197 ns	3855 ns	1.09
`bm_uniform<alg_type::rng>/4096`	8120 ns	7487 ns	1.08
`bm_uniform<alg_type::rng>/8192`	25961 ns	20594 ns	1.26
`benchmark_common<alg_type::std_fn>/adversary1`	14572 ns	6835 ns	2.13
`benchmark_common<alg_type::rng>/adversary1`	15813 ns	6323 ns	2.50
`benchmark_common<alg_type::std_fn>/adversary2`	69962 ns	4226 ns	16.56
`benchmark_common<alg_type::rng>/adversary2`	69475 ns	4430 ns	15.68