bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/40 | 37.4 ns | 38.8 ns | 0.96 |
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/18 | 65.9 ns | 47.9 ns | 1.38 |
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/16 | 72.1 ns | 51.0 ns | 1.41 |
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/14 | 79.4 ns | 51.3 ns | 1.55 |
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/10 | 105 ns | 51.3 ns | 2.05 |
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/8 | 128 ns | 51.1 ns | 2.50 |
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/5 | 199 ns | 51.9 ns | 3.83 |
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/4 | 246 ns | 50.9 ns | 4.83 |
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/3 | 323 ns | 53.8 ns | 6.00 |
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/2 | 484 ns | 52.5 ns | 9.22 |
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/1 | 23.0 ns | 19.7 ns | 1.17 |
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/40 | 36.6 ns | 38.9 ns | 0.94 |
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/18 | 60.1 ns | 47.7 ns | 1.26 |
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/16 | 66.1 ns | 51.2 ns | 1.29 |
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/14 | 74.1 ns | 51.4 ns | 1.44 |
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/10 | 104 ns | 51.4 ns | 2.02 |
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/8 | 127 ns | 51.7 ns | 2.46 |
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/5 | 197 ns | 52.1 ns | 3.78 |
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/4 | 244 ns | 51.1 ns | 4.77 |
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/3 | 331 ns | 59.5 ns | 5.56 |
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/2 | 479 ns | 54.4 ns | 8.81 |
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/1 | 23.2 ns | 19.6 ns | 1.18 |
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/40 | 992 ns | 569 ns | 1.74 |
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/18 | 925 ns | 608 ns | 1.52 |
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/16 | 924 ns | 98.4 ns | 9.39 |
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/14 | 952 ns | 102 ns | 9.33 |
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/10 | 1042 ns | 101 ns | 10.32 |
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/8 | 1108 ns | 102 ns | 10.86 |
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/5 | 1187 ns | 102 ns | 11.64 |
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/4 | 1241 ns | 105 ns | 11.82 |
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/3 | 1179 ns | 103 ns | 11.45 |
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/2 | 1125 ns | 104 ns | 10.82 |
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/1 | 42.6 ns | 38.5 ns | 1.11 |
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/40 | 364 ns | 566 ns | 0.64 |
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/18 | 451 ns | 612 ns | 0.74 |
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/16 | 478 ns | 98.0 ns | 4.88 |
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/14 | 490 ns | 100 ns | 4.90 |
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/10 | 559 ns | 100 ns | 5.59 |
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/8 | 641 ns | 102 ns | 6.28 |
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/5 | 803 ns | 102 ns | 7.87 |
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/4 | 900 ns | 105 ns | 8.57 |
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/3 | 977 ns | 104 ns | 9.39 |
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/2 | 1176 ns | 104 ns | 11.31 |
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/1 | 46.1 ns | 41.3 ns | 1.12 |
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/40 | 37.3 ns | 46.1 ns | 0.81 |
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/18 | 66.4 ns | 78.5 ns | 0.85 |
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/16 | 72.6 ns | 86.4 ns | 0.84 |
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/14 | 79.7 ns | 97.8 ns | 0.81 |
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/10 | 105 ns | 136 ns | 0.77 |
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/8 | 128 ns | 93.0 ns | 1.38 |
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/5 | 198 ns | 93.0 ns | 2.13 |
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/4 | 245 ns | 94.3 ns | 2.60 |
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/3 | 324 ns | 96.9 ns | 3.34 |
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/2 | 479 ns | 95.8 ns | 5.00 |
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/1 | 46.8 ns | 48.3 ns | 0.97 |
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/40 | 36.7 ns | 46.0 ns | 0.80 |
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/18 | 63.4 ns | 78.5 ns | 0.81 |
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/16 | 69.8 ns | 86.4 ns | 0.81 |
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/14 | 79.0 ns | 97.7 ns | 0.81 |
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/10 | 118 ns | 136 ns | 0.87 |
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/8 | 144 ns | 93.3 ns | 1.54 |
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/5 | 224 ns | 93.0 ns | 2.41 |
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/4 | 266 ns | 94.5 ns | 2.81 |
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/3 | 353 ns | 97.0 ns | 3.64 |
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/2 | 523 ns | 95.8 ns | 5.46 |
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/1 | 47.2 ns | 48.7 ns | 0.97 |
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/40 | 1069 ns | 405 ns | 2.64 |
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/18 | 959 ns | 525 ns | 1.83 |
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/16 | 1002 ns | 573 ns | 1.75 |
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/14 | 1036 ns | 594 ns | 1.74 |
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/10 | 1117 ns | 721 ns | 1.55 |
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/8 | 1221 ns | 172 ns | 7.10 |
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/5 | 1368 ns | 172 ns | 7.95 |
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/4 | 1377 ns | 172 ns | 8.01 |
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/3 | 1419 ns | 176 ns | 8.06 |
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/2 | 1428 ns | 176 ns | 8.11 |
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/1 | 81.8 ns | 85.2 ns | 0.96 |
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/40 | 556 ns | 414 ns | 1.34 |
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/18 | 612 ns | 528 ns | 1.16 |
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/16 | 647 ns | 573 ns | 1.13 |
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/14 | 662 ns | 598 ns | 1.11 |
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/10 | 723 ns | 729 ns | 0.99 |
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/8 | 810 ns | 173 ns | 4.68 |
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/5 | 918 ns | 172 ns | 5.34 |
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/4 | 1005 ns | 172 ns | 5.84 |
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/3 | 1048 ns | 176 ns | 5.95 |
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/2 | 1264 ns | 177 ns | 7.14 |
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/1 | 82.3 ns | 85.4 ns | 0.96 |
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/40 | 40.5 ns | 30.5 ns | 1.33 |
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/18 | 66.1 ns | 47.7 ns | 1.39 |
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/16 | 71.4 ns | 52.5 ns | 1.36 |
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/14 | 80.4 ns | 58.7 ns | 1.37 |
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/10 | 104 ns | 84.2 ns | 1.24 |
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/8 | 128 ns | 103 ns | 1.24 |
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/5 | 197 ns | 157 ns | 1.25 |
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/4 | 246 ns | 151 ns | 1.63 |
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/3 | 324 ns | 152 ns | 2.13 |
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/2 | 481 ns | 152 ns | 3.16 |
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/1 | 84.7 ns | 87.7 ns | 0.97 |
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/40 | 36.5 ns | 30.5 ns | 1.20 |
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/18 | 62.3 ns | 47.8 ns | 1.30 |
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/16 | 69.0 ns | 52.5 ns | 1.31 |
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/14 | 77.9 ns | 58.7 ns | 1.33 |
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/10 | 118 ns | 84.8 ns | 1.39 |
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/8 | 146 ns | 103 ns | 1.42 |
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/5 | 224 ns | 157 ns | 1.43 |
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/4 | 280 ns | 152 ns | 1.84 |
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/3 | 358 ns | 152 ns | 2.36 |
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/2 | 535 ns | 153 ns | 3.50 |
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/1 | 84.8 ns | 87.7 ns | 0.97 |
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/40 | 1013 ns | 385 ns | 2.63 |
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/18 | 960 ns | 441 ns | 2.18 |
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/16 | 994 ns | 467 ns | 2.13 |
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/14 | 1010 ns | 481 ns | 2.10 |
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/10 | 1102 ns | 545 ns | 2.02 |
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/8 | 1186 ns | 633 ns | 1.87 |
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/5 | 1304 ns | 792 ns | 1.65 |
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/4 | 1381 ns | 289 ns | 4.78 |
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/3 | 1411 ns | 288 ns | 4.90 |
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/2 | 1420 ns | 287 ns | 4.95 |
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/1 | 157 ns | 164 ns | 0.96 |
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/40 | 389 ns | 387 ns | 1.01 |
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/18 | 489 ns | 441 ns | 1.11 |
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/16 | 530 ns | 469 ns | 1.13 |
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/14 | 556 ns | 483 ns | 1.15 |
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/10 | 652 ns | 547 ns | 1.19 |
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/8 | 756 ns | 636 ns | 1.19 |
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/5 | 957 ns | 795 ns | 1.20 |
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/4 | 1086 ns | 288 ns | 3.77 |
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/3 | 1227 ns | 287 ns | 4.28 |
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/2 | 1351 ns | 287 ns | 4.71 |
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/1 | 157 ns | 164 ns | 0.96 |
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/40 | 40.4 ns | 40.4 ns | 1.00 |
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/18 | 66.0 ns | 55.5 ns | 1.19 |
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/16 | 71.2 ns | 60.2 ns | 1.18 |
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/14 | 79.9 ns | 67.4 ns | 1.19 |
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/10 | 104 ns | 95.6 ns | 1.09 |
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/8 | 128 ns | 116 ns | 1.10 |
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/5 | 197 ns | 177 ns | 1.11 |
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/4 | 243 ns | 219 ns | 1.11 |
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/3 | 322 ns | 288 ns | 1.12 |
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/2 | 479 ns | 241 ns | 1.99 |
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/1 | 160 ns | 167 ns | 0.96 |
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/40 | 36.3 ns | 41.0 ns | 0.89 |
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/18 | 62.7 ns | 56.4 ns | 1.11 |
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/16 | 69.7 ns | 60.8 ns | 1.15 |
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/14 | 77.9 ns | 67.6 ns | 1.15 |
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/10 | 121 ns | 96.2 ns | 1.26 |
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/8 | 149 ns | 117 ns | 1.27 |
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/5 | 230 ns | 178 ns | 1.29 |
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/4 | 286 ns | 219 ns | 1.31 |
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/3 | 375 ns | 288 ns | 1.30 |
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/2 | 561 ns | 239 ns | 2.35 |
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/1 | 160 ns | 167 ns | 0.96 |
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/40 | 1559 ns | 565 ns | 2.76 |
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/18 | 1426 ns | 609 ns | 2.34 |
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/16 | 1435 ns | 643 ns | 2.23 |
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/14 | 1427 ns | 654 ns | 2.18 |
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/10 | 1437 ns | 714 ns | 2.01 |
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/8 | 1497 ns | 812 ns | 1.84 |
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/5 | 1547 ns | 919 ns | 1.68 |
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/4 | 1522 ns | 1013 ns | 1.50 |
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/3 | 1455 ns | 1044 ns | 1.39 |
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/2 | 1257 ns | 461 ns | 2.73 |
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/1 | 307 ns | 322 ns | 0.95 |
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/40 | 386 ns | 565 ns | 0.68 |
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/18 | 480 ns | 607 ns | 0.79 |
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/16 | 517 ns | 642 ns | 0.81 |
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/14 | 549 ns | 657 ns | 0.84 |
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/10 | 644 ns | 716 ns | 0.90 |
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/8 | 742 ns | 813 ns | 0.91 |
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/5 | 944 ns | 925 ns | 1.02 |
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/4 | 1075 ns | 1017 ns | 1.06 |
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/3 | 1192 ns | 1045 ns | 1.14 |
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/2 | 1476 ns | 460 ns | 3.21 |
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/1 | 308 ns | 321 ns | 0.96 |
Uh oh!
There was an error while loading.Please reload this page.
⚙️ The optimization
Like I mentioned in#5346, both
std::search_nandranges::search_nmake steps byn elements, and avoid going back for a good input (where there are few potential matches), so for largen values vectorization wouldn't be an improvement.Still for smalln, such that vector register width is larger thann, and therefore, the vector step is bigger, it is possible to vectorize in a way that would be faster even for an input with few matches. For more matches, such vectorization will have more advantage, as it would not need to go back.
The approach is to compare elements, get bit mask, and look for contiguous set of ones of proper length.@Alcaro suggested:
Turns out this is efficient enough for AVX2 with the values ofn up twice smaller than AVX register size in elements. Despite there seems to be indeed high cost of ruined parallelism, I cannot find anything faster.
The shift values are computed based onn. To save one variable (general purpose register), we rely onn=1 to be handled separately, and assume at least one shift to happen.
To deal with matches on vector register boundary, the bitmask is concatenated with the previous one. AVX bitmask is 32 bits for 32 bytes of AVX value, doubled it is 64 bit, still fits x64 register perfectly. The alternative to concatenation could be handling the boundary case with
lzcnt/tzcnt, this turned out to be not faster.The fallback is used for tails and too largen values. For tails it uses
lzcntwith inverted carry value to have smooth transition from potential partial match in vectors to the scalar part. The fallback recreatesranges::search_nin<algorithm>, with slight variation.🥔 Down-level architectures support
SSE4.2 version is implementable in both senses of backporting the current approach to SSE and using
pcmpestri. I'd expect either to be of advantage forn values twice smaller than SSE register. Just feel like should not bother trying that.x86 version works the same way as x64. However, unlike many other vectorization algorithms, this one relies a lot on general-purpose 64 bit integer ops. To mitigate the impact
__ull_rshiftis used instead of the plain shift. This intrinsic usage doesn't impact 64-bit code, but makes 32-bit codegen better (at the expense of not handling huge shifts, which we don't need anyway). The shift values are ofinttype to match the intrinsic parameter type.Still, the efficiency on x86 is questionable (see benchmark results below). Apart from having shifts in multiple instructions, it is apparently due to general purpose registers deficit. The compiler isn't being helpful here too, some register spills look superfluous.
For 32-bit and 64-bit elements, it is possible to use the floating point bit mask, instead of integer bit mask, like in#4987/#5092. This will save bit width. But apart from the mysterious "bypass delay" (integers and floats instructions mix potential penalty), it will also make the bit magic more complicated, more dependent on element width, and still won't reduce the bit width for 8-bit and 16-bit elements, so this doesn't seem to be worth doing.
We could just skip x86. But we don't have precedent of having vectorization for x64, but not having it for x86, so I didn't want to introduce one.
1️⃣ Specialn=1 case
We need to handle this case as just
findvectorization.findvectorization is more efficient than this one, plus the assumption that the shift happens at least once saves a variable/register.The question is where we should handle this:
The latter two are indistinguishable in practice, so the real question is, if we should:
findforsearch_nwhen n=1 #5346 optimizationWith removaln=1 case from headers we get:
With keepingn=1 case in headers we get:
findpattern)memchrfor corresponding type and disabled vectorization mode✅ Test coverage
To cover the variety of possibilities, the randomized test should try different input lengths, differentn, and different actual matches lengths (including too long matches, too short matches, and different gap between matches). This has to have long run time, so it deserves a dedicated test.
The test coverage is not only useful for vectorization, it also compensates missing non-vectorization coverage, asked in#933.
This PR still doesn't fully address#933 as it is asked because:
I'm not sure how much these features are required, though. If they are required, further work to complete#933 would certainly need a different PR.
🏁 Benchmarks
In addition to the
TwoZonescase inherited from ##5346 , it hasDenseSmallSequences.These two are close to normal case and worst case respectively.
TwoZones(Zones in the table below) has half of range with mismatch character and half of rangers with match character. So the search should quickly proceed to the match part then check the first match which is successful.DenseSmallSequences(Dense in the table below) has too short matches of random with from 0 ton-1 interrupted by a single mismatch character.The vectorization improvement is more for
DenseSmallSequences, but we should probably care aboutTwoZonessomewhat more. If worst case is a priority, we can lift threshold for the vectorization twice.⏱️ Benchmark results
Click to expand:
🥈 Results interpretation
For x64 and for the vectorized n there is a certain improvement for Zones. For Dense the improvement is even greater.
The non-vectorized cases vary a lot, The fallback happen to be faster than header implementation often, but not always. Out of the header implementations, surprisingly, the ranges one is slower for Zones case.
The x86 results are not very good, but not too bad either.
The table contains a lot of rows, but I don't see a reasonable way to reduce it without losing important information.