On an x64 processor (Ice Lake) and GCC 12, I find that this PR is maybe slightly worse onbench_ondemand results, but the difference is so small that I would not say it is conclusive. It is more likely that the difference is too small and I cannot measure it accurately. I also tried on macOS, and the results were similarly not conclusive. Your results appear to indicate better performance for this PR in the DOM tests, but your clock speed is not the same. You do report fewer instructions, which is a good thing. Let me run DOM tests on this system... (x64/Ice Lake) and GCC 12. This PR (repeated 3 times)$ ./build/benchmark/dom/parsepr jsonexamples/twitter.json -n 4000number of iterations 4000jsonexamples/twitter.json========================= 9867 blocks - 631515 bytes - 55262 structurals ( 8.8 %)special blocks with: utf8 2284 ( 23.1 %) - escape 598 ( 6.1 %) - 0 structurals 1287 ( 13.0 %) - 1+ structurals 8581 ( 87.0 %) - 8+ structurals 3272 ( 33.2 %) - 16+ structurals 0 ( 0.0 %)special block flips: utf8 1104 ( 11.2 %) - escape 642 ( 6.5 %) - 0 structurals 940 ( 9.5 %) - 1+ structurals 940 ( 9.5 %) - 8+ structurals 2593 ( 26.3 %) - 16+ structurals 0 ( 0.0 %)All Stages (excluding allocation)| Speed : 18.8499 ns per block ( 87.92%) - 0.2945 ns per byte - 3.3660 ns per structural - 3.3950 GB/s| Cycles : 58.3621 per block ( 98.17%) - 0.9120 per byte - 10.4216 per structural - 3.096 GHz est. frequency| Instructions : 206.4364 per block (100.00%) - 3.2258 per byte - 36.8628 per structural - 3.537 per cycle| Misses : 484 branch misses ( 97.91%) - 1 cache misses ( 50.53%) - 26604.00 cache references|- Stage 1| Speed : 6.4073 ns per block ( 29.88%) - 0.1001 ns per byte - 1.1441 ns per structural - 9.9881 GB/s| Cycles : 19.8528 per block ( 33.39%) - 0.3102 per byte - 3.5451 per structural - 3.098 GHz est. frequency| Instructions : 69.8881 per block ( 33.85%) - 1.0921 per byte - 12.4798 per structural - 3.520 per cycle| Misses : 93 branch misses ( 18.81%) - 0 cache misses ( 0.00%) - 11215.00 cache references|- Stage 2| Speed : 12.4110 ns per block ( 57.89%) - 0.1939 ns per byte - 2.2162 ns per structural - 5.1564 GB/s| Cycles : 38.4054 per block ( 64.60%) - 0.6001 per byte - 6.8579 per structural - 3.094 GHz est. frequency| Instructions : 136.5482 per block ( 66.15%) - 2.1337 per byte - 24.3831 per structural - 3.555 per cycle| Misses : 385 branch misses ( 77.88%) - 1 cache misses ( 50.53%) - 15451.00 cache references5376.0 documents parsed per second (best)$ ./build/benchmark/dom/parsepr jsonexamples/twitter.json -n 4000number of iterations 4000jsonexamples/twitter.json========================= 9867 blocks - 631515 bytes - 55262 structurals ( 8.8 %)special blocks with: utf8 2284 ( 23.1 %) - escape 598 ( 6.1 %) - 0 structurals 1287 ( 13.0 %) - 1+ structurals 8581 ( 87.0 %) - 8+ structurals 3272 ( 33.2 %) - 16+ structurals 0 ( 0.0 %)special block flips: utf8 1104 ( 11.2 %) - escape 642 ( 6.5 %) - 0 structurals 940 ( 9.5 %) - 1+ structurals 940 ( 9.5 %) - 8+ structurals 2593 ( 26.3 %) - 16+ structurals 0 ( 0.0 %)All Stages (excluding allocation)| Speed : 18.7682 ns per block ( 88.68%) - 0.2933 ns per byte - 3.3514 ns per structural - 3.4098 GB/s| Cycles : 58.1223 per block ( 98.25%) - 0.9082 per byte - 10.3788 per structural - 3.097 GHz est. frequency| Instructions : 206.4364 per block (100.00%) - 3.2258 per byte - 36.8628 per structural - 3.552 per cycle| Misses : 468 branch misses ( 95.76%) - 0 cache misses ( 0.00%) - 26693.00 cache references|- Stage 1| Speed : 6.4103 ns per block ( 30.29%) - 0.1002 ns per byte - 1.1447 ns per structural - 9.9833 GB/s| Cycles : 19.8643 per block ( 33.58%) - 0.3104 per byte - 3.5471 per structural - 3.099 GHz est. frequency| Instructions : 69.8881 per block ( 33.85%) - 1.0921 per byte - 12.4798 per structural - 3.518 per cycle| Misses : 96 branch misses ( 19.64%) - 0 cache misses ( 0.00%) - 11551.00 cache references|- Stage 2| Speed : 12.3142 ns per block ( 58.19%) - 0.1924 ns per byte - 2.1989 ns per structural - 5.1969 GB/s| Cycles : 38.1207 per block ( 64.44%) - 0.5957 per byte - 6.8071 per structural - 3.096 GHz est. frequency| Instructions : 136.5482 per block ( 66.15%) - 2.1337 per byte - 24.3831 per structural - 3.582 per cycle| Misses : 374 branch misses ( 76.53%) - 0 cache misses ( 0.00%) - 15159.00 cache references5399.4 documents parsed per second (best)$ ./build/benchmark/dom/parsepr jsonexamples/twitter.json -n 4000number of iterations 4000jsonexamples/twitter.json========================= 9867 blocks - 631515 bytes - 55262 structurals ( 8.8 %)special blocks with: utf8 2284 ( 23.1 %) - escape 598 ( 6.1 %) - 0 structurals 1287 ( 13.0 %) - 1+ structurals 8581 ( 87.0 %) - 8+ structurals 3272 ( 33.2 %) - 16+ structurals 0 ( 0.0 %)special block flips: utf8 1104 ( 11.2 %) - escape 642 ( 6.5 %) - 0 structurals 940 ( 9.5 %) - 1+ structurals 940 ( 9.5 %) - 8+ structurals 2593 ( 26.3 %) - 16+ structurals 0 ( 0.0 %)All Stages (excluding allocation)| Speed : 18.7989 ns per block ( 87.33%) - 0.2938 ns per byte - 3.3569 ns per structural - 3.4042 GB/s| Cycles : 58.2127 per block ( 98.10%) - 0.9096 per byte - 10.3949 per structural - 3.097 GHz est. frequency| Instructions : 206.4364 per block (100.00%) - 3.2258 per byte - 36.8628 per structural - 3.546 per cycle| Misses : 468 branch misses ( 94.60%) - 0 cache misses ( 0.00%) - 27888.00 cache references|- Stage 1| Speed : 6.4219 ns per block ( 29.83%) - 0.1003 ns per byte - 1.1467 ns per structural - 9.9654 GB/s| Cycles : 19.8962 per block ( 33.53%) - 0.3109 per byte - 3.5528 per structural - 3.098 GHz est. frequency| Instructions : 69.8881 per block ( 33.85%) - 1.0921 per byte - 12.4798 per structural - 3.513 per cycle| Misses : 98 branch misses ( 19.81%) - 0 cache misses ( 0.00%) - 11224.00 cache references|- Stage 2| Speed : 12.3586 ns per block ( 57.41%) - 0.1931 ns per byte - 2.2069 ns per structural - 5.1783 GB/s| Cycles : 38.2584 per block ( 64.47%) - 0.5978 per byte - 6.8317 per structural - 3.096 GHz est. frequency| Instructions : 136.5482 per block ( 66.15%) - 2.1337 per byte - 24.3831 per structural - 3.569 per cycle| Misses : 371 branch misses ( 74.99%) - 0 cache misses ( 0.00%) - 17065.00 cache references5390.6 documents parsed per second (best)
Main branch (repeated 3 times)$ ./build/benchmark/dom/parse jsonexamples/twitter.json -n 4000number of iterations 4000jsonexamples/twitter.json========================= 9867 blocks - 631515 bytes - 55262 structurals ( 8.8 %)special blocks with: utf8 2284 ( 23.1 %) - escape 598 ( 6.1 %) - 0 structurals 1287 ( 13.0 %) - 1+ structurals 8581 ( 87.0 %) - 8+ structurals 3272 ( 33.2 %) - 16+ structurals 0 ( 0.0 %)special block flips: utf8 1104 ( 11.2 %) - escape 642 ( 6.5 %) - 0 structurals 940 ( 9.5 %) - 1+ structurals 940 ( 9.5 %) - 8+ structurals 2593 ( 26.3 %) - 16+ structurals 0 ( 0.0 %)All Stages (excluding allocation)| Speed : 18.6728 ns per block ( 89.77%) - 0.2918 ns per byte - 3.3344 ns per structural - 3.4272 GB/s| Cycles : 57.8292 per block ( 98.45%) - 0.9036 per byte - 10.3264 per structural - 3.097 GHz est. frequency| Instructions : 207.5327 per block (100.00%) - 3.2429 per byte - 37.0586 per structural - 3.589 per cycle| Misses : 462 branch misses ( 97.47%) - 8 cache misses ( 70.45%) - 26667.00 cache references|- Stage 1| Speed : 6.3749 ns per block ( 30.65%) - 0.0996 ns per byte - 1.1384 ns per structural - 10.0387 GB/s| Cycles : 19.7597 per block ( 33.64%) - 0.3088 per byte - 3.5284 per structural - 3.100 GHz est. frequency| Instructions : 69.8881 per block ( 33.68%) - 1.0921 per byte - 12.4798 per structural - 3.537 per cycle| Misses : 84 branch misses ( 17.72%) - 5 cache misses ( 44.03%) - 11294.00 cache references|- Stage 2| Speed : 12.2561 ns per block ( 58.92%) - 0.1915 ns per byte - 2.1885 ns per structural - 5.2216 GB/s| Cycles : 37.9380 per block ( 64.59%) - 0.5928 per byte - 6.7745 per structural - 3.095 GHz est. frequency| Instructions : 137.6446 per block ( 66.32%) - 2.1508 per byte - 24.5789 per structural - 3.628 per cycle| Misses : 368 branch misses ( 77.64%) - 4 cache misses ( 35.22%) - 15199.00 cache references5427.0 documents parsed per second (best)$ ./build/benchmark/dom/parse jsonexamples/twitter.json -n 4000number of iterations 4000jsonexamples/twitter.json========================= 9867 blocks - 631515 bytes - 55262 structurals ( 8.8 %)special blocks with: utf8 2284 ( 23.1 %) - escape 598 ( 6.1 %) - 0 structurals 1287 ( 13.0 %) - 1+ structurals 8581 ( 87.0 %) - 8+ structurals 3272 ( 33.2 %) - 16+ structurals 0 ( 0.0 %)special block flips: utf8 1104 ( 11.2 %) - escape 642 ( 6.5 %) - 0 structurals 940 ( 9.5 %) - 1+ structurals 940 ( 9.5 %) - 8+ structurals 2593 ( 26.3 %) - 16+ structurals 0 ( 0.0 %)All Stages (excluding allocation)| Speed : 18.9860 ns per block ( 87.36%) - 0.2967 ns per byte - 3.3903 ns per structural - 3.3707 GB/s| Cycles : 58.7981 per block ( 98.28%) - 0.9188 per byte - 10.4994 per structural - 3.097 GHz est. frequency| Instructions : 207.5327 per block (100.00%) - 3.2429 per byte - 37.0586 per structural - 3.530 per cycle| Misses : 459 branch misses ( 96.12%) - 4 cache misses ( 17.35%) - 27450.00 cache references|- Stage 1| Speed : 6.4357 ns per block ( 29.61%) - 0.1006 ns per byte - 1.1492 ns per structural - 9.9440 GB/s| Cycles : 19.9421 per block ( 33.33%) - 0.3116 per byte - 3.5610 per structural - 3.099 GHz est. frequency| Instructions : 69.8881 per block ( 33.68%) - 1.0921 per byte - 12.4798 per structural - 3.505 per cycle| Misses : 93 branch misses ( 19.48%) - 0 cache misses ( 0.00%) - 11112.00 cache references|- Stage 2| Speed : 12.5173 ns per block ( 57.60%) - 0.1956 ns per byte - 2.2352 ns per structural - 5.1126 GB/s| Cycles : 38.7468 per block ( 64.77%) - 0.6055 per byte - 6.9189 per structural - 3.095 GHz est. frequency| Instructions : 137.6446 per block ( 66.32%) - 2.1508 per byte - 24.5789 per structural - 3.552 per cycle| Misses : 361 branch misses ( 75.60%) - 4 cache misses ( 17.35%) - 16219.00 cache references5337.5 documents parsed per second (best)$ ./build/benchmark/dom/parse jsonexamples/twitter.json -n 4000number of iterations 4000jsonexamples/twitter.json========================= 9867 blocks - 631515 bytes - 55262 structurals ( 8.8 %)special blocks with: utf8 2284 ( 23.1 %) - escape 598 ( 6.1 %) - 0 structurals 1287 ( 13.0 %) - 1+ structurals 8581 ( 87.0 %) - 8+ structurals 3272 ( 33.2 %) - 16+ structurals 0 ( 0.0 %)special block flips: utf8 1104 ( 11.2 %) - escape 642 ( 6.5 %) - 0 structurals 940 ( 9.5 %) - 1+ structurals 940 ( 9.5 %) - 8+ structurals 2593 ( 26.3 %) - 16+ structurals 0 ( 0.0 %)All Stages (excluding allocation)| Speed : 18.6361 ns per block ( 88.83%) - 0.2912 ns per byte - 3.3278 ns per structural - 3.4340 GB/s| Cycles : 57.7138 per block ( 98.46%) - 0.9018 per byte - 10.3058 per structural - 3.097 GHz est. frequency| Instructions : 207.5327 per block (100.00%) - 3.2429 per byte - 37.0586 per structural - 3.596 per cycle| Misses : 475 branch misses ( 95.89%) - 0 cache misses ( 0.00%) - 26636.00 cache references|- Stage 1| Speed : 6.3687 ns per block ( 30.36%) - 0.0995 ns per byte - 1.1372 ns per structural - 10.0486 GB/s| Cycles : 19.7399 per block ( 33.68%) - 0.3085 per byte - 3.5249 per structural - 3.100 GHz est. frequency| Instructions : 69.8881 per block ( 33.68%) - 1.0921 per byte - 12.4798 per structural - 3.540 per cycle| Misses : 99 branch misses ( 19.99%) - 0 cache misses ( 0.00%) - 11500.00 cache references|- Stage 2| Speed : 12.2440 ns per block ( 58.36%) - 0.1913 ns per byte - 2.1864 ns per structural - 5.2267 GB/s| Cycles : 37.9011 per block ( 64.66%) - 0.5922 per byte - 6.7679 per structural - 3.095 GHz est. frequency| Instructions : 137.6446 per block ( 66.32%) - 2.1508 per byte - 24.5789 per structural - 3.632 per cycle| Misses : 368 branch misses ( 74.29%) - 0 cache misses ( 0.00%) - 15271.00 cache references5437.7 documents parsed per second (best)
Analysis (DOM)Just look at stage 2 (DOM), I get that the main branch speed is in the interval 5.11GB/s to 5.23 GB/s. The PR is in the interval 5.16 GB/s to 5.20 GB/s... using three trials. So this PR is anywhere between 2% better or 1% worse. When in doubt, I tend to go with the instruction count and the PR seems to reduce it by 0.8%. So I tend to believe that this PR is a win, although a small win. kostya ondemand (this PR)kostya<simdjson_ondemand>/manual_time 50436659 ns 61317425 ns 14 best_branch_miss=320.223k best_bytes_per_sec=2.73101G best_cache_miss=1.47762M best_cache_ref=6.24866M best_cycles=158.335M best_cycles_per_byte=1.15316 best_docs_per_sec=19.8901 best_frequency=3.1493G best_instructions=514.22M best_instructions_per_byte=3.74509 best_instructions_per_cycle=3.24766 best_items_per_sec=10.4281M branch_miss=321.055k bytes=137.305M bytes_per_second=2.53536G/s cache_miss=1.48088M cache_ref=6.24934M cycles=158.806M cycles_per_byte=1.15659 docs_per_sec=19.8268/s frequency=3.14863G/s instructions=514.22M instructions_per_byte=3.74509 instructions_per_cycle=3.23803 items=524.288k items_per_second=10.395M/s [BEST: throughput= 2.73 GB/s doc_throughput= 19 docs/s instructions= 514219519 cycles= 158335209 branch_miss= 320223 cache_miss= 1477619 cache_ref= 6248662 items= 524288 avg_time= 50436659 ns]kostya<simdjson_ondemand>/manual_time 50520943 ns 61601632 ns 14 best_branch_miss=317k best_bytes_per_sec=2.72754G best_cache_miss=1.50372M best_cache_ref=6.33852M best_cycles=158.467M best_cycles_per_byte=1.15412 best_docs_per_sec=19.8649 best_frequency=3.14792G best_instructions=514.22M best_instructions_per_byte=3.74509 best_instructions_per_cycle=3.24496 best_items_per_sec=10.4149M branch_miss=318.11k bytes=137.305M bytes_per_second=2.53113G/s cache_miss=1.50425M cache_ref=6.3387M cycles=159.002M cycles_per_byte=1.15802 docs_per_sec=19.7938/s frequency=3.14725G/s instructions=514.22M instructions_per_byte=3.74509 instructions_per_cycle=3.23404 items=524.288k items_per_second=10.3776M/s [BEST: throughput= 2.73 GB/s doc_throughput= 19 docs/s instructions= 514219518 cycles= 158466991 branch_miss= 317000 cache_miss= 1503725 cache_ref= 6338515 items= 524288 avg_time= 50520942 ns]kostya<simdjson_ondemand>/manual_time 50732221 ns 61651891 ns 14 best_branch_miss=317.674k best_bytes_per_sec=2.71335G best_cache_miss=1.48926M best_cache_ref=6.25722M best_cycles=159.359M best_cycles_per_byte=1.16062 best_docs_per_sec=19.7615 best_frequency=3.14916G best_instructions=514.22M best_instructions_per_byte=3.74509 best_instructions_per_cycle=3.22681 best_items_per_sec=10.3607M branch_miss=317.764k bytes=137.305M bytes_per_second=2.52059G/s cache_miss=1.48845M cache_ref=6.25745M cycles=159.726M cycles_per_byte=1.16329 docs_per_sec=19.7113/s frequency=3.14841G/s instructions=514.22M instructions_per_byte=3.74509 instructions_per_cycle=3.21938 items=524.288k items_per_second=10.3344M/s [BEST: throughput= 2.71 GB/s doc_throughput= 19 docs/s instructions= 514219519 cycles= 159358691 branch_miss= 317674 cache_miss= 1489255 cache_ref= 6257220 items= 524288 avg_time= 50732221 ns]
kostya ondemand (main branch)kostya<simdjson_ondemand>/manual_time 50533850 ns 61401809 ns 14 best_branch_miss=315.181k best_bytes_per_sec=2.72971G best_cache_miss=1.47801M best_cache_ref=6.25266M best_cycles=158.413M best_cycles_per_byte=1.15373 best_docs_per_sec=19.8806 best_frequency=3.14935G best_instructions=510.55M best_instructions_per_byte=3.71836 best_instructions_per_cycle=3.2229 best_items_per_sec=10.4232M branch_miss=314.899k bytes=137.305M bytes_per_second=2.53049G/s cache_miss=1.47842M cache_ref=6.25278M cycles=159.101M cycles_per_byte=1.15874 docs_per_sec=19.7887/s frequency=3.14841G/s instructions=510.55M instructions_per_byte=3.71836 instructions_per_cycle=3.20896 items=524.288k items_per_second=10.375M/s [BEST: throughput= 2.73 GB/s doc_throughput= 19 docs/s instructions= 510549503 cycles= 158413320 branch_miss= 315181 cache_miss= 1478014 cache_ref= 6252663 items= 524288 avg_time= 50533850 ns]kostya<simdjson_ondemand>/manual_time 49928598 ns 60663634 ns 14 best_branch_miss=313.717k best_bytes_per_sec=2.76004G best_cache_miss=1.47957M best_cache_ref=6.21029M best_cycles=156.674M best_cycles_per_byte=1.14107 best_docs_per_sec=20.1015 best_frequency=3.14939G best_instructions=510.55M best_instructions_per_byte=3.71836 best_instructions_per_cycle=3.25867 best_items_per_sec=10.539M branch_miss=314.029k bytes=137.305M bytes_per_second=2.56116G/s cache_miss=1.48521M cache_ref=6.21166M cycles=157.202M cycles_per_byte=1.14491 docs_per_sec=20.0286/s frequency=3.14853G/s instructions=510.55M instructions_per_byte=3.71836 instructions_per_cycle=3.24773 items=524.288k items_per_second=10.5008M/s [BEST: throughput= 2.76 GB/s doc_throughput= 20 docs/s instructions= 510549502 cycles= 156674234 branch_miss= 313717 cache_miss= 1479566 cache_ref= 6210292 items= 524288 avg_time= 49928598 ns]kostya<simdjson_ondemand>/manual_time 50223191 ns 60923453 ns 14 best_branch_miss=313.433k best_bytes_per_sec=2.74307G best_cache_miss=1.48189M best_cache_ref=6.19918M best_cycles=157.645M best_cycles_per_byte=1.14814 best_docs_per_sec=19.978 best_frequency=3.14943G best_instructions=510.55M best_instructions_per_byte=3.71836 best_instructions_per_cycle=3.23859 best_items_per_sec=10.4742M branch_miss=313.453k bytes=137.305M bytes_per_second=2.54614G/s cache_miss=1.48057M cache_ref=6.20006M cycles=158.128M cycles_per_byte=1.15165 docs_per_sec=19.9111/s frequency=3.1485G/s instructions=510.55M instructions_per_byte=3.71836 instructions_per_cycle=3.22871 items=524.288k items_per_second=10.4392M/s [BEST: throughput= 2.74 GB/s doc_throughput= 19 docs/s instructions= 510549502 cycles= 157645430 branch_miss= 313433 cache_miss= 1481890 cache_ref= 6199176 items= 524288 avg_time= 50223190 ns]
Analysis (ondemand)Focusing solely on one ondemand benchmark (kostya), this PR seems an overall negative. Recommendation (tentative)I think that if this PR focused solely on the DOM code, it would easier to consider it. As things stand, I don't have the data to convince myself that there are gains with On Demand benchmarks. It is possible that this PR creates a slight performance regression on On Demand. Of course, maybe there is a methodological issue. Risks : I should stress that results may vary depending the exact compiler and processor. We cannot rule out that this PR could be a net negative when using different systems, even with the DOM parser. So some caution is required. |
This PR adds some
unlikelycompiler hints to improve branch predition.For some reason in cases when SIMDJSON_TRY is used the compiler complatined about uninitialized variables, so explicit initialization was added.
The performance results:
masterbranch: