Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

more extractor#2274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
lemire wants to merge20 commits intomaster
base:master
Choose a base branch
Loading
fromlemire/extractor
Open

more extractor#2274

lemire wants to merge20 commits intomasterfromlemire/extractor

Conversation

@lemire
Copy link
Member

This is a small variation on PR#2247

I am observing a very significant performance regression compared to manually provided code. This is with GCC 12 on a recent x64 processor, but I see same effect with LLVM and ARM.

partial_tweets<simdjson_ondemand>/manual_time             178942 ns       248711 ns         3701 best_branch_miss=251 best_bytes_per_sec=5.30351G best_cache_miss=0 best_cache_ref=3.92k best_cycles=368.815k best_cycles_per_byte=0.584016 best_docs_per_sec=8.39807k best_frequency=3.09733G best_instructions=1.42694M best_instructions_per_byte=2.25956 best_instructions_per_cycle=3.869 best_items_per_sec=839.807k branch_miss=251.463 bytes=631.515k bytes_per_second=3.28678G/s cache_miss=0.035666 cache_ref=4.00135k cycles=375.234k cycles_per_byte=0.59418 docs_per_sec=5.58839k/s frequency=2.09695G/s instructions=1.42694M instructions_per_byte=2.25956 instructions_per_cycle=3.80281 items=100 items_per_second=558.839k/s [BEST: throughput=  5.30 GB/s doc_throughput=  8398 docs/s instructions=     1426944 cycles=      368815 branch_miss=     251 cache_miss=       0 cache_ref=      3920 items=       100 avg_time=    178942 ns]partial_tweets<simdjson_ondemand_extract>/manual_time     147342 ns       188064 ns         4724 best_branch_miss=271 best_bytes_per_sec=4.32746G best_cache_miss=0 best_cache_ref=5.65k best_cycles=451.717k best_cycles_per_byte=0.715291 best_docs_per_sec=6.85251k best_frequency=3.09539G best_instructions=1.7719M best_instructions_per_byte=2.80579 best_instructions_per_cycle=3.92258 best_items_per_sec=685.251k branch_miss=276.884 bytes=631.515k bytes_per_second=3.9917G/s cache_miss=1.03747 cache_ref=5.7469k cycles=455.218k cycles_per_byte=0.720835 docs_per_sec=6.78694k/s frequency=3.08954G/s instructions=1.7719M instructions_per_byte=2.80579 instructions_per_cycle=3.89241 items=100 items_per_second=678.694k/s [BEST: throughput=  4.33 GB/s doc_throughput=  6852 docs/s instructions=     1771896 cycles=      451717 branch_miss=     271 cache_miss=       0 cache_ref=      5650 items=       100 avg_time=    147341 ns]

the-moisrexand others added20 commitsSeptember 12, 2024 08:37
This is the bare minimum implementation of this idea.
@the-moisrex
Copy link
Member

:100644 100644 33973a51 00000000 Minclude/simdjson/generic/ondemand/object-inl.hdiff --git a/include/simdjson/generic/ondemand/object-inl.h b/include/simdjson/generic/ondemand/object-inl.hindex 33973a51..69863b9a 100644--- a/include/simdjson/generic/ondemand/object-inl.h+++ b/include/simdjson/generic/ondemand/object-inl.h@@ -35,7 +35,7 @@ simdjson_inline error_code object::extract(Funcs&&... endpoints) { #else template <endpoint ...Funcs> simdjson_inline error_code object::extract(Funcs&&... endpoints) noexcept((nothrow_endpoint<Funcs> && ...)) {-  return iter.on_field_raw([&](auto field_key, error_code& error) noexcept((nothrow_endpoint<Funcs> && ...)) {+  return iter.on_field_raw([&](auto field_key, error_code& error) __attribute__((always_inline)) {     std::ignore = ((field_key.unsafe_is_equal(endpoints.key()) ? (error = endpoints(value(iter.child()))) == SUCCESS : true) && ...);     if (error) {       return true;@@ -66,7 +66,7 @@ public:     return m_key;   }-  [[nodiscard]] constexpr error_code operator()(simdjson_result<value> val) noexcept(+  [[nodiscard]] simdjson_inline constexpr error_code operator()(simdjson_result<value> val) noexcept(       std::is_nothrow_assignable_v<T, simdjson_result<value>>) {     return val.get<T>(*pointer);   }@@ -95,7 +95,7 @@ public:     return m_key;   }-  [[nodiscard]] constexpr error_code operator()(simdjson_result<value> val) noexcept(+  [[nodiscard]] simdjson_inline constexpr error_code operator()(simdjson_result<value> val) noexcept(       std::is_nothrow_invocable_v<Func, simdjson_result<value>>) {     if constexpr (std::is_invocable_r_v<error_code, Func, simdjson_result<value>>) {       return func(val);@@ -133,12 +133,12 @@ public:   constexpr sub &operator=(sub &&) = default;   constexpr ~sub() = default;-  [[nodiscard]] constexpr error_code operator()(simdjson_result<value> val) noexcept((nothrow_endpoint<Tos> && ...)) {+  [[nodiscard]] simdjson_inline constexpr error_code operator()(simdjson_result<value> val) noexcept((nothrow_endpoint<Tos> && ...)) {     object obj;     if (auto const err = val.get_object().get(obj); err) {       return err;     }-    return std::apply([&obj]<typename... T>(T &&...app_tos) {+    return std::apply([&obj]<typename... T>(T &&...app_tos) __attribute__((always_inline)) {       return obj.extract(std::forward<T>(app_tos)...);     }, tos);   }

GCC before change:

ninja; ./benchmark/bench_ondemand --benchmark_filter=\(partial_tweets\|kostya\)\<simdjson_ondemand      lemire/extractor ✭ ✱ ◼[156/156] Linking CXX executable benchmark/bench_ondemand2024-10-09T18:03:15-10:00Running ./benchmark/bench_ondemandRun on (32 X 5500 MHz CPU s)CPU Caches:  L1 Data 48 KiB (x16)  L1 Instruction 32 KiB (x16)  L2 Unified 2048 KiB (x16)  L3 Unified 36864 KiB (x1)Load Average: 6.63, 2.77, 3.14simdjson::dom implementation:      haswellsimdjson::ondemand implementation (stage 1): haswellsimdjson::ondemand implementation (stage 2): fallback----------------------------------------------------------------------------------------------------------------Benchmark                                                      Time             CPU   Iterations UserCounters...----------------------------------------------------------------------------------------------------------------partial_tweets<simdjson_ondemand>/manual_time              84621 ns        98891 ns         8098 best_branch_miss=247 best_bytes_per_sec=7.90273G best_cache_miss=37 best_cache_ref=438 best_cycles=440.163k best_cycles_per_byte=0.696995 best_docs_per_sec=12.5139k best_frequency=5.50817G best_instructions=2.11722M best_instructions_per_byte=3.35261 best_instructions_per_cycle=4.81008 best_items_per_sec=1.25139M branch_miss=255.653 bytes=631.515k bytes_per_second=6.95034G/s cache_miss=8.82946 cache_ref=311.636 cycles=463.062k cycles_per_byte=0.733256 docs_per_sec=11.8174k/s frequency=5.47219G/s instructions=2.11385M instructions_per_byte=3.34727 instructions_per_cycle=4.56494 items=100 items_per_second=1.18174M/s [BEST: throughput=  7.90 GB/s doc_throughput= 12513 docs/s instructions=     2117221 cycles=      440163 branch_miss=     247 cache_miss=      37 cache_ref=       438 items=       100 avg_time=     84620 ns]partial_tweets<simdjson_ondemand_extract>/manual_time      96506 ns       111659 ns         6723 best_branch_miss=379 best_bytes_per_sec=6.79005G best_cache_miss=0 best_cache_ref=752 best_cycles=511.955k best_cycles_per_byte=0.810677 best_docs_per_sec=10.752k best_frequency=5.50454G best_instructions=2.51314M best_instructions_per_byte=3.97954 best_instructions_per_cycle=4.90891 best_items_per_sec=1075.2k branch_miss=414.145 bytes=631.515k bytes_per_second=6.09439G/s cache_miss=4.73598 cache_ref=822.361 cycles=529.507k cycles_per_byte=0.838471 docs_per_sec=10.3621k/s frequency=5.48679G/s instructions=2.51164M instructions_per_byte=3.97717 instructions_per_cycle=4.74336 items=100 items_per_second=1036.21k/s [BEST: throughput=  6.79 GB/s doc_throughput= 10751 docs/s instructions=     2513139 cycles=      511955 branch_miss=     379 cache_miss=       0 cache_ref=       752 items=       100 avg_time=     96505 ns]Creating a source file spanning 134087 KBkostya<simdjson_ondemand>/manual_time                   30070896 ns     37145386 ns           24 best_branch_miss=439.181k best_bytes_per_sec=4.72361G best_cache_miss=4.03406M best_cache_ref=6.45656M best_cycles=159.629M best_cycles_per_byte=1.16259 best_docs_per_sec=34.4024 best_frequency=5.49163G best_instructions=655.611M best_instructions_per_byte=4.77485 best_instructions_per_cycle=4.10708 best_items_per_sec=18.0367M branch_miss=438.196k bytes=137.305M bytes_per_second=4.25246G/s cache_miss=4.07028M cache_ref=6.42101M cycles=163.374M cycles_per_byte=1.18986 docs_per_sec=33.2547/s frequency=5.43297G/s instructions=652.297M instructions_per_byte=4.75072 instructions_per_cycle=3.99265 items=524.288k items_per_second=17.4351M/s [BEST: throughput=  4.72 GB/s doc_throughput=    34 docs/s instructions=   655611428 cycles=   159629433 branch_miss=  439181 cache_miss= 4034056 cache_ref=   6456561 items=    524288 avg_time=  30070895 ns]kostya<simdjson_ondemand_extract>/manual_time           32457782 ns     39038382 ns           22 best_branch_miss=442.625k best_bytes_per_sec=4.32497G best_cache_miss=3.96603M best_cache_ref=6.45726M best_cycles=174.157M best_cycles_per_byte=1.26839 best_docs_per_sec=31.499 best_frequency=5.48577G best_instructions=718.002M best_instructions_per_byte=5.22925 best_instructions_per_cycle=4.12273 best_items_per_sec=16.5145M branch_miss=443.999k bytes=137.305M bytes_per_second=3.93974G/s cache_miss=4.02753M cache_ref=6.45742M cycles=177.943M cycles_per_byte=1.29597 docs_per_sec=30.8093/s frequency=5.4823G/s instructions=718.002M instructions_per_byte=5.22925 instructions_per_cycle=4.03501 items=524.288k items_per_second=16.1529M/s [BEST: throughput=  4.32 GB/s doc_throughput=    31 docs/s instructions=   718001855 cycles=   174156908 branch_miss=  442625 cache_miss= 3966035 cache_ref=   6457262 items=    524288 avg_time=  32457781 ns]

Clang before change:

ninja; ./benchmark/bench_ondemand --benchmark_filter=\(partial_tweets\|kostya\)\<simdjson_ondemand                                                                                                                                     lemire/extractor ✭ ◼ninja: no work to do.2024-10-09T18:04:21-10:00Running ./benchmark/bench_ondemandRun on (32 X 5502.94 MHz CPU s)CPU Caches:  L1 Data 48 KiB (x16)  L1 Instruction 32 KiB (x16)  L2 Unified 2048 KiB (x16)  L3 Unified 36864 KiB (x1)Load Average: 8.84, 4.10, 3.56simdjson::dom implementation:      haswellsimdjson::ondemand implementation (stage 1): haswellsimdjson::ondemand implementation (stage 2): fallback----------------------------------------------------------------------------------------------------------------Benchmark                                                      Time             CPU   Iterations UserCounters...----------------------------------------------------------------------------------------------------------------partial_tweets<simdjson_ondemand>/manual_time              79715 ns        94275 ns         8696 best_branch_miss=187 best_bytes_per_sec=8.32562G best_cache_miss=0 best_cache_ref=1.2k best_cycles=417.74k best_cycles_per_byte=0.661489 best_docs_per_sec=13.1836k best_frequency=5.5073G best_instructions=2.0093M best_instructions_per_byte=3.18171 best_instructions_per_cycle=4.80992 best_items_per_sec=1.31836M branch_miss=192.302 bytes=631.515k bytes_per_second=7.37808G/s cache_miss=7.43054 cache_ref=1.2669k cycles=437.07k cycles_per_byte=0.692097 docs_per_sec=12.5447k/s frequency=5.4829G/s instructions=2.00829M instructions_per_byte=3.18012 instructions_per_cycle=4.5949 items=100 items_per_second=1.25447M/s [BEST: throughput=  8.33 GB/s doc_throughput= 13183 docs/s instructions=     2009296 cycles=      417740 branch_miss=     187 cache_miss=       0 cache_ref=      1200 items=       100 avg_time=     79715 ns]partial_tweets<simdjson_ondemand_extract>/manual_time      95804 ns       110762 ns         7200 best_branch_miss=422 best_bytes_per_sec=6.87176G best_cache_miss=0 best_cache_ref=419 best_cycles=505.955k best_cycles_per_byte=0.801177 best_docs_per_sec=10.8814k best_frequency=5.5055G best_instructions=2.38369M best_instructions_per_byte=3.77456 best_instructions_per_cycle=4.71127 best_items_per_sec=1088.14k branch_miss=430.691 bytes=631.515k bytes_per_second=6.13903G/s cache_miss=16.2228 cache_ref=522.979 cycles=526.056k cycles_per_byte=0.833006 docs_per_sec=10.438k/s frequency=5.49096G/s instructions=2.38369M instructions_per_byte=3.77456 instructions_per_cycle=4.53125 items=100 items_per_second=1043.8k/s [BEST: throughput=  6.87 GB/s doc_throughput= 10881 docs/s instructions=     2383689 cycles=      505955 branch_miss=     422 cache_miss=       0 cache_ref=       419 items=       100 avg_time=     95804 ns]Creating a source file spanning 134087 KBkostya<simdjson_ondemand>/manual_time                   30800649 ns     38328335 ns           22 best_branch_miss=440.359k best_bytes_per_sec=4.55785G best_cache_miss=4.05904M best_cache_ref=6.46117M best_cycles=165.345M best_cycles_per_byte=1.20422 best_docs_per_sec=33.195 best_frequency=5.48865G best_instructions=667.118M best_instructions_per_byte=4.85866 best_instructions_per_cycle=4.03469 best_items_per_sec=17.4038M branch_miss=440.151k bytes=137.305M bytes_per_second=4.15171G/s cache_miss=4.11497M cache_ref=6.46181M cycles=168.903M cycles_per_byte=1.23013 docs_per_sec=32.4668/s frequency=5.48374G/s instructions=667.118M instructions_per_byte=4.85866 instructions_per_cycle=3.94971 items=524.288k items_per_second=17.022M/s [BEST: throughput=  4.56 GB/s doc_throughput=    33 docs/s instructions=   667117628 cycles=   165345454 branch_miss=  440359 cache_miss= 4059041 cache_ref=   6461170 items=    524288 avg_time=  30800648 ns]kostya<simdjson_ondemand_extract>/manual_time           58341063 ns     65736654 ns           11 best_branch_miss=432.964k best_bytes_per_sec=2.39187G best_cache_miss=3.27313M best_cache_ref=6.46281M best_cycles=315.163M best_cycles_per_byte=2.29535 best_docs_per_sec=17.4201 best_frequency=5.49017G best_instructions=937.126M best_instructions_per_byte=6.82514 best_instructions_per_cycle=2.97347 best_items_per_sec=9.13315M branch_miss=433.432k bytes=137.305M bytes_per_second=2.19186G/s cache_miss=3.32811M cache_ref=6.46303M cycles=320.041M cycles_per_byte=2.33088 docs_per_sec=17.1406/s frequency=5.48569G/s instructions=937.126M instructions_per_byte=6.82514 instructions_per_cycle=2.92814 items=524.288k items_per_second=8.9866M/s [BEST: throughput=  2.39 GB/s doc_throughput=    17 docs/s instructions=   937125588 cycles=   315162785 branch_miss=  432964 cache_miss= 3273127 cache_ref=   6462807 items=    524288 avg_time=  58341063 ns]

GCC, after change:

inja; ./benchmark/bench_ondemand --benchmark_filter=\(partial_tweets\|kostya\)\<simdjson_ondemand[2/2] Linking CXX executable benchmark/bench_ondemand2024-10-09T17:54:43-10:00Running ./benchmark/bench_ondemandRun on (32 X 5500 MHz CPU s)CPU Caches:  L1 Data 48 KiB (x16)  L1 Instruction 32 KiB (x16)  L2 Unified 2048 KiB (x16)  L3 Unified 36864 KiB (x1)Load Average: 2.01, 3.69, 4.00simdjson::dom implementation:      haswellsimdjson::ondemand implementation (stage 1): haswellsimdjson::ondemand implementation (stage 2): fallback----------------------------------------------------------------------------------------------------------------Benchmark                                                      Time             CPU   Iterations UserCounters...----------------------------------------------------------------------------------------------------------------partial_tweets<simdjson_ondemand>/manual_time              86199 ns       100683 ns         7871 best_branch_miss=175 best_bytes_per_sec=7.64176G best_cache_miss=0 best_cache_ref=219 best_cycles=454.924k best_cycles_per_byte=0.720369 best_docs_per_sec=12.1007k best_frequency=5.50489G best_instructions=2.11722M best_instructions_per_byte=3.35261 best_instructions_per_cycle=4.65401 best_items_per_sec=1.21007M branch_miss=192.552 bytes=631.515k bytes_per_second=6.8231G/s cache_miss=6.19108 cache_ref=268.285 cycles=464.359k cycles_per_byte=0.735309 docs_per_sec=11.6011k/s frequency=5.38706G/s instructions=2.09654M instructions_per_byte=3.31986 instructions_per_cycle=4.51492 items=100 items_per_second=1.16011M/s [BEST: throughput=  7.64 GB/s doc_throughput= 12100 docs/s instructions=     2117221 cycles=      454924 branch_miss=     175 cache_miss=       0 cache_ref=       219 items=       100 avg_time=     86198 ns]partial_tweets<simdjson_ondemand_extract>/manual_time      85781 ns       101114 ns         8069 best_branch_miss=291 best_bytes_per_sec=7.50178G best_cache_miss=0 best_cache_ref=979 best_cycles=463.547k best_cycles_per_byte=0.734024 best_docs_per_sec=11.879k best_frequency=5.50649G best_instructions=2.20218M best_instructions_per_byte=3.48714 best_instructions_per_cycle=4.75071 best_items_per_sec=1.1879M branch_miss=307.502 bytes=631.515k bytes_per_second=6.85638G/s cache_miss=4.61718 cache_ref=983.787 cycles=471.129k cycles_per_byte=0.74603 docs_per_sec=11.6577k/s frequency=5.49226G/s instructions=2.20218M instructions_per_byte=3.48714 instructions_per_cycle=4.67426 items=100 items_per_second=1.16577M/s [BEST: throughput=  7.50 GB/s doc_throughput= 11879 docs/s instructions=     2202179 cycles=      463547 branch_miss=     291 cache_miss=       0 cache_ref=       979 items=       100 avg_time=     85780 ns]Creating a source file spanning 134087 KBkostya<simdjson_ondemand>/manual_time                   29947991 ns     37601157 ns           24 best_branch_miss=444.523k best_bytes_per_sec=4.80232G best_cache_miss=4.14271M best_cache_ref=6.45769M best_cycles=156.906M best_cycles_per_byte=1.14275 best_docs_per_sec=34.9756 best_frequency=5.48788G best_instructions=655.611M best_instructions_per_byte=4.77485 best_instructions_per_cycle=4.17837 best_items_per_sec=18.3373M branch_miss=456.155k bytes=137.305M bytes_per_second=4.26991G/s cache_miss=4.25865M cache_ref=6.45826M cycles=164.153M cycles_per_byte=1.19553 docs_per_sec=33.3912/s frequency=5.48126G/s instructions=655.611M instructions_per_byte=4.77486 instructions_per_cycle=3.99391 items=524.288k items_per_second=17.5066M/s [BEST: throughput=  4.80 GB/s doc_throughput=    34 docs/s instructions=   655611437 cycles=   156905935 branch_miss=  444523 cache_miss= 4142705 cache_ref=   6457695 items=    524288 avg_time=  29947990 ns]kostya<simdjson_ondemand_extract>/manual_time           30245373 ns     37590245 ns           23 best_branch_miss=438.628k best_bytes_per_sec=4.74102G best_cache_miss=4.04582M best_cache_ref=6.457M best_cycles=159.063M best_cycles_per_byte=1.15847 best_docs_per_sec=34.5291 best_frequency=5.49231G best_instructions=646.174M best_instructions_per_byte=4.70612 best_instructions_per_cycle=4.06237 best_items_per_sec=18.1032M branch_miss=434.832k bytes=137.305M bytes_per_second=4.22793G/s cache_miss=4.07704M cache_ref=6.40983M cycles=163.663M cycles_per_byte=1.19197 docs_per_sec=33.0629/s frequency=5.41117G/s instructions=641.059M instructions_per_byte=4.66887 instructions_per_cycle=3.91695 items=524.288k items_per_second=17.3345M/s [BEST: throughput=  4.74 GB/s doc_throughput=    34 docs/s instructions=   646174193 cycles=   159063247 branch_miss=  438628 cache_miss= 4045819 cache_ref=   6456997 items=    524288 avg_time=  30245373 ns]

Clang after patch:

./benchmark/bench_ondemand --benchmark_filter=\(partial_tweets\|kostya\)\<simdjson_ondemand                                                                                                                                          lemire/extractor ✭ ✱ ◼2024-10-09T17:53:59-10:00Running ./benchmark/bench_ondemandRun on (32 X 5500 MHz CPU s)CPU Caches:  L1 Data 48 KiB (x16)  L1 Instruction 32 KiB (x16)  L2 Unified 2048 KiB (x16)  L3 Unified 36864 KiB (x1)Load Average: 3.15, 4.13, 4.15simdjson::dom implementation:      haswellsimdjson::ondemand implementation (stage 1): haswellsimdjson::ondemand implementation (stage 2): fallback----------------------------------------------------------------------------------------------------------------Benchmark                                                      Time             CPU   Iterations UserCounters...----------------------------------------------------------------------------------------------------------------partial_tweets<simdjson_ondemand>/manual_time              79180 ns        93633 ns         8643 best_branch_miss=197 best_bytes_per_sec=8.11122G best_cache_miss=0 best_cache_ref=450 best_cycles=428.764k best_cycles_per_byte=0.678945 best_docs_per_sec=12.8441k best_frequency=5.50707G best_instructions=2.0093M best_instructions_per_byte=3.18171 best_instructions_per_cycle=4.68625 best_items_per_sec=1.28441M branch_miss=198.682 bytes=631.515k bytes_per_second=7.4279G/s cache_miss=5.06456 cache_ref=512.613 cycles=435.028k cycles_per_byte=0.688865 docs_per_sec=12.6294k/s frequency=5.49414G/s instructions=2.0093M instructions_per_byte=3.18171 instructions_per_cycle=4.61877 items=100 items_per_second=1.26294M/s [BEST: throughput=  8.11 GB/s doc_throughput= 12844 docs/s instructions=     2009296 cycles=      428764 branch_miss=     197 cache_miss=       0 cache_ref=       450 items=       100 avg_time=     79180 ns]partial_tweets<simdjson_ondemand_extract>/manual_time      89673 ns       105032 ns         7980 best_branch_miss=497 best_bytes_per_sec=7.35346G best_cache_miss=0 best_cache_ref=495 best_cycles=472.794k best_cycles_per_byte=0.748666 best_docs_per_sec=11.6442k best_frequency=5.50529G best_instructions=2.09835M best_instructions_per_byte=3.32272 best_instructions_per_cycle=4.43819 best_items_per_sec=1.16442M branch_miss=538.983 bytes=631.515k bytes_per_second=6.55876G/s cache_miss=9.3515 cache_ref=570.613 cycles=472.374k cycles_per_byte=0.748001 docs_per_sec=11.1516k/s frequency=5.26774G/s instructions=2.06032M instructions_per_byte=3.2625 instructions_per_cycle=4.36163 items=100 items_per_second=1.11516M/s [BEST: throughput=  7.35 GB/s doc_throughput= 11644 docs/s instructions=     2098348 cycles=      472794 branch_miss=     497 cache_miss=       0 cache_ref=       495 items=       100 avg_time=     89673 ns]Creating a source file spanning 134087 KBkostya<simdjson_ondemand>/manual_time                   30833970 ns     38675556 ns           22 best_branch_miss=440.382k best_bytes_per_sec=4.5328G best_cache_miss=4.03759M best_cache_ref=6.45941M best_cycles=166.235M best_cycles_per_byte=1.2107 best_docs_per_sec=33.0127 best_frequency=5.48785G best_instructions=667.118M best_instructions_per_byte=4.85866 best_instructions_per_cycle=4.0131 best_items_per_sec=17.3081M branch_miss=439.208k bytes=137.305M bytes_per_second=4.14722G/s cache_miss=4.09272M cache_ref=6.45905M cycles=169.122M cycles_per_byte=1.23173 docs_per_sec=32.4318/s frequency=5.48494G/s instructions=667.118M instructions_per_byte=4.85866 instructions_per_cycle=3.94459 items=524.288k items_per_second=17.0036M/s [BEST: throughput=  4.53 GB/s doc_throughput=    33 docs/s instructions=   667117635 cycles=   166234871 branch_miss=  440382 cache_miss= 4037591 cache_ref=   6459412 items=    524288 avg_time=  30833969 ns]kostya<simdjson_ondemand_extract>/manual_time           30911169 ns     38223328 ns           22 best_branch_miss=442.157k best_bytes_per_sec=4.55517G best_cache_miss=4.02945M best_cache_ref=6.46039M best_cycles=165.476M best_cycles_per_byte=1.20517 best_docs_per_sec=33.1756 best_frequency=5.48977G best_instructions=655.583M best_instructions_per_byte=4.77465 best_instructions_per_cycle=3.9618 best_items_per_sec=17.3936M branch_miss=441.434k bytes=137.305M bytes_per_second=4.13686G/s cache_miss=4.09106M cache_ref=6.4605M cycles=169.507M cycles_per_byte=1.23453 docs_per_sec=32.3508/s frequency=5.48368G/s instructions=655.583M instructions_per_byte=4.77465 instructions_per_cycle=3.86758 items=524.288k items_per_second=16.9611M/s [BEST: throughput=  4.56 GB/s doc_throughput=    33 docs/s instructions=   655582794 cycles=   165476087 branch_miss=  442157 cache_miss= 4029447 cache_ref=   6460389 items=    524288 avg_time=  30911168 ns]

Seems like the inlining does help; specially in Clang; Also

// this:auto& t = result.emplace_back();// instead of:      results.push_back(t);

is more apples to apples but seems like the compilers are already smart enough to do the right thing anyway.

Also, the cost ofconstructingto, andsub (which has a tuple inside) was always clear to me; I don't see any good way of extracting those constructions to the outside of the loop.

@the-moisrex
Copy link
Member

@lemire Things we could try:

  • Add a specialization forto<sub<...>> to eliminate an indirection; I had this before, but I benchmarked back then and I didn't see any difference so I just removed it.
  • Add some inline magic
  • Remove the use ofstd::tuple insub
  • Add another specialization forto<sub<to<...>, to<...>>> to eliminate indirection.
  • Instead ofobject::extract(Funcs&&... endpoints), tryingobject::extract(something&& sth) so the compiler can move the construction out of the loop if possible, but this probably won't help since we're referencing different things onto's constructions.
  • Eliminating value_iterator::on_field_raw(Func&& func) which the compiler may not be able to inline the function passed to it.
  • Implement a trie or something for checking the keys.

But these (except the last one if possible) may not bring much to the table.

@lemire
Copy link
MemberAuthor

@the-moisrex Thanks for the analysis.

@lemire
Copy link
MemberAuthor

lemire commentedSep 17, 2025
edited
Loading

What you can do easily with C++20, is this...

object.extract<"myfirstkey","mysecondkey","mythirdkey">(lambda1, lambda2, lambda3);

In this case, the keys become compile-time constant. This could be beneficial.

@the-moisrex
Copy link
Member

the-moisrex commentedSep 18, 2025
edited
Loading

I don't wanna propose a compile time query language because that's a lot of work, and adds learning curve for the users, but feels like we're trying to do that here.

The keys becoming a compile time constant may not be as beneficial as we think they'd be, we'd have zero pre-processing for them in the case of<"one", "two", ...>. Unless we're trying to implement the a trie or something at compile time from those keys, then yes, this can be beneficial.

The problem withto{sub... is that we're mixing user given references, and expect the compiler to do magic. It can't.

We need to separate the keys and where we want them to be put. And also we need to make sure we're not storing temporary structures so we wouldn't be making the compiler's job harder.

<key, key, key>(ref, ref, ref) does both of those things, but we wouldn't benefit from having access to keys at compile without the last idea I explained above which is getting smarter about the search with a trie or something like that.

I still don't like the syntax for<key, key, key> though, I don't like mixing types and values; feels strange, even though I've done it many times.

What I really like to be able to do is this, but I'm not sure if we can do it with reflection or not, even if we could, the compiler support is not there.

object.extract(car.make);// it would know to check for "make" auto-magically// heck, if we could do that, we could do this:object >> car.make;

Some brain-storming:

object.extract("one","two","key")(one, two, value);object["one","two","three"](one, two, three);auto [one, two, three] = object["one","two","three"]auto res = object["one","two","three"]res.get(one);res.get(two);res.get(three);// I suspect these will have performance penalties:object.extract("one","two","key") >> one >> two >> value;object >>"one" >>"two" >>"key" >> one >> two >> value;// this is cool too, but the keys are not gonna be available at compile time unless the references are too, which they will not.object.extract("one","two","key", one, two, value);

But to be honest, none of them are as powerful as theto/sub idea, unless we're actually thinking of creating a query language for it in a way that the keys would be queries.

And we have a bunch of JSON query languages likejq and what not.

@lemire
Copy link
MemberAuthor

@the-moisrex Reflection will be around soon.

The prototypes we have right now were implemented relatively quickly.

Right now, with simdjson 4.0, you can just automatically deserialize aCar instance. It works now.

You could even have it generate a structure on demand:

auto z = object.extract<"key1",int,"key2", std::string>()

(or some nicer variant)

This would return a structure with two attributes.

Unless we're trying to implement the a trie or something at compile time from those keys, then yes, this can be beneficial.

Your implementation is nice, but it has a significant cost. I believe that it is algorithmic. It is too expensive to check the keys in loops. So having compile-time strings might allow more clever implementations which might be needed to get your approach to be highly efficient.

@the-moisrex
Copy link
Member

With utilities likeboost::pfr::structure_tie we can do this:

structCar {  std::string make;  std::string model;} car;object.extract_into<"make","model">(car);

Or we could do it like this:

structure_tie(car) = object["make", "model"];tie(car.make, car.model) = object["make", "model"];tie(car.make, car.model) = object.extract<"make", "model">();// or with structured_tie

Of course, structure_tie has limitations.

I'm sure I even can pull this off, but the syntax is even worse (Imagine,this is possible):

object.extract<"make", &Car::make,"model", &Car::model>(car);

if we wanted to extract sub-objects, maybe we could do these, but I'm not 100% sure yet:

structCar {  std::string make;structDriver {    std::string name;  } driver;} car;// with a query language for example:tie(car.make, car.driver.name) = object["make", "driver.name"];// without a query language, maybe we could pull this off, but it would be ugly:tie(car.make, car.driver.name) = object["make", "driver"][self, "name"];// Maybe this?tie(car.make, car.driver.name) = object["make", sub("driver","name")];tie(car.make, car.driver.name) = object.extract<"make", sub("driver","name")>();

I'm not sure, but maybe we even could pull off sub-object of an array like these:

// get the first driver:tie(car.make, car.driver.name) = object.extract<"make", sub("drivers",0,"name")>();tie(car.make, car.driver.name) = object["make", sub("drivers",0,"name")];tie(car.make, car.driver.name) = object["make", "drivers"][selfobj, 0][selfobj, "name"];

But error handling in these is gonna be something to figure out.

Or maybe we could remove the references from what we have in this PR, and do it like this:

tie(car.make, car.driver.name, error) = obj.extract(      to{"make"},      to{"drivers", sub{        to{0, sub{"name"}},      }},);// or this:tie(car.make, car.driver.name, error) = obj.extract("make",      sub{"drivers", sub{0, sub{"name"}});// maybe:tie(car.make, car.driver.name, error) = obj.extract("make",sub("drivers",0,"name"));// if we could do that, we definitely can do it at compile time:tie(car.make, car.driver.name, error) = obj.extract<"make",sub("drivers",0,"name")>();

Do you think we could make a utility to run a lambda in the tie? Because it this PR's implementationsub can do that.

auto set_it = [](string& name) { car.driver.name = name; };tie(car.make, invoke_on_equal(set_it), error) = obj.extract<      "make",      sub("drivers",0,"name")>();

Oh wait, if we can pull that one off, we can do this as well for sub-objects:

auto set_it = [](auto& drivers) { car.driver.name = drivers[0].extract<"name">(); };tie(car.make, invoke_on_equal(set_it), error) = obj.extract<"make", "drivers">();// and we can create a utility for that as well which we could:tie(car.make, sub_to<0,"name">(car.driver.name), error) = obj.extract<"make", "drivers">();

For error handling withtie we might be able to do this to clean things up, but I'm not sure:

if (tie_up(car.make, sub_to<0,"name">(car.driver.name)) = obj.extract<"make","drivers">()) {// failure?}

@lemire, what do you think? which syntax would be the best?

@lemire
Copy link
MemberAuthor

Don't forget that you can run clang with static reflection. Please see
https://github.com/simdjson/simdjson/tree/master/p2996

With utilities like boost::pfr::structure_tie we can do this:

I don't think we want to bundle boost in simdjson.

what do you think? which syntax would be the best?

I think that what can provide high performance and efficiency should be the driving force.

This being said...

Something like this...

object.extract_into<"make","model">(car);

is very nice and it should be easy work in simdjson 4.0 with reflection.

It is better than having to annotate thecar struct.

@the-moisrex
Copy link
Member

I don't think we want to bundle boost in simdjson.

We don't have to, I don't think it would be that hard to implement. It's a 2 liner for C++26, but a bit more for older versions.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@lemire@the-moisrex

[8]ページ先頭

©2009-2025 Movatter.jp