- Notifications
You must be signed in to change notification settings - Fork34
Description
Even toughJEP 438 states that bothx64 andAArch64 architectures should benefit from new vector api, currently performance ofsimdjson-java on M1 mac is way worse than other parsers:
Benchmark Mode Cnt Score Error UnitsParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_fastjson thrpt 5 1229.991 ± 39.538 ops/sParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jackson thrpt 5 1099.877 ± 9.560 ops/sParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter thrpt 5 607.902 ± 10.469 ops/sParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter_scala thrpt 5 1930.694 ± 41.766 ops/sParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjson thrpt 5 26.287 ± 0.295 ops/sParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjsonPadded thrpt 5 26.516 ± 0.686 ops/sThis may be due to the usage of 256 bit vectors, I have found anthread which states that:
on AArch64 NEON, the max hardware vector size is 128 bits. So for 256-bits, we are not able to intrinsify to use SIMD directly, which will fall back to Java implementation of those APIs
When running the benchmark with'-XX:+UnlockDiagnosticVMOptions', '-XX:+PrintIntrinsics' the following output can be observed, supporting this theory:
** not supported: arity=0 op=load vlen=32 etype=byte ismask=no** not supported: arity=1 op=store vlen=32 etype=byte ismask=no** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte** not supported: arity=0 op=load vlen=32 etype=byte ismask=no** not supported: arity=1 op=store vlen=32 etype=byte ismask=no** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byteObviously AArch64 support is not as important as x64, but it may be interesting to make the implementation flexible to support both architectures. Perhaps theC++ implementation can be used as a reference.
Anyway, great work so far on the Java port, the results on x64 are very impressive!