Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

use simd masking for amd64&arm64#326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
nhooyr merged 26 commits intocoder:devfromwdvxdr1123:patch-simd-mask
Feb 22, 2024
Merged

Conversation

wdvxdr1123
Copy link
Contributor

goos: windows
goarch: amd64
pkg: nhooyr.io/websocket
cpu: Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz
Benchmark_mask/2/basic-8 425339004 2.795 ns/op 715.66 MB/s
Benchmark_mask/2/nhooyr-8 379937766 3.186 ns/op 627.78 MB/s
Benchmark_mask/2/gorilla-8 392164167 3.071 ns/op 651.24 MB/s
Benchmark_mask/2/gobwas-8 310037222 3.880 ns/op 515.46 MB/s
Benchmark_mask/3/basic-8 321408024 3.806 ns/op 788.32 MB/s
Benchmark_mask/3/nhooyr-8 350726338 3.478 ns/op 862.58 MB/s
Benchmark_mask/3/gorilla-8 332217727 3.634 ns/op 825.43 MB/s
Benchmark_mask/3/gobwas-8 247376214 4.886 ns/op 614.01 MB/s
Benchmark_mask/4/basic-8 261182472 4.582 ns/op 872.91 MB/s
Benchmark_mask/4/nhooyr-8 381830712 3.262 ns/op1226.05 MB/s
Benchmark_mask/4/gorilla-8 272616304 4.395 ns/op 910.04 MB/s
Benchmark_mask/4/gobwas-8 204574558 5.855 ns/op 683.19 MB/s
Benchmark_mask/8/basic-8 191330037 6.162 ns/op1298.24 MB/s
Benchmark_mask/8/nhooyr-8 369694992 3.285 ns/op2435.65 MB/s
Benchmark_mask/8/gorilla-8 175388466 6.743 ns/op1186.48 MB/s
Benchmark_mask/8/gobwas-8 241719933 4.886 ns/op1637.45 MB/s
Benchmark_mask/16/basic-8 100000000 10.92 ns/op1464.83 MB/s
Benchmark_mask/16/nhooyr-8 272565096 4.436 ns/op3606.98 MB/s
Benchmark_mask/16/gorilla-8 100000000 11.20 ns/op1428.53 MB/s
Benchmark_mask/16/gobwas-8 221356798 5.405 ns/op2960.45 MB/s
Benchmark_mask/32/basic-8 61476984 20.40 ns/op1568.80 MB/s
Benchmark_mask/32/nhooyr-8 238665572 5.050 ns/op6337.22 MB/s
Benchmark_mask/32/gorilla-8 100000000 12.09 ns/op2647.28 MB/s
Benchmark_mask/32/gobwas-8 186077235 6.477 ns/op4940.36 MB/s
Benchmark_mask/128/basic-8 14629720 80.90 ns/op1582.19 MB/s
Benchmark_mask/128/nhooyr-8 181241968 6.565 ns/op19497.98 MB/s
Benchmark_mask/128/gorilla-8 68308342 16.76 ns/op7639.37 MB/s
Benchmark_mask/128/gobwas-8 94582026 12.97 ns/op9872.11 MB/s
Benchmark_mask/512/basic-8 3921001 305.6 ns/op1675.55 MB/s
Benchmark_mask/512/nhooyr-8 123102199 9.721 ns/op52669.11 MB/s
Benchmark_mask/512/gorilla-8 32355914 38.18 ns/op13411.43 MB/s
Benchmark_mask/512/gobwas-8 31528501 37.80 ns/op13544.37 MB/s
Benchmark_mask/4096/basic-8 491804 2381 ns/op1720.39 MB/s
Benchmark_mask/4096/nhooyr-8 26159691 46.98 ns/op87187.73 MB/s
Benchmark_mask/4096/gorilla-8 4898440 243.6 ns/op16817.89 MB/s
Benchmark_mask/4096/gobwas-8 4336398 277.2 ns/op14776.40 MB/s
Benchmark_mask/16384/basic-8 113842 9623 ns/op1702.66 MB/s
Benchmark_mask/16384/nhooyr-8 8088847 154.5 ns/op106058.18 MB/s
Benchmark_mask/16384/gorilla-8 1282993 933.6 ns/op17549.90 MB/s
Benchmark_mask/16384/gobwas-8 997347 1086 ns/op15093.49 MB/s

nhooyr-ts, Ink-33, aca, nhooyr, idy, mrkmrtns, marekmartins, Jisin0, Jibbscript, and sc0Vu reacted with heart emoji
@wdvxdr1123wdvxdr1123 changed the titleuse simd mask for amd64&arm64use simd masking for amd64&arm64Jan 24, 2022
@nhooyrnhooyr changed the base branch frommaster todevOctober 13, 2023 09:12
@nhooyrnhooyr added this to thev1.9.0 milestoneOct 13, 2023
@nhooyrnhooyrforce-pushed thedev branch 8 times, most recently frome6fb843 to0caa997CompareOctober 19, 2023 11:01
@nhooyr
Copy link
Contributor

Finally gotten around to reviewing this. I'm not very familiar with writing assembly of any kind. Why use AVX2 instead of AVX-512?

@nhooyr
Copy link
Contributor

Also don't worry about the merge conflicts, I'll fix them myself.

@nhooyr
Copy link
Contributor

Benchmark_mask/2/basic-12           631384161            1.883 ns/op    1061.88 MB/s           0 B/op          0 allocs/opBenchmark_mask/2/nhooyr-12          591894866            2.061 ns/op     970.52 MB/s           0 B/op          0 allocs/opBenchmark_mask/2/gorilla-12         657205106            1.923 ns/op    1040.00 MB/s           0 B/op          0 allocs/opBenchmark_mask/2/gobwas-12          496567813            2.496 ns/op     801.34 MB/s           0 B/op          0 allocs/opBenchmark_mask/3/basic-12           592897168            1.992 ns/op    1506.14 MB/s           0 B/op          0 allocs/opBenchmark_mask/3/nhooyr-12          507159836            2.197 ns/op    1365.80 MB/s           0 B/op          0 allocs/opBenchmark_mask/3/gorilla-12         553840022            2.304 ns/op    1302.28 MB/s           0 B/op          0 allocs/opBenchmark_mask/3/gobwas-12          397366413            2.800 ns/op    1071.31 MB/s           0 B/op          0 allocs/opBenchmark_mask/4/basic-12           634193241            1.807 ns/op    2213.23 MB/s           0 B/op          0 allocs/opBenchmark_mask/4/nhooyr-12          569515338            2.002 ns/op    1998.05 MB/s           0 B/op          0 allocs/opBenchmark_mask/4/gorilla-12         451382727            2.599 ns/op    1538.81 MB/s           0 B/op          0 allocs/opBenchmark_mask/4/gobwas-12          356507592            3.312 ns/op    1207.75 MB/s           0 B/op          0 allocs/opBenchmark_mask/8/basic-12           405458120            2.981 ns/op    2683.23 MB/s           0 B/op          0 allocs/opBenchmark_mask/8/nhooyr-12          586096395            2.124 ns/op    3765.62 MB/s           0 B/op          0 allocs/opBenchmark_mask/8/gorilla-12         296482132            4.003 ns/op    1998.59 MB/s           0 B/op          0 allocs/opBenchmark_mask/8/gobwas-12          358996738            3.317 ns/op    2411.46 MB/s           0 B/op          0 allocs/opBenchmark_mask/16/basic-12          199646600            5.828 ns/op    2745.57 MB/s           0 B/op          0 allocs/opBenchmark_mask/16/nhooyr-12         482739769            2.494 ns/op    6416.64 MB/s           0 B/op          0 allocs/opBenchmark_mask/16/gorilla-12        166567765            7.225 ns/op    2214.41 MB/s           0 B/op          0 allocs/opBenchmark_mask/16/gobwas-12         297547316            3.989 ns/op    4011.07 MB/s           0 B/op          0 allocs/opBenchmark_mask/32/basic-12          66204484            18.72 ns/op 1709.47 MB/s           0 B/op          0 allocs/opBenchmark_mask/32/nhooyr-12         444971588            2.557 ns/op    12516.90 MB/s          0 B/op          0 allocs/opBenchmark_mask/32/gorilla-12        153725197            7.672 ns/op    4171.01 MB/s           0 B/op          0 allocs/opBenchmark_mask/32/gobwas-12         221328512            5.407 ns/op    5918.17 MB/s           0 B/op          0 allocs/opBenchmark_mask/128/basic-12         21106347            58.03 ns/op 2205.73 MB/s           0 B/op          0 allocs/opBenchmark_mask/128/nhooyr-12        329196819            3.777 ns/op    33893.45 MB/s          0 B/op          0 allocs/opBenchmark_mask/128/gorilla-12       100000000           11.08 ns/op 11552.46 MB/s          0 B/op          0 allocs/opBenchmark_mask/128/gobwas-12        82296996            14.98 ns/op 8546.19 MB/s           0 B/op          0 allocs/opBenchmark_mask/512/basic-12          5925668           208.8 ns/op  2451.84 MB/s           0 B/op          0 allocs/opBenchmark_mask/512/nhooyr-12        11774136           101.9 ns/op  5023.62 MB/s           0 B/op          0 allocs/opBenchmark_mask/512/gorilla-12       43038144            26.93 ns/op 19014.42 MB/s          0 B/op          0 allocs/opBenchmark_mask/512/gobwas-12        23169214            55.74 ns/op 9184.92 MB/s           0 B/op          0 allocs/opBenchmark_mask/4096/basic-12          795450          1445 ns/op    2835.39 MB/s           0 B/op          0 allocs/opBenchmark_mask/4096/nhooyr-12        9641613           124.3 ns/op  32940.03 MB/s          0 B/op          0 allocs/opBenchmark_mask/4096/gorilla-12       8906532           139.6 ns/op  29346.43 MB/s          0 B/op          0 allocs/opBenchmark_mask/4096/gobwas-12        2789071           424.5 ns/op  9648.84 MB/s           0 B/op          0 allocs/opBenchmark_mask/16384/basic-12         219685          5795 ns/op    2827.23 MB/s           0 B/op          0 allocs/opBenchmark_mask/16384/nhooyr-12       6135582           196.3 ns/op  83454.70 MB/s          0 B/op          0 allocs/opBenchmark_mask/16384/gorilla-12      2377486           516.0 ns/op  31752.39 MB/s          0 B/op          0 allocs/opBenchmark_mask/16384/gobwas-12        723357          1557 ns/op    10523.07 MB/s          0 B/op          0 allocs/opPASSok      nhooyr.io/websocket/internal/thirdparty 58.195s

For some reason it slows down at the 512 byte benchmark. Not sure what's going on there.

@nhooyr
Copy link
Contributor

More clearly:

Benchmark_mask/2/nhooyr-12          590403414            2.028 ns/op     986.19 MB/s           0 B/op          0 allocs/opBenchmark_mask/3/nhooyr-12          584087539            2.063 ns/op    1453.96 MB/s           0 B/op          0 allocs/opBenchmark_mask/4/nhooyr-12          655971961            1.839 ns/op    2175.33 MB/s           0 B/op          0 allocs/opBenchmark_mask/8/nhooyr-12          642215430            1.905 ns/op    4199.37 MB/s           0 B/op          0 allocs/opBenchmark_mask/16/nhooyr-12         485812323            2.301 ns/op    6954.78 MB/s           0 B/op          0 allocs/opBenchmark_mask/32/nhooyr-12         501743362            2.351 ns/op    13608.66 MB/s          0 B/op          0 allocs/opBenchmark_mask/128/nhooyr-12        334930033            3.648 ns/op    35090.20 MB/s          0 B/op          0 allocs/opBenchmark_mask/512/nhooyr-12        51036463            99.33 ns/op 5154.74 MB/s           0 B/op          0 allocs/opBenchmark_mask/4096/nhooyr-12       11011562           121.7 ns/op  33663.04 MB/s          0 B/op          0 allocs/opBenchmark_mask/16384/nhooyr-12       6010369           197.6 ns/op  82904.02 MB/s          0 B/op          0 allocs/op

Super weird.

@nhooyr
Copy link
Contributor

Disabling AVX2 seems to have fixed it.

Benchmark_mask/2/nhooyr-12          542097008            2.197 ns/op     910.42 MB/s           0 B/op          0 allocs/opBenchmark_mask/3/nhooyr-12          537046092            2.258 ns/op    1328.35 MB/s           0 B/op          0 allocs/opBenchmark_mask/4/nhooyr-12          516057957            1.957 ns/op    2044.01 MB/s           0 B/op          0 allocs/opBenchmark_mask/8/nhooyr-12          566813392            2.027 ns/op    3946.05 MB/s           0 B/op          0 allocs/opBenchmark_mask/16/nhooyr-12         456252357            2.465 ns/op    6491.72 MB/s           0 B/op          0 allocs/opBenchmark_mask/32/nhooyr-12         477971746            2.697 ns/op    11862.99 MB/s          0 B/op          0 allocs/opBenchmark_mask/128/nhooyr-12        323935191            3.760 ns/op    34040.58 MB/s          0 B/op          0 allocs/opBenchmark_mask/512/nhooyr-12        131543775            8.955 ns/op    57174.80 MB/s          0 B/op          0 allocs/opBenchmark_mask/4096/nhooyr-12       23514272            46.50 ns/op 88092.14 MB/s          0 B/op          0 allocs/opBenchmark_mask/16384/nhooyr-12       6336271           181.9 ns/op  90069.97 MB/s          0 B/op          0 allocs/op

@nhooyrnhooyrforce-pushed thepatch-simd-mask branch 6 times, most recently from1e8bf28 to32d0aa1CompareOctober 19, 2023 23:40
@nhooyr
Copy link
Contributor

The amd64 code looks good to me so far but the arm64 code doesn't seem to produce any speedup at least through qemu.

goos: linuxgoarch: amd64pkg: nhooyr.io/websocketcpu: 12th Gen Intel(R) Core(TM) i5-1235UBenchmarkFlateWriter-12        3722    326920 ns/op 1200024 B/op      16 allocs/opBenchmarkFlateReader-12      169479      6926 ns/op   41047 B/op       6 allocs/opBenchmarkConn/disabledCompress-12            84481     12720 ns/op  40.25 MB/s       518.0 read/op       520.0 written/op       1 B/op       0 allocs/opBenchmarkConn/compressContextTakeover-12     32448     33822 ns/op  15.14 MB/s        24.00 read/op        36.00 written/op      42 B/op       0 allocs/opBenchmarkConn/compressNoContext-12           38430     29966 ns/op  17.09 MB/s        41.00 read/op        29.00 written/op      96 B/op       0 allocs/opPASSok  nhooyr.io/websocket6.819sgoos: linuxgoarch: amd64pkg: nhooyr.io/websocket/internal/thirdpartycpu: 12th Gen Intel(R) Core(TM) i5-1235UBenchmark_mask/amd64/basic/8-12 425723130         2.780 ns/op2877.27 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/16-12         224227551         5.293 ns/op3022.94 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/32-12         100000000        10.19 ns/op3139.45 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/128-12        24135116        46.41 ns/op2757.74 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/256-12        12339093        85.20 ns/op3004.60 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/512-12         7325516       163.8 ns/op3125.51 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/1024-12        3657289       320.5 ns/op3194.87 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/2048-12        1887517       638.8 ns/op3206.18 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/4096-12         934762      1264 ns/op3241.70 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/8192-12         395722      2598 ns/op3153.37 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/16384-12        236943      5162 ns/op3173.86 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/8-12      505864449         2.316 ns/op3454.92 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/16-12     500031924         2.375 ns/op6737.54 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/32-12     451944298         2.574 ns/op12429.91 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/128-12    306800580         3.938 ns/op32506.67 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/256-12    197035516         6.612 ns/op38717.64 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/512-12    114783908        10.59 ns/op48332.85 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/1024-12   59498761        19.20 ns/op53328.93 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/2048-12   31537369        36.59 ns/op55970.07 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/4096-12   15516426        77.49 ns/op52861.24 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/8192-12    8057901       150.7 ns/op54358.50 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/16384-12   4023576       294.3 ns/op55666.10 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/8-12 498550161         2.298 ns/op3481.43 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/16-12         508013607         2.505 ns/op6387.00 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/32-12         475446944         2.687 ns/op11909.62 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/128-12        347085175         3.462 ns/op36969.76 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/256-12        239742297         5.094 ns/op50253.25 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/512-12        132367032         9.429 ns/op54300.89 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/1024-12       59876775        17.24 ns/op59387.88 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/2048-12       43464296        28.10 ns/op72877.63 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/4096-12       25988770        51.22 ns/op79973.77 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/8192-12       11870416        97.20 ns/op84279.05 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/16384-12       6374655       196.1 ns/op83555.75 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/8-12                 307082148         4.199 ns/op1905.10 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/16-12                166534495         7.258 ns/op2204.54 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/32-12                157286900         7.638 ns/op4189.59 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/128-12               121178448        10.14 ns/op12620.60 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/256-12               88366356        13.62 ns/op18791.55 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/512-12               40303383        26.69 ns/op19181.52 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/1024-12              28564507        41.38 ns/op24744.85 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/2048-12              14325160        72.32 ns/op28317.53 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/4096-12               8834644       130.5 ns/op31378.79 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/8192-12               4661844       249.3 ns/op32856.93 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/16384-12              2452156       491.8 ns/op33317.08 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/8-12                  372520472         3.229 ns/op2477.79 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/16-12                 303515722         3.914 ns/op4088.10 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/32-12                 215681712         5.353 ns/op5977.97 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/128-12                82971432        15.39 ns/op8319.67 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/256-12                43254800        30.40 ns/op8420.77 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/512-12                20618145        58.86 ns/op8698.44 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/1024-12               11872770       108.3 ns/op9453.73 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/2048-12                6433407       207.7 ns/op9860.23 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/4096-12                3156878       403.0 ns/op10162.75 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/8192-12                1622864       745.8 ns/op10984.28 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/16384-12                820447      1490 ns/op10997.96 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/8-12                    585874134         2.147 ns/op3726.24 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/16-12                   475160053         2.394 ns/op6684.32 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/32-12                   356494118         3.316 ns/op9650.56 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/128-12                  269125159         4.106 ns/op31177.06 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/256-12                  150355809         7.474 ns/op34249.82 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/512-12                  72345751        14.25 ns/op35929.65 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/1024-12                 41781184        24.17 ns/op42371.22 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/2048-12                 26343178        45.28 ns/op45225.24 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/4096-12                 13897591        94.29 ns/op43440.45 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/8192-12                  6824702       185.3 ns/op44204.41 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/16384-12                 3472126       368.5 ns/op44459.74 MB/s       0 B/op       0 allocs/opPASSok  nhooyr.io/websocket/internal/thirdparty102.835sgoos: linuxgoarch: arm64pkg: nhooyr.io/websocket/internal/thirdpartycpu: 12th Gen Intel(R) Core(TM) i5-1235U @ 1364.583MHzBenchmark_mask/arm64/basic/8-12 47771958        26.59 ns/op 300.86 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/basic/16-12         24547660        52.69 ns/op 303.64 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/basic/32-12         12533614        92.10 ns/op 347.46 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/basic/128-12         3555813       346.9 ns/op 368.94 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/basic/256-12         1811830       673.4 ns/op 380.14 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/basic/512-12          938022      1335 ns/op 383.53 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/basic/1024-12         484177      2479 ns/op 413.13 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/basic/2048-12         211894      5014 ns/op 408.45 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/basic/4096-12         112736     10130 ns/op 404.35 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/basic/8192-12          61010     21183 ns/op 386.72 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/basic/16384-12         31218     39141 ns/op 418.59 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nhooyr-go/8-12      39843982        28.80 ns/op 277.80 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nhooyr-go/16-12     34930447        29.61 ns/op 540.32 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nhooyr-go/32-12     32931360        33.07 ns/op 967.69 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nhooyr-go/128-12    32877277        42.30 ns/op3025.92 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nhooyr-go/256-12    21600469        60.31 ns/op4244.99 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nhooyr-go/512-12    14673056        94.28 ns/op5430.72 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nhooyr-go/1024-12    8250734       163.7 ns/op6256.35 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nhooyr-go/2048-12    3977023       301.1 ns/op6802.66 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nhooyr-go/4096-12    2260831       578.0 ns/op7086.45 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nhooyr-go/8192-12    1121847      1079 ns/op7594.77 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nhooyr-go/16384-12    508933      2095 ns/op7819.85 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/8-12 34301584        36.89 ns/op 216.87 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/16-12         33929019        37.52 ns/op 426.46 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/32-12         31671778        41.70 ns/op 767.31 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/128-12        25115096        53.61 ns/op2387.78 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/256-12        17948512        63.43 ns/op4036.25 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/512-12        12472801       104.4 ns/op4902.55 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/1024-12        7425166       161.7 ns/op6334.35 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/2048-12        3981708       292.6 ns/op6998.52 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/4096-12        2086530       563.9 ns/op7264.25 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/8192-12        1070166      1114 ns/op7355.53 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/16384-12        504093      2159 ns/op7588.84 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gorilla/8-12                 27462318        46.20 ns/op 173.14 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gorilla/16-12                23176634        49.10 ns/op 325.85 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gorilla/32-12                22810416        58.54 ns/op 546.62 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gorilla/128-12               12784365        87.69 ns/op1459.75 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gorilla/256-12                8819766       142.4 ns/op1797.13 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gorilla/512-12                5834811       225.9 ns/op2266.71 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gorilla/1024-12               3309975       369.7 ns/op2769.72 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gorilla/2048-12               1758891       763.6 ns/op2682.02 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gorilla/4096-12                742028      1404 ns/op2917.37 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gorilla/8192-12                489636      2739 ns/op2990.70 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gorilla/16384-12               236709      5086 ns/op3221.13 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gobwas/8-12                  31763971        34.14 ns/op 234.35 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gobwas/16-12                 28280493        41.83 ns/op 382.47 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gobwas/32-12                 23041581        52.92 ns/op 604.73 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gobwas/128-12                10903680       115.3 ns/op1110.58 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gobwas/256-12                 6139404       202.1 ns/op1266.59 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gobwas/512-12                 3639919       339.2 ns/op1509.60 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gobwas/1024-12                1897648       680.3 ns/op1505.26 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gobwas/2048-12                 958771      1223 ns/op1674.76 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gobwas/4096-12                 520082      2581 ns/op1586.94 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gobwas/8192-12                 243410      4994 ns/op1640.52 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/gobwas/16384-12                129097      9468 ns/op1730.40 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nbio/8-12                    41615394        27.52 ns/op 290.75 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nbio/16-12                   38795175        31.95 ns/op 500.84 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nbio/32-12                   35392299        36.75 ns/op 870.68 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nbio/128-12                  31278990        39.73 ns/op3221.36 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nbio/256-12                  20779035        59.31 ns/op4316.11 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nbio/512-12                  12213514        99.53 ns/op5144.00 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nbio/1024-12                  7523419       161.6 ns/op6335.00 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nbio/2048-12                  3721555       330.7 ns/op6192.52 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nbio/4096-12                  1884742       612.8 ns/op6683.56 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nbio/8192-12                  1000591      1199 ns/op6834.55 MB/s       0 B/op       0 allocs/opBenchmark_mask/arm64/nbio/16384-12                  512989      2263 ns/op7238.41 MB/s       0 B/op       0 allocs/opPASS

In fact it's slower. Not sure what's going on.

@nhooyr
Copy link
Contributor

Will test on a proper VM too.

@nhooyrnhooyrforce-pushed thepatch-simd-mask branch 2 times, most recently from7d0c6f4 to9f298ecCompareOctober 20, 2023 14:29
json.Encoder is 42% faster than json.Marshal thanks to the memory reuse.goos: linuxgoarch: amd64pkg: nhooyr.io/websocket/wsjsoncpu: 12th Gen Intel(R) Core(TM) i5-1235UBenchmarkJSON/json.Encoder-12            3517579           340.2 ns/op        24 B/op          1 allocs/opBenchmarkJSON/json.Marshal-12            2374086           484.3 ns/op       728 B/op          2 allocs/opClosescoder#409
[qrvnl@dios ~/src/websocket] 130$ go test -bench=. ./wsjson/goos: linuxgoarch: amd64pkg: nhooyr.io/websocket/wsjsoncpu: 12th Gen Intel(R) Core(TM) i5-1235UBenchmarkJSON/json.Encoder/8-12         14041426            72.59 ns/op  110.21 MB/s          16 B/op          1 allocs/opBenchmarkJSON/json.Encoder/16-12        13936426            86.99 ns/op  183.92 MB/s          16 B/op          1 allocs/opBenchmarkJSON/json.Encoder/32-12        11416401           115.3 ns/op   277.59 MB/s          16 B/op          1 allocs/opBenchmarkJSON/json.Encoder/128-12        4600574           264.7 ns/op   483.55 MB/s          16 B/op          1 allocs/opBenchmarkJSON/json.Encoder/256-12        2710398           433.9 ns/op   590.06 MB/s          16 B/op          1 allocs/opBenchmarkJSON/json.Encoder/512-12        1588930           717.3 ns/op   713.82 MB/s          16 B/op          1 allocs/opBenchmarkJSON/json.Encoder/1024-12        823138          1484 ns/op     689.80 MB/s          16 B/op          1 allocs/opBenchmarkJSON/json.Encoder/2048-12        402823          2875 ns/op     712.32 MB/s          16 B/op          1 allocs/opBenchmarkJSON/json.Encoder/4096-12        213926          5602 ns/op     731.14 MB/s          16 B/op          1 allocs/opBenchmarkJSON/json.Encoder/8192-12         92864         11281 ns/op     726.19 MB/s          16 B/op          1 allocs/opBenchmarkJSON/json.Encoder/16384-12        39318         29203 ns/op     561.04 MB/s          19 B/op          1 allocs/opBenchmarkJSON/json.Marshal/8-12         10768671           114.5 ns/op    69.89 MB/s          48 B/op          2 allocs/opBenchmarkJSON/json.Marshal/16-12        10140996           113.9 ns/op   140.51 MB/s          64 B/op          2 allocs/opBenchmarkJSON/json.Marshal/32-12         9211780           121.6 ns/op   263.06 MB/s          64 B/op          2 allocs/opBenchmarkJSON/json.Marshal/128-12        4632796           264.2 ns/op   484.53 MB/s         224 B/op          2 allocs/opBenchmarkJSON/json.Marshal/256-12        2441511           473.5 ns/op   540.65 MB/s         432 B/op          2 allocs/opBenchmarkJSON/json.Marshal/512-12        1298788           896.2 ns/op   571.27 MB/s         912 B/op          2 allocs/opBenchmarkJSON/json.Marshal/1024-12        602084          1866 ns/op     548.83 MB/s        1808 B/op          2 allocs/opBenchmarkJSON/json.Marshal/2048-12        341151          3817 ns/op     536.61 MB/s        3474 B/op          2 allocs/opBenchmarkJSON/json.Marshal/4096-12        175594          7034 ns/op     582.32 MB/s        6548 B/op          2 allocs/opBenchmarkJSON/json.Marshal/8192-12         83222         15023 ns/op     545.30 MB/s       13591 B/op          2 allocs/opBenchmarkJSON/json.Marshal/16384-12        33087         39348 ns/op     416.39 MB/s       27304 B/op          2 allocs/opPASSok      nhooyr.io/websocket/wsjson  32.934s
@dixyes
Copy link

I guess qemu simd emulation harms performance

on aliyun(alibabacloud) yitian710 (arm64 armv8) 2c4g machine:

root@iZbp1heu8m4uq7gguvddwaZ:~/websocket# cat /proc/cpuinfoprocessor       : 0BogoMIPS        : 100.00Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh btiCPU implementer : 0x41CPU architecture: 8CPU variant     : 0x0CPU part        : 0xd49CPU revision    : 0processor       : 1BogoMIPS        : 100.00Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh btiCPU implementer : 0x41CPU architecture: 8CPU variant     : 0x0CPU part        : 0xd49CPU revision    : 0root@iZbp1heu8m4uq7gguvddwaZ:~/websocket# uname -aLinux iZbp1heu8m4uq7gguvddwaZ 5.10.0-19-arm64 #1 SMP Debian 5.10.149-2 (2022-10-21) aarch64 GNU/Linux
goos: linuxgoarch: arm64pkg: nhooyr.io/websocket/internal/thirdpartyBenchmark_mask/arm64/basic/8-2  206792809                5.802 ns/op    1378.89 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/16-2                 100000000               10.02 ns/op     1596.73 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/32-2                 58691935                20.34 ns/op     1573.17 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/128-2                14648796                81.91 ns/op     1562.64 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/256-2                 7302968               164.3 ns/op      1558.27 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/512-2                 3585920               334.4 ns/op      1530.96 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/1024-2                1807688               663.8 ns/op      1542.68 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/2048-2                 901452              1322 ns/op        1548.69 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/4096-2                 453880              2641 ns/op        1550.79 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/8192-2                 227306              5273 ns/op        1553.59 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/16384-2                113630             10536 ns/op        1555.07 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/8-2              372385791                3.200 ns/op    2499.82 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/16-2             326266168                3.677 ns/op    4351.15 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/32-2             326263063                3.675 ns/op    8706.64 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/128-2            193277991                6.178 ns/op    20717.82 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/256-2            120835178                9.939 ns/op    25757.71 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/512-2            67891269                17.58 ns/op     29120.25 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/1024-2           36238434                33.05 ns/op     30981.53 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/2048-2           18876517                63.51 ns/op     32244.43 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/4096-2            9632865               124.4 ns/op      32913.56 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/8192-2            4862270               246.5 ns/op      33239.77 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/16384-2           2449879               490.7 ns/op      33386.93 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/8-2         320587507                3.747 ns/op    2134.84 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/16-2                298397137                4.016 ns/op    3984.46 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/32-2                295286755                4.051 ns/op    7899.26 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/128-2               198758401                6.010 ns/op    21299.05 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/256-2               148294503                8.101 ns/op    31599.58 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/512-2               99287224                12.21 ns/op     41941.45 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/1024-2              59101357                20.24 ns/op     50591.08 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/2048-2              32870538                36.43 ns/op     56215.26 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/4096-2              17392502                68.75 ns/op     59578.86 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/8192-2               8991554               133.3 ns/op      61432.88 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/16384-2              4537192               264.3 ns/op      61990.60 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/8-2                        166697532                7.199 ns/op    1111.26 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/16-2                       95416378                12.50 ns/op     1280.35 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/32-2                       99859288                12.03 ns/op     2659.82 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/128-2                      74788264                15.98 ns/op     8008.48 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/256-2                      49521510                24.10 ns/op     10620.54 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/512-2                      30854259                38.75 ns/op     13213.30 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/1024-2                     17709324                67.75 ns/op     15114.36 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/2048-2                      9540504               125.6 ns/op      16301.06 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/4096-2                      4887254               245.4 ns/op      16689.60 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/8192-2                      2506159               477.0 ns/op      17173.59 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/16384-2                     1276844               939.9 ns/op      17431.75 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/8-2                         239466345                5.011 ns/op    1596.61 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/16-2                        198722446                6.030 ns/op    2653.50 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/32-2                        149454994                8.028 ns/op    3986.12 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/128-2                       58453107                20.45 ns/op     6259.12 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/256-2                       32118558                37.26 ns/op     6870.96 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/512-2                       16886425                70.98 ns/op     7213.33 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/1024-2                       8660222               138.4 ns/op      7396.91 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/2048-2                       4389014               273.8 ns/op      7478.89 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/4096-2                       2220012               540.4 ns/op      7579.69 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/8192-2                       1000000              1070 ns/op        7654.83 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/16384-2                       561620              2130 ns/op        7691.23 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/8-2                           359732443                3.339 ns/op    2395.91 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/16-2                          295799040                4.060 ns/op    3941.20 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/32-2                          222655515                5.406 ns/op    5918.87 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/128-2                         175895174                6.824 ns/op    18757.64 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/256-2                         100000000               11.33 ns/op     22586.09 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/512-2                         59968189                19.72 ns/op     25968.88 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/1024-2                        33116636                36.16 ns/op     28320.44 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/2048-2                        17286394                69.43 ns/op     29496.87 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/4096-2                         8810706               136.0 ns/op      30118.04 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/8192-2                         4461346               268.9 ns/op      30466.70 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/16384-2                        2242198               534.8 ns/op      30633.09 MB/s          0 B/op          0 allocs/opPASS

on aliyun(alibabacloud) ampere altra (arm64 armv8) 2c4g machine:

root@iZbp19nzrw6iywyjtl52srZ:~# cat /proc/cpuinfo processor       : 0BogoMIPS        : 50.00Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbsCPU implementer : 0x41CPU architecture: 8CPU variant     : 0x3CPU part        : 0xd0cCPU revision    : 1processor       : 1BogoMIPS        : 50.00Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbsCPU implementer : 0x41CPU architecture: 8CPU variant     : 0x3CPU part        : 0xd0cCPU revision    : 1root@iZbp19nzrw6iywyjtl52srZ:~# uname -aLinux iZbp19nzrw6iywyjtl52srZ 5.10.0-19-arm64 #1 SMP Debian 5.10.149-2 (2022-10-21) aarch64 GNU/Linux
goos: linuxgoarch: arm64pkg: nhooyr.io/websocket/internal/thirdpartyBenchmark_mask/arm64/basic/8-2  156192206                7.680 ns/op    1041.61 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/16-2                 87099630                13.69 ns/op     1168.31 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/32-2                 43625746                27.15 ns/op     1178.65 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/128-2                11600862               103.4 ns/op      1237.93 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/256-2                 5790669               207.2 ns/op      1235.57 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/512-2                 2849724               421.0 ns/op      1216.19 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/1024-2                1443289               830.9 ns/op      1232.42 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/2048-2                 723596              1652 ns/op        1239.84 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/4096-2                 364108              3289 ns/op        1245.26 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/8192-2                 182422              6565 ns/op        1247.79 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/basic/16384-2                 91266             13126 ns/op        1248.20 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/8-2              179696448                6.678 ns/op    1198.02 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/16-2             171135552                7.011 ns/op    2282.01 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/32-2             163356070                7.345 ns/op    4356.99 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/128-2            100000000               10.21 ns/op     12531.93 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/256-2            73170615                16.29 ns/op     15715.26 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/512-2            42342985                28.30 ns/op     18091.30 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/1024-2           22871635                52.36 ns/op     19557.29 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/2048-2           11953033               100.4 ns/op      20390.69 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/4096-2            6098042               196.7 ns/op      20824.14 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/8192-2            3083127               389.3 ns/op      21045.51 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nhooyr-go/16384-2           1549681               773.8 ns/op      21172.69 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/8-2         239631874                5.007 ns/op    1597.77 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/16-2                246623696                4.874 ns/op    3282.49 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/32-2                224660503                5.343 ns/op    5989.62 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/128-2               146873190                8.150 ns/op    15705.46 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/256-2               100000000               11.35 ns/op     22548.91 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/512-2               66308772                18.03 ns/op     28401.24 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/1024-2              38059369                31.39 ns/op     32624.65 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/2048-2              20609492                58.09 ns/op     35258.25 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/4096-2              10760130               111.5 ns/op      36737.95 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/8192-2               5494204               218.4 ns/op      37501.18 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/wdvxdr1123-asm/16384-2              2776998               432.0 ns/op      37923.91 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/8-2                        126019189                9.511 ns/op     841.13 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/16-2                       87176002                13.69 ns/op     1168.45 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/32-2                       79482931                15.03 ns/op     2129.34 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/128-2                      51963406                23.05 ns/op     5552.79 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/256-2                      35389480                33.79 ns/op     7576.26 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/512-2                      21703480                55.26 ns/op     9265.94 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/1024-2                     12215022                98.20 ns/op     10427.73 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/2048-2                      6514315               184.3 ns/op      11113.19 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/4096-2                      3286785               365.1 ns/op      11219.00 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/8192-2                      1691893               709.5 ns/op      11545.83 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/gorilla/16384-2                      855566              1397 ns/op        11726.46 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/8-2                         170618257                7.011 ns/op    1141.12 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/16-2                        138242857                8.621 ns/op    1855.95 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/32-2                        100000000               11.73 ns/op     2729.19 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/128-2                       39364594                30.25 ns/op     4230.99 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/256-2                       22373491                54.43 ns/op     4703.68 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/512-2                       11536062               105.1 ns/op      4873.04 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/1024-2                       5862846               201.1 ns/op      5091.32 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/2048-2                       2996881               397.3 ns/op      5154.43 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/4096-2                       1488253               820.2 ns/op      4993.60 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/8192-2                        770410              1599 ns/op        5123.95 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/gobwas/16384-2                       373005              3122 ns/op        5248.40 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/8-2                           224579071                5.341 ns/op    1497.82 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/16-2                          189027944                6.344 ns/op    2521.98 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/32-2                          143714523                8.347 ns/op    3833.80 MB/s           0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/128-2                         100000000               10.35 ns/op     12369.09 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/256-2                         69961245                17.03 ns/op     15031.81 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/512-2                         39476787                30.39 ns/op     16845.45 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/1024-2                        20975644                57.11 ns/op     17930.66 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/2048-2                        10845069               110.5 ns/op      18530.47 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/4096-2                         5520278               217.4 ns/op      18844.48 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/8192-2                         2782627               431.4 ns/op      18989.59 MB/s          0 B/op          0 allocs/opBenchmark_mask/arm64/nbio/16384-2                        1397580               858.8 ns/op      19077.29 MB/s          0 B/op          0 allocs/opPASS

@nhooyr
Copy link
Contributor

Right on, thanks for testing@dixyes

@nightwolfz
Copy link

Finally gotten around to reviewing this. I'm not very familiar with writing assembly of any kind. Why use AVX2 instead of AVX-512?

AVX-512 is not widely supported, while AVX2 is everywhere.

nhooyr and sachk reacted with thumbs up emoji

I'm just not good enough at assembly. I added tests to confirm that@wdvxdr'simplementation works correctly and matches the output of the basic masking loop.
nhooyr added a commit to wdvxdr1123/websocket that referenced this pull requestFeb 22, 2024
Standard library does this too. Unfortunate wish they just exposed it in thestandard library. Perhaps we can isolate the specific code we need later.
@nhooyr
Copy link
Contributor

nhooyr commentedFeb 22, 2024
edited
Loading

Final results:

goos: linuxgoarch: amd64pkg: nhooyr.io/websocket/internal/thirdpartycpu: 12th Gen Intel(R) Core(TM) i5-1235UBenchmark_mask/amd64/basic/8-12 423375534         2.786 ns/op2871.05 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/16-12         226554633         5.359 ns/op2985.68 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/32-12         117482640        10.19 ns/op3140.90 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/128-12        26246637        45.81 ns/op2794.00 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/256-12        14100849        84.95 ns/op3013.68 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/512-12         7287253       165.2 ns/op3098.76 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/1024-12        3688262       320.3 ns/op3197.24 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/2048-12        1888688       638.6 ns/op3207.04 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/4096-12         939709      1275 ns/op3212.55 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/8192-12         416410      2533 ns/op3233.74 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/basic/16384-12        237880      5075 ns/op3228.53 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/8-12      516842565         2.323 ns/op3443.66 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/16-12     512148457         2.321 ns/op6895.02 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/32-12     463799696         2.554 ns/op12531.05 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/128-12    305272117         3.889 ns/op32909.16 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/256-12    186344584         6.533 ns/op39186.37 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/512-12    98735030        10.37 ns/op49364.30 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/1024-12   60532092        20.18 ns/op50735.99 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/2048-12   31890501        36.09 ns/op56745.07 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/4096-12   15045230        79.13 ns/op51760.10 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/8192-12    7874872       152.5 ns/op53720.47 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nhooyr-go/16384-12   3976707       300.0 ns/op54621.87 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/8-12 565721422         2.087 ns/op3833.34 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/16-12         490515590         2.396 ns/op6678.41 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/32-12         499705630         2.309 ns/op13859.26 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/128-12        349259366         3.673 ns/op34851.70 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/256-12        121710386        10.07 ns/op25427.13 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/512-12        100000000        12.00 ns/op42654.69 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/1024-12       68401042        17.57 ns/op58296.87 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/2048-12       38861618        28.96 ns/op70716.39 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/4096-12       22134694        53.55 ns/op76483.15 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/8192-12       12523645        91.32 ns/op89702.20 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/wdvxdr1123-asm/16384-12       6966129       167.6 ns/op97768.91 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/8-12                 306537969         3.908 ns/op2047.33 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/16-12                167440917         7.127 ns/op2245.06 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/32-12                157346451         7.623 ns/op4197.75 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/128-12               100000000        10.17 ns/op12590.73 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/256-12               91401891        13.36 ns/op19161.41 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/512-12               43890088        26.60 ns/op19246.01 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/1024-12              26414316        41.59 ns/op24621.32 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/2048-12              16049217        71.19 ns/op28766.12 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/4096-12               9171207       129.4 ns/op31658.05 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/8192-12               4856886       250.7 ns/op32674.27 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gorilla/16384-12              2488569       485.2 ns/op33764.34 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/8-12                  366741759         3.282 ns/op2437.84 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/16-12                 303639134         3.906 ns/op4095.90 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/32-12                 223418820         5.406 ns/op5919.31 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/128-12                89532153        13.94 ns/op9180.17 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/256-12                39774794        32.82 ns/op7799.75 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/512-12                21657115        53.08 ns/op9646.12 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/1024-12               11203101        97.40 ns/op10513.88 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/2048-12                6175005       200.9 ns/op10194.80 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/4096-12                3083400       390.6 ns/op10487.27 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/8192-12                1551018       714.0 ns/op11473.42 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/gobwas/16384-12                847084      1428 ns/op11476.19 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/8-12                    640919714         1.895 ns/op4220.73 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/16-12                   523854591         2.453 ns/op6522.16 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/32-12                   344619900         3.268 ns/op9793.04 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/128-12                  281670219         4.072 ns/op31433.68 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/256-12                  164968168         7.219 ns/op35463.76 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/512-12                  82934056        13.82 ns/op37060.27 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/1024-12                 48002257        22.96 ns/op44599.52 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/2048-12                 29191290        41.93 ns/op48845.44 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/4096-12                 14418003        84.95 ns/op48215.55 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/8192-12                  7101901       161.0 ns/op50892.32 MB/s       0 B/op       0 allocs/opBenchmark_mask/amd64/nbio/16384-12                 3655984       353.4 ns/op46365.54 MB/s       0 B/op       0 allocs/opPASSok  nhooyr.io/websocket/internal/thirdparty94.759s

Thanks again@wdvxdr1123 and sorry for the large delay.

@nhooyrnhooyr merged commit8a54c1b intocoder:devFeb 22, 2024
nhooyr added a commit to alixander/websocket that referenced this pull requestApr 7, 2024
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@nhooyrnhooyrAwaiting requested review from nhooyr

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
v1.9.0
Development

Successfully merging this pull request may close these issues.

4 participants
@wdvxdr1123@nhooyr@dixyes@nightwolfz

[8]ページ先頭

©2009-2025 Movatter.jp