Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Assel Meher
Assel Meher

Posted on • Originally published atsegflow.github.io

     

My journey optimizing the Go Compiler

Original post:https://segflow.github.io/post/go-compiler-optimization/

AtEDGE we write a lot of Go, and we love it for various reasons, one of them being speed. One day I got into a situation where I need to assign anint to a variable based on another string value.

Sounds easy right? well yes, but this particular use case awakened the beast in me and made me think what's thebest way to do it.

The journey finished by me contributing to the language compiler and makemap lookups faster.

Situation

Our binaries can be found in 3 flavors,amd64,arm64, andarm. Sometimes a running binary needs to know what is its architecture, for example when pulling images/other binaries if the current running binary is anamd64 binary then we should use theamd64 repository or registry.

In Go that's easy. Theconstantruntime.GOARCH gives us the running program's architecture.

In one case, I needed to assign anint to a variable based on the value ofruntime.GOARCH. And the code below does exactly that:

vararchIndexintswitchruntime.GOARCH{case"amd64":archIndex=0case"arm64":archIndex=1case"arm":archIndex=2}

But I didn't want it to be that way because the day we support another architecture I need to add anothercase clause and that didn't feel right to me.

It's a simple value mapping and I though using amap followed by a lookup would be better. Below was themap based solution:

archIndex:=map[string]int{"amd64":0,"arm":1,"arm64":2,}[runtime.GOARCH]

Problem

Themap based solution felt more readable and maintainable to me but I was curious which solution was faster?

The code is not in a hot path, and micro-optimizing it is not needed. But still wanted to know what's faster.

To satisfy my curiosity, I benchmarked both approaches.

goos: darwingoarch: amd64BenchmarkMapImpl-4     19503195       58.0 ns/opBenchmarkSwitchImpl-4  1000000000     0.648 ns/op

Turns out themap based solution is 96 times slower than theswitch based one. To understand why it's the case I start analyzing the generated code for both approaches.

Compiler generated code

Like any other language compiler, to generate the final output the Go compiler will pass through various phases:

  • Scanning: Scans the source code and split it intotokens
  • Parsing: Parses thosetokens and build the Abstract Syntax Tree (AST). Also checks that the code is a valid Go code (type checking etc..)
  • Code generating: convert theAST to a lower-level representation of the program, specifically into a Static Single Assignment (SSA) form

At the end of theparsing phase, we are certain the program is valid Go code. The interesting phase for our case is the last one.

The code generation phase takes theAST, applies some optimization to theAST itself by re-writing it, and then convert it into anSSA form. After the initial version of theSSA has been generated, several optimization passes will be applied like "dead code elimination", "constant propagation" and "bound check elimination"

We can see the work of each optimizer and the finalSSA for our function by running this command

GOSSAFUNC=switchImplementation go tool compile benchmark_test.go

The command generates a html filessa.html showing the generatedSSA for the functionswitchImplementation.

switch based implementation

The final SSA form for ourswitchImplementation function looks like this:

00000 (8) TEXT "".switchImplementation(SB), ABIInternal00001 (8) FUNCDATA $0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)00002 (8) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)00003 (8) FUNCDATA $2, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)00004 (+11) PCDATA $0, $000005 (+11) PCDATA $1, $000006 (+11) MOVQ $0, "".~r0(SP)00007 (11) RET00008 (?) END

The first block is the function epilogue where mainly a stack frame needs to be allocated. The second one is the body, and the final block is the function prologue where the functions need to return to its caller.

The function body in our case is a simple move instruction which moves 0 to the ~r0 registry. So the function is only returning 0 immediately there is nothing else. To confirm this I generated the SSA for the following function:

func return0() int {    return 0}

And the final generated code is exactly the same as you can see ithere. And that's why it's so fast.

map based implementation

As for the SSA form of themapImplementation function, it's longer, I annotated it so it's easier to understand what's happening.

00000 (31) TEXT "".mapImplementation(SB), ABIInternal00001 (31) FUNCDATA $0, gclocals·7d2d5fca80364273fb07d5820a76fef4(SB)00002 (31) FUNCDATA $1, gclocals·b9237f7ca55cc8bf6e05646631ad00ce(SB)00003 (31) FUNCDATA $2, gclocals·a5ed3e65458aadaa1d48863859d2a323(SB)00004 (31) FUNCDATA $3, "".mapImplementation.stkobj(SB)00005 (+32) PCDATA $0, $000006 (+32) PCDATA $1, $100007 (+32) XORPS X0, X000008 (32) MOVUPS X0, ""..autotmp_2-256(SP)00009 (32) MOVUPS X0, ""..autotmp_2-240(SP)00010 (32) MOVUPS X0, ""..autotmp_2-224(SP)00011 (32) PCDATA $0, $100012 (32) PCDATA $1, $200013 (32) LEAQ ""..autotmp_3-208(SP), DI00014 (32) PCDATA $0, $000015 (32) LEAQ -48(DI), DI00016 (32) DUFFZERO $23900017 (32) PCDATA $0, $200018 (32) PCDATA $1, $100019 (32) LEAQ ""..autotmp_3-208(SP), AX00020 (32) PCDATA $0, $000021 (32) MOVQ AX, ""..autotmp_2-240(SP)00022 (32) CALL runtime.fastrand(SB)00023 (32) MOVL (SP), AX00024 (32) MOVL AX, ""..autotmp_2-244(SP)00025 (33) PCDATA $0, $200026 (+33) LEAQ type.map[string]int(SB), AX00027 (33) PCDATA $0, $000028 (33) MOVQ AX, (SP)00029 (33) PCDATA $0, $300030 (33) LEAQ ""..autotmp_2-256(SP), CX00031 (33) PCDATA $0, $000032 (33) MOVQ CX, 8(SP)00033 (33) PCDATA $0, $400034 (33) LEAQ go.string."amd64"(SB), DX00035 (33) PCDATA $0, $000036 (33) MOVQ DX, 16(SP)00037 (33) MOVQ $5, 24(SP)00038 (+33) CALL runtime.mapassign_faststr(SB)    // assign "amd64" key00039 (33) PCDATA $0, $200040 (33) MOVQ 32(SP), AX00041 (33) PCDATA $0, $000042 (33) MOVQ $0, (AX)                          // assign "0" value00043 (34) PCDATA $0, $200044 (+34) LEAQ type.map[string]int(SB), AX00045 (34) PCDATA $0, $000046 (34) MOVQ AX, (SP)00047 (34) PCDATA $0, $300048 (34) LEAQ ""..autotmp_2-256(SP), CX00049 (34) PCDATA $0, $000050 (34) MOVQ CX, 8(SP)00051 (34) PCDATA $0, $400052 (34) LEAQ go.string."arm"(SB), DX00053 (34) PCDATA $0, $000054 (34) MOVQ DX, 16(SP)00055 (34) MOVQ $3, 24(SP)00056 (+34) CALL runtime.mapassign_faststr(SB)    // assign "arm" key00057 (34) PCDATA $0, $200058 (34) MOVQ 32(SP), AX00059 (34) PCDATA $0, $000060 (34) MOVQ $1, (AX)                          // assign "1" value00061 (35) PCDATA $0, $200062 (+35) LEAQ type.map[string]int(SB), AX00063 (35) PCDATA $0, $000064 (35) MOVQ AX, (SP)00065 (35) PCDATA $0, $300066 (35) LEAQ ""..autotmp_2-256(SP), CX00067 (35) PCDATA $0, $000068 (35) MOVQ CX, 8(SP)00069 (35) PCDATA $0, $400070 (35) LEAQ go.string."arm64"(SB), DX00071 (35) PCDATA $0, $000072 (35) MOVQ DX, 16(SP)00073 (35) MOVQ $5, 24(SP)00074 (+35) CALL runtime.mapassign_faststr(SB)    // assign "arm64" key00075 (35) PCDATA $0, $200076 (35) MOVQ 32(SP), AX00077 (35) PCDATA $0, $000078 (35) MOVQ $2, (AX)                          // assign "2" value00079 (36) PCDATA $0, $200080 (+36) LEAQ type.map[string]int(SB), AX00081 (36) PCDATA $0, $000082 (36) MOVQ AX, (SP)00083 (36) PCDATA $0, $200084 (36) PCDATA $1, $000085 (36) LEAQ ""..autotmp_2-256(SP), AX00086 (36) PCDATA $0, $000087 (36) MOVQ AX, 8(SP)00088 (36) PCDATA $0, $200089 (36) LEAQ go.string."amd64"(SB), AX00090 (36) PCDATA $0, $000091 (36) MOVQ AX, 16(SP)00092 (36) MOVQ $5, 24(SP)00093 (+36) CALL runtime.mapaccess1_faststr(SB)  // perform the map lookup00094 (36) PCDATA $0, $200095 (36) MOVQ 32(SP), AX00096 (36) PCDATA $0, $000097 (36) MOVQ (AX), AX00098 (+32) MOVQ AX, "".~r0(SP)00099 (+36) RET00100 (?) END

The reason behind this is the fact that the generated code is building the map which requires allocating it, assign the different values, and then doing a lookup.

Constant folding

The reason why the switch implementation is similar to areturn 0 is something calledconstant folding.

Constant folding is the process of recognizing and evaluating constant expressions at compile time rather than computing them at runtime - Wikipedia

We know thatruntime.GOARCH is a constant, so not only its value cannot change but also it's known at compile time. The compiler can use this two properties to evaluate constant expression at compile time instead of doing that when running, in our case the compiler knew which of thecase clauses is true so it deleted the conditional structure and replaced it with a nakedreturn 0.

This was not the case on themap based implementation.

Implement the optimization

Our map lookup looks like this:

map[string]int{"amd64":0,"arm":1,"arm64":2,}[runtime.GOARCH]

This is represented in the AST using anINDEXMAP node. TheINDEXMAP has two childsleft andright (remember it's a tree).

Theleft child is the map we will lookup from, and theright child is the key we are looking for. Both childs are also nodes, for example theright node can be aFUNCCALL node for a lookup like this:

map[string]int{"amd64":0,"arm":1,"arm64":2,}[aRandomFunc()]

At compile time, we can check if bothright andleft nodes are constant, if they are, we see if what are we looking for (the key), is defined in the constant map, and if it's the case we replace theINDEXMAP node in the AST by the value of that key. This will replace all lookups on maps where the map is anOMAPLIT and the key is a constant with a constant if possible.

This optimization is applied directly to the AST and not the SSA form. This type of AST optimization is implemented inside thewalk function.

The PR with this optimization can be seen here:https://go-review.googlesource.com/c/go/+/208323

The new generated SSA with that optimization can be foundhere

Now if we benchmark both implementations using the Go compiler from that branch we see that both are similar. They are both similar to ourreturn 0 function.

BenchmarkSwitchImpl-4           1000000000               0.599 ns/op           0 B/op          0 allocs/opBenchmarkMapImpl-4              1000000000               0.612 ns/op           0 B/op          0 allocs/op

Conclusion

The PR is not merged yet, hopefully soon, it got added to Go 1.15 milestone which should be released in a month.

Huge thanks to everyone in the #compilers channel inGophers slack

Top comments(6)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss
CollapseExpand
 
craigmc08 profile image
Craig McIlwrath
  • Joined

I was looking at the comments on that pull request and the discussion about side effects in the map caught my eye. If I understood it right, it wasn't an issue because this optimization doesn't happen in that case.

Did you give any thought to applying a similar optimization when only the index is constant? Something like evaluating all of the values in the map and then returning the proper value based on the constant index?

CollapseExpand
 
segflow profile image
Assel Meher
Hacker and Go lover. My time is split between writing code & breaking code.
  • Location
    Tunis, Tunisia
  • Joined
• Edited on• Edited

That's actually an idea I thought of when I read the reviewer's comment. And looks like a good one, I also believe that only function calls (like the given example in the PR) can have side effects, probably there are others but can't think of them right now.

CollapseExpand
 
xzf profile image
xzf
  • Joined

func mapImplementation() int {
return map[string]int{
"amd64": 0,
"arm": 1,
"arm64": 2,
}[runtime.GOARCH]
}

what are you thinking?

CollapseExpand
 
segflow profile image
Assel Meher
Hacker and Go lover. My time is split between writing code & breaking code.
  • Location
    Tunis, Tunisia
  • Joined
• Edited on• Edited

That's the function used the benchmark the single lookup using a const literal on a literal map, and compared to the switch one yes it's 96 slower, of course, I'm talking slower on my MAC.
I already said, that optimizing that is somewhat unnecessary. But with that PR that function becomes similar to afunc mapImplementation () int { return 0 }. This means there is no map allocation at all.

CollapseExpand
 
xzf profile image
xzf
  • Joined

each time alloc a map ?

96 times slower!!!! lol

CollapseExpand
 
freedom profile image
Freedom
  • Joined

As I see the current status, there have not any replies after Matthew Dempsky's comment?

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

Hacker and Go lover. My time is split between writing code & breaking code.
  • Location
    Tunis, Tunisia
  • Joined

Trending onDEV CommunityHot

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp