String concatenation much faster in Python than Go

Question 1

I'm looking at using Go to write a small program that's mostly handling text. I'm pretty sure, based on what I've heard about Go and Python that Go will be substantially faster. I don't actually have a specific need for insane speeds, but I'd like to get to know Go.

The "Go is going to be faster" idea was supported by a trivial test:

# test.pyprint("Hello world")

$ time python dummy.pyHello worldreal    0m0.029suser    0m0.019ssys 0m0.010s

// test.gopackage mainimport "fmt"func main() {    fmt.Println("hello world")}

$ time ./testhello worldreal    0m0.001suser    0m0.001ssys 0m0.000s

Looks good in terms of raw startup speed (which is entirely expected). Highly non-scientific justification:

$ strace python test.py 2>&1 | wc -l1223$ strace ./test 2>&1 | wc -l174

However, my next contrived test was how fast is Go when faffing with strings, and I was expecting to be similarly blown away by Go's raw speed. So, this was surprising:

# test2.pys = ""for i in range(1000000):    s += "a"

$ time python test2.pyreal    0m0.179suser    0m0.145ssys 0m0.013s

// test2.gopackage mainfunc main() {    s := ""    for i:= 0; i < 1000000; i++ {        s += "a";    }}

$ time ./test2real    0m56.840suser    1m50.836ssys 0m17.653

So Go ishundreds of times slower than Python.

Now, I know this is probably due toSchlemiel the Painter's algorithm, which explains why the Go implementation is quadratic ini (i is 10 times bigger leads to 100 times slowdown).

However, the Python implementation seems much faster: 10 times more loops only slows it down by twice. The same effect persists if you concatenatestr(i), so I doubt there's some kind of magical JIT optimization tos = 100000 * 'a' going on. And it's not much slower if Iprint(s) at the end, so the variable isn't being optimised out.

Naivety of the concatenation methods aside (there are surely more idiomatic ways in each language), is there something here that I have misunderstood, or is it simply easier in Go than in Python to run into cases where you have to deal with C/C++-style algorithmic issues when handling strings (in which case a straight Go port might not be as uh-may-zing as I might hope without having to, ya'know, think about things and do my homework)?

Or have I run into a case where Python happens to work well, but falls apart under more complex use?

Versions used: Python 3.8.2, Go 1.14.2

Question 2

Take a look atthis question for details on more efficient string concatenation in Go. This type of benchmark tends to rely mostly on the allocation strategy for the underlying array(s).

Question 3

Also Python's way should be:''.join('a' for _ in range(n))

Question 4

Go has counted strings (strings are implemented as a two-element header with length and data). The real heart of the problem is an implementation detail: Go never expands a string "in place" as it does not have a ref-counting garbage collector; Python over-allocates on purpose (to handle this kind of dumb thing) and does have a reference-counting garbage collector and can see that the refcount on the underlying string is 1 and expand the string in place.

Question 5

In essence, what you've demonstrated is that if you invoke the Python memory allocator roughly 20 times (log2(1000000)), it's slower than invoking the Go memory allocator roughly 1000000 times.

Question 6

@torek that's the perfect answer to the question and reveals an important implementation detail that could be critical if a Go port is to be of any benefit over a "naive" Python program.

Question 7

TL;DR summary: basically you're testing the two implementation's allocators / garbage collectors and heavily weighting the scale on the Python side (by chance, as it were, but this is something the Python folks optimized at some point).

To expand my comments into a real answer:

Both Go and Python have counted strings, i.e., strings are implemented as a two-element header thingy containing a length (byte count or, for Python 3 strings, Unicode characters count) and data pointer.
Both Go and Python are garbage-collected (GCed) languages. That is, in both languages, you can allocate memory without having to worry about freeing it yourself: the system takes care of that automatically.
But the underlying implementations differ, quite a bit in this particular one important way: the version of Python you are using has areference counting GC. The Go system you are using does not.

With a reference count, the inner bits of the Python string handler can do this. I'll express it as Go (or at least pseudo-Go) although the actual Python implementation is in C and I have not made all the details line up properly:

// add (append) new string t to existing string sfunc add_to_string(s, t string_header) string_header {    need = s.len + t.len    if s.refcount == 1 { // can modify string in-place        data = s.data        if cap(data) >= need {            copy_into(data + s.len, t.data, t.len)            return s        }    }    // s is shared or s.cap < need    new_s := make_new_string(roundup(need))    // important: new_s has extra space for the next call to add_to_string    copy_into(new_s.data, s.data, s.len)    copy_into(new_s.data + s.len, t.data, t.len)    s.refcount--    if s.refcount == 0 {        gc_release_string(s)    }    return new_s}

By over-allocating—rounding up theneed value so thatcap(new_s) is large—we get about log₂(n) calls to the allocator, where n is the number of times you dos += "a". With n being 1000000 (one million), that's about 20 times that we actually have to invoke themake_new_string function and release (for gc purposes because the collector uses refcounts as a first pass) the old strings.

[Edit: your source archaeology led tocommit 2c9c7a5f33d, which suggests less than doubling but still a multiplicative increase. To other readers, seecomment.]

The current Go implementation allocates strings without a separate capacity header field (seereflect.StringHeader and note the big caveat that says "don't depend on this, it might be different in future implementations"). Between the lack of a refcount—we can't tell in the runtime routine that adds two strings, that the target has only one reference—and the inability to observe the equivalent ofcap(s) (orcap(s.data)), the Go runtime has to create a new string every time. That's one million memory allocations.

To show that the Python code really does use the refcount, take your original Python:

s = ""for i in range(1000000):    s += "a"

and add a second variablet like this:

s = ""t = sfor i in range(1000000):    s += "a"    t = s

The difference in execution time is impressive:

$ time python test2.py        0.68 real         0.65 user         0.03 sys$ time python test3.py       34.60 real        34.08 user         0.51 sys

The modified Python program still beats Go (1.13.5) on this same system:

$ time ./test2       67.32 real       103.27 user        13.60 sys

and I have not poked any further into the details, but Isuspect the Go GC is running more aggressively than the Python one. The Go GC is very different internally, requiring write barriers and occasional "stop the world" behavior (of all goroutines that are not doing the GC work). The refcounting nature of the Python GC allows it to never stop: even with a refcount of 2, the refcount ont drops to 1 and then next assignment tot drops it to zero, releasing the memory block for re-use in the next trip through the main loop. So it's probably picking up the same memory block over and over again.

(If my memory is correct, Python's "over-allocate strings and check the refcount to allow expand-in-place" trick was not in all versions of Python. It may have first been added around Python 2.4 or so. This memory is extremely vague and a quick Google search did not turn up any evidence one way or the other. [Edit: Python 2.7.4, apparently.])

Question 8

That's a really helpful insight. I was inspired to do some digging into cpython: I looks likePyByteArray_Resize will over-allocate up to 12.5% of the length at a time, otherwise it allocates the exact amount. This looks to have come in in 2008 (2c9c7a5f33d, in Python 2.7.4rc1) whenstringobject.c was renamed tobytearrayobject.c. Assuming that's how it works, log1.125(1e6) is about 117 allocations.

Question 9

Ah, so deliberate over-allocation was a relatively late 2.7 addition, then. I was pretty sure it wasn't always this way.

Question 10

Well. You should never, ever use string concatenation in this way :-)

in go, try thestrings.Buider

package mainimport ( "strings")func main() {    var b1 strings.Builder    for i:= 0; i < 1000000; i++ {        b1.WriteString("a");    }}

Question 11

well, this is nice but an explanation (and a link) for why "you should never use this" would make it much better!

Question 12

strings.Builder should be used because it is much faster for the use case of appending strings repeatedly.

torek 499k71 gold badges764 silver badges891 bronze badges · Accepted Answer · 2020-04-30 17:10:53Z

TL;DR summary: basically you're testing the two implementation's allocators / garbage collectors and heavily weighting the scale on the Python side (by chance, as it were, but this is something the Python folks optimized at some point).

To expand my comments into a real answer:

Both Go and Python have counted strings, i.e., strings are implemented as a two-element header thingy containing a length (byte count or, for Python 3 strings, Unicode characters count) and data pointer.
Both Go and Python are garbage-collected (GCed) languages. That is, in both languages, you can allocate memory without having to worry about freeing it yourself: the system takes care of that automatically.
But the underlying implementations differ, quite a bit in this particular one important way: the version of Python you are using has areference counting GC. The Go system you are using does not.

With a reference count, the inner bits of the Python string handler can do this. I'll express it as Go (or at least pseudo-Go) although the actual Python implementation is in C and I have not made all the details line up properly:

// add (append) new string t to existing string sfunc add_to_string(s, t string_header) string_header {    need = s.len + t.len    if s.refcount == 1 { // can modify string in-place        data = s.data        if cap(data) >= need {            copy_into(data + s.len, t.data, t.len)            return s        }    }    // s is shared or s.cap < need    new_s := make_new_string(roundup(need))    // important: new_s has extra space for the next call to add_to_string    copy_into(new_s.data, s.data, s.len)    copy_into(new_s.data + s.len, t.data, t.len)    s.refcount--    if s.refcount == 0 {        gc_release_string(s)    }    return new_s}

By over-allocating—rounding up theneed value so thatcap(new_s) is large—we get about log₂(n) calls to the allocator, where n is the number of times you dos += "a". With n being 1000000 (one million), that's about 20 times that we actually have to invoke themake_new_string function and release (for gc purposes because the collector uses refcounts as a first pass) the old strings.

[Edit: your source archaeology led tocommit 2c9c7a5f33d, which suggests less than doubling but still a multiplicative increase. To other readers, seecomment.]

The current Go implementation allocates strings without a separate capacity header field (seereflect.StringHeader and note the big caveat that says "don't depend on this, it might be different in future implementations"). Between the lack of a refcount—we can't tell in the runtime routine that adds two strings, that the target has only one reference—and the inability to observe the equivalent ofcap(s) (orcap(s.data)), the Go runtime has to create a new string every time. That's one million memory allocations.

To show that the Python code really does use the refcount, take your original Python:

s = ""for i in range(1000000):    s += "a"

and add a second variablet like this:

s = ""t = sfor i in range(1000000):    s += "a"    t = s

The difference in execution time is impressive:

$ time python test2.py        0.68 real         0.65 user         0.03 sys$ time python test3.py       34.60 real        34.08 user         0.51 sys

The modified Python program still beats Go (1.13.5) on this same system:

$ time ./test2       67.32 real       103.27 user        13.60 sys

and I have not poked any further into the details, but Isuspect the Go GC is running more aggressively than the Python one. The Go GC is very different internally, requiring write barriers and occasional "stop the world" behavior (of all goroutines that are not doing the GC work). The refcounting nature of the Python GC allows it to never stop: even with a refcount of 2, the refcount ont drops to 1 and then next assignment tot drops it to zero, releasing the memory block for re-use in the next trip through the main loop. So it's probably picking up the same memory block over and over again.

(If my memory is correct, Python's "over-allocate strings and check the refcount to allow expand-in-place" trick was not in all versions of Python. It may have first been added around Python 2.4 or so. This memory is extremely vague and a quick Google search did not turn up any evidence one way or the other. [Edit: Python 2.7.4, apparently.])

That's a really helpful insight. I was inspired to do some digging into cpython: I looks likePyByteArray_Resize will over-allocate up to 12.5% of the length at a time, otherwise it allocates the exact amount. This looks to have come in in 2008 (2c9c7a5f33d, in Python 2.7.4rc1) whenstringobject.c was renamed tobytearrayobject.c. Assuming that's how it works, log1.125(1e6) is about 117 allocations.
Ah, so deliberate over-allocation was a relatively late 2.7 addition, then. I was pretty sure it wasn't always this way.

Movatterモバイル変換

Collectives™ on Stack Overflow

String concatenation much faster in Python than Go

2 Answers2

2 Comments

2 Comments

Your Answer

Sign up orlog in

Post as a guest

Linked

Related

Hot Network Questions

Subscribe to RSS