The Go Blog
Strings, bytes, runes and characters in Go
Rob Pike
23 October 2013
Introduction
Theprevious blog post explained how sliceswork in Go, using a number of examples to illustrate the mechanism behindtheir implementation.Building on that background, this post discusses strings in Go.At first, strings might seem too simple a topic for a blog post, but to usethem well requires understanding not only how they work,but also the difference between a byte, a character, and a rune,the difference between Unicode and UTF-8,the difference between a string and a string literal,and other even more subtle distinctions.
One way to approach this topic is to think of it as an answer to the frequentlyasked question, “When I index a Go string at positionn, why don’t I get thenth character?”As you’ll see, this question leads us to many details about how text worksin the modern world.
An excellent introduction to some of these issues, independent of Go,is Joel Spolsky’s famous blog post,The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).Many of the points he raises will be echoed here.
What is a string?
Let’s start with some basics.
In Go, a string is in effect a read-only slice of bytes.If you’re at all uncertain about what a slice of bytes is or how it works,please read theprevious blog post;we’ll assume here that you have.
It’s important to state right up front that a string holdsarbitrary bytes.It is not required to hold Unicode text, UTF-8 text, or any other predefined format.As far as the content of a string is concerned, it is exactly equivalent to aslice of bytes.
Here is a string literal (more about those soon) that uses the\xNN
notation to define a string constant holding some peculiar byte values.(Of course, bytes range from hexadecimal values 00 through FF, inclusive.)
const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
Printing strings
Because some of the bytes in our sample string are not valid ASCII, not evenvalid UTF-8, printing the string directly will produce ugly output.The simple print statement
fmt.Println(sample)
produces this mess (whose exact appearance varies with the environment):
��=� ⌘
To find out what that string really holds, we need to take it apart and examine the pieces.There are several ways to do this.The most obvious is to loop over its contents and pull out the bytesindividually, as in thisfor
loop:
for i := 0; i < len(sample); i++ { fmt.Printf("%x ", sample[i]) }
As implied up front, indexing a string accesses individual bytes, notcharacters. We’ll return to that topic in detail below. For now, let’sstick with just the bytes.This is the output from the byte-by-byte loop:
bd b2 3d bc 20 e2 8c 98
Notice how the individual bytes match thehexadecimal escapes that defined the string.
A shorter way to generate presentable output for a messy stringis to use the%x
(hexadecimal) format verb offmt.Printf
.It just dumps out the sequential bytes of the string as hexadecimaldigits, two per byte.
fmt.Printf("%x\n", sample)
Compare its output to that above:
bdb23dbc20e28c98
A nice trick is to use the “space” flag in that format, putting aspace between the%
and thex
. Compare the format stringused here to the one above,
fmt.Printf("% x\n", sample)
and notice how the bytes comeout with spaces between, making the result a little less imposing:
bd b2 3d bc 20 e2 8c 98
There’s more. The%q
(quoted) verb will escape any non-printablebyte sequences in a string so the output is unambiguous.
fmt.Printf("%q\n", sample)
This technique is handy when much of the string isintelligible as text but there are peculiarities to root out; it produces:
"\xbd\xb2=\xbc ⌘"
If we squint at that, we can see that buried in the noise is one ASCII equals sign,along with a regular space, and at the end appears the well-known Swedish “Place of Interest”symbol.That symbol has Unicode value U+2318, encoded as UTF-8 by the bytesafter the space (hex value20
):e2
8c
98
.
If we are unfamiliar or confused by strange values in the string,we can use the “plus” flag to the%q
verb. This flag causes the output to escapenot only non-printable sequences, but also any non-ASCII bytes, allwhile interpreting UTF-8.The result is that it exposes the Unicode values of properly formatted UTF-8that represents non-ASCII data in the string:
fmt.Printf("%+q\n", sample)
With that format, the Unicode value of the Swedish symbol shows up as a\u
escape:
"\xbd\xb2=\xbc \u2318"
These printing techniques are good to know when debuggingthe contents of strings, and will be handy in the discussion that follows.It’s worth pointing out as well that all these methods behave exactly thesame for byte slices as they do for strings.
Here’s the full set of printing options we’ve listed, presented asa complete program you can run (and edit) right in the browser:
// Copyright 2013 The Go Authors. All rights reserved.// Use of this source code is governed by a BSD-style// license that can be found in the LICENSE file.
package mainimport "fmt"func main() { const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98" fmt.Println("Println:") fmt.Println(sample) fmt.Println("Byte loop:") for i := 0; i < len(sample); i++ { fmt.Printf("%x ", sample[i]) } fmt.Printf("\n") fmt.Println("Printf with %x:") fmt.Printf("%x\n", sample) fmt.Println("Printf with % x:") fmt.Printf("% x\n", sample) fmt.Println("Printf with %q:") fmt.Printf("%q\n", sample) fmt.Println("Printf with %+q:") fmt.Printf("%+q\n", sample)}
[Exercise: Modify the examples above to use a slice of bytesinstead of a string. Hint: Use a conversion to create the slice.]
[Exercise: Loop over the string using the%q
format on each byte.What does the output tell you?]
UTF-8 and string literals
As we saw, indexing a string yields its bytes, not its characters: a string is just abunch of bytes.That means that when we store a character value in a string,we store its byte-at-a-time representation.Let’s look at a more controlled example to see how that happens.
Here’s a simple program that prints a string constant with a single characterthree different ways, once as a plain string, once as an ASCII-only quotedstring, and once as individual bytes in hexadecimal.To avoid any confusion, we create a “raw string”, enclosed by back quotes,so it can contain only literal text. (Regular strings, enclosed by doublequotes, can contain escape sequences as we showed above.)
// Copyright 2013 The Go Authors. All rights reserved.// Use of this source code is governed by a BSD-style// license that can be found in the LICENSE file.package mainimport "fmt"
func main() { const placeOfInterest = `⌘` fmt.Printf("plain string: ") fmt.Printf("%s", placeOfInterest) fmt.Printf("\n") fmt.Printf("quoted string: ") fmt.Printf("%+q", placeOfInterest) fmt.Printf("\n") fmt.Printf("hex bytes: ") for i := 0; i < len(placeOfInterest); i++ { fmt.Printf("%x ", placeOfInterest[i]) } fmt.Printf("\n")}
The output is:
plain string: ⌘quoted string: "\u2318"hex bytes: e2 8c 98
which reminds us that the Unicode character value U+2318, the “Placeof Interest” symbol ⌘, is represented by the bytese2
8c
98
, andthat those bytes are the UTF-8 encoding of the hexadecimalvalue 2318.
It may be obvious or it may be subtle, depending on your familiarity withUTF-8, but it’s worth taking a moment to explain how the UTF-8 representationof the string was created.The simple fact is: it was created when the source code was written.
Source code in Go isdefined to be UTF-8 text; no other representation isallowed. That implies that when, in the source code, we write the text
`⌘`
the text editor used to create the program places the UTF-8 encodingof the symbol ⌘ into the source text.When we print out the hexadecimal bytes, we’re just dumping thedata the editor placed in the file.
In short, Go source code is UTF-8, sothe source code for the string literal is UTF-8 text.If that string literal contains no escape sequences, which a rawstring cannot, the constructed string will hold exactly thesource text between the quotes.Thus by definition andby construction the raw string will always contain a valid UTF-8representation of its contents.Similarly, unless it contains UTF-8-breaking escapes like thosefrom the previous section, a regular string literal will also alwayscontain valid UTF-8.
Some people think Go strings are always UTF-8, but theyare not: only string literals are UTF-8.As we showed in the previous section, stringvalues can contain arbitrarybytes;as we showed in this one, stringliterals always contain UTF-8 textas long as they have no byte-level escapes.
To summarize, strings can contain arbitrary bytes, but when constructedfrom string literals, those bytes are (almost always) UTF-8.
Code points, characters, and runes
We’ve been very careful so far in how we use the words “byte” and “character”.That’s partly because strings hold bytes, and partly because the idea of “character”is a little hard to define.The Unicode standard uses the term “code point” to refer to the item representedby a single value.The code point U+2318, with hexadecimal value 2318, represents the symbol ⌘.(For lots more information about that code point, seeits Unicode page.)
To pick a more prosaic example, the Unicode code point U+0061 is the lowercase Latin letter ‘A’: a.
But what about the lower case grave-accented letter ‘A’, à?That’s a character, and it’s also a code point (U+00E0), but it has otherrepresentations.For example we can use the “combining” grave accent code point, U+0300,and attach it to the lower case letter a, U+0061, to create the same character à.In general, a character may be represented by a number of differentsequences of code points, and therefore different sequences of UTF-8 bytes.
The concept of character in computing is therefore ambiguous, or at leastconfusing, so we use it with care.To make things dependable, there arenormalization techniques that guarantee thata given character is always represented by the same code points, but thatsubject takes us too far off the topic for now.A later blog post will explain how the Go libraries address normalization.
“Code point” is a bit of a mouthful, so Go introduces a shorter term for theconcept:rune.The term appears in the libraries and source code, and means exactlythe same as “code point”, with one interesting addition.
The Go language defines the wordrune
as an alias for the typeint32
, soprograms can be clear when an integer value represents a code point.Moreover, what you might think of as a character constant is called arune constant in Go.The type and value of the expression
'⌘'
isrune
with integer value0x2318
.
To summarize, here are the salient points:
- Go source code is always UTF-8.
- A string holds arbitrary bytes.
- A string literal, absent byte-level escapes, always holds valid UTF-8 sequences.
- Those sequences represent Unicode code points, called runes.
- No guarantee is made in Go that characters in strings are normalized.
Range loops
Besides the axiomatic detail that Go source code is UTF-8,there’s really only one way that Go treats UTF-8 specially, and that is when usingafor
range
loop on a string.
We’ve seen what happens with a regularfor
loop.Afor
range
loop, by contrast, decodes one UTF-8-encoded rune on eachiteration.Each time around the loop, the index of the loop is the starting position of thecurrent rune, measured in bytes, and the code point is its value.Here’s an example using yet another handyPrintf
format,%#U
, which showsthe code point’s Unicode value and its printed representation:
// Copyright 2013 The Go Authors. All rights reserved.// Use of this source code is governed by a BSD-style// license that can be found in the LICENSE file.package mainimport "fmt"func main() {
const nihongo = "日本語" for index, runeValue := range nihongo { fmt.Printf("%#U starts at byte position %d\n", runeValue, index) }
}
The output shows how each code point occupies multiple bytes:
U+65E5 '日' starts at byte position 0U+672C '本' starts at byte position 3U+8A9E '語' starts at byte position 6
[Exercise: Put an invalid UTF-8 byte sequence into the string. (How?)What happens to the iterations of the loop?]
Libraries
Go’s standard library provides strong support for interpreting UTF-8 text.If afor
range
loop isn’t sufficient for your purposes,chances are the facility you need is provided by a package in the library.
The most important such package isunicode/utf8
,which containshelper routines to validate, disassemble, and reassemble UTF-8 strings.Here is a program equivalent to thefor
range
example above,but using theDecodeRuneInString
function from that package todo the work.The return values from the function are the rune and its width inUTF-8-encoded bytes.
// Copyright 2013 The Go Authors. All rights reserved.// Use of this source code is governed by a BSD-style// license that can be found in the LICENSE file.package mainimport ( "fmt" "unicode/utf8")func main() {
const nihongo = "日本語" for i, w := 0, 0; i < len(nihongo); i += w { runeValue, width := utf8.DecodeRuneInString(nihongo[i:]) fmt.Printf("%#U starts at byte position %d\n", runeValue, i) w = width }
}
Run it to see that it performs the same.Thefor
range
loop andDecodeRuneInString
are defined to produceexactly the same iteration sequence.
Look at thedocumentationfor theunicode/utf8
package to see whatother facilities it provides.
Conclusion
To answer the question posed at the beginning: Strings are built from bytesso indexing them yields bytes, not characters.A string might not even hold characters.In fact, the definition of “character” is ambiguous and it wouldbe a mistake to try to resolve the ambiguity by defining that strings are madeof characters.
There’s much more to say about Unicode, UTF-8, and the world of multilingualtext processing, but it can wait for another post.For now, we hope you have a better understanding of how Go strings behaveand that, although they may contain arbitrary bytes, UTF-8 is a central partof their design.
Next article:Four years of Go
Previous article:Arrays, slices (and strings): The mechanics of 'append'
Blog Index
[8]ページ先頭