Strings 2 - Unicode
Transcript
Be aware that a byte
does not necessarily correspond to a character. If strings in Go contain human-readable text, then it is supposed to be Unicode text, encoded as UTF–8.
I won’t go into the details of Unicode here, but there is a minimum of knowledge about Unicode that you need in order to understand what’s going on when operating on strings.
Let’s start with a byte. A byte can hold value from 0 to 255. This is more than enough to include all English letters, all punctuation marks, all Arabic digits, and there is even space for more. The ASCII character set was an early standard for encoding English text.
However, if you count in all German umlauts, all French accents, and all of the special characters and character decorations found in many Eurpean languages from Irish to Romanian, from Finnish to Hungarian, then all these would not fit into one byte any more.
Then consider all Asian scripts - Chinese, Japanese, Korean, and so on - and you see where this is going. To manage all this, the Unicode standard was developed. The goal was to assign a unique number to each character of each of the written languages on Earth. Currently, the Unicode standard contains more than 128,000 characters.
So to store all possible characters of the Unicode standard, we would need three bytes per character, or better four bytes, as this fits our processor architectures better. However, this would be an enormous waste of memory and storage space for English text, which is, especially when it comes to programming, still quite commonly used.
A good compromise is a variable-length encoding. UTF–8 is such an encoding. ASCII characters between 0 and 127 are still represented by one single byte in UTF–8, whereas other characters may consume up to four bytes.
fmt.Println(len("a")) fmt.Println(len("ä")) fmt.Println(len("走"))
To distinguish between pure ASCII characters and Unicode characters, a single Unicode character in Go is called a rune.
Back to Go: How can we operate on character level rather than on byte level?
First, there is the range operator, which, as we have seen in the lesson about loops, iterates over runes rather than over bytes.
This loop shows the behavior of the range operator again:
for i, v := range "aä走." { fmt.Println(i, v, string(v)) }
v
is of type rune
, which is an alias for int32
. So even if the character within the string consumes only one byte, the range operator returns it as a four-byte value.
Many standard operations also work fine with Unicode. For example, testing if a string is a substring of another string requires no special handling of Unicode characters. If the string is a substring at byte level, then it also is a substring at UTF–8 level.
Finally, the Go standard library contains packages that handle UTF–8 strings and runes. For example:
The strings
package contains functions that have variations for working at byte or rune level, respectively, like, for example,
// in package strings func strings.IndexByte(s string, c byte) int
and
// in package strings func strings.IndexRune(s string, r rune) int
Note that the second parameters–c
and r
, respectively–represent single characters. Use single quotes here when passing a character literal; for example,
idx := strings.IndexByte(s, 'a').
The “utf8” package contains various functions for working with strings that contain runes. For example, the equivalent to len(s) at rune level is
// in package utf8 func RuneCountInString(s string) (n int)
To summarize
- Go treats text as Unicode encoded as UTF–8.
- Some characters in Unicode require more than one byte of storage.
- Data type
rune
represents a single Unicode character. It is equivalent to int32. - The len function as well as the index and the slice operations work at byte level, whereas the range operator works at character level.
- The
strings
andutf8
packages provide functions for dealing with runes.
Links
Effective Go: For (especially the paragraph on range
and Unicode)
Language reference: Rune literals
Package documentation: strings
Package documentation: unicode/utf8
Tip: An excellent introduction to Unicode can be found here (or in the Web archive)
Another tip: package golang.org/x/text
and its subpackages contain functionality around internationalization and localization; some useful, others low-level stuff for other packages. For Unicode handling, you might find the golang.org/x/text/unicode
package and its subpackges useful.
10 comments