Previous Lecture Complete and continue  

  Strings 2 - Unicode

Transcript

Be aware that a byte does not necessarily correspond to a character. If strings in Go contain human-readable text, then it is supposed to be Unicode text, encoded as UTF–8.

I won’t go into the details of Unicode here, but there is a minimum of knowledge about Unicode that you need in order to understand what’s going on when operating on strings.

Let’s start with a byte. A byte can hold value from 0 to 255. This is more than enough to include all English letters, all punctuation marks, all Arabic digits, and there is even space for more. The ASCII character set was an early standard for encoding English text.

However, if you count in all German umlauts, all French accents, and all of the special characters and character decorations found in many Eurpean languages from Irish to Romanian, from Finnish to Hungarian, then all these would not fit into one byte any more.

Then consider all Asian scripts - Chinese, Japanese, Korean, and so on - and you see where this is going. To manage all this, the Unicode standard was developed. The goal was to assign a unique number to each character of each of the written languages on Earth. Currently, the Unicode standard contains more than 128,000 characters.

So to store all possible characters of the Unicode standard, we would need three bytes per character, or better four bytes, as this fits our processor architectures better. However, this would be an enormous waste of memory and storage space for English text, which is, especially when it comes to programming, still quite commonly used.

A good compromise is a variable-length encoding. UTF–8 is such an encoding. ASCII characters between 0 and 127 are still represented by one single byte in UTF–8, whereas other characters may consume up to four bytes.

fmt.Println(len("a"))
fmt.Println(len("ä"))
fmt.Println(len("走"))

To distinguish between pure ASCII characters and Unicode characters, a single Unicode character in Go is called a rune.

Back to Go: How can we operate on character level rather than on byte level?

First, there is the range operator, which, as we have seen in the lesson about loops, iterates over runes rather than over bytes.

This loop shows the behavior of the range operator again:

for i, v := range "aä走." {
    fmt.Println(i, v, string(v))
}

v is of type rune, which is an alias for int32. So even if the character within the string consumes only one byte, the range operator returns it as a four-byte value.

Many standard operations also work fine with Unicode. For example, testing if a string is a substring of another string requires no special handling of Unicode characters. If the string is a substring at byte level, then it also is a substring at UTF–8 level.

Finally, the Go standard library contains packages that handle UTF–8 strings and runes. For example:

The strings package contains functions that have variations for working at byte or rune level, respectively, like, for example,

// in package strings
func strings.IndexByte(s string, c byte) int

and

// in package strings
func strings.IndexRune(s string, r rune) int

Note that the second parameters–c and r, respectively–represent single characters. Use single quotes here when passing a character literal; for example,

idx := strings.IndexByte(s, 'a').

The “utf8” package contains various functions for working with strings that contain runes. For example, the equivalent to len(s) at rune level is

// in package utf8
func RuneCountInString(s string) (n int)

To summarize

  • Go treats text as Unicode encoded as UTF–8.
  • Some characters in Unicode require more than one byte of storage.
  • Data type rune represents a single Unicode character. It is equivalent to int32.
  • The len function as well as the index and the slice operations work at byte level, whereas the range operator works at character level.
  • The strings and utf8 packages provide functions for dealing with runes.

Links

Effective Go: For (especially the paragraph on range and Unicode)

Language reference: Rune literals

Package documentation: strings

Package documentation: unicode/utf8

Tip: An excellent introduction to Unicode can be found here (or in the Web archive)

Another tip: package golang.org/x/text and its subpackages contain functionality around internationalization and localization; some useful, others low-level stuff for other packages. For Unicode handling, you might find the golang.org/x/text/unicode package and its subpackges useful.


Download
Discussion
4 comments