Strings, bytes, runes and characters in Go

mseepgood · on Oct 24, 2013

Some people here seem to think that indexing, measuring and slicing operations based on runes (code points) instead of bytes (UTF-8 code units) by default would be to be a good idea. It's not - you get the worst of both worlds: indexing is not a constant-time operation and a code point is still not a user-perceived character, because combining character sequences consist of multiple code points, even normalization doesn't help in general.

Other languages like C# seem to be different on the surface, but in fact they index and measure by code units as well (2 byte UTF-16 code units), not by code points.

masklinn · on Oct 24, 2013

> It's not - you get the worst of both worlds: indexing is not a constant-time operation

You can't usefully index a unicode stream in constant time and do correct and useful textual stuff anyway due to combining codepoints which may not have precombined forms (if only because there is no defined limit to the number of combining codepoints tacked onto the base) (so normalization will not save you) or codepoints which are not visible to the user and which you may or may not want to see depending on the work you're doing.

People really need to come to terms that a unicode stream is exactly that, a stream.

iv_08 · on Oct 24, 2013

> You can't usefully index a unicode stream in constant time and do correct and useful textual stuff anyway

To find an index of a substring you need to scan the string, right. But once you have the byte index you can quickly jump to its position in the string, e.g. when you do a slice operation based on that index: s[i:]. If strings.Index() returned a code point index and not a byte index you would have to scan the string again.

masklinn · on Oct 24, 2013

> To find an index of a substring you need to scan the string, right. But once you have the byte index you can quickly jump to its position in the string

Stop doing that and just get the bit of string you want in the first place?

derefr · on Oct 24, 2013

> and a code point is still not a user-perceived character

How about indexing, measuring and slicing operations based on user-perceived characters, then?

iv_08 · on Oct 24, 2013

I think the number of displayed characters is even font-dependent.

codeka · on Oct 24, 2013

Because even that is not exactly trivial, particularly for noon-Latin languages.

lmm · on Oct 24, 2013

No distinction between string and byte array? I foresee all the fun of python 3 in go's future. Those of us programming in the real world need to deal with legacy character sets in strings obtained from elsewhere, and it's no fun at all to discover that what you thought was a string is actually an array of SJIS or iso8859-1 bytes.

mediocregopher · on Oct 24, 2013

It sounds like you've decided to dislike go without ever having actually used it for anything in the "real world".

There is a difference between a string and a byte array. A string is a string. A byte array is a []byte (byte slice). You have to explicitly cast from one to another. Neither are inherently utf8. A string is represented by a byte array under-the-hood, and string literals in your code are read as utf8 encoded. Strings themselves are not necessarily utf8 encoded, and if you need to use a different encoding there's libraries for that (unless you're using something really esoteric).

lmm · on Oct 24, 2013

So what do you get when you read a file, or when a file is uploaded to your web server? What happens if you write a function that accepts a string as a parameter, but haven't noticed that you're implicitly assuming the string is utf8? (e.g. a function that formats one string using another - if the encodings are different you'll end up with a string that's invalid for either encoding, no?)

The distinction between a string with one encoding and a string with another is subtle but vitally important - exactly the sort of thing a type system should take care of.

kisielk · on Oct 24, 2013

There is a library in one of the Go subrepositories that handles transformation from other encodings to UTF-8: https://code.google.com/p/go/source/browse?repo=text#hg%2Fen...

If you're expecting to get data in other other encodings you could put together some detection and transformation at the point of ingress and convert to UTF-8 encoded text for the rest of your application.

pygy_ · on Oct 24, 2013

Julia uses the same strategy, but indexing an utf8 string returns the rune rather than the byte. If you try to get a byte in the middle of a rune representation, it raises an error.

The `next(string, index)` function used for the iteration protocol works like the `utf8.DecodeRuneInString()` shown in the example, but it returns the next valid index rather than the character width.

dbaupp · on Oct 24, 2013

Rust has a similar approach (in that it raises an error when you attempt to do something not on a rune boundary), although `string[index]` still returns a byte rather than a character but strong static typing means that it isn't a huge problem.

elithrar · on Oct 24, 2013

Go also has utf8.RuneCount([]byte) and utf8.RuneCountInString(string), which return the number of runes: http://golang.org/pkg/unicode/utf8/#RuneCount

(check the docs, there's also RuneStart, which returns true/false if you index mid-rune)

SigmundA · on Oct 24, 2013

Coming from C# it seems odd that a string would index on byte and not char(rune) and that it is essentially a read only byte array. If you wanted a byte array why wouldn't you use a byte array, why have strings and byte arrays?

In C# you can encode/decode strings to byte arrays based on your desired encoding, but a string is composed of characters, it's in memory representation is abstracted.

Is this a performance or zero copy thing? Not having to encode/decode to get to the bytes?

bazzargh · on Oct 24, 2013

As someone's already pointed out, C# strings are composed of UTF-16 codepoints not characters - this means that if you have a character outside the basic multilingual plane it'll be represented as two codepoints using a surrogate, and the character count in the C# string will be wrong (the same is true of Java and JS for example)

That's a hard problem, and avoiding it in every situation would require scanning the strings for surrogates beforehand, when you might never need to know that information. Go makes it explicit that knowing the exact character position and string length in characters comes at a cost.

There's a good discussion of this on Tim Bray's blog: http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

twotwotwo · on Oct 24, 2013

Just for fun, here's Go handling a char outside the BMP (😃, U+1F603):

http://play.golang.org/p/qg7POYAAOL

samatman · on Oct 24, 2013

https://github.com/mnemnion/emojure/

You can even export them without Capital letters ;-)

twotwotwo · on Oct 24, 2013

I did run into at least one eminently reasonable use of how Go source is defined to be in UTF-8. Comments in the crypto libs just use math symbols where they're handy, like this in crypto/rsa[1]:

  // Check that de ≡ 1 mod p-1, for each prime.
  // This implies that e is coprime to each p-1 as e has a multiplicative
  // inverse. Therefore e is coprime to lcm(p-1,q-1,r-1,...) =
  // exponent(ℤ/nℤ). It also implies that a^de ≡ a mod p as a^(p-1) ≡ 1
  // mod p. Thus a^de ≡ a mod n for all a coprime to n, as required.

Sadly, the spec requires identifiers to be just Unicode letters and digits, so we will never experience the power and glory of emoji function names in Go.

[1] http://golang.org/src/pkg/crypto/rsa/rsa.go

neild · on Oct 24, 2013

Indexing on runes is expensive, since either you need to store strings as arrays of runes or iterate on each index operation. Indexing on runes is also less useful than it may first seem. Consider that a rune is distinct from a glyph--a single "character" on the screen may be composed of several runes. The occasions when you care about specific runes as opposed to substrings are uncommon. When dealing with substrings, there is no advantage to substring-of-runes as opposed to substring-of-bytes.

Note that many languages that appear to offer indexing by rune (e.g., Java) do not in fact do so, since their 16-bit "character" type is incapable of representing all runes. The fact that this is only rarely an issue points at the fundamental rarity with with code needs to deal with runes-qua-runes.

porges · on Oct 24, 2013

Note that in C# indexing on char is really indexing on UTF-16 codepoint, which is arguably worse because it seems to work, until it doesn't.

SigmundA · on Oct 24, 2013

Interesting, never had to use use StringInfo to get into surrogate pair support. Although its seems like this could be fixed since both char and string are abstracted from bytes, although indexing on variable length chars would increase computation or UTF-32 would increase memory.

skybrian · on Oct 24, 2013

Well, the first question is what a string's internal representation should be, and they went with UTF8. Once you've decided on UTF8, the question is whether to hide the representation or expose it. If you decide to expose it, there's hardly any difference between an immutable byte array and string, so it's simpler in a way to have one type that can be used both ways.

dsymonds · on Oct 24, 2013

No, read the article more carefully. A Go string's internal representation is a sequence of bytes. There's nothing UTF-8 about it.

skybrian · on Oct 24, 2013

I understood the article just fine. It's not wrong to say that in Go, text (called strings in other languages) is nearly always represented in UTF8 format and is commonly stored in variables of type string, or sometimes []byte.

Perhaps I should have left out the word "internal" since it's exposed.

dsymonds · on Oct 24, 2013

Yes, "internal representation" has a specific meaning. If you had said "standard representation" or something then I would have agreed with you.

dchest · on Oct 24, 2013

Yes, there is: range on a string works on runes by decoding UTF-8.

dsymonds · on Oct 24, 2013

That has nothing to do with "a string's internal representation". That is just how a range loop is defined/implemented.

dsymonds · on Oct 24, 2013

The string type is immutable, which is the main reason for using it instead of []byte.

SigmundA · on Oct 24, 2013

Same in C#, but why not have an immutable byte array for bytes and a string for char/runes? Honestly never needed an immutable byte array that wasn't for chars.

dsymonds · on Oct 24, 2013

Go doesn't have the concept of an immutable array/slice. It's been proposed (largely to bridge the divide between string and []byte), but it doesn't fit quite right with the rest of the language.

twotwotwo · on Oct 24, 2013

Conversely, I could say why not use bytes for both, and that when I need an index lookup it's usually a byte offset. But I think neither statement is totally satisfying, and the more interesting question is why different languages settled on different encodings for strings in RAM.

Microsoft had reasons to pick UTF-16 for C# and the CLR. The Windows API speaks UTF-16, and when C# was first announced way back in 2000, UTF-8 was not yet a widely used encoding on the Web; Unicode was sadly not that widely used on the Web, period. The decision to use 16-bit Unicode in the Win32 API went even further back, to the development leading up to NT 3.1's release in July 1993. At that point, UTF-8 was a relative baby; it was presented at USENIX in January 1993. Also, back then, code points basically were 16 bits because surrogate pairs were but a twinkle in the Unicode Consortium's eye. The Unicode 2.0 standard added surrogate pairs to help them expand their CJK selection and generally let them add more chars of all sorts; it wasn't released until 1996. Now here we are and we have 😃, U+1F603.

Go came along in 2009; by then, many Web sites were being served in UTF-8, and Go'd be used in significant part in Web operations, and UTF-8 was also the default encoding in many Unix environments. Go initially didn't run on Windows at all and was ported by the community, so fitting in with Win32 wasn't an issue. Code points that wouldn't fit in 16 bits were a fact of life by '09, too. Arguments about inherent merits of encodings aside, it probably would have seemed to lots of folks that UTF-8 was a natural choice for that task at that time.

Also, Pike and Thompson are two of the three co-designers of Go and co-designed UTF-8, and UTF-8 was the encoding used by the Plan 9 OS/environment they built at Bell Labs, so, again, technical details aside, it was kind of a foregone conclusion which encoding they'd build the language around. :)

The one thing I do not want to do here is get in an argument about the inherent merits of character encodings, so I'm just not gonna do that. :)

On UTF-16 and Windows NT: http://support.microsoft.com/kb/99884, http://en.wikipedia.org/wiki/Windows_NT_3.1, and http://en.wikipedia.org/wiki/Unicode

On Unicode adoption on the Web, UTF-8, and Plan 9: http://googleblog.blogspot.com/2012/02/unicode-over-60-perce..., http://en.wikipedia.org/wiki/UTF-8, http://en.wikipedia.org/wiki/Plan_9_from_Bell_Labs

frou_dh · on Oct 24, 2013

That blog post is an example of good technical writing.

stevvooe · on Oct 24, 2013

After following the development of Go for awhile, I've come to idolize Rob Pike's terse, accurate communication style.

4ad · on Oct 24, 2013

Check out his books too: The Unix Programming Environment and The Practice of Programming. Both co-authored by Brian Kernighan.