Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Some people here seem to think that indexing, measuring and slicing operations based on runes (code points) instead of bytes (UTF-8 code units) by default would be to be a good idea. It's not - you get the worst of both worlds: indexing is not a constant-time operation and a code point is still not a user-perceived character, because combining character sequences consist of multiple code points, even normalization doesn't help in general.

Other languages like C# seem to be different on the surface, but in fact they index and measure by code units as well (2 byte UTF-16 code units), not by code points.



> It's not - you get the worst of both worlds: indexing is not a constant-time operation

You can't usefully index a unicode stream in constant time and do correct and useful textual stuff anyway due to combining codepoints which may not have precombined forms (if only because there is no defined limit to the number of combining codepoints tacked onto the base) (so normalization will not save you) or codepoints which are not visible to the user and which you may or may not want to see depending on the work you're doing.

People really need to come to terms that a unicode stream is exactly that, a stream.


> You can't usefully index a unicode stream in constant time and do correct and useful textual stuff anyway

To find an index of a substring you need to scan the string, right. But once you have the byte index you can quickly jump to its position in the string, e.g. when you do a slice operation based on that index: s[i:]. If strings.Index() returned a code point index and not a byte index you would have to scan the string again.


> To find an index of a substring you need to scan the string, right. But once you have the byte index you can quickly jump to its position in the string

Stop doing that and just get the bit of string you want in the first place?


> and a code point is still not a user-perceived character

How about indexing, measuring and slicing operations based on user-perceived characters, then?


I think the number of displayed characters is even font-dependent.


Because even that is not exactly trivial, particularly for noon-Latin languages.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: