No, when you are doing string manipulation, you are almost never interestet in just the seven bit ASCII range, as there is almost no language that can be written using just that.
Right. That’s why I still get mail with my name mangled and my street name barely recognisable. Because I’m in the 1%. Too bad for me…
In all seriousness, though, in the real world ASCII works only for a subset of a handful of languages. The vast majority of the population does not read or write any English in their day to day lives. As far as end users are concerned, you should probably swap your percentages.
ASCII is mostly fine within your programs like the parser you mention in your other comment. But even then, it’s better if a Chinese user name does not break your reporting or logging systems or your parser, so it’s still a good idea to take Unicode seriously. Otherwise, anything that comes from a user or gets out of the program needs to behave.
I said use a Unicode library if input data is actual human language. Which names and addresses are.
99% case being ASCII data generated by other software of unknown provenance. (Or sometimes by humans, but it's still data for machines, not for humans.)
User-provided data, yes, but also data where you can treat non-ASCII bytes as garbage in -> garbage out. E.g. the config file might be typed by a human but if you need to support case-insensitive keys you still don't need to worry about Unicode.
"The vast majority of the population does not read or write any English in their day to day lives."
This is doubtful: https://en.wikipedia.org/wiki/List_of_languages_by_total_num...
While English speakers are not a majority, it is the most popular language.
And one should also note that given English is the lingua franca of programming, I'd suspect that English as a second language is actually a majority for programmers.
So any code that deals solely with programmers as users can easily just use standard ASCII as default, and never see any problems.
> "The vast majority of the population does not read or write any English in their day to day lives." This is doubtful: https://en.wikipedia.org/wiki/List_of_languages_by_total_num... While English speakers are not a majority, it is the most popular language.
That is the number of English-speaking people, as in people who can speak English. Not necessarily people who use it every day. In any case, ASCII only works for a subset of even English if you ignore all loan words and diacritics in things like proper names.
> So any code that deals solely with programmers as users can easily just use standard ASCII as default, and never see any problems.
That would not be much code at all, given that most code deals with user interfaces or user-provided data. That is the point: it’s not because the code is in basic English simplified enough to fit in ASCII that you can ignore Unicode and don’t need to consider text encoding.
> That’s why I still get mail with my name mangled
Which is why you always type out addresses in ASCII representations in any foreign transactions even if it's not going to match your identity documents, unless the other party specifically demands it in UTF-8 and insists that they can handle it.
> it’s better if a Chinese user name does not break your reporting or logging systems
You should not be just casually dumping Chinese usernames into logs without warnings, in fact, you should not be using Chinese characters for usernames at all. Lots of Chinese online services exclusively use numeric IDs and e-mails for login IDs. "Usernames in natural human language" is a valid concept only in ASCII cultural sphere.
> Which is why you always type out addresses in ASCII representations in any foreign transactions even if it's not going to match your identity documents, unless the other party specifically demands it in UTF-8 and insists that they can handle it.
That is not always possible and the translation from local writing system to ASCII is often not unique and ambiguous. There really is no excuse for this sort of thinking. Even American programmers have to realise at some point that programs serve some purpose and that their failure to represent how the world works is just that: a failure. There is no excuse for programs to not support UTF-8 from user input to any output, including all the processing in between.
It's funny how software developers live in bubbles so much. Whether you deal with human language a lot or almost not at all depends entirely on your specific domain. Anyone working on user interfaces of any kind must accommodate for proper encoding, for example; that includes pretty much every line-of-business app out there, which is a lot of code.
Every search feature everywhere has to be case-insensitive or it's unusable. Search seems like a pretty ubiquitous feature in a lot of software, and has to work regardless of locale/encoding.
Converting string case is almost never something you want to do for text that's displayed to the end user, but there are many situations where you need to do it internally. Generally when the spec is case insensitive, but you still need to verify or organize things using string comparison.
Ah, i see, we disagree on what is "human language". An abbreviation like HTML and it's different capitalisations to me sound a lot like a feature of human language.
Is this a serious argument? Humans don't directly use HTML to communicate with each other. It's a document markup language rendered by user agents, developed against a specification.
Markup languages and SGML in particular absolutely are designed for digital text communication by humans and to be written using plain text editors; it's kindof the entire point of avoiding binary data constructs.
And to GP, SGML/HTML actually has a facility to define uppercasing rules beyond ASCII, namely the LCNMSTRT, UCNMSTRT, LCNMCHAR, UCNMCHAR options in the SYNTAX NAMING section in the SGML declaration introduced in the "Extended Naming Rules" revision of ISO 8879 (SGML std, cf. https://sgmljs.net/docs/sgmlrefman.html). Like basically everything else on this level, these rules are still used by HTML 5 to this date, and in particular, that while elements names can contain arbitrary characters, only those in the IRV (ASCII) get case-folded for canonization.
HTML is a text-based medium. But that doesn't make it a human language. Some human languages are not text-based. And some text is not a human language.
ANSI C was designed to be written by humans using a plain text editor. That doesn't make it a human language.
But but, I want to have a custom web component and register it under my own name, which can only be properly written in Ukrainian Cyrillic. How dare you not let me have it.
I would argue that for most programs when you're doing string manipulation you're doing it for internal programming reasons - logs, error messages, etc. In that case you are in nearly full control of the strings and therefore can declare that you're only working with ASCII.
The other normal cases of string usage are file paths and user interface, and the needed operations can be done with simple string functions, and even in UTF8 encoding the characters you care about are in the ASCII range.
With file paths the manipulations that you're most often doing is path based so you only care about '/', '\', ':', and '.' ASCII characters.
With user interface elements you're likely to be using them as just static data and only substituting values into placeholders when necessary.
> I would argue that for most programs when you're doing string manipulation you're doing it for internal programming reasons - logs, error messages, etc. In that case you are in nearly full control of the strings and therefore can declare that you're only working with ASCII.
Why would you argue that? In my experience it's about formatting things that are addressed to the user, where the hardest and most annoying localization problems matter a lot. That includes sorting the last name "van den Berg" just after "Bakker", stylizing it as "Berg, van den", and making sure this capitalization is correct and not "Van Den Berg". There is no built in standard library function in any language that does any of that. It's so much larger than ascii and even larger than unicode.
Another user said that the main takeaway is that you can't process strings until you know their language (locale), and that is exactly correct.
I would maintain that your program has more string manipulation for error messages and logging than for generating localised formatted names.
Further I do say that if you're creating text for presenting to the user then the most common operation would be replacement of some field in pre-defined text.
In your case I would design it so that the correctly capitalised first name, surname, and variations of those for sorting would be generated at the data entry point (manually or automatically) and then just used when needed in user facing text generation. Therefore the only string operation needed would be replacement of placeholders like the fmt and standard library provide. This uses more memory and storage but these are cheaper now.
I agree, but the logging formatters don't really do much beyond trivially pasting in placeholders.
And as for data entry... Maybe in an ideal world. In the current world, marred by importing previously mangled datasets, a common solution in the few companies I've worked at is to just not do anything, which leaves ugly edges, yet is "good enough".
File paths are scary. The last I checked (which is admittedly a while ago), Windows didn't for example care about correct UTF-16 surrogate pairs at all, it'd happily accept invalid UTF-16 strings.
So use standard string processing libraries on path names at your own peril.
It's a good idea to consider file paths as a bag of bytes.
IIRC, the FAT filesystem (before Windows 95) allowed lowercase letters, but there's a layer in the filesystem driver that converted everything to uppercase, e.g. if you did the command "more readme.txt", the more command would ask the filesystem for "readme.txt" and it would search for "README.TXT" in the file allocation table.
I think I once hex-edited the FA-table to change a filename to have a lowercase name (or maybe it was disk corruption), trying to delete that file didn't work because it would be trying to delete "FOO", and couldn't find it because the file was named "FOo".
> It's a good idea to consider file paths as a bag of bytes
(Nitpick: sequence of bytes)
Also very limiting. If you do that, you can’t, for example, show a file name to the user as a string or easily use a shell to process data in your file system (do you type “/bin” or “\x2F\x62\x69\x6E”?)
Unix, from the start, claimed file names where byte sequences, yet assumed many of those to encode ascii.
That's what I mean, you treat filesystem paths as bags of bytes separated by known ASCII characters, as the only path manipulation that you generally need to do is to append a path, remove a path, change extension, things that only care about those ASCII characters. You only modify the path strings at those known characters and leave everything in between as is (with some exceptions using OS API specific functions as needed).
Just using UTF-8 for username at all is problematic. That has been a major PSA item for Windows users in my language literally since 90s and still is. Microsoft switched home folder names from Microsoft Account username to shortened user email for that reason.
Yes and most importantly, that interpretation is for display purposes ONLY. If your file manager won't let me delete a file because the name includes invalid UTF-16/UTF-8 then it is simply broken.
Better to just convert WTF-16 (Windows filenames re not guaranteed to be valid UTF-16) to/from WTF-8 at the API boundary and then do the same processing internally on all platforms.