No, when you are doing string manipulation, you are almost never interestet in j...

vardump · on Oct 8, 2024

> as there is almost no language that can be written using just that.

99% of use cases I've seen have nothing to do with human language.

1% human language case that is needs to be handled properly using a proper Unicode library.

Your mileage (percentages) may vary depending on your job.

kergonath · on Oct 8, 2024

Right. That’s why I still get mail with my name mangled and my street name barely recognisable. Because I’m in the 1%. Too bad for me…

In all seriousness, though, in the real world ASCII works only for a subset of a handful of languages. The vast majority of the population does not read or write any English in their day to day lives. As far as end users are concerned, you should probably swap your percentages.

ASCII is mostly fine within your programs like the parser you mention in your other comment. But even then, it’s better if a Chinese user name does not break your reporting or logging systems or your parser, so it’s still a good idea to take Unicode seriously. Otherwise, anything that comes from a user or gets out of the program needs to behave.

vardump · on Oct 8, 2024

I said use a Unicode library if input data is actual human language. Which names and addresses are.

99% case being ASCII data generated by other software of unknown provenance. (Or sometimes by humans, but it's still data for machines, not for humans.)

kergonath · on Oct 8, 2024

I am really not sure about this 99%. A lot of programs deal with quite a lot of user-provided data, which you don’t control.

account42 · on Oct 9, 2024

User-provided data, yes, but also data where you can treat non-ASCII bytes as garbage in -> garbage out. E.g. the config file might be typed by a human but if you need to support case-insensitive keys you still don't need to worry about Unicode.

kergonath · on Oct 10, 2024

Exactly. But in this case, don’t try to upper-case or otherwise transform anything.

Factory · on Oct 9, 2024

"The vast majority of the population does not read or write any English in their day to day lives." This is doubtful: https://en.wikipedia.org/wiki/List_of_languages_by_total_num... While English speakers are not a majority, it is the most popular language. And one should also note that given English is the lingua franca of programming, I'd suspect that English as a second language is actually a majority for programmers. So any code that deals solely with programmers as users can easily just use standard ASCII as default, and never see any problems.

kergonath · on Oct 10, 2024

> "The vast majority of the population does not read or write any English in their day to day lives." This is doubtful: https://en.wikipedia.org/wiki/List_of_languages_by_total_num... While English speakers are not a majority, it is the most popular language.

That is the number of English-speaking people, as in people who can speak English. Not necessarily people who use it every day. In any case, ASCII only works for a subset of even English if you ignore all loan words and diacritics in things like proper names.

> So any code that deals solely with programmers as users can easily just use standard ASCII as default, and never see any problems.

That would not be much code at all, given that most code deals with user interfaces or user-provided data. That is the point: it’s not because the code is in basic English simplified enough to fit in ASCII that you can ignore Unicode and don’t need to consider text encoding.

numpad0 · on Oct 9, 2024

> That’s why I still get mail with my name mangled

Which is why you always type out addresses in ASCII representations in any foreign transactions even if it's not going to match your identity documents, unless the other party specifically demands it in UTF-8 and insists that they can handle it.

> it’s better if a Chinese user name does not break your reporting or logging systems

You should not be just casually dumping Chinese usernames into logs without warnings, in fact, you should not be using Chinese characters for usernames at all. Lots of Chinese online services exclusively use numeric IDs and e-mails for login IDs. "Usernames in natural human language" is a valid concept only in ASCII cultural sphere.

kergonath · on Oct 10, 2024

> Which is why you always type out addresses in ASCII representations in any foreign transactions even if it's not going to match your identity documents, unless the other party specifically demands it in UTF-8 and insists that they can handle it.

That is not always possible and the translation from local writing system to ASCII is often not unique and ambiguous. There really is no excuse for this sort of thinking. Even American programmers have to realise at some point that programs serve some purpose and that their failure to represent how the world works is just that: a failure. There is no excuse for programs to not support UTF-8 from user input to any output, including all the processing in between.

Muromec · on Oct 8, 2024

Who and why still tries to lowercase/uppercase names? Please tell them to stop.

kergonath · on Oct 8, 2024

Hell if I know. I don’t know what kind of abomination e-commerce websites run on their backend, I just see the consequences.

9dev · on Oct 8, 2024

It's funny how software developers live in bubbles so much. Whether you deal with human language a lot or almost not at all depends entirely on your specific domain. Anyone working on user interfaces of any kind must accommodate for proper encoding, for example; that includes pretty much every line-of-business app out there, which is a lot of code.

elpocko · on Oct 8, 2024

Every search feature everywhere has to be case-insensitive or it's unusable. Search seems like a pretty ubiquitous feature in a lot of software, and has to work regardless of locale/encoding.

account42 · on Oct 9, 2024

Search needs a whole lot more normalization than just case folding.

elpocko · on Oct 9, 2024

Okay.

inexcf · on Oct 8, 2024

Why do you need upper- or lowercase conversion in cases that have nothing to do with human language?

vardump · on Oct 8, 2024

Here's an example. Hypothetically say you want to build an HTML parser.

You might encounter tags like <html>, <HTML>, <Html>, etc., but you want to perform a hash table lookup.

So first you're going to normalize to either lower- or uppercase.

ARandumGuy · on Oct 8, 2024

Converting string case is almost never something you want to do for text that's displayed to the end user, but there are many situations where you need to do it internally. Generally when the spec is case insensitive, but you still need to verify or organize things using string comparison.

inexcf · on Oct 8, 2024

Ah, i see, we disagree on what is "human language". An abbreviation like HTML and it's different capitalisations to me sound a lot like a feature of human language.

recursive · on Oct 8, 2024

Is this a serious argument? Humans don't directly use HTML to communicate with each other. It's a document markup language rendered by user agents, developed against a specification.

tannhaeuser · on Oct 8, 2024

Markup languages and SGML in particular absolutely are designed for digital text communication by humans and to be written using plain text editors; it's kindof the entire point of avoiding binary data constructs.

And to GP, SGML/HTML actually has a facility to define uppercasing rules beyond ASCII, namely the LCNMSTRT, UCNMSTRT, LCNMCHAR, UCNMCHAR options in the SYNTAX NAMING section in the SGML declaration introduced in the "Extended Naming Rules" revision of ISO 8879 (SGML std, cf. https://sgmljs.net/docs/sgmlrefman.html). Like basically everything else on this level, these rules are still used by HTML 5 to this date, and in particular, that while elements names can contain arbitrary characters, only those in the IRV (ASCII) get case-folded for canonization.

recursive · on Oct 8, 2024

HTML is a text-based medium. But that doesn't make it a human language. Some human languages are not text-based. And some text is not a human language.

ANSI C was designed to be written by humans using a plain text editor. That doesn't make it a human language.

Muromec · on Oct 8, 2024

But but, I want to have a custom web component and register it under my own name, which can only be properly written in Ukrainian Cyrillic. How dare you not let me have it.

daemin · on Oct 8, 2024

I would argue that for most programs when you're doing string manipulation you're doing it for internal programming reasons - logs, error messages, etc. In that case you are in nearly full control of the strings and therefore can declare that you're only working with ASCII.

The other normal cases of string usage are file paths and user interface, and the needed operations can be done with simple string functions, and even in UTF8 encoding the characters you care about are in the ASCII range. With file paths the manipulations that you're most often doing is path based so you only care about '/', '\', ':', and '.' ASCII characters. With user interface elements you're likely to be using them as just static data and only substituting values into placeholders when necessary.

pistoleer · on Oct 8, 2024

> I would argue that for most programs when you're doing string manipulation you're doing it for internal programming reasons - logs, error messages, etc. In that case you are in nearly full control of the strings and therefore can declare that you're only working with ASCII.

Why would you argue that? In my experience it's about formatting things that are addressed to the user, where the hardest and most annoying localization problems matter a lot. That includes sorting the last name "van den Berg" just after "Bakker", stylizing it as "Berg, van den", and making sure this capitalization is correct and not "Van Den Berg". There is no built in standard library function in any language that does any of that. It's so much larger than ascii and even larger than unicode.

Another user said that the main takeaway is that you can't process strings until you know their language (locale), and that is exactly correct.

daemin · on Oct 8, 2024

I would maintain that your program has more string manipulation for error messages and logging than for generating localised formatted names.

Further I do say that if you're creating text for presenting to the user then the most common operation would be replacement of some field in pre-defined text.

In your case I would design it so that the correctly capitalised first name, surname, and variations of those for sorting would be generated at the data entry point (manually or automatically) and then just used when needed in user facing text generation. Therefore the only string operation needed would be replacement of placeholders like the fmt and standard library provide. This uses more memory and storage but these are cheaper now.

pistoleer · on Oct 8, 2024

I agree, but the logging formatters don't really do much beyond trivially pasting in placeholders.

And as for data entry... Maybe in an ideal world. In the current world, marred by importing previously mangled datasets, a common solution in the few companies I've worked at is to just not do anything, which leaves ugly edges, yet is "good enough".

heisenzombie · on Oct 8, 2024

File paths? I think filesystem paths are generally “bags of bytes” that the OS might interpret as UTF-16 (Windows) or UTF-8 (macOS, Linux).

For example: https://en.m.wikipedia.org/wiki/Program_Files#Localization

vardump · on Oct 8, 2024

File paths are scary. The last I checked (which is admittedly a while ago), Windows didn't for example care about correct UTF-16 surrogate pairs at all, it'd happily accept invalid UTF-16 strings.

So use standard string processing libraries on path names at your own peril.

It's a good idea to consider file paths as a bag of bytes.

netsharc · on Oct 8, 2024

IIRC, the FAT filesystem (before Windows 95) allowed lowercase letters, but there's a layer in the filesystem driver that converted everything to uppercase, e.g. if you did the command "more readme.txt", the more command would ask the filesystem for "readme.txt" and it would search for "README.TXT" in the file allocation table.

I think I once hex-edited the FA-table to change a filename to have a lowercase name (or maybe it was disk corruption), trying to delete that file didn't work because it would be trying to delete "FOO", and couldn't find it because the file was named "FOo".

Someone · on Oct 8, 2024

> It's a good idea to consider file paths as a bag of bytes

(Nitpick: sequence of bytes)

Also very limiting. If you do that, you can’t, for example, show a file name to the user as a string or easily use a shell to process data in your file system (do you type “/bin” or “\x2F\x62\x69\x6E”?)

Unix, from the start, claimed file names where byte sequences, yet assumed many of those to encode ascii.

That’s part of why Plan 9 made the choice “names may contain any printable character (that is, any character outside hexadecimal 00-1F and 80-9F)” (https://9fans.github.io/plan9port/man/man9/intro.html)

daemin · on Oct 8, 2024

That's what I mean, you treat filesystem paths as bags of bytes separated by known ASCII characters, as the only path manipulation that you generally need to do is to append a path, remove a path, change extension, things that only care about those ASCII characters. You only modify the path strings at those known characters and leave everything in between as is (with some exceptions using OS API specific functions as needed).

numpad0 · on Oct 9, 2024

Just using UTF-8 for username at all is problematic. That has been a major PSA item for Windows users in my language literally since 90s and still is. Microsoft switched home folder names from Microsoft Account username to shortened user email for that reason.

account42 · on Oct 9, 2024

Yes and most importantly, that interpretation is for display purposes ONLY. If your file manager won't let me delete a file because the name includes invalid UTF-16/UTF-8 then it is simply broken.

BoringTimesGang · on Oct 8, 2024

Now double all of that effort, so you can get it to work with Windows' UTF-16 wstrings.

account42 · on Oct 9, 2024

Better to just convert WTF-16 (Windows filenames re not guaranteed to be valid UTF-16) to/from WTF-8 at the API boundary and then do the same processing internally on all platforms.