Discovering one bug after another in the UTF-8 decoding logic in OpenBSD

jamal-kumar · on March 11, 2023

I love this site, aside from the design the articles on it are good reads. I guess OpenBSD people have a hilarious idea of design sometimes. I remember when de raadt and some others were making slides for conferences in MS comic sans just to "annoy design nerds" (not my wording IIRC) for a while

messe · on March 11, 2023

> I remember when de raadt and some others were making slides for conferences in MS comic sans just to "annoy design nerds" (not my wording IIRC) for a while

I'm pretty sure that's still a thing, and as somebody who nerds out on typography (my undergrad thesis used the Tufte LaTeX package), I fucking love it.

vfclists · on March 11, 2023

> aside from the design the articles on it are good reads

The design is perfectly fine.

steponlego · on March 11, 2023

It’s horrid.

GrumpySloth · on March 10, 2023

It's weird to me that text encoding is put in the domain of the kernel. But I guess it's the way it needs to be until the world moves to microkernels.

jamal-kumar · on March 11, 2023

Yeah, what's even weirder is when it's so integral to the kernel that it can lead to crashes. I've seen more than one case now where someone just fuzzing around on text engines ends up taking down an entire system, at least a couple on mac osx and ios alone. Dangerous stuff. [1] [2]

Linux might not be the most performant in the text engine game but at least it runs in userland.

[1] https://arstechnica.com/gadgets/2013/08/rendering-bug-crashe...

[2] https://www.engadget.com/unusual-buggy-character-string-appl...

phoehne · on March 11, 2023

I've been waiting for a "true" microkernel OS since 1980s.

lproven · on March 11, 2023

QNX and Minix 3 don't count?

layer8 · on March 10, 2023

This seems to be about the system console built into the kernel.

GrumpySloth · on March 10, 2023

Yeah, I got that part. I've just assumed that would be handled by a server process in a microkernel architecture. But I've just checked seL4 and it has in-kernel support for the console as well. That's not what I expected.

MisterTea · on March 11, 2023

The system terminal has to be in-kernel otherwise you would not see kernel prints during boot, before any fs (and servers launched from it) is mounted.

hinkley · on March 10, 2023

Even with a microkernel, there's a lot of power in having a text console even when half the system is not functioning. If this code exists in a separate process, a lot of the handling bugs still need to be fixed to make the OS usable.

snvzz · on March 11, 2023

Serial port is trivial.

Lt_Riza_Hawkeye · on March 11, 2023

okay... and something needs to be able to decode the text sent to it over the serial port. All the bugs pointed out in the article also happen if you use the serial port console

comex · on March 11, 2023

Yes, and that “something” can be an app like gnome-terminal running on a full-fat text rendering library and GUI framework… on a completely different computer. That software could hypothetically have the same bug, but probably not, and at any rate it‘s not taking up space in the core of a microkernel.

matja · on March 11, 2023

The 120k+ lines of code in drivers/tty/serial in Linux suggests that maybe it's not _that_ trivial.

somat · on March 11, 2023

Tangentially, I always found it strange that five and six byte utf-8 sequences are illegal. My naive thought is that it provided a good amount of future proofing for when Unicode needs to expand again.

dhosek · on March 11, 2023

Unicode was originally conceived of as a 16-bit encoding. Han unification was necessary to squeeze everything into 16 bits, but it (kind of) worked. Going from 16 bits to 20 bits¹ might not seem like a big increase, but remember that this is an exponential progression, and not a linear one. Those four bits increase the capacity of Unicode by a factor of 16. The amount of empty space is much greater than we’re likely to ever use. Furthermore, there are costs (albeit not necessarily large ones) to having those empty code points—even with efficient use of space using 2-level lookup tables, increasing the capacity of Unicode beyond its current size would add time and space complexity to code for doing things like identifying character codes, grapheme boundaries, case folding, sorting, etc.

⸻

1. Technically, its 21 bits, but since the maximum value for a code point is 0x10ffff, we can round down to 0xfffff—which means we’re treating the original 16-bit capacity of Unicode as a rounding error!

toast0 · on March 11, 2023

five and six byte sequences are defined as illegal, because they encode code points that don't fit in utf-16. Of the 1,112,064 assignable code points allowed by UTF-16, we've only assigned 149,186 in the past 30 years, so we've probably got a couple centuries before we'll run out. Deal with it then.

bentley · on March 11, 2023

I’m surprised that you were downvoted, because you’re totally right. UTF‐16 can represent almost a 21‐bit range—yes, greater than 16—thanks to surrogate pairs. The largest codepoint representable in UTF‐16 is U+10FFFF. That’s why UTF‐8 dropped the 5‐ and 6‐byte sequences, why even UTF‐32 is limited to 21 bits, and why U+D800 to U+DFFF (the range UTF‐16 uses to encode surrogate pairs) are explicitly disallowed from being represented in UTF‐8 and UTF‐32.

hinkley · on March 10, 2023

My eyes are bleeding. Reader mode to the rescue.

Always easier to see someone else's problems than our own, I suppose.

notfed · on March 11, 2023

In case anyone thinks this is just snark, it's not. The font shadow legitimately hurts my eyes and I am unable to read this article.

spc476 · on March 11, 2023

And aside from one <svg> block and one <table> block, all the tags are either <div>s or <spans>s. That's ... impressive.

winrid · on March 11, 2023

That's because your eyes think it's out of focus so the muscles get tired.

It's pretty terrible. Plz no text shadow.

MBCook · on March 11, 2023

Font shadow? All the article text looks normal. Only the headings have a pink shadow.

I’m on iOS for what it’s worth.

asddubs · on March 11, 2023

no, the font does have a very subtle shadow. you can see it better when you highlight the text, at least on FF on linux

hinkley · on March 11, 2023

Font shadow misuse and multiple saturated colors are the most common ways to make something painful to look at. Tiny fonts too.

This one used shadows and flirts with too much saturated color.

volkadav · on March 11, 2023

idk, I think the site's design is kind of endearing even if it isn't super ergonomic. Maybe I'm just old and nostalgic for the wild west creativity of the early web before everyone slapped bootstrap on wordpress or whatever and started looking the same. :)

jameshart · on March 11, 2023

'endearing' is about right. It's programmer art - the kind of stylesheet someone puts together to demonstrate all the features of their CSS engine, without concern as to its suitability to any particular visual purpose.

chungy · on March 11, 2023

There are themes you can change (bottom of page), but I don't think any of them significantly improve it. Worse still, everything is stuffed into <div> elements and it also makes Firefox's "No Style" mode basically useless.

Not only did the author get way too clever with CSS effects, but couldn't even be bothered to use semantic HTML so it could degrade nicely.

mouse_ · on March 10, 2023

I think they're going for something like https://old.reddit.com/r/mildlyinfuriating/

bmacho · on March 11, 2023

It looks fine to me. (windows, edge)

skullone · on March 11, 2023

My goodness, yah, shadows on -everything- I thought it was just my eyes, and "ok, I gotta get off the computer", at least I wasn't the only one!

camgunz · on March 11, 2023

I love the site design, and the other themes are cool too!

smm11 · on March 11, 2023

I was just thinking the old internet was better, then I visited the link. Not that I'm not interested, but ...

lsllc · on March 11, 2023

It loads too quickly! Is there a "baud-rate" CSS selector that could be set to 19.2K? (unless you were lucky enough to have one of those fancy 56K modems!).

dhosek · on March 11, 2023

I think I was on 56K pretty early. Sending full-page 2400dpi bitmaps as PostScript files to the service bureau to generate film was a lengthy process.

muyuu · on March 11, 2023

it's not old enough

it uses fancy text effects you wouldn't have used in the 90s, and assets way too heavy for the 90s

adjust those two things to the 90s and the site is pretty much perfect for what it does