Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Discovering one bug after another in the UTF-8 decoding logic in OpenBSD (exoticsilicon.com)
97 points by t-3 on March 10, 2023 | hide | past | favorite | 38 comments


I love this site, aside from the design the articles on it are good reads. I guess OpenBSD people have a hilarious idea of design sometimes. I remember when de raadt and some others were making slides for conferences in MS comic sans just to "annoy design nerds" (not my wording IIRC) for a while


> I remember when de raadt and some others were making slides for conferences in MS comic sans just to "annoy design nerds" (not my wording IIRC) for a while

I'm pretty sure that's still a thing, and as somebody who nerds out on typography (my undergrad thesis used the Tufte LaTeX package), I fucking love it.


> aside from the design the articles on it are good reads

The design is perfectly fine.


It’s horrid.


It's weird to me that text encoding is put in the domain of the kernel. But I guess it's the way it needs to be until the world moves to microkernels.


Yeah, what's even weirder is when it's so integral to the kernel that it can lead to crashes. I've seen more than one case now where someone just fuzzing around on text engines ends up taking down an entire system, at least a couple on mac osx and ios alone. Dangerous stuff. [1] [2]

Linux might not be the most performant in the text engine game but at least it runs in userland.

[1] https://arstechnica.com/gadgets/2013/08/rendering-bug-crashe...

[2] https://www.engadget.com/unusual-buggy-character-string-appl...


I've been waiting for a "true" microkernel OS since 1980s.


QNX and Minix 3 don't count?


This seems to be about the system console built into the kernel.


Yeah, I got that part. I've just assumed that would be handled by a server process in a microkernel architecture. But I've just checked seL4 and it has in-kernel support for the console as well. That's not what I expected.


The system terminal has to be in-kernel otherwise you would not see kernel prints during boot, before any fs (and servers launched from it) is mounted.


Even with a microkernel, there's a lot of power in having a text console even when half the system is not functioning. If this code exists in a separate process, a lot of the handling bugs still need to be fixed to make the OS usable.


Serial port is trivial.


okay... and something needs to be able to decode the text sent to it over the serial port. All the bugs pointed out in the article also happen if you use the serial port console


Yes, and that “something” can be an app like gnome-terminal running on a full-fat text rendering library and GUI framework… on a completely different computer. That software could hypothetically have the same bug, but probably not, and at any rate it‘s not taking up space in the core of a microkernel.


The 120k+ lines of code in drivers/tty/serial in Linux suggests that maybe it's not _that_ trivial.


Tangentially, I always found it strange that five and six byte utf-8 sequences are illegal. My naive thought is that it provided a good amount of future proofing for when Unicode needs to expand again.


Unicode was originally conceived of as a 16-bit encoding. Han unification was necessary to squeeze everything into 16 bits, but it (kind of) worked. Going from 16 bits to 20 bits¹ might not seem like a big increase, but remember that this is an exponential progression, and not a linear one. Those four bits increase the capacity of Unicode by a factor of 16. The amount of empty space is much greater than we’re likely to ever use. Furthermore, there are costs (albeit not necessarily large ones) to having those empty code points—even with efficient use of space using 2-level lookup tables, increasing the capacity of Unicode beyond its current size would add time and space complexity to code for doing things like identifying character codes, grapheme boundaries, case folding, sorting, etc.

1. Technically, its 21 bits, but since the maximum value for a code point is 0x10ffff, we can round down to 0xfffff—which means we’re treating the original 16-bit capacity of Unicode as a rounding error!


five and six byte sequences are defined as illegal, because they encode code points that don't fit in utf-16. Of the 1,112,064 assignable code points allowed by UTF-16, we've only assigned 149,186 in the past 30 years, so we've probably got a couple centuries before we'll run out. Deal with it then.


I’m surprised that you were downvoted, because you’re totally right. UTF‐16 can represent almost a 21‐bit range—yes, greater than 16—thanks to surrogate pairs. The largest codepoint representable in UTF‐16 is U+10FFFF. That’s why UTF‐8 dropped the 5‐ and 6‐byte sequences, why even UTF‐32 is limited to 21 bits, and why U+D800 to U+DFFF (the range UTF‐16 uses to encode surrogate pairs) are explicitly disallowed from being represented in UTF‐8 and UTF‐32.


My eyes are bleeding. Reader mode to the rescue.

Always easier to see someone else's problems than our own, I suppose.


In case anyone thinks this is just snark, it's not. The font shadow legitimately hurts my eyes and I am unable to read this article.


And aside from one <svg> block and one <table> block, all the tags are either <div>s or <spans>s. That's ... impressive.


That's because your eyes think it's out of focus so the muscles get tired.

It's pretty terrible. Plz no text shadow.


Font shadow? All the article text looks normal. Only the headings have a pink shadow.

I’m on iOS for what it’s worth.


no, the font does have a very subtle shadow. you can see it better when you highlight the text, at least on FF on linux


Font shadow misuse and multiple saturated colors are the most common ways to make something painful to look at. Tiny fonts too.

This one used shadows and flirts with too much saturated color.


idk, I think the site's design is kind of endearing even if it isn't super ergonomic. Maybe I'm just old and nostalgic for the wild west creativity of the early web before everyone slapped bootstrap on wordpress or whatever and started looking the same. :)


'endearing' is about right. It's programmer art - the kind of stylesheet someone puts together to demonstrate all the features of their CSS engine, without concern as to its suitability to any particular visual purpose.


There are themes you can change (bottom of page), but I don't think any of them significantly improve it. Worse still, everything is stuffed into <div> elements and it also makes Firefox's "No Style" mode basically useless.

Not only did the author get way too clever with CSS effects, but couldn't even be bothered to use semantic HTML so it could degrade nicely.


I think they're going for something like https://old.reddit.com/r/mildlyinfuriating/


It looks fine to me. (windows, edge)


My goodness, yah, shadows on -everything- I thought it was just my eyes, and "ok, I gotta get off the computer", at least I wasn't the only one!


I love the site design, and the other themes are cool too!


I was just thinking the old internet was better, then I visited the link. Not that I'm not interested, but ...


It loads too quickly! Is there a "baud-rate" CSS selector that could be set to 19.2K? (unless you were lucky enough to have one of those fancy 56K modems!).


I think I was on 56K pretty early. Sending full-page 2400dpi bitmaps as PostScript files to the service bureau to generate film was a lengthy process.


it's not old enough

it uses fancy text effects you wouldn't have used in the 90s, and assets way too heavy for the 90s

adjust those two things to the 90s and the site is pretty much perfect for what it does




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: