Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't know; to me it reads like someone trying to show off, but I don't think it's helpful.

> The latter three overlong encodings aren't considered canonical, and software devs have to fix them to make string matching efficient and so forth, to save space, and most worrisomely to prevent attackers from sliding special characters into strings to crack systems - say by using larger, noncanonical encodings to evade filters that would catch and block the canonical, shorter encodings.

This is misleading. "Software devs" don't have to "fix them". Overlong encodings in UTF-8 are invalid since Unicode 3.1, precisely because of the security considerations, and conformant implementations are required "not to interpret any ill-formed code unit subsequences" (Unicode Standard section 3.9). RFC 3629 also states that "[i]mplementations of the decoding algorithm above MUST protect against decoding invalid sequences."

Fortunately, it's pretty straightforward to achieve this, and the Unicode standard contains a nine-row table listing all valid combinations of ranges of one to four bytes.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: