Memory savings isn't the important thing. Small strings are, after all, small. You can fit a lot of them in cache. What you're saving is the CPU cost of malloc()+free().
I'll trust that they had good performance data that made them decide to do this. Still, I'm always a bit skeptical about these clever string tricks. It's an optimization that looks amazing in microbenchmarks but has costs in the real world. That's because you're adding a (potentially hard-to-predict) branch to every access to the string.
If your application constantly allocates and deallocates lots of tiny ASCII strings, this clever representation will be fantastic for you. However, if you use a mix of string sizes, tend to keep them around for awhile, and you do a lot of manipulations on those strings, you pay the branch-misprediction cost over and over.
Apple's implementation actually adds a (potentially hard-to-predict) branch to every single Objective-C message send.
They're extremely careful about the performance of objc_msgSend, because it's so significant to the overall performance of apps. The running time for a hot path message send is down to single digit CPU cycles. Any addition to that shows up pretty loudly. I'm sure they checked to make sure the check doesn't add too much overhead in real-world use, and that the wins are worth it.
It's not really CPU intensive. To turn a 5-bit char into an 8-bit one, just do a lookup into a tiny constant array. Doesn't even introduce any new branches. Roughly the same performance as iterating over every character. (which you're probably doing anyway if you need the conversion)
The savings aren't just memory, it's probably a performance improvement too. Fewer allocations, fewer pointer indirections, and some operations (like equality checking) become O(1) instead of O(n) for these tiny strings.
The other way round (deciding whether a newly minted string of length 10 can be put in a tagged pointer) is slightly more complex. Also, for tiny strings (the only strings that this supports) on modern hardware, I'm not sure that O(n) takes much longer than that O(1).
I would think the avoidance of allocations and the pointer indirections are the big wins.
Marshalling a string object ref to an actual value (before it can be used for string manipulations internally) must incur a fair bit of CPU overhead.
I guess they decided the unboxed memory savings were worth it.