It also destroys models ability to follow phonetic or syntactic constraints: htt...

It also destroys models ability to follow phonetic or syntactic constraints: https://paperswithcode.com/paper/most-language-models-can-be...

I've been waiting for real innovation to come in tokenization algorithms. I think a truly well trained character level model would be awesome. Far easier to guide, and might even know how to "count" it's own output.