I've been using Kokoro TTS with the CLI app, audiblez, mentioned in the "Similar Projects" section of the README. The model is fast and delivers impressive quality for its small size. Some issues I have faced, however, are:
a) It doesn't distinguish periods at the end of sentences from the dots in abbreviations such as "Mr." or "Mrs." The result is an awkward pause between "Mr." and the name.
b) It doesn't handle ellipses well.
c) Words are pronounced the same way regardless of context.
I fixed that here: https://github.com/cpttripzz/audiblez
The main problem with Kokoro is how flat and lifeless it sounds. But it is very fast. I prefer Chatterbox tts but it is around 20 times slower and will not work without a GPU
Kokoro is small and fast because all the text -> phoneme conversion is done by “dumb code” and only the phoneme -> sound part is done using a neural net.