Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This opens a huge possibilities. It's likely we could simply plug in stable diffusion using a linear layer. As well as whisper and some TTS. Getting a back to back mixed image/sound/text engine running on a laptop.

I wonder if there's powerful enough ViT model that does OCR.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: