This opens a huge possibilities. It's likely we could simply plug in stable diffusion using a linear layer. As well as whisper and some TTS. Getting a back to back mixed image/sound/text engine running on a laptop.
I wonder if there's powerful enough ViT model that does OCR.
I wonder if there's powerful enough ViT model that does OCR.