I'm late to the party, but what stops training this SD model on audio spectrograms? Then you'd tell it "some mozart-style violin for 5 seconds, add drums in background." The spectrogram is then translated to sound and suddenly you're a very decent music writer.
With img2txt you could give it an audio file, call it "S" and tell "music in S style, but with flute".
mp3 density: 30sec per 1MB (some instrumental music with background). jpg density: 12M pixels per MB (trees and some landscaping). I'd argue music has a lot more information, if we can compare seconds with pixels. Imho, OpenAI didnt do a great job: a small dataset and a limited model.
With img2txt you could give it an audio file, call it "S" and tell "music in S style, but with flute".