I did transcription for a while in 2021. It is absurdly hard. Especially as these days humans only get the difficult jobs that AI has already taken a stab at.
The hardest one I did was for a sports network where it was a motorcross motorbike event where most of what you could hear was the roar of the bikes. There were two commentators I had to transcribe over the top of that mess and they were using the slang insider nicknames for all the riders, not their published names, so I had to sit and Google forums to find the names of the riders while I was listening. I'm not even sure how these local models would even be able to handle that insanity at all because they almost certainly lack enough domain knowledge.
I was skepitcal upon hearing the figure but various sources do indeed back it up and [0] is a pretty interesting paper (old but still relevant human transcibers haven't changed in accuracy).
I think it's actually hard to verify how correct a transcription is, at scale. Curious where those error rate numbers come from, because they should test it on people actually doing their job.
You missed a giant factor: domain knowledge. Transcribing something outside of your knowledge realm is very hard. I posted above about transcribing the commentary of a motorbike race where the commentators only used the slang names of the riders.
If you transcribe a minute of conversation, you'll have like 5 words transcribed wrongly. In an hour podcast, that is 300 wrongly transcribed words.