i get the sense that the solution to this problem is more use of LLMs (running critical feedback and review in a loop) rather than less use of LLMs.
If you can build good tooling around current kinda dumb LLMs now to lower that number, we will be in a pretty good position as the foundational models continue to improve.
Yeah I'd imagine the problem is not verifying the output against retrieved documents. If it just hallucinates it would ignore the given context, something that can absolutely be verified by another LLM.
If you can build good tooling around current kinda dumb LLMs now to lower that number, we will be in a pretty good position as the foundational models continue to improve.