> but using code generated from an LLM is pure madness unless what you are building is truly going to be thrown away and rewritten from scratch, as is relying on it as a linting, debugging, or source of truth tool.
That does not match my experience at all. You obviously have to use your brain to review it, but for a lot of problems LLMs produce close to perfect code in record time. It depends a lot on your prompting skills though.
Perhaps I suck at prompting but what I've noticed is that if an LLM has hallucinated something or learned a fake fact, it will use that fact no matter what you say to try to steer it away. The only way to get out of the loop is to know the answer yourself but in that case you wouldn't need an LLM.
I’ve found a good way to get unstuck here is to use another model, either or comparable or superior quality, or interestingly sometimes even a weaker version of the same product (e.g. Claude Haiku, vs. Sonnet*). My mental model here is similar to pair programming or simply bringing in a colleague when you’re stuck.
*I don’t know to what extent it’s worthwhile discussing whether you could call these the same model vs. entirely different, for any two products in the same family. Outside of simply quantising the same model and nothing else. Maybe you could include distillations of a base model too?
The idea of using a smaller version of the same (or a similar) model as a check is interesting. Overfitting is super basic, and tends to be less prominent in systems with fewer parameters. When this works, you may be finding examples of this exact phenomenon.
> The idea of using a smaller version of the same (or a similar) model as a check is interesting.
I built my chat app around this idea and to save money. When it comes to coding, I feel Sonnet 3.5 is still the best but I don't start with it. I tend to use cheaper models in the beginning since it usually takes a few iterations to get to a certain point and I don't want to waste tokens in the process. When I've reached a certain state or if it is clear that the LLM is not helping, I will bring in Sonnet to review things.
Here is an example of how the conversation between models will work.
The reason why this works for my application is, I have a system prompt that includes the following lines:
# Critical Context Information
Your name is {{gs-chat-llm-model}} and the current date and time is {{gs-chat-datetime}}.
When I make an API call, I will replace the template strings with the model and date. I also made sure to include instructions in the first user message to let the model know it needs to sign off on each message. So with the system prompt and message signature, you can say "what do you think of <LLM's> response".
That is not my experience. I wrote recently [1] about how I use it and it’s more like an intern, pair programmer or rubber duck. None of which make you worse.
That does not match my experience at all. You obviously have to use your brain to review it, but for a lot of problems LLMs produce close to perfect code in record time. It depends a lot on your prompting skills though.