As of today though, that doesn't work. Even straightforward tasks that are perfectly spec-ed can't be reliably done with agents, at least in my experience.
I recently used Claude for a refactor. I had an exact list of call sites, with positions etc. The model had to add .foo to a bunch of builders that were either at that position or slightly before (the code position was for .result() or whatever.) I gave it the file and the instruction, and it mostly did it, but it also took the opportunity to "fix" similar builders near those I specified.
That is after iterating a few times on the prompt (first time it didn't want to do it because it was too much work, second time it tried to do it via regex, etc.)
> An operation that propagates a NaN operand to its result and has a single NaN as an input should produce a
NaN with the payload of the input NaN if representable in the destination format.
> If two or more inputs are NaN, then the payload of the resulting NaN should be identical to the payload of
one of the input NaNs if representable in the destination format. This standard does not specify which of
the input NaNs will provide the payload.
As the comment below notes, the language should means it is recommended, but not required. And there are indeed platforms that do not implement the recommendation.
I work in finance and we have prod excel spreadsheets. Those spreadsheets are versioned like code artifacts, with automated testing and everything. Converting them to real applications is a major part of the work the technology division does.
They usually happen because some new and exciting line of business is started by a small team as a POC. Those teams don't get full technology backing, it would slow down the early iteration and cost a lot of money for an idea that may not be lucrative. Eventually they make a lot of money and by then risk controls are basically requiring them to document every single change they make in excel. This eventually sucks enough that they complain and get a tech team to convert the spreadsheet.
My experience being they are an exception rather than the rule and many more businesses have sheets that tend further toward Heath Robinson than would be admitted in public.
According to that OpenAI paper, models hallucinate in part because they are optimized on benchmarks that involve guessing. If you make a model that refuses to answer when unsure, you will not get SOTA performance on existing benchmarks and everyone will discount your work. If you create a new benchmark that penalizes guessing, everyone will think you are just creating benchmarks that advantage yourself.
That is such a cop-out, if there was a really good benchmark for getting rid of hallucinations then it would be included in every eval comparison graph.
The real reason is that every bench I've seen has Anthropic with lower hallucinations.
The captcha would have to be something really boring and repetitive like every click you have to translate a word from one of ten languages to english then make a bullet list of what it means.
I think most of this trial and error "You are an experienced engineer" stuff probably hurts model performance. No one ever does comprehensive testing so eh, yolo.
There are papers showing that models follow instructions less the more instructions they have. Now you think about how many instructions are embedded in that MD + the system prompt + likely a local AGENTS.md and at the end there is probably very little here that matters.
OTOH most model APIs are basically identical to each other. You can switch from one to the other using openrouter without even altering the code. Furthermore, they aren't reliable (drop rates can be as high as 20%) and compliance "guarantees" are, AFAIK, completely untested. As anyone used the Copilot compliance guarantees to defend themselves in a copyright infringement suit yet?
I think you are right that trust and operational certainty justifies significant premiums. It would be great if trust and operational certainty were available.
I recently used Claude for a refactor. I had an exact list of call sites, with positions etc. The model had to add .foo to a bunch of builders that were either at that position or slightly before (the code position was for .result() or whatever.) I gave it the file and the instruction, and it mostly did it, but it also took the opportunity to "fix" similar builders near those I specified.
That is after iterating a few times on the prompt (first time it didn't want to do it because it was too much work, second time it tried to do it via regex, etc.)
reply