More

xmcqdpt2 · 2026-02-03T13:21:18 1770124878

As of today though, that doesn't work. Even straightforward tasks that are perfectly spec-ed can't be reliably done with agents, at least in my experience.

I recently used Claude for a refactor. I had an exact list of call sites, with positions etc. The model had to add .foo to a bunch of builders that were either at that position or slightly before (the code position was for .result() or whatever.) I gave it the file and the instruction, and it mostly did it, but it also took the opportunity to "fix" similar builders near those I specified.

That is after iterating a few times on the prompt (first time it didn't want to do it because it was too much work, second time it tried to do it via regex, etc.)

xmcqdpt2 · 2026-02-03T13:13:07 1770124387

Syntax errors should be caught by type checking / compiling/ linting. That should not take 2-3 hours!

xmcqdpt2 · 2026-02-02T13:14:33 1770038073

See 6.2.3 in the 2019 standard.

> 6.2.3 NaN propagation

> An operation that propagates a NaN operand to its result and has a single NaN as an input should produce a NaN with the payload of the input NaN if representable in the destination format.

> If two or more inputs are NaN, then the payload of the resulting NaN should be identical to the payload of one of the input NaNs if representable in the destination format. This standard does not specify which of the input NaNs will provide the payload.

ekelsen · 2026-02-02T15:16:33 1770045393

As the comment below notes, the language should means it is recommended, but not required. And there are indeed platforms that do not implement the recommendation.

xmcqdpt2 · 2026-02-02T15:37:15 1770046635

Oh right sorry. That is confusing.

xmcqdpt2 · 2026-02-02T12:09:46 1770034186

I work in finance and we have prod excel spreadsheets. Those spreadsheets are versioned like code artifacts, with automated testing and everything. Converting them to real applications is a major part of the work the technology division does.

They usually happen because some new and exciting line of business is started by a small team as a POC. Those teams don't get full technology backing, it would slow down the early iteration and cost a lot of money for an idea that may not be lucrative. Eventually they make a lot of money and by then risk controls are basically requiring them to document every single change they make in excel. This eventually sucks enough that they complain and get a tech team to convert the spreadsheet.

defrost · 2026-02-02T12:36:26 1770035786

I too have seen such things.

My experience being they are an exception rather than the rule and many more businesses have sheets that tend further toward Heath Robinson than would be admitted in public.

* https://en.wikipedia.org/wiki/W._Heath_Robinson

xmcqdpt2 · 2026-02-02T11:58:51 1770033531

> Honestly the absolute revolution for me would be if someone managed to make LLM tell "sorry I don't know enough about the topic"

https://arxiv.org/abs/2509.04664

According to that OpenAI paper, models hallucinate in part because they are optimized on benchmarks that involve guessing. If you make a model that refuses to answer when unsure, you will not get SOTA performance on existing benchmarks and everyone will discount your work. If you create a new benchmark that penalizes guessing, everyone will think you are just creating benchmarks that advantage yourself.

snovv_crash · 2026-02-02T20:43:33 1770065013

That is such a cop-out, if there was a really good benchmark for getting rid of hallucinations then it would be included in every eval comparison graph.

The real reason is that every bench I've seen has Anthropic with lower hallucinations.

KellyCriterion · 2026-02-02T17:42:16 1770054136

...or they hallicunate because of floating point issues in parallel execution environments:

https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

cbdevidal · 2026-02-02T12:15:18 1770034518

Holy perverse incentives, Batman

xmcqdpt2 · 2026-01-30T12:20:54 1769775654

The captcha would have to be something really boring and repetitive like every click you have to translate a word from one of ten languages to english then make a bullet list of what it means.

xmcqdpt2 · 2026-01-29T13:24:12 1769693052

Not entirely different from many human engineers...

philipwhiuk · 2026-01-29T15:57:13 1769702233

Indeed - most of my StackOverflow credit is for explaining TLS config options.

xmcqdpt2 · 2026-01-29T13:22:41 1769692961

I think most of this trial and error "You are an experienced engineer" stuff probably hurts model performance. No one ever does comprehensive testing so eh, yolo.

https://github.com/agoodway/.claude/blob/main/skills/elixir-...

There are papers showing that models follow instructions less the more instructions they have. Now you think about how many instructions are embedded in that MD + the system prompt + likely a local AGENTS.md and at the end there is probably very little here that matters.

cpursley · 2026-01-29T14:41:45 1769697705

Yeah, I honestly lean on the elixir agent one more over the full skill:

https://github.com/agoodway/.claude/blob/main/agents/elixir-...

xmcqdpt2 · 2026-01-29T12:37:39 1769690259

For an even lighter system than Celery, I'm a big fan of

https://python-rq.org/

It's super low on the dependencies and integrates nicely as a library into python applications. It's very bare bones.

TkTech · 2026-01-29T15:59:01 1769702341

I kinda have my own https://github.com/tktech/chancy :P

xmcqdpt2 · 2026-01-29T12:18:16 1769689096

OTOH most model APIs are basically identical to each other. You can switch from one to the other using openrouter without even altering the code. Furthermore, they aren't reliable (drop rates can be as high as 20%) and compliance "guarantees" are, AFAIK, completely untested. As anyone used the Copilot compliance guarantees to defend themselves in a copyright infringement suit yet?

I think you are right that trust and operational certainty justifies significant premiums. It would be great if trust and operational certainty were available.