It's totally possible to build entire software products in the fraction of the time it took before.
But, reading the comments here, the behaviors from one version to another point version (not major version mind you) seem very divergent.
It feels like we are now able to manage incredibly smart engineers for a month at the price of a good sushi dinner.
But it also feels like you have to be diligent about adopting new models (even same family and just point version updates) because they operate totally differently regardless of your prompt and agent files.
Imagine managing a team of software developers where every month it was an entirely new team with radically different personalities, career experiences and guiding principles. It would be chaos.
I suspect that older models will be deprecated quickly and unexpectedly, or, worse yet, will be swapped out with subtle different behavioral characteristics without notice. It'll be quicksand.
I had an interesting experience recently where I ran Opus 4.6 against a problem that o4-mini had previously convinced me wasn't tractable... and Opus 4.6 found me a great solution. https://github.com/simonw/sqlite-chronicle/issues/20
This inspired me to point the latest models at a bunch of my older projects, resulting in a flurry of fixes and unblocks.
I have a codebase (personal project) and every time there is a new Claude Opus model I get it to do a full code review. Never had any breakages in last couple of model updates. Worried one day it just generates a binary and deletes all the code.
I was being facetious, I mean one day models might skip the middle man of code and compilation and take your specs and produce an ultra efficent binary.
Musk was saying that recently but I don't see it being efficient or worthwhile to do this. I could be proven brutally wrong, but code is language; executables aren't. There's also no real reason to bother with this when we have quick-compiling languages.
More realistically, I could see particular languages and frameworks proving out to be more well-designed and apt for AI code creation; for instance, I was always too lazy to use a strongly-typed language, preferring Ruby for the joy of writing in it (obsessing about types is for a particular kind of nerd that I've never wanted to be). But now with AI, everything's better with strong types in the loop, since reasoning about everything is arguably easier and the compiler provides stronger guarantees about what's happening. Similarly, we could see other linguistic constructs come to the forefront because of what they allow when the cost of implementation drops to zero.
You can map tokens to CPU instructions and train a model on that, that's what they do for input images I think.
I think the main limitation on the current models is not that cpu instructions aren't cpu instructions (even though they can be with .asm), it's that they are causal, the cpu would need to generate a binary entirely from start to finish sequentially.
If we learned something over the last 50 years of programming is that that's hard and that's why we invented programming languages? Why would it be simpler to just generate the machine code, sure maybe an LLM to application can exist, but my money is in that there will be a whole toolchain in the middle, and it will probably be the same old toolchain that we are using currently, an OS, probably Linux.
Isn't it more common that stuff builds on the existing infra instead of a super duper revolution that doesn't use the previous tech stack? It's much easier to add onto rather than start from scratch.
Those CPU instructions still need to be making calls out to things, though. Hallucinated source code will reveal its flaws through linters, compiler errors, test suites. A hallucinated binary will not reveal its flaws until it segfaults.
Programs that pass linters, compile and test suites can still segfault. A good test harness that test the binary comprehensively can limit this. The model could be trained to have patterns of efficient assembly it uses rather than source code.
I’ve thought an interesting outcome might be that it’s not even that there’s a binary generated. It’s just user input -> machine code LLM -> CPU. Like the only binary would be the LLM itself and it’s essentially mimicking software live. The paper “Diffusion as a Model of Environment Dream” (DIAMOND) is close to what I’m thinking, where they have a diffusion model generate frames of a game, updating with user input, but there’s no actual “game” code it’s just the model.
Like you’d have a machine code LLM that behaves like software but instead of a static binary being executed it’s just the LLM itself “executing” on inputs and precious state. I’m horrible at communicating this idea but hopefully the gist is there.
You're going to need to spend crazy compute just compiling and obtaining training data. And until it's oneshotting absolutely everything. You're going to be asking it what it's it doing and then it'll be "uncompiling" it's code, I can't see this being more efficient than the other way compiling.
I suspect the actual benefit would be more in virtualised interfaces such as Genie 3, skipping this step altogether. Where it's just manipulating pixels and the pixels change based on the underlying statistical model output rather than old school computation.
This may seem obvious, but many people overlook it. The effect is especially clear when using an AI music model. For example, in Suno AI you can remaster an older AI generated track with a newer model. I do this with all my songs whenever a new model is released. It makes it super easy to see the improvements that were made to the models over time.
I keep giving the top Anthropic, Google and OpenAI models problems.
They come up with passable solutions and are good for getting juices flowing and giving you a start on a codebase, but they are far from building "entire software products" unless you really don't care about quality and attention to detail.
Yeah I keep maintaining a specific app I built with gpt 5.1 codex max with that exact model because it continues to work for the requests I send it, and attempts with other models even 5.2 or 5.3 codex seemed to have odd results. If I were superstitious I would say it’s almost like the model that wrote the code likes to work on the code better. Perhaps there’s something about the structure it created though that it finds easier to understand…
I have long suspected that a large part of people's distaste for given models comes from their comfort with their daily driver.
Which I guess feeds back to prompting still being critical for getting the most out of a model (outside of subjective stylistic traits the models have in their outputs).
It's totally possible to build entire software products in the fraction of the time it took before.
But, reading the comments here, the behaviors from one version to another point version (not major version mind you) seem very divergent.
It feels like we are now able to manage incredibly smart engineers for a month at the price of a good sushi dinner.
But it also feels like you have to be diligent about adopting new models (even same family and just point version updates) because they operate totally differently regardless of your prompt and agent files.
Imagine managing a team of software developers where every month it was an entirely new team with radically different personalities, career experiences and guiding principles. It would be chaos.
I suspect that older models will be deprecated quickly and unexpectedly, or, worse yet, will be swapped out with subtle different behavioral characteristics without notice. It'll be quicksand.