Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm not so optimistic as someone that works on agents for businesses and creating tools for it. The leap from low 90s to 99% is classic last mile problem for LLM agents. The more generic and spread an agent is (can-do-it-all) the more likely it will fail and disappoint.

Can't help but feel many are optimizing happy paths in their demos and hiding the true reality. Doesn't mean there isn't a place for agents but rather how we view them and their potential impact needs to be separated from those that benefit from hype.

just my two cents



In general most of the previous AI "breakthrough" in the last decade were backed by proper scientific research and ideas:

- AlphaGo/AlphaZero (MCTS)

- OpenAI Five (PPO)

- GPT 1/2/3 (Transformers)

- Dall-e 1/2, Stable Diffusion (CLIP, Diffusion)

- ChatGPT (RLHF)

- SORA (Diffusion Transformers)

"Agents" is a marketing term and isn't backed by anything. There is little data available, so it's hard to have generally capable agents in the sense that LLMs are generally capable


I disagree that there isn't an innovation.

The technology for reasoning models is the ability to do RL on verifiable tasks, with the some (as-of-yet unpublished, but well-known) search over reasoning chains, with a (presumably neural) reasoning fragment proposal machine, and a (presumably neural) scoring machine for those reasoning fragments.

The technology for agents is effectively the same, with some currently-in-R&D way to scale the training architecture for longer-horizon tasks. ChatGPT agent or o3/o4-mini are likely the first published models that take advantage of this research.

It's fairly obvious that this is the direction that all the AI labs are going if you go to SF house parties or listen to AI insiders like Dwarkesh Patel.


Fair enough I guess, even though the concept of agent/agentic task popped before reasoning models were really a thing


The idea of chatbots existed before ChatGPT, does that mean it's purely marketing hype?


'Agents' are just a design pattern for applications that leverage recent proper scientific breakthroughs. We now have models that are increasingly capable of reading arbitrary text and outputting valid json/xml. It seems like if we're careful about what text we feed them and what json/xml we ask for, we can get them to string together meaningful workflows and operations.

Obviously, this is working better in some problem spaces than others; seems to mainly depend on how in-distribution the data domain is to the LLM's training set. Choices about context selection and the API surface exposed in function calls seem to have a large effect on how well these models can do useful work as well.


My personal framing of "Agents" is that they're more like software robots than they are an atomic unit of technology. Composed of many individual breakthroughs, but ultimately a feat of design and engineering to make them useful for a particular task.


Agents have been a field in AI long since 1990s.

MDP, Q learning, TD, RL, PPO are basically all about agent.

What we have today is still very much the same field as it was.


Yep. Agents are only powered by clever use of training data, nothing more. There hasn't been a real breakthrough in a long time.


"Long time" as in, 7 months since o1 and reasoning models were released? That was a pretty big breakthrough.


In the context of our conversation and what OP wrote, there has been no breakthrough since around 2018. What you're seeing is the harvesting of all low-hanging fruit from a tree that was discovered years ago. But fruit is almost gone. All top models perform at almost the same level. All the "agents" and "reasoning models" are just products of training data.

I wrote more about it here:

https://news.ycombinator.com/item?id=44426993

You may also be interested in this article, that goes into details even more:

https://blog.jxmo.io/p/there-are-no-new-ideas-in-ai-only


This "all breakthroughs are old" argument is very unsatisfying. It reminds me of when people would describe LLMs as being "just big math functions". It is technically correct, but it misses the point.

AI researchers spent years figuring out how to apply RL to LLMs without degrading their general capabilities. That's the breakthrough. Not the existence of RL, but making it work for LLMs specifically. Saying "it's just RL, we've known about that for ages" does not acknowledge the work that went into this.

Similarly, using the fact that new breakthroughs look like old research ideas is not particularly good evidence that we are going to head into a winter. First, what are the limits of RL, really? Will we just get models that are highly performant at narrow tasks? Or will the skills we train LLMs for generalise? What's the limit? This is still an open question. RL for narrow domains like Chess yielded superhuman results, and I am interested to see how far we will get with it for LLMs.

This also ignores active research that has been yielding great results, such as AlphaEvolve. This isn't a new idea either, but does that really matter? They figured out how to apply evolutionary algorithms with LLMs to improve code. So, there's another idea to add to your list of old ideas. What's to say there aren't more old ideas that will pop up when people figure out how to apply them?

Maybe we will add a search layer with MCTS on top of LLMs to allow progress on really large math problems by breaking them down into a graph of sub-problems. That wouldn't be a new idea either. Or we'll figure out how to train better reranking algorithms to sort our training data, to get better performance. That wouldn't be new either! Or we'll just develop more and better tools for LLMs to call. There's going to be a limit at some point, but I am not convinced by your argument that we have reached peak LLM.


I understand your argument. The recipe that finally let RLHF + SFT work without strip mining base knowledge was real R&D, and GPT 4 class models wouldn’t feel so "chatty but competent" without it. I just still see ceiling effects that make the whole effort look more like climbing a very tall tree than building a Saturn V.

GPT 4.1 is marketed as a "major improvement" but under the hood it’s still the KL-regularised PPO loop OpenAI first stabilized in 2022 only with a longer context window and a lot more GPUs for reward model inference.

They retired GPT 4.5 after five months and told developers to fall back to 4.1. The public story is "cost to serve” not breakthroughs left on the table. When you sunset your latest flagship because the economics don’t close, that’s not a moon shot trajectory, it’s weight shaving on a treehouse.

Stanford’s 2025 AI-Index shows that model to model spreads on MMLU, HumanEval, and GSM8K have collapsed to low single digits, performance curves are flattening exactly where compute curves are exploding. A fresh MIT-CSAIL paper modelling "Bayes slowdown" makes the same point mathematically: every extra order of magnitude of FLOPs is buying less accuracy than the one before.[1]

A survey published last week[2] catalogs the 2025 state of RLHF/RLAIF: reward hacking, preference data scarcity, and training instability remain open problems, just mitigated by ever heavier regularisation and bigger human in the loop funnels. If our alignment patch still needs a small army of labelers and a KL muzzle to keep the model from self lobotomising calling it "solved" feels optimistic.

Scale, fancy sampling tricks, and patched up RL got us to the leafy top so chatbots that can code and debate decently. But the same reports above show the branches bending under compute cost, data saturation, and alignment tax. Until we swap out the propulsion system so new architectures, richer memory, or learning paradigms that add information instead of reweighting it we’re in danger of planting a flag on a treetop and mistaking it for Mare Tranquillitatis.

Happy to climb higher together friend but I’m still packing a parachute, not a space suit.

1. https://arxiv.org/html/2507.07931v1

2. https://arxiv.org/html/2507.04136v1


But that's how progress works! To me it makes sense that llms first manage to do 80% of the task, then 90, then 95, then 98, then 99, then 99.5, and so on. The last part IS the hardest, and each iteration of LLMs will get a bit further.

Just because it didn't reach 100% just yet doesn't mean that LLMs as a whole are doomed. In fact, the fact that they are slowly approaching 100% shows promise that there IS a future for LLMs, and that they still have the potential to change things fundamentally, more so than they did already.


But they don’t do 80% of the task. They do 100% of the task, but 20% is wrong (and you don’t know which 20% without manually verifying all of it).

So it is really great for tasks where do the work is a lot harder than verifying it, and mostly useless for tasks where doing the work and verifying it are similarly difficult.


Right — and I'd conjecture until LLMs get close to the accuracy of an entry-level employee, they may not have enough economic value to be viable beyond the hype/novelty phase. Why? Because companies already chose a "minimum quality to be valuable" bar when they set the bar for their most junior entry level. They could get lower-quality work for cheaper by just carving out an even lower-bar hiring tier. If they haven't, maybe it's because work below that quality level is just not a net-positive contribution at all.


I would go so far as to say that the reason people feel LLMs have stagnated is precisely because they feel like they're only progressing a few percentage points between iteration - despite the fact that these points are the hardest.


The people who feel that LLMs have stagnated are similar to the ones who feel like LLMs are not useful.


> Can't help but feel many are optimizing happy paths in their demos and hiding the true reality.

Even with the best intentions, this feels similar to when a developer hands off code directly to the customer without any review, or QA, etc. We all know that what a developer considers "done" often differs significantly from what the customer expects.


>> many are optimizing happy paths in their demos and hiding the true reality

Yep. This is literally what every AI company does nowadays.


>The more generic and spread an agent is (can-do-it-all) the more likely it will fail and disappoint.

To your point - the most impressive AI tool (not an LLM but bear with me) I have used to date, and I loathe giving Adobe any credit, is Adobe's Audio Enhance tool. It has brought back audio that prior to it I would throw out or, if the client was lucky, would charge thousands of dollars and spend weeks working on to repair to get it half as good as that thing spits out in minutes. Not only is it good at salvaging terrible audio, it can make mediocre zoom audio sound almost like it was recorded in a proper studio. It is truly magic to me.

Warning: don't feed it music lol it tries to make the sounds into words. That being said, you can get some wild effects when you do it!


Not even well-optimized. The demos in the related sit-down chat livestream video showed an every-baseball-park-trip planner report that drew a map with seemingly random lines that missed the east coast entirely, leapt into the Gulf of Mexico, and was generally complete nonsense. This was a pre-recorded demo being live-streamed with Sam Altman in the room, and that’s what they chose to show.


I mostly agree with this. The goal with AI companies is not to reach 99% or 100% human-level, it's >100% (do tasks better than an average human could, or eventually an expert).

But since you can't really do that with wedding planning or whatnot, the 100% ceiling means the AI can only compete on speed and cost. And the cost will be... whatever Nvidia feels like charging per chip.


yep, the same problem with outsourcing, getting the 90% "done" is easy, the 10% is hard and completely depends on how the "90%" was archived


Seen this happen many times with current agent implementations. With RL (and provided you have enough use case data) you can get to a high accuracy on many of these shortcomings. Most problems arise from the fact that prompting is not the most reliable mechanism and is brittle. Teaching a model on specific tasks help negate those issues, and overall results in a better automation outcome without devs having to make so much effort to go from 90% to 99%. Another way to do it is parallel generation and then identifying at runtime which one seems most correct (majority voting or llm as a judge).

I agree with you on the hype part. Unfortunately, that is the reality of current silicon valley. Hype gets you noticed, and gets you users. Hype propels companies forward, so that is about to stay.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: