> They're also not going to be able to direct three different agents at once in different areas of a large project that they've designed the architecture for.
I wonder what the practical limits are.
As a senior dev on a greenfield solo project it's too exhausting for me to have two parallel agents (front/back), most of the time they're waiting for me to spec, review or do acceptance test. Feels like sprinting, not something I could do day in and day out.
Might be due to tasks being too fine grained, but assuming larger ones are proportionally longer to spec and review, I don't see more than two (or, okay, three, maybe I'm just slow) being a realistic scenario.
More than that, I think we're firmly in the vibe coding (or maybe spec-driven vibe coding) territory.
At least on a team, the limit is the team's time to review all the code. We've also found that vibe engineered (or "supervised vibing" as I call it) code tends to have more issues in code review because of a false sense of security creating blind spots when self reviewing. Even more burden on the team.
We're experimenting with code review prompts and sub agents. Seems local reviews are best, so the bulk of the burden is on the vibing engineer, rather than the team.
Do you have a sense for how much overhead this is all adding? Or, to put it another way, what I’m really asking is what productivity gain (or loss) are you seeing versus traditional engineering?
In our experience, it depends on the task and the language. In the case of trivial or boilerplate code, even if someone pushes 3k-4k lines of code in one day, it's manageable because you can just go through it. However, 3k lines of interconnected modules, complex interactions, and intricate logic require a lot of brainpower and time to review properly and in most cases, there are multiple bugs, edge cases that haven't been considered, and other issues scattered throughout the code.
And empirical studies on informal code review show that humans have a very small impact on error rates. It disappears when they read more than roughly 200 SLOC in an hour.
Interesting, do you have a link to the study? Our experience is different, at least when reviewing LLM generated code, we find quite a few errors, especially beyond 200 LOC. It also depends on what you're reviewing, 200 LOC != 200 LOC. A boilerplate 200 LOC change? A security sensitive 200 LOC change? A purely algorithmic and complex 200 LOC change?
Isn't the current state of thing such that it's really hard to tell? I think the METR study showed that self-reported productivity boosts aren't necessarily reliable.
I have been messing with vibe engineering on a solo project and I have such a hard time telling if there's an improvement. It's this feeling of "what's faster, one lead engineer coding or one lead engineer guiding 3 energetic but naive interns"?
The problem with this is that software engineering is a very unorganized and fashion/emotion driven domain.
We don't have reliable productivity numbers for basically... anything.
I <feel> that I'm more productive with statically typed languages but I haven't seen large scale, reliable studies. Same with unit tests, integration tests, etc.
And then there are all the types of software engineering: web frontend, web API, mobile frontend, command line frontend, Windows GUI, MacOS GUI, Linux backend (10 million different stacks), Windows backend (1 million different stacks), throwaway projects, WordPress webpages, etc, etc.
A controlled experiment done with a representative sample would be lovely. In the long-run it comes down to the financial impact that occurs incrementally because of LLMs.
In the short-run, from what I see, firms are trying to play-up the operational efficiency gains they have achieved. Which then signals promise to investors in the stock market, for which, investors then translate this promise into expectations about the future which are then reflected in the present value of equity.
But in reality it seems to be reducing head-count because they over-hired before the hype and furore of LLMs.
> In the short-run, from what I see, firms are trying to play-up the operational efficiency gains they have achieved.
The thing is all of this is getting priced in, and will be table stakes for any business, so I don't see it as a big factor in future success.
As I've said here, LinkedIn, and one a few other places, the businesses that will succeed with AI will be those who can use it to add/create value. They will outcompete and out-succeed businesses that can't move beyond cost cutting with AI[0].
[0] Which might not last forever anyway. Granted there are a decent number of players in the market, thankfully, but this wouldn't be the first time tech companies had hooked large numbers of individuals and businesses on a service and then jacked up the prices once they'd captured enough of the market. It's still very much in the SV and PE playbook. SolarWinds is a recent example of the latter.
I wanted to point you at https://neverworkintheory.org/ which attempted to bridge the gap between academia and software engineering. Turns out the site shut down, because (quoting their retrospective)
> Twelve years after It Will Never Work in Theory launched, the real challenge in software engineering research is not what to do about ChatGPT or whatever else Silicon Valley is gushing about at the moment. Rather, it is how to get researchers to focus on problems that practitioners care about and practitioners to pay attention to what researchers discover. This was true when we started, it was true 10 years ago, and it remains true today.
The entire retrospective [1] is well worth a read, and unfortunately reinforcing your exact point about software development being fashion/emotion driven.
The other problem is the perennial, how much of what we do actually has value?
Churning out 5x (or whatever - I’m deliberately being a bit hyperbolic) as much code sounds great on the face of it but what does it matter if little to none of it is actually valuable?
You correctly identify that software development is often driven by fashion and emotion but the much much bigger problem is that product and portfolio management is driven by fashion and emotion. How much stuff is built based on the whims of CEOs or other senior stakeholders without any real evidence to back it up?
I suppose the big advantage of being more “productive” is that you can churn through more wrong ideas more quickly and thus perhaps improve your chances of stumbling across something that is valuable.
But, of course, as I’ve just said: if that’s to work it’s absolutely predicated on real (and very substantial) productivity gains.
Perhaps I’m thinking about this wrong though: it’s not about production where standards, and the need to be vigilant, are naturally high, but really the gains should be seen mostly in terms of prototyping and validating multiple/many solutions and ideas.
"I suppose the big advantage of being more “productive” is that you can churn through more wrong ideas more quickly and thus perhaps improve your chances of stumbling across something that is valuable."
But I think there is a very big danger here - you build in the action but completely neglect the deep thinking behind a vision, strategy etc.
So yes you produce more stuff. But that stuff means more money spent - which is generally a sunk cost too.
In a bizarre way, I predict we will see the failure rate of software firms rise. Despite the fact these 'productivity' tools exist.
Yeah, I mean, you might be right. As others have commented, I think it's simply very hard to say what gains we're really going to see from AI-assisted software development at present.
And then of course there's the question of how many businesses have their key value proposition rendered obsolete, and to what extent it's rendered obsolete, by AI: doesn't have to be completely nullified for them to fail (which obviously applies to some software companies, but goes far beyond that sector).
I resonate on the exhaustion — actually, the context switching fatigue is why we built Sculptor for ourselves (https://imbue.com/sculptor). We usually see devs running 4-6 agents in parallel today using Sculptor today. Personally I think much of the fatigue comes from:
1) friction in spawning agents
2) friction in reviewing agent changes
3) context management annoyance when e.g. you start debugging part of the agent's work but then have to reload context to continue the original task
It's still super early, but we've felt a lot less fatigued using Sculptor so far. To make it easier to spawn agents without worrying, we run agents in containers so they can run in YOLO mode and don't interfere with each other. To make it easy to review changes, we made "Pairing Mode", lets you instantly sync any agent's work from the container into your local IDE to test it, then switch to another.
For context management, we just shipped the ability to fork agents form any point in the convo history, so you can reuse an agent that you loaded with high-quality context and fork off to debug an agent's changes or try all options it presented. It also lets you keep a few explorations going and check in when you have time.
Anyway, sorry, shilling the product a bit much but I just wanted to say that we've seen people successfully use more than 2 agents without feeling exhausted!
Switching between the two parallel agents (frontend & backend, same project), requiring context switches.
I'm speccing out the task in detail for one agent, then reviewing code for the previous task on the other agent and testing the implementation, then speccing the next part for that one (or asking for fixes/tweaks), then back to the first agent.
They're way faster in producing code than I am in reviewing and spelling out in details what I want, meaning I always have the other one ready.
When doing everyting myself, there are periods where I need to think hard and periods where it's pretty straightforward and easy (typing out the stuff I envisioned, boilerplate, etc).
With two agents, I constantly need to be on full alert and totally focused (but switching contexts every few minutes), which is way more tiring for me.
With just one agent, the pauses in the workflow (while I'm waiting for it to finish) are long enough to get distracted but short enough to not being able to do anything else (mostly).
Still figuring out the sweet spot for me personally.
I've been meaning to try out some text-to-speech to see if that makes it a bit easier. Part of the difficulty of "spelling out in detail what I want" is the need for precise written language, which is high cognitive load, which makes the context switching difficult.
Been wondering if just natural speaking could both speed up typing. Maybe have an embedded transform/compaction that strips out all the ummms and gets to the point of what you were trying to say. Might have lower cognitive load, which could make it easier.
This works really well already. You can fire up something like Wispr Flow and dump what you're saying directly into Claude Code or similar, it will ignore the ums and usually figure out what you mean.
I use ChatGPT voice mode in their iPhone app for this. I walk the dog for an hour and have a loose conversation with ChatGPT through my AirPods, then at the end I tell it to turn everything we discussed into a spec I can paste into Claude Code.
I wonder what the practical limits are.
As a senior dev on a greenfield solo project it's too exhausting for me to have two parallel agents (front/back), most of the time they're waiting for me to spec, review or do acceptance test. Feels like sprinting, not something I could do day in and day out.
Might be due to tasks being too fine grained, but assuming larger ones are proportionally longer to spec and review, I don't see more than two (or, okay, three, maybe I'm just slow) being a realistic scenario.
More than that, I think we're firmly in the vibe coding (or maybe spec-driven vibe coding) territory.