Those are some high profile (celebrity) developers. I wonder if they have measur...

embedding-shape · 2026-01-21T11:12:21 1768993941

Almost off-topic, but got me curious: How can I measure this myself? Say I want to put concrete numbers to this, and actually measure, how should I approach it?

My naive approach would be to just implement it twice, once together with an LLM and once without, but that has obvious flaws, most obvious that the order which you do it with impacts the results too much.

So how would I actually go about and be able to provide data for this?

disgruntledphd2 · 2026-01-21T14:52:18 1769007138

> My naive approach would be to just implement it twice, once together with an LLM and once without, but that has obvious flaws, most obvious that the order which you do it with impacts the results too much.

You'd get a set of 10-15 projects, and a set of 10-15 developers. Then each developer would implement the solution with LLM assistance and without such assistance. You'd ensure that half the developers did LLM first, and the others traditional first.

You'd only be able to detect large statistical effects, but that would be a good start.

If it's just you then generate a list of potential projects and then flip a coin as to whether or not to use the LLM and record how long it takes along with a bunch of other metrics that make sense to you.

embedding-shape · 2026-01-21T15:05:57 1769007957

The initial question was:

> wonder if they have measured their results?

Which seems to indicate that there would be a suitable way for a single individual to be able to measure this by themselves, which is why I asked.

What you're talking about is a study and beyond the scope of a single person, and also doesn't give me the information I'd need about myself.

> If it's just you then generate a list of potential projects and then flip a coin as to whether or not to use the LLM and record how long it takes along with a bunch of other metrics that make sense to you.

That sounds like I can just go by "yeah, feels like I'm faster", which I thought exactly was parent wanted to avoid...

disgruntledphd2 · 2026-01-21T16:59:31 1769014771

> That sounds like I can just go by "yeah, feels like I'm faster", which I thought exactly was parent wanted to avoid...

No it doesn't, but perhaps I assumed too much context. Like, you probably want to look up the Quantified Self movement, as they do lots of social science like research on themselves.

> Which seems to indicate that there would be a suitable way for a single individual to be able to measure this by themselves, which is why I asked.

I honestly think pick a metric you care about and then flip a coin to use an LLM or not is the best you're gonna get within the constraints.

embedding-shape · 2026-01-21T17:03:13 1769014993

> Like, you probably want to look up the Quantified Self movement, as they do lots of social science like research on themselves.

I guess I was looking for something bit more concrete, that one could apply themselves, which would answer the "if they have measured their results? [...] Can you provide data that objects this view" part of parents comment.

> then flip a coin to use an LLM or not is the best you're gonna get within the constraints.

Do you think trashb who made the initial question above would take the results of such evaluation and say "Yeah, that's good enough and answers my question"?

disgruntledphd2 · 2026-01-22T10:35:50 1769078150

> I guess I was looking for something bit more concrete, that one could apply themselves, which would answer the "if they have measured their results? [...] Can you provide data that objects this view" part of parents comment.

This stuff is really, really hard. Social science is very difficult as there's a lot of variance in human ability/responses. Added to that is the variance surrounding setup and tool usage (claude code vs aider vs gemini vs codex etc).

Like, there's a good reason why social scientists try to use larger samples from a population, and get very nerdy with stratification et al. This stuff is difficult otherwise.

The gold standard (rather like the METR study) is multiple people with random assignment to tasks with a large enough sample of people/tasks that lots of the random variance gets averaged out.

On a 1 person sample level, it's almost impossible to get results as good as this. You can eliminate the person level variance (because it's just one person), but I think you'd need maybe 100 trials/tasks to get a good estimate.

Personally, that sounds really implausible, and even if you did accomplish this, I'd be sceptical of the results as one would expect a learning effect (getting better at both using LLM tools and side projects in general).

The simple answer here (to your original question) is no, you probably can't measure this yourself as you won't have enough data or enough controls around the collection of this data to make accurate estimates.

To get anywhere near a good estimate you'd need multiple developers and multiple tasks (and a set of people to rate the tasks such that the average difficulty remains constant.

Actually, I take that back. If you work somewhere with lots and lots of non-leetcode interview questions (take homes etc) you could probably do the study I suggested internally. If you were really interested in how this works for professional development, then you could randomise at the level of interviewee and track those that made it through and compare to output/reviews approx 1 year later.

But no, there's no quick and easy way to do this because the variance is way too high.

> Do you think trashb who made the initial question above would take the results of such evaluation and say "Yeah, that's good enough and answers my question"?

I actually think trashb would have been OK with my original study, but obviously that's just my opinion.

trashb · 2026-01-22T11:32:25 1769081545

To wrap this up, what I was trying to say is that the feeling of being faster may not align with the reality. Even for people that have a good understanding of the matter it may be difficult to estimate. So I would say be skeptical of claims like this and try to somehow quantize it in a way that matters for the tasks you do. This is something managers of software projects have been trying to tackling for a while now.

There is no exact measurement in this case but you could get an idea by testing certain types of implementations. For example if you are finishing similar tasks on average 25% faster during a longer testing period with and without AI. Just the act of timing yourself doing tasks with or without AI may already give a crude indication of the difference.

You could also run a trail implementing coding tasks like leet code however you will introduce some kind of bias due to having done it previously. And additionally the tasks may not align with your daily activities.

A trail with multiple developers working on the same task pool with or without AI could lead to more substantial results but you won't be able to do that by yourself.

embedding-shape · 2026-01-22T15:56:44 1769097404

So there seems to be an shared underestanding how difficult "measure your results" would be in this case, so could we also agree that asking someone:

> I wonder if they have measured their results? [...] Can you provide data that objects this view, based on these (celebrity) developers or otherwise?

isn't really fair? Because not even you or I really know how to do so in a fair and reasonable manner, unless we start to involve trials with multiple developers and so on.

disgruntledphd2 · 2026-01-23T14:57:20 1769180240

> isn't really fair? Because not even you or I really know how to do so in a fair and reasonable manner, unless we start to involve trials with multiple developers and so on.

I think in a small conversation like this, it's probably not entirely fair.

However, we're hearing similar things from much larger organisations who definitely have the resources to do studies like this, and yet there's very little decent work available.

In fact, lots of the time they are deliberately misleading people (25% of our code generated by AI being copilot/other autocomplete). Like, that 25% stat was probably true historically with JetBrains products and using any form of code generations (for protobufs et al) so it's wildly deceptive et al.

edanm · 2026-01-23T13:08:19 1769173699

> I wonder if they have measured their results?

This is a notoriously difficult thing to measure in a study. More relevantly though, IMO, it's not a small effect that might be difficult to notice - it's a huge, huge speedup.

How many developers have measured whether they are faster when programming in Python vs assembly? I doubt many have. And I doubt many have chosen Python over assembly because of any study that backs it up. But it's also not exactly a subtle difference - I'm fairly 99% of people will say that, in practice, it's obvious that Python is faster for programming than assembly.

I talked literally yesterday to a colleague who's a great senior dev, and he made a demo in an hour and a half that he says would've taken him two weeks to do without AI. This isn't a subtle, hard to measure difference. Of course this is in an area where AI coding shines (a new codebase for demo purposes) - but can we at least agree that in some things AI is clearly an order of magnitude speedup?