Terrence Tao claims [0] contributions by the public are counter-productive since the energy required to check a contribution outweighs its benefit:
> (for) most research projects, it would not help to have input from the general public. In fact, it would just be time-consuming, because error checking
Since frontier LLMs make clumsy mistakes, they may fall into this category of 'error-prone' mathematician whose net contributions are actually negative, despite being impressive some of the time.
It depends a lot about the ratios here. There's a fast flip between "interesting but useless" and "useful" when the tradeoff flips.
How fast can you check the contribution? How small of a part is it? An unsolicited contribution is different from one you immediately directed. Do you need to reply? How fast are followups? Multi-day back and forths are a pain, a fast directed chat is different. You don't have to worry about being rude to an LLM.
Then it comes down to how smart a frontier model is vs the people who write to mathematicians. The latter groups will be filled with both smart helpful people and cranks.
Unlike general public the models can be trained. I mean if you train a member of general public, you've got a specialist, who is no longer a member of general public.
Unlike the general public though, these models have advanced dementia when it comes to learning from corrections, even within a single session. They keep regressing and I haven't found a way to stop that yet.
What boggles the mind: we have gone for so long to try to strive for correctness and suddenly being right 70% of the time and wrong the remaining 30% is fine. The parallel with self driving is pretty strong here: solving 70% of the cases is easy, the remaining 30% are hard or maybe even impossible. Statistically speaking these models do better than most humans, most of the time. But they do not do better than all humans, and they can't do it all of the time and when they get it wrong they make such tremendously basic mistakes that you have to wonder how they manage to get things right.
Maybe it's true that with an ever increasing model size and more and more (proprietary, the public sources are exhausted by now so private data is the frontier where model owners can still gain an edge) we will reach a point where the models will be right 98% of the time or more but what would be the killer feature for me is an indication of the confidence level of the output. Because no matter whether junk or pearls it all looks the same and that is more dangerous than having nothing at all.
A common resistor has a +/- 10% tolerance. A milspec one is 1%. Yet we have ways of building robust systems using such “subpar” components. The trick is to structure the system in a way that builds the error rate into the process and corrects for it. Easier said than done of course for a lot of problems but we do have techniques for doing this and we are learning more.
I think the real killer feature would be that they stop making basic mistakes, and that they gain some introspection. It's not a problem if they're wrong 30% of the time if they're able to gauge their own confidence like a human would. Then you can know to disregard the answer, or check it more thoroughly.
> It's not a problem if they're wrong 30% of the time if they're able to gauge their own confidence like a human would.
This is a case where I would not use human performance as the standard to beat. Training people to be both intellectually honest and statistically calibrated is really hard.
Perhaps, but an AI that can only answer like a precocious child who's spent years reading encyclopedias but has not learned to detect when it's thinking poorly or not remembering clearly is much less useful.
> the killer feature for me is an indication of the confidence level of the output.
I don't think I did something special too ChatGPT to get it to do this, but it's started reporting confidence levels to me, eg from my most recent chat:
> In China: you could find BEVs that cost same or even less than ICE equivalents in that size band. (Confidence ~0.70)
I would counter that any computationally correct code that accelerates any existing research code base is a net positive. I don't care how that is achieved as long as it doesn't sacrifice accuracy and precision.
We're not exactly swimming in power generation and efficient code uses less power.
It's perhaps practical, though, to ask it to do a lot of verification and demonstration of correctness in Lean or another proof environment-- to both get its error rate down and to speed up the review of its results. After all, its time is close to "free."
> (for) most research projects, it would not help to have input from the general public. In fact, it would just be time-consuming, because error checking
Since frontier LLMs make clumsy mistakes, they may fall into this category of 'error-prone' mathematician whose net contributions are actually negative, despite being impressive some of the time.
[0] https://www.youtube.com/watch?v=HUkBz-cdB-k&t=2h59m33s