Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I do a lot of human evaluations. Lots of Bayesian / statistical models that can infer rater quality without ground truth labels. The other thing about preference data you have to worry about (which this article gets at) is: preferences of _who_? Human raters are a significantly biased population of people, different ages, genders, religions, cultures, etc all inform preferences. Lots of work being done to leverage and model this.

Then for LMArena there is the host of other biases / construct validity: people are easily fooled, even PhD experts; in many cases it’s easier for a model to learn how to persuade than actually learn the right answers.

But a lot of dismissive comments as if frontier labs don’t know this, they have some of the best talent in the world. They aren’t perfect but they in a large sene know what they’re doing and what the tradeoffs of various approaches are.

Human annotations are an absolute nightmare for quality which is why coding agents are so nice: they’re verifiable and so you can train them in a way closer to e.g. alphago without the ceiling of human performance



> in many cases it’s easier for a model to learn how to persuade than actually learn the right answers

So we should expect the models to eventually tend toward the same behaviors that politicians exhibit?


Maybe a happy to deceive marketing/sales role would be more accurate.


100% (am a Bayesian statistician).

Isn’t it fascinating how it comes down to quality of judgement (and the descriptions thereof)?

We need an LMArena rated by experts.


As a statistician, do you you think you could, given access to the data, identify the subset of LMArena users that are experts?


Yes, for sure! I can think of a few ways.


they always know, they just have non-AGI incentive and asymetric upside to play along...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: