Greg nails something that seems to be passing the academic world of recommendati...

FiReaNG3L · on March 29, 2009

I think they use RMSE because it's easy, not because it's ideal. Bellkor, a participating team in Netflix challenge, discussed this in their paper describing their method who won the progress prize; they calculated whether minute differences in RMSE improved the quality of top10 results; it did pretty significantly.

wheels · on March 29, 2009

Just fished it out -- paper is here for the curious:

http://public.research.att.com/~volinsky/netflix/RecSys08tut...

It's one, amusingly, that I'd skipped because it seemed to be less technical. :-) Good stuff.

randomwalker · on March 29, 2009

This hasn't been passing us by! Netflix were the ones who decided to make RMSE the criterion for their contest, and put up a million dollars to ride on it for good measure, so it's hardly a surprise all the papers are focused on it. Of course, RMSE doesn't measure user satisfaction; that's why we write papers describing the techniques that seem to work, and it's up to Netflix (and other recommendation service providers) to pick which of those they want to use given that they're maximizing something slightly different.

wheels · on March 29, 2009

It's true that not being in academia that I don't hear the conversations that fill the gaps between publications. But if one's simply going from the published output on collaborative filtering at the moment there has been some convergence on RMSE as a benchmark. That's understandable, since it's easily measurable, and as you say, there are some folks throwing $1mm at it (which really isn't much considering what it'd do for their sales).

Tichy · on March 30, 2009

Still, wouldn't predicting how well somebody likes something form a good basis for running a recommendation engine on top of it? Maybe it is a waste of effort for many scenarios, but if you can do it well, you can still add all sorts of algorithms to pick the best recommedations from the predictions?

nkurz · on March 30, 2009

Well, that's the question underlying the article. Consider the hypothetical case of a movie that is very controversial: all 1's or 5's. Even if your system can tell that a user is quite likely to fall in the '5' camp, the only safe prediction for a high variance movie is something close to the middle. Even if you are pretty sure the user would give this movie a 5, the squared error for the small chance of a 1 is enormous.

But a rating close to the middle is never going to be chosen as a recommendation if the algorithm involves recommending the movies with the highest predicted scores. Instead, an RMSE based system is always going to prefer safe 4's over risky 5's. This doesn't mean that improved predictions can't yield improved recommendations, but I don't see truly great ones ever coming from a prediction based approach.

Personally, I want a recommendation system that maximizes the percentage of recommendations that I would rate as 5's, and don't much care if the misses are 1's, 2's, or 3's.

wheels · on March 30, 2009

And beyond that it's somewhat domain specific as to what the tolerance for misses is. In something like recommending online (non-paid) content, it doesn't matter much. It's worth more to take a gamble on something a user will really like than to give them something you're sure they won't hate. If you get two great hits and three bad hits, it's probably still a net win for the user. On the other hand, if you're say, doing online dating recommendations, you probably want to avoid the polarized cases since you could lose a paid customer with one horrible recommendation.