Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I can't find the part of the post where you explain that JSON parsing is the reason for the slowdown. You just say mrjob itself is slower. While I agree that mrjob's defaults encourage the use of JSON, I think it's unfair to blame lack of optimization on the framework, given that the bare Python example could just as easily have used JSON.

One real issue with mrjob is that it assumes you're only going to have one key and one value. It isn't straightforward to use multiple key fields. The workaround is to write a custom protocol (which, btw, is very simple [1]) that uses the line up to the first tab as the key, and the rest of the line as the value, probably splitting it on tab as well and passing it through as a tuple. If we had made multipart keys simpler to use, maybe you would have chosen to use a more efficient format.

Anyway, the main part I take issue with is:

"mrjob seems highly active, easy-to-use, and mature...but it appears to perform the slowest."

That's just not true. It would be fair to say that optimizing jobs with multipart keys isn't straightforward and therefore encourages non-optimal code, but that's moot if you're just using one key and one value, as most people do.

I'm really not trying to dump on you here. I liked the post! I would just prefer that it was more precise about these things.

[1] http://mrjob.readthedocs.org/en/latest/guides/writing-mrjobs...

EDIT: If anyone's thinking about downvoting this guy (someone did), don't. This is a discussion in good faith.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: