I like Google's crush-tools, which works on delimited data (e.g. tab-delimited),...

sturadnidge · on Sept 19, 2013

They look interesting, thanks! Any experience using them over GB or TB datasets?

mjn · on Sept 19, 2013

I've used them on tens-of-GB datasets (not TB), and they're quite fast except for those implemented in Perl, which is kind of a hidden gotcha. For example, calcfield is barely usable, because it loops a perl expression over every record in the file. But things like funiq and cutfield are fast, at least as fast as the traditional Unix tools. And if you have pre-sorted data, aggregate2 is a nice aggregation tool for larger-than-memory datasets.

sliverstorm · on Sept 20, 2013

That's interesting. I've written Perl tools that apply pattern-matching to several-GB flat files, and while it was horrifyingly slow at first, I was able to get the performance down to a minute in the average case. Honestly the whole time I think I/O was a greater limiting factor than Perl's processing speed.