The problem is your that your ability to explore the data and the data volume ar...

ims · on May 13, 2013

Very true. Wouldn't the typical approach to this involve probabilistic methods like taking large-ish (but not "Big") samples from your multi TB data and doing your EDA with those?

Choronzon · on May 13, 2013

That would work very well if our random sample accurately reflected the superset of data,which it almost always does but you also want to consider the following...

Imagine our data was 98% junk with 2% of the data consisting of sequential patterns. We may be able to spot this on a graph relatively easily over the whole dataset but our random sampling would greatly reduce the quality of this information.

We can extend that to any ordering or periodicity in the data.if data at position n has a hidden dependency of data at position n+/-1 random sampling will break us.

cma · on May 13, 2013

Do random sampling plus n lines of surrounding context.