The problem is your that your ability to explore the data and the data volume are inversely correlated.You are far more likely to find interesting things exploring an in-memory dataset using something like ipython and pandas than throwing pig jobs at a few dozen TB of gunk.
Big data is great if you know exactly what you are looking for. If you get into a stage where you are trying to explore a huge DB looking for relationships your need to be very good at machine learning and statistical analysis (spurious correlations ahoy!) to come out significantly ahead.Its also an enormous time sink.
In summation the bigger the data the simpler the analysis you can throw at it efficiently.
Very true. Wouldn't the typical approach to this involve probabilistic methods like taking large-ish (but not "Big") samples from your multi TB data and doing your EDA with those?
That would work very well if our random sample accurately reflected the superset of data,which it almost always does but you also want to consider the following...
Imagine our data was 98% junk with 2% of the data consisting of sequential patterns. We may be able to spot this on a graph relatively easily over the whole dataset but our random sampling would greatly reduce the quality of this information.
We can extend that to any ordering or periodicity in the data.if data at position n has a hidden dependency of data at position n+/-1 random sampling will break us.