The first thing I do when I deal with data which has more than --let's say-- 10,000 rows is to put it in HDF file format and work with that. Saves a ton of time while developing a script. I had a python script do a histogram and it ran ~15sec for a file with 100k rows. With converting it first to HDF it ran in ~0.5sec. The import in python is also much shorter (two lines).
HDF is made for high performance numerical I/O. It's great and you can query several structures and even do slices of arrays on the command line (with h5tools).
It's also widely used by Octave, Python, R, Matlab... And you don't have a drawback since you can just pipe it into existing command line tools with a h5dump.
Wow, those benchmarks are old. 195MHz test machine?
Still, 15K rows is not much. Just did that 100K rows (read/sum) bit in Octave. Took about 1/4 second to extract and sum. I assume HDF rocks for much larger sets.
http://www.hdfgroup.org/tools5desc.html#1
The first thing I do when I deal with data which has more than --let's say-- 10,000 rows is to put it in HDF file format and work with that. Saves a ton of time while developing a script. I had a python script do a histogram and it ran ~15sec for a file with 100k rows. With converting it first to HDF it ran in ~0.5sec. The import in python is also much shorter (two lines).
HDF is made for high performance numerical I/O. It's great and you can query several structures and even do slices of arrays on the command line (with h5tools).
It's also widely used by Octave, Python, R, Matlab... And you don't have a drawback since you can just pipe it into existing command line tools with a h5dump.
HDF5:
http://www.hdfgroup.org/HDF5/RD100-2002/HDF5_Performance.pdf
http://www.hdfgroup.org/HDF5/RD100-2002/HDF5_Overview.pdf