[SciPy-User] IO of large ASCII table data

Tue Aug 17 14:13:14 EDT 2010

On Tue, Aug 17, 2010 at 11:07 AM, Benjamin Root <ben.root at ou.edu> wrote:
>
>
> On Tue, Aug 17, 2010 at 1:03 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>
>> On Tue, Aug 17, 2010 at 10:53 AM, Éric Depagne <edepagne at lcogt.net> wrote:
>> > Le mardi 17 août 2010 10:41:26, Dan Lussier a écrit :
>> >> I am looking to read in large (many million rows) ASCII space
>> >> separated tables into numpy arrays.
>> >>
>> >> In the past I have heard of people using Miller's TableIO to do this
>> >> but was wondering if a similarly fast method has been more recently
>> >> integrated into scipy/numpy?
>> >>
>> >> In consulting the documentation the most likely candidate is
>> >> numpy.genfromtext(...).  Is this function pure python or does it rely
>> >> on a C extension as was the case with Miller's TableIO?
>> >>
>> >> Any advice here would be great as my application could get seriously
>> >> bogged down (both time and memory) in reading these files into arrays
>> >> if I get onto the wrong track.
>> >>
>> >> Thanks.
>> > There is the numpy.loadtxt() method that can also read data from file.
>> > I use it to read large datasets. Considering its speed, here are numbers
>> > I
>> > typically get. To extract 2.5 million lines and 10 columns it needs
>> > ~3mn.
>>
>> For comparison, h5py (and pytables) are over 1500 times faster:
>>
>> Save data:
>>
>> >> arr = np.random.rand(2500000, 10)
>> >> import h5py
>> >> f = h5py.File('/tmp/speed.hdf5')
>> >> f['arr'] = arr
>>
>> Time the loading of data:
>>
>> $ ipython
>> >> import time
>> >> import h5py
>> >> f = h5py.File('/tmp/speed.hdf5')
>> >> t1=time.time(); a = f['arr'][:]; print time.time() - t1
>> 0.0953390598297
>>
>> Speed up:
>>
>> >> 3*60/0.0953390598297
>>   1887.9984795479013
>
> Keith,
>
> Note that files saved to the /tmp directory are likely using tmpfs, which is
> heavily RAM oriented.  Your speed-up might not be reflecting the impact of
> disk I/O.

Here's the time it takes to read from my home directory (regular HD):

>> f = h5py.File('/home/kg/speed.hdf5')
>> t1=time.time(); a = f['arr'][:]; print time.time() - t1
0.0879709720612

And here's the time it takes to read from my ram disk:

>> f = h5py.File('/dev/shm/speed.hdf5')
>> t1=time.time(); a = f['arr'][:]; print time.time() - t1
0.086874961853