[SciPy-User] IO of large ASCII table data

Dan Lussier dtlussier at gmail.com
Tue Aug 17 14:41:21 EDT 2010


That's great.  Thanks.

I am going to give np.fromfile(...) a try as I have data that is
pretty uniform and will fixup the output as necessary.

On my system reading a 1.2M row by 11 column data file
np.genfromtxt(...) took 50 seconds while np.fromfile(...) plus a
np.reshape(...) to change the shape of the output took under 2
seconds.

Dan

On Tue, Aug 17, 2010 at 1:13 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
> On Tue, Aug 17, 2010 at 11:07 AM, Benjamin Root <ben.root at ou.edu> wrote:
>>
>>
>> On Tue, Aug 17, 2010 at 1:03 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>>
>>> On Tue, Aug 17, 2010 at 10:53 AM, Éric Depagne <edepagne at lcogt.net> wrote:
>>> > Le mardi 17 août 2010 10:41:26, Dan Lussier a écrit :
>>> >> I am looking to read in large (many million rows) ASCII space
>>> >> separated tables into numpy arrays.
>>> >>
>>> >> In the past I have heard of people using Miller's TableIO to do this
>>> >> but was wondering if a similarly fast method has been more recently
>>> >> integrated into scipy/numpy?
>>> >>
>>> >> In consulting the documentation the most likely candidate is
>>> >> numpy.genfromtext(...).  Is this function pure python or does it rely
>>> >> on a C extension as was the case with Miller's TableIO?
>>> >>
>>> >> Any advice here would be great as my application could get seriously
>>> >> bogged down (both time and memory) in reading these files into arrays
>>> >> if I get onto the wrong track.
>>> >>
>>> >> Thanks.
>>> > There is the numpy.loadtxt() method that can also read data from file.
>>> > I use it to read large datasets. Considering its speed, here are numbers
>>> > I
>>> > typically get. To extract 2.5 million lines and 10 columns it needs
>>> > ~3mn.
>>>
>>> For comparison, h5py (and pytables) are over 1500 times faster:
>>>
>>> Save data:
>>>
>>> >> arr = np.random.rand(2500000, 10)
>>> >> import h5py
>>> >> f = h5py.File('/tmp/speed.hdf5')
>>> >> f['arr'] = arr
>>>
>>> Time the loading of data:
>>>
>>> $ ipython
>>> >> import time
>>> >> import h5py
>>> >> f = h5py.File('/tmp/speed.hdf5')
>>> >> t1=time.time(); a = f['arr'][:]; print time.time() - t1
>>> 0.0953390598297
>>>
>>> Speed up:
>>>
>>> >> 3*60/0.0953390598297
>>>   1887.9984795479013
>>
>> Keith,
>>
>> Note that files saved to the /tmp directory are likely using tmpfs, which is
>> heavily RAM oriented.  Your speed-up might not be reflecting the impact of
>> disk I/O.
>
> Here's the time it takes to read from my home directory (regular HD):
>
>>> f = h5py.File('/home/kg/speed.hdf5')
>>> t1=time.time(); a = f['arr'][:]; print time.time() - t1
> 0.0879709720612
>
> And here's the time it takes to read from my ram disk:
>
>>> f = h5py.File('/dev/shm/speed.hdf5')
>>> t1=time.time(); a = f['arr'][:]; print time.time() - t1
> 0.086874961853
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>



More information about the SciPy-User mailing list