[Numpy-discussion] loading data

Fri Jun 26 07:05:58 EDT 2009

A Friday 26 June 2009 12:38:11 Mag Gam escrigué:
> Thanks everyone for the great and well thought out responses!
>
> To make matters worse, this is actually a 50gb compressed csv file. So
> it looks like this, 2009.06.01.plasmasub.csv.gz
> We get this data from another lab from the Westcoast every night
> therefore I don't have the option to have this file natively in hdf5.
> We are sticking with hdf5 because we have other applications that use
> this data and we wanted to standardize hdf5.

Well, since you are adopting HDF5, the best solution is that the Westcoast lab 
would send the file directly in HDF5.  That will save you a lot of headaches.  
If this is not possible, then I think the best would be that you do some 
profiles in your code and see where the bottleneck is.  Using cProfile 
normally offers a good insight on what's consuming more time in your 
converter.

There are three most probable hot spots, the decompressor (gzip) time, the 
np.loadtxt and the HDF5 writer function.  If the problem is gzip, then you 
won't be unable to accelerate the conversion unless the other lab is willing 
to use a lighter compressor (lzop, for example).  If it is np.loadtxt(), then 
you should ask yourself if you are trying to load everything in-memory; if you 
are, don't do that; just try to load & write slice by slice.  Finally, if the 
problem is on the HDF5 write, try to use write array slices (and not record-
by-record writes).

> Also, I am curious about Neil's  np.memmap. Do you have a some sample
> code for mapping a compressed csv file into memory? and loading the
> dataset into a dset (hdf5 structure)?

No, np.memmap is meant to map *uncompressed binary* files in memory, so you 
can't follow this path.

-- 
Francesc Alted