best way to read a huge ascii file.

Tue Nov 29 11:43:38 EST 2016

On Tuesday, November 29, 2016 at 3:18:29 PM UTC+1, Heli wrote:
> Hi all, 
> 
> Let me update my question, I have an ascii file(7G) which has around 100M lines.  I read this file using : 
> 
> f=np.loadtxt(os.path.join(dir,myfile),delimiter=None,skiprows=0) 
> 
> x=f[:,1] 
> y=f[:,2] 
> z=f[:,3] 
> id=f[:,0] 
> 
> I will need the x,y,z and id arrays later for interpolations. The problem is reading the file takes around 80 min while the interpolation only takes 15 mins.
> 
> I tried to get the memory increment used by each line of the script using python memory_profiler module.
> 
> The following line which reads the entire 7.4 GB file increments the memory usage by 3206.898 MiB (3.36 GB). First question is Why it does not increment the memory usage by 7.4 GB?
> 
> f=np.loadtxt(os.path.join(dir,myfile),delimiter=None,skiprows=0) 
> 
> The following 4 lines do not increment the memory at all. 
> x=f[:,1] 
> y=f[:,2] 
> z=f[:,3] 
> id=f[:,0] 
> 
> Finally I still would appreciate if you could recommend me what is the most optimized way to read/write to files in python? are numpy np.loadtxt and np.savetxt the best?
> 
> Thanks in Advance,

Hi,

Have you considered storing the data in HDF5? There is an excellent Python
interface for this (see: http://www.h5py.org/). The advantage that you will
have is that no text to number conversion has to applied anymore. You can
directly operate on the datasets in the HDF5 database. 

If you would go this direction, the following would get you started:

>>> import h5py
>>> import numpy
>>> from numpy import uint32, float32, arange

>>> fd = h5py.File('demo.h5', mode='w') # Note that this will truncate
                                        # the file, use 'r' or 'a' if you
                                        # want to open an existing file
>>> observations = fd.create_group('/observations')
>>> N = 1000000 
>>> observations.create_dataset('id', data=arange(0, N, dtype=uint32))
>>> observations.create_dataset('x', data=numpy.random.random(N), dtype=float32)
>>> observations.create_dataset('y', data=numpy.random.random(N), dtype=float32)
>>> observations.create_dataset('z', data=numpy.random.random(N), dtype=float32)
>>> 
>>> fd.close()

Note that you can also combine x,y and z in a single dataset if you want to.
See the documentation for datasets for more information http://docs.h5py.org/en/latest/high/dataset.html

I would also advise you to carefully select the proper dtype for the arrays.
In particular if you know the value range for your datasets. This can save
you a lot of disk space and probably will increase the performance a little.

Marco