best way to read a huge ascii file.

Tue Nov 29 09:45:23 EST 2016

Heli writes:

> Hi all, 
>
> Let me update my question, I have an ascii file(7G) which has around
> 100M lines.  I read this file using :
>
> f=np.loadtxt(os.path.join(dir,myfile),delimiter=None,skiprows=0) 
>
> x=f[:,1] 
> y=f[:,2] 
> z=f[:,3] 
> id=f[:,0] 
>
> I will need the x,y,z and id arrays later for interpolations. The
> problem is reading the file takes around 80 min while the
> interpolation only takes 15 mins.

(Are there only those four columns in the file? I guess yes.)

> The following line which reads the entire 7.4 GB file increments the
> memory usage by 3206.898 MiB (3.36 GB). First question is Why it does
> not increment the memory usage by 7.4 GB?
>
> f=np.loadtxt(os.path.join(dir,myfile),delimiter=None,skiprows=0) 

In general, doubles take more space as text than as, well, doubles,
which (in those arrays) take eight bytes (64 bits) each:

>>> len("0.1411200080598672 -0.9899924966004454 -0.1425465430742778 20.085536923187668 ")
78
>>> 4*8
32

> Finally I still would appreciate if you could recommend me what is the
> most optimized way to read/write to files in python? are numpy
> np.loadtxt and np.savetxt the best?

A document I found says "This function aims to be a fast reader for
simply formatted files" so as long as you want to save the numbers as
text, this is probably meant to be the best way.

https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html

Perhaps there are binary load and save functions? They could be faster.
The binary data file would be opaque, but probably you are not editing
it by hand anyway.