best way to read a huge ascii file.

Jussi Piitulainen jussi.piitulainen at helsinki.fi
Tue Nov 29 09:45:23 EST 2016


Heli writes:

> Hi all, 
>
> Let me update my question, I have an ascii file(7G) which has around
> 100M lines.  I read this file using :
>
> f=np.loadtxt(os.path.join(dir,myfile),delimiter=None,skiprows=0) 
>
> x=f[:,1] 
> y=f[:,2] 
> z=f[:,3] 
> id=f[:,0] 
>
> I will need the x,y,z and id arrays later for interpolations. The
> problem is reading the file takes around 80 min while the
> interpolation only takes 15 mins.

(Are there only those four columns in the file? I guess yes.)

> The following line which reads the entire 7.4 GB file increments the
> memory usage by 3206.898 MiB (3.36 GB). First question is Why it does
> not increment the memory usage by 7.4 GB?
>
> f=np.loadtxt(os.path.join(dir,myfile),delimiter=None,skiprows=0) 

In general, doubles take more space as text than as, well, doubles,
which (in those arrays) take eight bytes (64 bits) each:

>>> len("0.1411200080598672 -0.9899924966004454 -0.1425465430742778 20.085536923187668 ")
78
>>> 4*8
32

> Finally I still would appreciate if you could recommend me what is the
> most optimized way to read/write to files in python? are numpy
> np.loadtxt and np.savetxt the best?

A document I found says "This function aims to be a fast reader for
simply formatted files" so as long as you want to save the numbers as
text, this is probably meant to be the best way.

https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html

Perhaps there are binary load and save functions? They could be faster.
The binary data file would be opaque, but probably you are not editing
it by hand anyway.



More information about the Python-list mailing list