best way to read a huge ascii file.

Steve D'Aprano steve+python at pearwood.info
Tue Nov 29 18:20:01 EST 2016


On Wed, 30 Nov 2016 01:17 am, Heli wrote:

> The following line which reads the entire 7.4 GB file increments the
> memory usage by 3206.898 MiB (3.36 GB). First question is Why it does not
> increment the memory usage by 7.4 GB?
>
> f=np.loadtxt(os.path.join(dir,myfile),delimiter=None,skiprows=0)

Floating point numbers as strings typically take up far more space than do
floats. On disk, a string like "3.141592653589793" requires 13 bytes. Plus
there are additional bytes used as separators between fields, and at least
one more byte (a newline) at the end of each record. Whereas, once
converted to a float (a C 64-bit double) it only requires 8 bytes. In a
numpy array, there's no separator needed and the values are tightly packed.

So its quite reasonable to expect a saving of around 50%.

> The following 4 lines do not increment the memory at all.
> x=f[:,1]
> y=f[:,2]
> z=f[:,3]
> id=f[:,0]

Numpy slices are views, not copies.


> Finally I still would appreciate if you could recommend me what is the
> most optimized way to read/write to files in python? are numpy np.loadtxt
> and np.savetxt the best?

You're not just reading a file. You're reading a file and converting
millions of strings to floats.

You are processing 7GB of data in 80 minutes, or around 1.5MB per second. Do
you have reason to think that's unreasonably slow? (Apart from wishing that
it were faster.) Where are you reading the file from? How much RAM do you
have?



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list