best way to read a huge ascii file.

Tue Nov 29 12:29:35 EST 2016

On 29/11/2016 14:17, Heli wrote:
> Hi all,
>
> Let me update my question, I have an ascii file(7G) which has around 100M lines.  I read this file using :
>
> f=np.loadtxt(os.path.join(dir,myfile),delimiter=None,skiprows=0)
>
> x=f[:,1]
> y=f[:,2]
> z=f[:,3]
> id=f[:,0]
>
> I will need the x,y,z and id arrays later for interpolations. The problem is reading the file takes around 80 min while the interpolation only takes 15 mins.
>
> I tried to get the memory increment used by each line of the script using python memory_profiler module.
>
> The following line which reads the entire 7.4 GB file increments the memory usage by 3206.898 MiB (3.36 GB). First question is Why it does not increment the memory usage by 7.4 GB?

Is there enough total RAM capacity for another 4.2GB?

But if the file is text, and being read into binary data in memory, it 
will be different. Usually binary data takes less space. I assume the 
loader doesn't load the entire text file first, do the conversions to 
binary, then unloads file, as that would then require 10.6GB during that 
process!

> f=np.loadtxt(os.path.join(dir,myfile),delimiter=None,skiprows=0)
>
> The following 4 lines do not increment the memory at all.
> x=f[:,1]
> y=f[:,2]
> z=f[:,3]
> id=f[:,0]

That's surprising because if those are slices, they would normally 
create a copy (I suppose you don't set f to 0 or something after those 
lines). But if numpy data is involved, I seem to remember that slices 
are actually views into the data.

> Finally I still would appreciate if you could recommend me what is the most optimized way to read/write to files in python? are numpy np.loadtxt and np.savetxt the best?

Why not post a sample couple of lines from the file? (We don't need the 
other 99,999,998 assuming they are all have the same format.) Then we 
can see if there's anything obviously inefficient about it.

-- 
Bartc