best way to read a huge ascii file.

Steve D'Aprano steve+python at pearwood.info
Fri Nov 25 17:36:17 EST 2016


On Sat, 26 Nov 2016 02:17 am, Heli wrote:

> Hi,
> 
> I have a huge ascii file(40G) and I have around 100M lines.  I read this
> file using :
> 
> f=np.loadtxt(os.path.join(dir,myfile),delimiter=None,skiprows=0)
[...]
> I will need the x,y,z and id arrays later for interpolations. The problem
> is reading the file takes around 80 min while the interpolation only takes
> 15 mins.

There's no way of telling whether this is good performance or bad
performance. Where are you reading it from? Over a network file share on
the other side of the world? From a USB hard drive with a USB 2 cable? From
a blazing fast SSD hard drive running on a server-class machine with a TB
of RAM?

My suggestion is that before you spend any more time trying to optimize the
software, you try a simple test that will tell you whether or not you are
wasting your time. Try making a copy of this 40GB file. I'd expect that
making a copy should take *longer* than just reading the file, because you
have to read and write 40GB. If you find that it takes (let's say) 30
minutes to read and write a copy of the file, and 80 minutes for numpy to
read the file, then its worth looking at optimizing the process. But if it
takes (say) 200 minutes to read and write the file, then probably not. When
you're dealing with large quantities of data, it takes time to move that
many bytes from place to place.

It would also help if you told us a bit more about the machine you are
running on, specifically how much RAM do you have. If you're trying to
process a 40GB file in memory on a machine with 2GB of RAM, you're going to
have a bad time...


-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list