[Numpy-discussion] Efficient way to load a 1Gb file?

Thu Aug 11 14:50:14 EDT 2011

In article 
<CANm_+Zqmsgo8Q+Oz_0RCya-hJv4Q7PqynDb=LzrgvbTxGY3MWQ at mail.gmail.com>,
 Anne Archibald <aarchiba at physics.mcgill.ca> wrote:

> There was also some work on a semi-mutable array type that allowed
> appending along one axis, then 'freezing' to yield a normal numpy
> array (unfortunately I'm not sure how to find it in the mailing list
> archives). One could write such a setup by hand, using mmap() or
> realloc(), but I'd be inclined to simply write a filter that converted
> the text file to some sort of binary file on the fly, value by value.
> Then the file can be loaded in or mmap()ed.  A 1 Gb text file is a
> miserable object anyway, so it might be desirable to convert to (say)
> HDF5 and then throw away the text file.

Thank you and the others for your help.

It seems a shame that loadtxt has no argument for predicted length, 
which would allow preallocation and less appending/copying data.

And yes...reading the whole file first to figure out how many elements 
it has seems sensible to me -- at least as a switchable behavior, and 
preferably the default. 1Gb isn't that large in modern systems, but 
loadtxt is filing up all 6Gb of RAM reading it!

I'll suggest the HDF5 solution to my colleague. Meanwhile I think he's 
hacked around the problem by reading the file through once to figure out 
the array length, allocating that, and reading the data in with a Python 
loop. Sounds slow, but it's working.

-- Russell