[Numpy-discussion] memory-efficient loadtxt

Wed Oct 3 12:05:49 EDT 2012

On 1. okt. 2012, at 21:07, Chris Barker wrote:

> Paul,
> 
> Nice to see someone working on these issues, but:
> 
> I'm not sure the problem you are trying to solve -- accumulating in a
> list is pretty efficient anyway -- not a whole lot overhead.

Oh, there's significant overhead, since we're not talking of a list - we're talking of a list-of-lists. My guesstimate from my hacking session (off the top of my head - I don't have my benchmarks in working memory :) is around 3-5 times more memory with the list-of-lists approach for a single column / 1D array, which presumably is the worst case (a length 1 list for each line of input). Hence, if you want to load a 2 GB file into RAM on a machine with 4 GB of the stuff, you're screwed with the old approach and a happy camper with mine.

> But if you do want to improve that, it may be better to change the
> accumulating method, rather than doing the double-read thing. I"ve
> written, and posted here, code that provides an Acumulator that uses
> numpy internally, so not much memory overhead. In the end, it's not
> any faster than accumulating in a list and then converting to an
> array, but it does use less memory.

I see your point - but if you're to return a single array, and the file is close to the total system memory, you've still got a factor of 2 issue when shuffling the binary data from the accumulator into the result array. That is, unless I'm missong something here?

Cheers
Paul