[Numpy-discussion] memory-efficient loadtxt

Wed Oct 3 12:22:35 EDT 2012

On Wed, Oct 3, 2012 at 9:05 AM, Paul Anton Letnes
<paul.anton.letnes at gmail.com> wrote:

>> I'm not sure the problem you are trying to solve -- accumulating in a
>> list is pretty efficient anyway -- not a whole lot overhead.
>
> Oh, there's significant overhead, since we're not talking of a list - we're talking of a list-of-lists.

hmm, a list of nupy scalers (custom dtype) would be a better option,
though maybe not all that much better -- still an extra pointer and
pyton object for each row.

> I see your point - but if you're to return a single array, and the file is close to the total system memory, you've still got a factor of 2 issue when shuffling the binary data from the accumulator into the result array. That is, unless I'm missong something here?

Indeed, I think that's how my current accumulator works -- the
__array__() method returns a copy of the data buffer, so that you
won't accidentally re-allocate it under the hood later and screw up
the returned version.

But it is indeed accumulating in a numpy array, so it should be pretty
possible, maybe even easy to turn it into a regular array without a
data copy. You'd just have to destroy the original somehow (or mark it
as never-resize) so you wouldn't have the clash. messing wwith the
OWNDATA flags might take care of that.

But it seems Wes has a better solution.

One other note, though -- if you have arrays that are that close to
max system memory, you are very likely to have other trouble anyway --
numpy does make a lot of copies!

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov