numpy.memmap advice?

Thu Feb 19 15:26:33 EST 2009

On Feb 19, 10:36 am, Lionel <lionel.ke... at gmail.com> wrote:
> On Feb 19, 9:51 am, Carl Banks <pavlovevide... at gmail.com> wrote:
>
>
>
>
>
> > On Feb 19, 9:34 am, Lionel <lionel.ke... at gmail.com> wrote:
>
> > > On Feb 18, 12:35 pm, Carl Banks <pavlovevide... at gmail.com> wrote:
>
> > > > On Feb 18, 10:48 am, Lionel <lionel.ke... at gmail.com> wrote:
>
> > > > > Thanks Carl, I like your solution. Am I correct in my understanding
> > > > > that memory is allocated at the slicing step in your example i.e. when
> > > > > "reshaped_data" is sliced using "interesting_data = reshaped_data[:,
> > > > > 50:100]"? In other words, given a huge (say 1Gb) file, a memmap object
> > > > > is constructed that memmaps the entire file. Some relatively small
> > > > > amount of memory is allocated for the memmap operation, but the bulk
> > > > > memory allocation occurs when I generate my final numpy sub-array by
> > > > > slicing, and this accounts for the memory efficiency of using memmap?
>
> > > > No, what accounts for the memory efficienty is there is no bulk
> > > > allocation at all.  The ndarray you have points to the memory that's
> > > > in the mmap.  There is no copying data or separate array allocation.
>
> > > Does this mean that everytime I iterate through an ndarray that is
> > > sourced from a memmap, the data is read from the disc? The sliced
> > > array is at no time wholly resident in memory? What are the
> > > performance implications of this?
>
> > Ok, sorry for the confusion.  What I should have said is that there is
> > no bulk allocation *by numpy* at all.  The call to mmap does allocate
> > a chunk of RAM to reflect file contents, but the numpy arrays don't
> > allocate any memory of their own: they use the same memory as was
> > allocated by the mmap call.
>
> > Carl Banks- Hide quoted text -
>
> > - Show quoted text -
>
> Thanks for the explanations Carl. I'm sorry, but it's me who's the
> confused one here, not anyone else :-)
>
> I hate to waste everyone's time again, but something is just not
> "clicking" in that black-hole I call a brain. So..."numpy.memmap"
> allocates a chunk off the heap to coincide with the file contents. If
> I memmap the entire 1 Gb file, a corresponding amount (approx. 1 Gb)
> is allocated? That seems to contradict what is stated in the numpy
> documentation:
>
> "class numpy.memmap
> Create a memory-map to an array stored in a file on disk.
>
> Memory-mapped files are used for accessing small segments of large
> files on disk, without reading the entire file into memory."

Yes, it allocates room for the whole file in your process's LOGICAL
address space.  However, it doesn't actually reserve any PHYSICAL
memory, or read in any data from the disk, until you've actually
access the data.  And then it only reads small chunks in, not the
whole file.

So when you mmap your 1GB file, the OS sets aside a 1 GB chunk of
address to use for your memory map.  That's all it does: it doesn't
read anything from disk, it doesn't reserve any physical RAM.  Later,
when you access a byte in the mmap via a pointer, the OS notes that it
hasn't yet loaded the data at that address, so it grabs a small chunk
of physical ram and reads in the a small amount of data from the disk
containing the byte you are accessing.  This all happens automatically
and transparently to you.

> In my previous example that we were working with (100x100 data file),
> you used an offset to memmap the "lower-half" of the array. Does this
> mean that in the process of memmapping that lower half, RAM was set
> aside for 50x100 32-bit complex numbers? If so, and I decide to memmap
> an entire file, there is no memory benefit in doing so.

The mmap call sets aside room for all 100x100 32-bit complex numbers
in logical address space, regardless of whether you use the offset
parameter or not.  However, it might only read in part of the file in
from disk, and will only reserve physical RAM for the parts it reads
in.

> At this point do you (or anyone else) recommend I just write a little
> function for my class that takes the coords I intend to load and "roll
> my own" function? Seems like the best way to keep memory to a minimum,
> I'm just worried about performance. On the other hand, the most I'd be
> loading would be around 1k x 1k worth of data.-

No, if your file is not too large to mmap, just do it the way you've
been doing it.  The documentation you've been reading is pretty much
correct, even if you approach it naively.  It is both memory and I/O
efficient.  You're overthinking things here; don't try to outsmart the
operating system.  It'll take care of the performance issues
satisfactorily.

The only thing you have to worry about is if the file is too large to
fit into your process's logical address space, which on a typical 32-
bit system is 2-3 GB (depending on configuration) minus the space
occupied by Python and other heap objects, which is probably only a
few MB.

Carl Banks