numpy.memmap advice?

Thu Feb 19 16:31:51 EST 2009

On Feb 19, 12:26 pm, Carl Banks <pavlovevide... at gmail.com> wrote:
> On Feb 19, 10:36 am, Lionel <lionel.ke... at gmail.com> wrote:
>
>
>
>
>
> > On Feb 19, 9:51 am, Carl Banks <pavlovevide... at gmail.com> wrote:
>
> > > On Feb 19, 9:34 am, Lionel <lionel.ke... at gmail.com> wrote:
>
> > > > On Feb 18, 12:35 pm, Carl Banks <pavlovevide... at gmail.com> wrote:
>
> > > > > On Feb 18, 10:48 am, Lionel <lionel.ke... at gmail.com> wrote:
>
> > > > > > Thanks Carl, I like your solution. Am I correct in my understanding
> > > > > > that memory is allocated at the slicing step in your example i.e. when
> > > > > > "reshaped_data" is sliced using "interesting_data = reshaped_data[:,
> > > > > > 50:100]"? In other words, given a huge (say 1Gb) file, a memmap object
> > > > > > is constructed that memmaps the entire file. Some relatively small
> > > > > > amount of memory is allocated for the memmap operation, but the bulk
> > > > > > memory allocation occurs when I generate my final numpy sub-array by
> > > > > > slicing, and this accounts for the memory efficiency of using memmap?
>
> > > > > No, what accounts for the memory efficienty is there is no bulk
> > > > > allocation at all.  The ndarray you have points to the memory that's
> > > > > in the mmap.  There is no copying data or separate array allocation.
>
> > > > Does this mean that everytime I iterate through an ndarray that is
> > > > sourced from a memmap, the data is read from the disc? The sliced
> > > > array is at no time wholly resident in memory? What are the
> > > > performance implications of this?
>
> > > Ok, sorry for the confusion.  What I should have said is that there is
> > > no bulk allocation *by numpy* at all.  The call to mmap does allocate
> > > a chunk of RAM to reflect file contents, but the numpy arrays don't
> > > allocate any memory of their own: they use the same memory as was
> > > allocated by the mmap call.
>
> > > Carl Banks- Hide quoted text -
>
> > > - Show quoted text -
>
> > Thanks for the explanations Carl. I'm sorry, but it's me who's the
> > confused one here, not anyone else :-)
>
> > I hate to waste everyone's time again, but something is just not
> > "clicking" in that black-hole I call a brain. So..."numpy.memmap"
> > allocates a chunk off the heap to coincide with the file contents. If
> > I memmap the entire 1 Gb file, a corresponding amount (approx. 1 Gb)
> > is allocated? That seems to contradict what is stated in the numpy
> > documentation:
>
> > "class numpy.memmap
> > Create a memory-map to an array stored in a file on disk.
>
> > Memory-mapped files are used for accessing small segments of large
> > files on disk, without reading the entire file into memory."
>
> Yes, it allocates room for the whole file in your process's LOGICAL
> address space.  However, it doesn't actually reserve any PHYSICAL
> memory, or read in any data from the disk, until you've actually
> access the data.  And then it only reads small chunks in, not the
> whole file.
>
> So when you mmap your 1GB file, the OS sets aside a 1 GB chunk of
> address to use for your memory map.  That's all it does: it doesn't
> read anything from disk, it doesn't reserve any physical RAM.  Later,
> when you access a byte in the mmap via a pointer, the OS notes that it
> hasn't yet loaded the data at that address, so it grabs a small chunk
> of physical ram and reads in the a small amount of data from the disk
> containing the byte you are accessing.  This all happens automatically
> and transparently to you.
>
> > In my previous example that we were working with (100x100 data file),
> > you used an offset to memmap the "lower-half" of the array. Does this
> > mean that in the process of memmapping that lower half, RAM was set
> > aside for 50x100 32-bit complex numbers? If so, and I decide to memmap
> > an entire file, there is no memory benefit in doing so.
>
> The mmap call sets aside room for all 100x100 32-bit complex numbers
> in logical address space, regardless of whether you use the offset
> parameter or not.  However, it might only read in part of the file in
> from disk, and will only reserve physical RAM for the parts it reads
> in.
>
> > At this point do you (or anyone else) recommend I just write a little
> > function for my class that takes the coords I intend to load and "roll
> > my own" function? Seems like the best way to keep memory to a minimum,
> > I'm just worried about performance. On the other hand, the most I'd be
> > loading would be around 1k x 1k worth of data.-
>
> No, if your file is not too large to mmap, just do it the way you've
> been doing it.  The documentation you've been reading is pretty much
> correct, even if you approach it naively.  It is both memory and I/O
> efficient.  You're overthinking things here; don't try to outsmart the
> operating system.  It'll take care of the performance issues
> satisfactorily.
>
> The only thing you have to worry about is if the file is too large to
> fit into your process's logical address space, which on a typical 32-
> bit system is 2-3 GB (depending on configuration) minus the space
> occupied by Python and other heap objects, which is probably only a
> few MB.
>
> Carl Banks- Hide quoted text -
>
> - Show quoted text -

I see. That was very well explained Carl, thank you.