numpy.memmap advice?

Wed Feb 18 13:48:56 EST 2009

On Feb 18, 12:56 am, Carl Banks <pavlovevide... at gmail.com> wrote:
> On Feb 17, 3:08 pm, Lionel <lionel.ke... at gmail.com> wrote:
>
>
>
>
>
> > Hello all,
>
> > On a previous thread (http://groups.google.com/group/comp.lang.python/
> > browse_thread/thread/64da35b811e8f69d/67fa3185798ddd12?
> > hl=en&lnk=gst&q=keene#67fa3185798ddd12) I was asking about reading in
> > binary data. Briefly, my data consists of complex numbers, 32-bit
> > floats for real and imaginary parts. The data is stored as 4 bytes
> > Real1, 4 bytes Imaginary1, 4 bytes Real2, 4 bytes Imaginary2, etc. in
> > row-major format. I needed to read the data in as two separate numpy
> > arrays, one for real values and one for imaginary values.
>
> > There were several very helpful performance tips offered, and one in
> > particular I've started looking into. The author suggested a
> > "numpy.memmap" object may be beneficial. It was suggested I use it as
> > follows:
>
> > descriptor = dtype([("r", "<f4"), ("i", "<f4")])
> > data = memmap(filename, dtype=descriptor, mode='r').view(recarray)
> > print "First 100 real values:", data.r[:100]
>
> > I have two questions:
> > 1) What is "recarray"?
>
> Let's look:
>
> [GCC 4.3.2] on linux2
> Type "help", "copyright", "credits" or "license" for more information.>>> import numpy
> >>> numpy.recarray
>
> <class 'numpy.core.records.recarray'>
>
> >>> help(numpy.recarray)
>
> Help on class recarray in module numpy.core.records:
>
> class recarray(numpy.ndarray)
>  |  recarray(shape, dtype=None, buf=None, **kwds)
>  |
>  |  Subclass of ndarray that allows field access using attribute
> lookup.
>  |
>  |  Parameters
>  |  ----------
>  |  shape : tuple
>  |      shape of record array
>  |  dtype : data-type or None
>  |      The desired data-type.  If this is None, then the data-type is
> determine
>  |      by the *formats*, *names*, *titles*, *aligned*, and
> *byteorder* keywords
>  |  buf : [buffer] or None
>  |      If this is None, then a new array is created of the given
> shape and data
>  |      If this is an object exposing the buffer interface, then the
> array will
>  |      use the memory from an existing buffer.  In this case, the
> *offset* and
>  |      *strides* keywords can also be used.
> ...
>
> So there you have it.  It's a subclass of ndarray that allows field
> access using attribute lookup.  (IOW, you're creating a view of the
> memmap'ed data of type recarray, which is the type numpy uses to
> access structures by name.  You need to create the view because
> regular numpy arrays, which numpy.memmap creates, can't access fields
> by attribute.)
>
> help() is a nice thing to use, and numpy is one of the better
> libraries when it comes to docstrings, so learn to use it.
>
> > 2) The documentation for numpy.memmap claims that it is meant to be
> > used in situations where it is beneficial to load only segments of a
> > file into memory, not the whole thing. This is definately something
> > I'd like to be able to do as my files are frequently >1Gb. I don't
> > really see in the diocumentation how portions are loaded, however.
> > They seem to create small arrays and then assign the entire array
> > (i.e. file) to the memmap object. Let's assume I have a binary data
> > file of complex numbers in the format described above, and let's
> > assume that the size of the complex data array (that is, the entire
> > file) is 100x100 (rows x columns). Could someone please post a few
> > lines showing how to load the top-left 50 x 50 quadrant, and the lower-
> > right 50 x 50 quadrant into memmap objects? Thank you very much in
> > advance!
>
> You would memmap the whole region in question (in this case the whole
> file), then take a slice.  Actually you could get away with memmapping
> just the last 50 rows (bottom half).  The offset into the file would
> be 50*100*8, so:
>
> data = memmap(filename, dtype=descriptor, mode='r',offset=
> (50*100*8)).view(recarray)
> reshaped_data = reshape(data,(50,100))
> intersting_data = reshaped_data[:,50:100]
>
> A word of caution: Every instance of numpy.memmap creates its own mmap
> of the whole file (even if it only creates an array from part of the
> file).  The implications of this are A) you can't use numpy.memmap's
> offset parameter to get around file size limitations, and B) you
> shouldn't create many numpy.memmaps of the same file.  To work around
> B, you should create a single memmap, and dole out views and slices.
>
> Carl Banks- Hide quoted text -
>
> - Show quoted text -

Thanks Carl, I like your solution. Am I correct in my understanding
that memory is allocated at the slicing step in your example i.e. when
"reshaped_data" is sliced using "interesting_data = reshaped_data[:,
50:100]"? In other words, given a huge (say 1Gb) file, a memmap object
is constructed that memmaps the entire file. Some relatively small
amount of memory is allocated for the memmap operation, but the bulk
memory allocation occurs when I generate my final numpy sub-array by
slicing, and this accounts for the memory efficiency of using memmap?