numpy.memmap advice?

Wed Feb 18 03:56:10 EST 2009

On Feb 17, 3:08 pm, Lionel <lionel.ke... at gmail.com> wrote:
> Hello all,
>
> On a previous thread (http://groups.google.com/group/comp.lang.python/
> browse_thread/thread/64da35b811e8f69d/67fa3185798ddd12?
> hl=en&lnk=gst&q=keene#67fa3185798ddd12) I was asking about reading in
> binary data. Briefly, my data consists of complex numbers, 32-bit
> floats for real and imaginary parts. The data is stored as 4 bytes
> Real1, 4 bytes Imaginary1, 4 bytes Real2, 4 bytes Imaginary2, etc. in
> row-major format. I needed to read the data in as two separate numpy
> arrays, one for real values and one for imaginary values.
>
> There were several very helpful performance tips offered, and one in
> particular I've started looking into. The author suggested a
> "numpy.memmap" object may be beneficial. It was suggested I use it as
> follows:
>
> descriptor = dtype([("r", "<f4"), ("i", "<f4")])
> data = memmap(filename, dtype=descriptor, mode='r').view(recarray)
> print "First 100 real values:", data.r[:100]
>
> I have two questions:
> 1) What is "recarray"?

Let's look:

[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> numpy.recarray
<class 'numpy.core.records.recarray'>
>>> help(numpy.recarray)

Help on class recarray in module numpy.core.records:

class recarray(numpy.ndarray)
 |  recarray(shape, dtype=None, buf=None, **kwds)
 |
 |  Subclass of ndarray that allows field access using attribute
lookup.
 |
 |  Parameters
 |  ----------
 |  shape : tuple
 |      shape of record array
 |  dtype : data-type or None
 |      The desired data-type.  If this is None, then the data-type is
determine
 |      by the *formats*, *names*, *titles*, *aligned*, and
*byteorder* keywords
 |  buf : [buffer] or None
 |      If this is None, then a new array is created of the given
shape and data
 |      If this is an object exposing the buffer interface, then the
array will
 |      use the memory from an existing buffer.  In this case, the
*offset* and
 |      *strides* keywords can also be used.
...

So there you have it.  It's a subclass of ndarray that allows field
access using attribute lookup.  (IOW, you're creating a view of the
memmap'ed data of type recarray, which is the type numpy uses to
access structures by name.  You need to create the view because
regular numpy arrays, which numpy.memmap creates, can't access fields
by attribute.)

help() is a nice thing to use, and numpy is one of the better
libraries when it comes to docstrings, so learn to use it.

> 2) The documentation for numpy.memmap claims that it is meant to be
> used in situations where it is beneficial to load only segments of a
> file into memory, not the whole thing. This is definately something
> I'd like to be able to do as my files are frequently >1Gb. I don't
> really see in the diocumentation how portions are loaded, however.
> They seem to create small arrays and then assign the entire array
> (i.e. file) to the memmap object. Let's assume I have a binary data
> file of complex numbers in the format described above, and let's
> assume that the size of the complex data array (that is, the entire
> file) is 100x100 (rows x columns). Could someone please post a few
> lines showing how to load the top-left 50 x 50 quadrant, and the lower-
> right 50 x 50 quadrant into memmap objects? Thank you very much in
> advance!

You would memmap the whole region in question (in this case the whole
file), then take a slice.  Actually you could get away with memmapping
just the last 50 rows (bottom half).  The offset into the file would
be 50*100*8, so:

data = memmap(filename, dtype=descriptor, mode='r',offset=
(50*100*8)).view(recarray)
reshaped_data = reshape(data,(50,100))
intersting_data = reshaped_data[:,50:100]

A word of caution: Every instance of numpy.memmap creates its own mmap
of the whole file (even if it only creates an array from part of the
file).  The implications of this are A) you can't use numpy.memmap's
offset parameter to get around file size limitations, and B) you
shouldn't create many numpy.memmaps of the same file.  To work around
B, you should create a single memmap, and dole out views and slices.

Carl Banks