[Numpy-discussion] Change in memmap behaviour

Mon Jul 2 12:51:10 EDT 2012

On Mon, Jul 2, 2012 at 3:53 PM, Sveinung Gundersen <sveinugu at gmail.com> wrote:
> Hi,
>
> We are developing a large project for genome analysis
> (http://hyperbrowser.uio.no), where we use memmap vectors as the basic data
> structure for storage. The stored data are accessed in slices, and used as
> basis for calculations. As the stored data may be large (up to 24 GB), the
> memory footprint is important.
>
> We experienced a problem with 64-bit addressing for the function concatenate
> (using quite old numpy version 1.5.1rc), and have thus updated the version
> of numpy to 1.7.0.dev-651ef74, where the problem has been fixed. We have,
> however, experienced another problem connected to a change in memmap
> behaviour. This change seems to have come with the 1.6 release.
>
> Before (1.5.1rc1):
>
>>>> import platform; print platform.python_version()
> 2.7.0
>>>> import numpy as np
>>>> np.version.version
> '1.5.1rc1'
>>>> a = np.memmap('testmemmap', 'int32', 'w+', shape=20)
>>>> a[:] = 2
>>>> a[0:2]
> memmap([2, 2], dtype=int32)
>>>> a[0:2]._mmap
> <mmap.mmap object at 0x3c246f8>
>>>> a.sum()
> 40
>>>> a.sum()._mmap
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> AttributeError: 'numpy.int64' object has no attribute '_mmap'
>
> After (1.6.2):
>
>>>> import platform; print platform.python_version()
> 2.7.0
>>>> import numpy as np
>>>> np.version.version
> '1.6.2'
>>>> a = np.memmap('testmemmap', 'int32', 'w+', shape=20)
>>>> a[:] = 2
>>>> a[0:2]
> memmap([2, 2], dtype=int32)
>>>> a[0:2]._mmap
> <mmap.mmap object at 0x1b82ed50>
>>>> a.sum()
> memmap(40)
>>>> a.sum()._mmap
> <mmap.mmap object at 0x1b82ed50>
>
> The problem is then that doing calculations of memmap objects, resulting in
> scalar results, previously returned a numpy scalar, with no reference to the
> memmap object. We could then just keep the result, and mark the memmap for
> garbage collection. Now, the memory usage of the system has increased
> dramatically, as we now longer have this option.

Your actual memory usage may not have increased as much as you think,
since memmap objects don't necessarily take much memory -- it sounds
like you're leaking virtual memory, but your resident set size
shouldn't go up as much.

That said, this is clearly a bug, and it's even worse than you mention
-- *all* operations on memmap arrays are holding onto references to
the original mmap object, regardless of whether they share any memory:
  >>> a = np.memmap("/etc/passwd", np.uint8, "r")
  # arithmetic
  >>> (a + 10)._mmap is a._mmap
  True
  # fancy indexing (doesn't return a view!)
  >>> a[[1, 2, 3]]._mmap is a._mmap
  True
  >>> a.sum()._mmap is a._mmap
  True
Really, only slicing should be returning a np.memmap object at all.
Unfortunately, it is currently impossible to create an ndarray
subclass that returns base-class ndarrays from any operations --
__array_finalize__() has no way to do this. And this is the third
ndarray subclass in a row that I've looked at that wanted to be able
to do this, so I guess maybe it's something we should implement...

In the short term, the numpy-upstream fix is to change
numpy.core.memmap:memmap.__array_finalize__ so that it only copies
over the ._mmap attribute of its parent if np.may_share_memory(self,
parent) is True. Patches gratefully accepted ;-)

In the short term, you have a few options for hacky workarounds. You
could monkeypatch the above fix into the memmap class. You could
manually assign None to the _mmap attribute of offending arrays (being
careful only to do this to arrays where you know it is safe!). And for
reduction operations like sum() in particular, what you have right now
is not actually a scalar object -- it is a 0-dimensional array that
holds a single scalar. You can pull this scalar out by calling .item()
on the array, and then throw away the array itself -- the scalar won't
have any _mmap attribute.
  def scalarify(scalar_or_0d_array):
    if isinstance(scalar_or_0d_array, np.ndarray):
      return scalar_or_0d_array.item()
    else:
      return scalar_or_0d_array
  # works on both numpy 1.5 and numpy 1.6:
  total = scalarify(a.sum())

-N