[Python-Dev] [numpy wishlist] PyMem_*Calloc

Mon Apr 14 07:39:01 CEST 2014

Hi all,

The new tracemalloc infrastructure in python 3.4 is super-interesting
to numerical folks, because we really like memory profiling. Numerical
programs allocate a lot of memory, and sometimes it's not clear which
operations allocate memory (some numpy operations return views of the
original array without allocating anything; others return copies). So
people actually use memory tracking tools[1], even though
traditionally these have been pretty hacky (i.e., just checking RSS
before and after each line is executed), and numpy has even grown its
own little tracemalloc-like infrastructure [2], but it only works for
numpy data.

BUT, we also really like calloc(). One of the basic array creation
routines in numpy is numpy.zeros(), which returns an array full of --
you guessed it -- zeros. For pretty much all the data types numpy
supports, the value zero is represented by the bytestring consisting
of all zeros. So numpy.zeros() usually uses calloc() to allocate its
memory.

calloc() is more awesome than malloc()+memset() for two reasons.
First, calloc() for larger allocations is usually implemented using
clever VM tricks, so that it doesn't actually allocate any memory up
front, it just creates a COW mapping of the system zero page and then
does the actual allocation one page at a time as different entries are
written to. This means that in the somewhat common case where you
allocate a large array full of zeros, and then only set a few
scattered entries to non-zero values, you can end up using much much
less memory than otherwise. It's entirely possible for this to make
the difference between being able to run an analysis versus not.
memset() forces the whole amount of RAM to be committed immediately.

Secondly, even if you *are* going to touch all the memory, then
calloc() is still faster than malloc()+memset(). The reason is that
for large allocations, malloc() usually does a calloc() no matter what
-- when you get a new page from the kernel, the kernel has to make
sure you can't see random bits of other processes's memory, so it
unconditionally zeros out the page before you get to see it. calloc()
knows this, so it doesn't bother zeroing it again. malloc()+memset(),
by contrast, zeros the page twice, producing twice as much memory
traffic, which is huge.

SO, we'd like to route our allocations through PyMem_* in order to let
tracemalloc "see" them, but because there is no PyMem_*Calloc, doing
this would force us to give up on the calloc() optimizations.

The obvious solution is to add a PyMem_*Calloc to the API. Would this
be possible? Unfortunately it would require adding a new field to the
PyMemAllocator struct, which would be an ABI/API break; PyMemAllocator
is exposed directly in the C API and passed by value:
  https://docs.python.org/3.5/c-api/memory.html#c.PyMemAllocator
(Too bad we didn't notice this a few months ago before 3.4 was
released :-(.) I guess we could just rename the struct in 3.5, to
force people to update their code. (I guess there aren't too many
people who would have to update their code.)

Thoughts?

-n

[1] http://scikit-learn.org/stable/developers/performance.html#memory-usage-profiling
[2] https://github.com/numpy/numpy/pull/309

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org