[SciPy-dev] Memory mapped files in scipy core

Tue Nov 22 09:10:36 EST 2005

I'm not familiar with the mmap interface, but these insertion tricks
sound like they solve a particularly unpleasant problem of IDL that I
hit a lot.

It's considered nice in CS to allocate space on the fly, since it
keeps your allocation with the code that uses it.  It's particularly
useful if you have an unknown amount of data coming in.  To that end,
I have a routine, concat, that allows you to tack on an array to any
side of an existing array:

x = concat(2, x, y)

puts x next to y in the second dimension.  The arrays may have any
dimension and shape.  If they're different shapes, concat fills any
void space in with a pad value.  If I'm reading in a dataset of some
thousands of images, I can just put that in a loop and then ask the
final array how big it is, rather than "peeking" at some ancillary
data to find out how much space to pre-allocate in x.  The
pre-allocation line (which has to be outside the loop in IDL, or has
to be protected by an if) is unnecessary.

That's just one use.  There are many more, and options I'm not
discussing.

There are two problems with a non-mmapped approach to implementing
concat.  First, each call to concat allocates a new, larger array of
the necessary kind and copies in the existing data.  So obviously, it
gets slow very soon, copying all the data that has been read so far
over and over again.  Also, it's not possible to make an array that
contains more than half the RAM.

Having the ability to insert data would be an immense help here.
You'd just insert onto the end, or the side (which is a series of
interior insertions under the hood).  No extra copies happen, and both
problems are solved.

Concat solves a bunch of problems in IDL that Python may not have,
such as allowing x to be undefined in the initial loop iteration
(IDL's array syntax does not let you concatenate an array with an
undefined object).

So, I would make extensive use of this capability, and I think it
might become the default way to read in large datasets, particularly
if they are of variable size, or if data elements might be examined
and discarded during the reading process (so that even if you knew the
total number of elements, you would not know the final space you'd
need, you'd have to overallocate, and you'd need to reshape at the end
of the loop in order not to have a bunch of empty space in your array.

Once I had an mmapped array, I suppose use would be pretty generic.  I
often need to take a cut or slice through all the data.  For example,
say I have a 3D stack of images.  Yesterday I looped over X and Y,
extracted all the Z values at a given X,Y, and did sigma rejection to
find bad pixels.  Then I poked in the median value for each outlier.
I don't have any particular operations I can think of that I only do
on large images.

I have hit the 2GB limit recently, but fortunately only have 2 32-bit
machines left.  The 2GB limit does put a damper on things!  You'll see
a lot more call for mmapped files in the coming years.  For the
reasons given above, I think that will be true even if RAM does grow
with dataset size.

--jh--