[Numpy-discussion] Missing/accumulating data

Tue Jun 28 12:50:52 EDT 2011

As with Travis, I have not had time to wade through the 150+ messages
on masked arrays, but I'd like to raise a concept I've mentioned in
the past that would enable a broader use if done slightly differently.
That is, the "masked array" problem is a subset of this more-general
problem.  Please respond on this thread if you want me to see the
replies.

The problem is of data accumulation and averaging, and it comes up all
the time in astronomy, medical imaging, and other fields that handle
stacks of images.  Say that I observe a planet, taking images at, say,
30 Hz for several hours, as with many video cameras.  I want to
register or map those images to a common coordinate system, stack
them, and average them.  I must recognize:

1. transient there are bad pixels due to cosmic ray hits on the CCD array
2. some pixels are permanently dead and always produce bad data
3. some data in the image is not part of the planet
4. some images have too much atmospheric blurring to be used
5. sometimes moons of planets like Jupiter pass in front
6. etc.

For each original image:
  Make a mask where 1 is good and 0 is bad.  This trims both bad
      pixels and pixels that are not part of the planet.
  Evaluate the images and zero the entire image (or regions of an
      image) if quality is low (or there is an obstruction).
  Shift (or map, in the case of lon-lat mapping) the image AND THE
      MASK to a common coordinate grid.
  Append the image and the mask to the ends of the stacks of prior
      images and masks.

For each x and y position in the image/mask stacks:
  Perform some averaging of all pixels in the stack that are not masked,
      or otherwise use the mask info to decide how to use the image info.

The most common of these averages is to multiply the stack of images
by the stack of masks, zeroing the bad data, then sum both images and
masks down the third axis.  This produces an accumulator image and a
count image, where the count image says how many data points went into
the accumulator at each pixel position.  Divide the accumulator by the
count.  This produces an image with no missing data and the highest
SNR per pixel possible.

Two things differ from the current ma package: the sense of TRUE/FALSE
and the ability to have numbers other than 0 or 1.  For example, if
data quality were part of the assessment, each pixel could get a data
quality number and the accumulation would only include pixels with
better than a certain number, or there could be weighted averaging.

Currently, we just implement this as described, without using the
masked array capability at all.  However, it's cumbersome to hand
around pairs of objects, and making a new object to handle them
together means breaking it apart when calling standard routines that
don't handle masks.  There are probably some very fast ways to do
interesting kinds of averages that we haven't explored but would if
this kind of capability were in the infrastructure.

So, I hope that if some kind of masking goes into the array object, it
allows TRUE=good, FALSE=bad, and at least an integer count if not a
floating quality number.  One bit is not enough.  Of course, these
could be options.  Certainly adding 7 or 31 more bits to the mask will
make space problems for some, as would flipping the sense of the
mask.  However, speed matters.  Some of those doing this kind of
processing are handling hours of staring video data, which is 1e6
frames a night from each camera.

--jh--