[SciPy-dev] Should ndimage.measurements.* should return lists if index is a list?

Sat May 2 02:14:39 EDT 2009

On Fri, May 1, 2009 at 4:37 PM, Thouis (Ray) Jones <thouis at broad.mit.edu> wrote:
> On Fri, May 1, 2009 at 16:11,  <josef.pktd at gmail.com> wrote:
>> After a recent comment by Anne, I looked at the weights option in np.bincount.
>> It is as fast to calculate group/label means with np.bincount as with
>> the current ndimage and it scales the same, but it works only for all
>> indices and not for min, max.
>> weights can be calculated for any element wise function.
>> group labels can be anything that np.unique1d can handle, string
>> labels take twice as long
>
> I think you can get min/max by sorting by (label, value) and then
> using diff on the sorted labels to find the indices where a label
> changes.

yes this works, it takes about twice as long for  min and max than the
current ndimage implementation, see below.

In my previous script, I didn't test against ndimage and needed two
corrections, index for ndimage had one element too many,
and ndimage.variance uses the unbiased estimator for the variance,
while I worked with the biased estimator. I added a ddof option to the
bincount version.

I had looked at this when I started to rewrite and generalize ANOVA in
stats and for manipulating "factors" as in R. Before I found that I
can (ab)use np.bincount, ndimage.measurement was by far the fastest
way of doing this for large arrays. There are currently similar uses
in stats._support that have a very slow implementation and could be
made faster this way.
ndimage.measurement is still more flexible than np.bincount, because
it allows to specify index separately, it works for subsets of labels,
and for supersets (indices that don't actually occur in the label
array).

One thing that neither bincount nor the current implementation allows
but would be very useful for stats, is an axis argument for 2d arrays,
e.g.  groupmeans(values,..., axis=0) similar to np.mean, np.var

Josef

    nobs = 100
    nfact = 5
    factors = np.random.randint(nfact,size=nobs)
    values = np.random.randn(nobs)

    # find minimum and maximum per factor
    sortind = np.lexsort((values, factors))
    fs = factors[sortind]
    vs = values[sortind]
    fsd = np.hstack((np.inf,np.diff(fs),np.inf))
    gmin = vs[fsd[:-1] != 0]
    gmax = vs[fsd[1:] != 0]
    print gmin
    print ndimage.minimum(values, labels=factors, index=np.arange(5))
    print gmax
    print ndimage.maximum(values, labels=factors, index=np.arange(5))