[Numpy-discussion] Proposal for new ufunc functionality

Tue Apr 13 10:17:00 EDT 2010

On Tue, Apr 13, 2010 at 10:03 AM, Travis Oliphant
<oliphant at enthought.com> wrote:
>
> On Apr 12, 2010, at 5:31 PM, Robert Kern wrote:
>
> We should collect all of these proposals into a NEP.      To clarify what I
>
> mean by "group-by" behavior.
>
> Suppose I have an array of floats and an array of integers.   Each element
>
> in the array of integers represents a region in the float array of a certain
>
> "kind".   The reduction should take place over like-kind values:
>
> Example:
>
> add.reduceby(array=[1,2,3,4,5,6,7,8,9], by=[0,1,0,1,2,0,0,2,2])
>
> results in the calculations:
>
> 1 + 3 + 6 + 7
>
> 2 + 4
>
> 5 + 8 + 9
>
> and therefore the output (notice the two arrays --- perhaps a structured
>
> array should be returned instead...)
>
> [0,1,2],
>
> [17, 6, 22]
>
> The real value is when you have tabular data and you want to do reductions
>
> in one field based on values in another field.   This happens all the time
>
> in relational algebra and would be a relatively straightforward thing to
>
> support in ufuncs.
>
> I might suggest a simplification where the by array must be an array
> of non-negative ints such that they are indices into the output. For
> example (note that I replace 2 with 3 and have no 2s in the by array):
>
> add.reduceby(array=[1,2,3,4,5,6,7,8,9], by=[0,1,0,1,3,0,0,3,3]) ==
> [17, 6, 0, 22]
>
> This basically generalizes bincount() to other binary ufuncs.
>
>
> Interesting proposal.   I do like the having only one output.
> I'm particularly interested in reductions with "by" arrays of strings.  i.e.
>  something like:
> add.reduceby([10,11,12,13,14,15,16],
> by=['red','green','red','green','red','blue', 'blue']).
> resulting in:
> 10+12+14
> 11+13
> 15+16
> In practice, these would have to be essentially mapped to the kind of
> integer array I used in the original example, and so I suppose if we couple
> your proposal with the segment function from the rest of my original
> proposal, then the same resulting functionality is available (with perhaps
> the extra intermediate integer array that may not be strictly necessary).
> But, having simple building blocks is usually better in the long run (and
> typically leads to better optimizations by human programmers).

Currently I'm using unique return_inverse to do the recoding into integers

>>> np.unique(['red','green','red','green','red','blue', 'blue'],return_inverse=True)
(array(['blue', 'green', 'red'],
      dtype='|S5'), array([2, 1, 2, 1, 2, 0, 0]))

and then feed into bincount.

Your plans are a good generalization and speedup.

Josef

> Thanks,
> -Travis
>
> --
> Travis Oliphant
> Enthought Inc.
> 1-512-536-1057
> http://www.enthought.com
> oliphant at enthought.com
>
>
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>