[Numpy-discussion] Feedback pls on proposed changes to bincount()

Sun Mar 11 04:10:45 EDT 2007

Hi,

I'd like to propose some minor modifications to the function 
bincount(arr, weights=None), so would like some feedback from other uses 
of bincount() before I write this up as a proper patch, .

Background:
bincount() has two forms:
- bincount(x) returns an integer array ians of length max(x)+1  where 
ians[n] is the number of times n appears in x.
- bincount(x, weights) returns a double array dans of length max(x)+1 
where dans[n] is the sum of elements in the weight vector weights[i] at 
the positions where x[i]==n
In both cases, all elements of x must be non-negative.

Proposed changes:
(1) Remove the restriction that elements of x must be non-negative.
Currently bincount() starts by finding max(x) and min(x). If the min 
value is negative, an exception is raised.  This change proposes 
dropping the initial search for min(x), and instead testing for 
non-negativity while summing values in the return arrays ians or dans. 
Any indexes where where x is negative will be silently ignored. This 
will allow selective bincounts where values to ignore are flagged with a 
negative bin number.

(2) Allow an optional argument for maximum bin number.
Currently bincount(x) returns an array whose length is dependent on 
max(x). It is sometimes preferable to specify the exact size of the 
returned array, so this change would add a new optional argument, 
max_bin, which is one less than the size of the returned array. Under 
this change, bincount() starts by finding max(x) only if max_bin is not 
specified. Then the returned array ians or dans is created with length 
max_bin+1, and any indexes that would overflow the output array are 
silently ignored.

(3) Allow an optional output array, y.
Currently bincount() creates a new output array each time. Sometimes it 
is preferable to add results to an existing output array, for example, 
when the input array is only available in smaller chunks, or for a 
progressive update strategy to avoid fp precision problems when adding 
lots of small weights to larger subtotals. Thus we can add an extra 
optional argument y that bypasses the creation of an output array.

With these three change, the function signature of bincount() would become:
 bincount(x, weights=None, y=None, max_bin=None)

Anyway, that's the general idea. I'd be grateful for any feedback before 
I code this up as a patch to _compiled_base.c.

Cheers

Stephen