[Numpy-discussion] New function `count_unique` to generate contingency tables.

Eelco Hoogendoorn hoogendoorn.eelco at gmail.com
Tue Aug 12 12:17:09 EDT 2014


Thanks. Prompted by that stackoverflow question, and similar problems I had
to deal with myself, I started working on a much more general extension to
numpy's functionality in this space. Like you noted, things get a little
panda-y, but I think there is a lot of panda's functionality that could or
should be part of the numpy core, a robust set of grouping operations in
particular.

see pastebin here:
http://pastebin.com/c5WLWPbp

Ive posted about it on this list before, but without apparent interest; and
I havnt gotten around to getting this up to professional standards yet
either. But there is a lot more that could be done in this direction.

Note that the count functionality in the stackoverflow answer is relatively
indirect and inefficient, using the inverse_index and such. A much more
efficient method is obtained by the code used here.


On Tue, Aug 12, 2014 at 5:57 PM, Warren Weckesser <
warren.weckesser at gmail.com> wrote:

>
>
>
> On Tue, Aug 12, 2014 at 11:35 AM, Warren Weckesser <
> warren.weckesser at gmail.com> wrote:
>
>> I created a pull request (https://github.com/numpy/numpy/pull/4958) that
>> defines the function `count_unique`.  `count_unique` generates a
>> contingency table from a collection of sequences.  For example,
>>
>> In [7]: x = [1, 1, 1, 1, 2, 2, 2, 2, 2]
>>
>> In [8]: y = [3, 4, 3, 3, 3, 4, 5, 5, 5]
>>
>> In [9]: (xvals, yvals), counts = count_unique(x, y)
>>
>> In [10]: xvals
>> Out[10]: array([1, 2])
>>
>> In [11]: yvals
>> Out[11]: array([3, 4, 5])
>>
>> In [12]: counts
>> Out[12]:
>> array([[3, 1, 0],
>>        [1, 1, 3]])
>>
>>
>> It can be interpreted as a multi-argument generalization of `np.unique(x,
>> return_counts=True)`.
>>
>> It overlaps with Pandas' `crosstab`, but I think this is a pretty
>> fundamental counting operation that fits in numpy.
>>
>> Matlab's `crosstab` (http://www.mathworks.com/help/stats/crosstab.html)
>> and R's `table` perform the same calculation (with a few more bells and
>> whistles).
>>
>>
>> For comparison, here's Pandas' `crosstab` (same `x` and `y` as above):
>>
>> In [28]: import pandas as pd
>>
>> In [29]: xs = pd.Series(x)
>>
>> In [30]: ys = pd.Series(y)
>>
>> In [31]: pd.crosstab(xs, ys)
>> Out[31]:
>> col_0  3  4  5
>> row_0
>> 1      3  1  0
>> 2      1  1  3
>>
>>
>> And here is R's `table`:
>>
>> > x <- c(1,1,1,1,2,2,2,2,2)
>> > y <- c(3,4,3,3,3,4,5,5,5)
>> > table(x, y)
>>    y
>> x   3 4 5
>>   1 3 1 0
>>   2 1 1 3
>>
>>
>> Is there any interest in adding this (or some variation of it) to numpy?
>>
>>
>> Warren
>>
>>
>
> While searching StackOverflow in the numpy tag for "count unique", I just
> discovered that I basically reinvented Eelco Hoogendoorn's code in his
> answer to
> http://stackoverflow.com/questions/10741346/numpy-frequency-counts-for-unique-values-in-an-array.
> Nice one, Eelco!
>
> Warren
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140812/6e693e90/attachment.html>


More information about the NumPy-Discussion mailing list