[Python-ideas] collections.Counter should implement __mul__, __rmul__

Wes Turner wes.turner at gmail.com
Sun Apr 15 22:18:36 EDT 2018


tf.bincount() returns a vector with integer counts.
https://www.tensorflow.org/api_docs/python/tf/bincount

Keras calls np.bincount in an mnist example.

np.bincount returns an array with a __mul__
https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.bincount.html

- sklearn.preprocessing.normalize

http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-normalization

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html


featuretools.primitives.NUnique has a normalize method.
https://docs.featuretools.com/generated/featuretools.primitives.NUnique.html#featuretools.primitives.NUnique

And I'm done sharing non-pure-python solutions for this problem, I promise

On Sunday, April 15, 2018, Wes Turner <wes.turner at gmail.com> wrote:

>
>
> On Sunday, April 15, 2018, Peter Norvig <peter at norvig.com> wrote:
>
>> If you think of a Counter as a multiset, then it should support __or__,
>> not __add__, right?
>>
>> I do think it would have been fine if Counter did not support "+" at all
>> (and/or if Counter was limited to integer values). But  given where we are
>> now, it feels like we should preserve `c + c == 2 * c`.
>>
>> As to the "doesn't really add any new capabilities" argument, that's
>> true, but it is also true for Counter as a whole: it doesn't add much over
>> defaultdict(int), but it is certainly convenient to have a standard way to
>> do what it does.
>>
>> I agree with your intuition that low level is better. `total` would be
>> useful. If you have total and mul, then as you and others have pointed out,
>> normalize is just c *= 1/c.total.
>>
>> I can also see the argument for a new FrequencyTable class in the
>> statistics module. (By the way, I refactored my
>> https://github.com/norvig/pytudes/blob/master/ipynb/Probability.ipynb a
>> bit, and now I no longer need a `normalize` function.)
>>
>
> nltk.probability.FreqDist(collections.Counter) doesn't have a __mul__
> either
> http://www.nltk.org/api/nltk.html#nltk.probability.FreqDist
>
> numpy.unique(, return_counts=True).unique_counts returns an array sorted
> by value with a __mul__.
> https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html
>
> scipy.stats.itemfreq returns an array sorted by value with a __mul__ and
> the items in the first column.
> https://docs.scipy.org/doc/scipy/reference/generated/
> scipy.stats.itemfreq.html
>
> pandas.Series.value_counts(, normalize=False) returns a Series sorted by
> descending frequency.
> https://pandas.pydata.org/pandas-docs/stable/generated/
> pandas.Series.value_counts.html
>
>
>> On Sun, Apr 15, 2018 at 5:06 PM Raymond Hettinger <
>> raymond.hettinger at gmail.com> wrote:
>>
>>>
>>>
>>> > On Apr 15, 2018, at 2:05 PM, Peter Norvig <peter at norvig.com> wrote:
>>> >
>>> > For most types that implement __add__, `x + x` is equal to `2 * x`.
>>> >
>>> > ...
>>> >
>>> >
>>> > That is true for all numbers, list, tuple, str, timedelta, etc. -- but
>>> not for collections.Counter. I can add two Counters, but I can't multiply
>>> one by a scalar. That seems like an oversight.
>>>
>>> If you view the Counter as a sparse associative array of numeric values,
>>> it does seem like an oversight.  If you view the Counter as a Multiset or
>>> Bag, it doesn't make sense at all ;-)
>>>
>>> From an implementation point of view, Counter is just a kind of dict
>>> that has a __missing__() method that returns zero.  That makes it trivially
>>> easy to subclass Counter to add new functionality or just use dictionary
>>> comprehensions for bulk updates.
>>>
>>> >
>>> >
>>> > It would be worthwhile to implement multiplication because, among
>>> other reasons, Counters are a nice representation for discrete probability
>>> distributions, for which multiplication is an even more fundamental
>>> operation than addition.
>>>
>>> There is an open issue on this topic.  See:
>>> https://bugs.python.org/issue25478
>>>
>>> One stumbling point is that a number of commenters are fiercely opposed
>>> to non-integer uses of Counter. Also, some of the use cases (such as those
>>> found in Allen Downey's "Think Stats" and "Think Bayes" books) also need
>>> division and rescaling to a total (i.e. normalizing the total to 1.0) for a
>>> probability mass function.
>>>
>>> If the idea were to go forward, it still isn't clear whether the correct
>>> API should be low level (__mul__ and __div__ and a "total" property) or
>>> higher level (such as a normalize() or rescale() method that produces a new
>>> Counter instance).  The low level approach has the advantage that it is
>>> simple to understand and that it feels like a logical extension of the
>>> __add__ and __sub__ methods.  The downside is that doesn't really add any
>>> new capabilities (being just short-cuts for a simple dict comprehension or
>>> call to c.values()).  And, it starts to feature creep the Counter class
>>> further away from its core mission of counting and ventures into the realm
>>> of generic sparse arrays with numeric values.  There is also a
>>> learnability/intelligibility issue in __add__ and __sub__ correspond to
>>> "elementwise" operations while  __mul__ and __div__ would be "scalar
>>> broadcast" operations.
>>>
>>> Peter, I'm really glad you chimed in.  My advocacy lacked sufficient
>>> weight to move this idea forward.
>>>
>>>
>>> Raymond
>>>
>>>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180415/39c0cf31/attachment-0001.html>


More information about the Python-ideas mailing list