[Python-ideas] Additions to collections.Counter and a Counter derived class

David Mertz mertz at gnosis.cx
Wed Mar 15 12:59:54 EDT 2017


I added a couple comments at https://bugs.python.org/issue25478 about what
I mean.  Raymond replied as well.  So it feels like we should use that
thread there.

In a scientific context I often think of a Counter as a way to count
observations of a categorical variable.  "I saw 3 As, then 7 Bs, etc".  My
proposed attribute/property would let you ask `num_observations =
mycounter.total`.

As I comment on the issue, we could ask for frequency by spelling it like:

    freqs = {k:v/c.total for k, v in c.items()}

And yes, the question in my mind is whether we should define:

    @property
    def total(self):
        return sum(self.values())

Or keep a running count in all the mutator methods. Probably the property,
actually, since we don't want to slow down all the existing code.

I don't really want .normalize() as a method, but if we did, the spelling
should be `.normalized()` instead. This reverse/reversed and sort/sorted
that we already have to reflect mutators versus copies. But notice that
copying is almost always in functions rather than in methods.

The thing is, once something is normalized, it really IS NOT a Counter
anymore. My dict comprehension emphasizes that by creating a plain
dictionary. Even if .normalized() existed, I don't believe it should return
a Counter object (instead either a plain dict, or some new specialized
subclass like Frequencies).

On Wed, Mar 15, 2017 at 2:14 AM, Marco Cognetta <cognetta.marco at gmail.com>
wrote:

> Thanks for the reply. I will add it to the documentation and will work
> on a use case and recipe for the n highest frequency problem.
>
> As for
>
> >I'm not sure if this would be better as an attribute kept directly or as
> a property that called `sum(self.values())` when accessed.  I believe that
> having `mycounter.total` would provide the right normalization in a clean
> API, and also expose easy access to other questions one would naturally ask
> (e.g. "How many observations were made?")
>
> I am not quite sure what you mean, especially the observations part.
> For the attribute part, do you mean we would just have a hidden class
> variable like num_values that was incremented or decremented whenever
> something is added or removed, so we have O(1) size queries instead of
> O(n) (where n is the number of keys)? Then there could be a method
> like 'normalize' that printed all elements with their frequencies
> divided by the total count?
>
> On Wed, Mar 15, 2017 at 12:52 AM, David Mertz <mertz at gnosis.cx> wrote:
> > On Tue, Mar 14, 2017 at 2:38 AM, Marco Cognetta <
> cognetta.marco at gmail.com>
> > wrote:
> >>
> >> 1) Addition of a Counter.least_common method:
> >> This was addressed in https://bugs.python.org/issue16994, but it was
> >> never resolved and is still open (since Jan. 2013). This is a small
> >> change, but I think that it is useful to include in the stdlib.
> >
> >
> > -1 on adding this.  I read the issue, and do not find a convincing use
> case
> > that is common enough to merit a new method.  As some people noted in the
> > issue, the "least common" is really the infinitely many keys not in the
> > collection at all.
> >
> > But I can imagine an occasional need to, e.g. "find outliers."  However,
> > that is not hard to spell as `mycounter.most_common()[-1*N:]`.  Or if
> your
> > program does this often, write a utility function `find_outliers(...)`
> >
> >> 2) Undefined behavior when using Counter.most_common:
> >> 'c', 'c']), when calling c.most_common(3), there are more than 3 "most
> >> common" elements in c and c.most_common(3) will not always return the
> >> same list, since there is no defined total order on the elements in c.
> >>
> >> Should this be mentioned in the documentation?
> >
> >
> > +1. I'd definitely support adding this point to the documentation.
> >
> >>
> >> Additionally, perhaps there is room for a method that produces all of
> >> the elements with the n highest frequencies in order of their
> >> frequencies. For example, in the case of c = Counter([1, 1, 1, 2, 2,
> >> 3, 3, 4, 4, 5]) c.aforementioned_method(2) would return [(1, 3), (2,
> >> 2), (3, 2), (4, 2)] since the two highest frequencies are 3 and 2.
> >
> >
> > -0 on this.  I can see wanting this, but I'm not sure often enough to
> add to
> > the promise of the class.  The utility function to do this would be
> somewhat
> > less trivial to write than `find_outliers(..)` but not enormously hard.
> I
> > think I'd be +0 on adding a recipe to the documentation for a utility
> > function.
> >
> >>
> >> 3) Addition of a collections.Frequency or collections.Proportion class
> >> derived from collections.Counter:
> >>
> >> This is sort of discussed in https://bugs.python.org/issue25478.
> >> The idea behind this would be a dictionary that, instead of returning
> >> the integer frequency of an element, would return it's proportional
> >> representation in the iterable.
> >
> >
> > One could write a subclass easily enough.  The essential feature in my
> mind
> > would be to keep an attributed Counter.total around to perform the
> > normalization.  I'm +1 on adding that to collections.Counter itself.
> >
> > I'm not sure if this would be better as an attribute kept directly or as
> a
> > property that called `sum(self.values())` when accessed.  I believe that
> > having `mycounter.total` would provide the right normalization in a clean
> > API, and also expose easy access to other questions one would naturally
> ask
> > (e.g. "How many observations were made?")
> >
> >
> >
> > --
> > Keeping medicines from the bloodstreams of the sick; food
> > from the bellies of the hungry; books from the hands of the
> > uneducated; technology from the underdeveloped; and putting
> > advocates of freedom in prisons.  Intellectual property is
> > to the 21st century what the slave trade was to the 16th.
>



-- 
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20170315/2f2ebc90/attachment.html>


More information about the Python-ideas mailing list