[Python-ideas] Additions to collections.Counter and a Counter derived class

Wed Mar 15 20:34:21 EDT 2017

On Wed, Mar 15, 2017 at 11:06:20AM -0700, David Mertz wrote:
> On Wed, Mar 15, 2017 at 10:39 AM, Steven D'Aprano <steve at pearwood.info>
> wrote:
> 
> > > But I can imagine an occasional need to, e.g. "find outliers."  However,
> > > that is not hard to spell as `mycounter.most_common()[-1*N:]`.  Or if
> > your
> > > program does this often, write a utility function `find_outliers(...)`
> >
> > That's not how you find outliers :-)
> > Just because a data point is uncommon doesn't mean it is an outlier.
> >
> 
> That's kinda *by definition* what an outlier is in categorical data!
[...]
> This isn't exactly statistics, but it's like your product example.  There
> are infinitely many random strings that occurred zero times among US
> births.  But a "rare name" is one that occurred at least once, not one of
> these zero-occurring possible strings.

I'm not sure that "outlier" is defined for non-numeric data, or at 
least not formally defined. You'd need a definition of central location 
(which would be the mode) and a definition of spread, and I'm not sure 
how you would measure spread for categorical data. What's the spread of 
this data?

["Jack", "Jack", "Jill", "Jack"]

The mode is clearly "Jack", but beyond that I'm not sure what can be 
said except to give the frequencies themselves.

One commonly used definition of outlier (due to John Tukey) is:

- divide your data into four equal quarters;
- the points between each quarter are known as quartiles, and 
  there are three of them: Q1, Q2 (the median), Q3;
- define the Interquartile Range IQR = Q3 - Q2;
- define lower and upper fences as Q1 - 1.5*IQR and Q3 + 1.5*IQR;
- anything not between the lower and upper fences is an outlier.

Or to be precise, a *suspected* outlier, since for very long tailed 
distributions, rare values are to be expected and should not be 
discarded without good reason.

If your data is Gaussian, that corresponds to discarding roughly 1% of 
the most extreme values.

> I realize from my example, however, that I'm probably more interested in
> the actual uncommonality, not the specific `.least_common()`.  I.e. I'd
> like to know which names occurred fewer than 10 times... but I don't know
> how many items that will include.  Or as a percentage, which names occur in
> fewer than 0.01% of births?

Indeed. While the frequencies themselves are useful, the 
least_common(count) (by analogy with Counter.most_common) is not so 
useful.

-- 
Steve