[Python-ideas] Pre-PEP: adding a statistics module to Python

Mon Aug 5 19:58:30 CEST 2013

On 5 August 2013 17:38, Steven D'Aprano <steve at pearwood.info> wrote:
> On 06/08/13 01:58, Oscar Benjamin wrote:
>
>> Having looked at the reference implementation I'm slightly confused
>> about the mode API/implementation. It took a little while for me to
>> understand what the ``window`` parameter is for (I was only able to
>> understand it by studying the source) but I've got it now.
>>
>> ISTM that the mode class is splicing two fundamentally different
>> things together:
>> 1) Finding the most frequently occurring values in some collection of
>> data.
>> 2) Estimating the location of the peak of a hypothetical continuous
>> probability distribution from which some real-valued numeric data is
>> drawn.
>
> Both of these -- the most frequent value, and the peak in a distribution --
> are called the mode, and are fundamentally the same thing, and only differ
> between continuous and discrete data.
>
> In both cases, you are estimating a population mode from a sample.

With discrete data the sample will often be the population and your
mode() function will return the exact value that is indisputably the
mode.

> With
> discrete data, you can count the values, and the one with the highest
> frequency is the sample mode. With continuous data, you almost certainly
> will find that every value is unique. There are two approaches to
> calculating the sample mode for continuous data:

There are many more than two approaches. This is why I don't really
think it is suitable for the stdlib stats module. Computing the mode
of a sample of data having discrete values is a well-defined problem
(actually there are controversial aspects; see below) and there is
essentially one basic method for doing it.

Estimating the mode from a finite sample drawn from a continuous
probability distribution is not a well-posed problem: there is no
non-arbitrary way to do it. Every method uses heuristics and AFAIK
every method has parameters that must be arbitrarily specified (such
as ``window``). Different methods or parameter choices can in some
cases give wildly different results so I think that this is an
algorithm that needs to be used carefully and shouldn't be documented
as *the* way to compute the mode() for continuous numbers. At the
least the docstring should explain how a user should choose the value
of window and what it does!

> bin the data first, then
> count the frequencies of the bins; or quoting from "Numerical Recipes"
> (reference in the source), by the technique known in the literature as
> "Estimating the rate of an inhomogeneous Poisson process from Jth waiting
> times". That's a mouthful, which is probably why it's so hard to find
> anything online about it. But check the reference given in the source. Any
> of the "Numerical Recipes..." by Press et al should have it. (There are
> versions for C, Fortran and Pascal.)

I have the C version published about 10 years after yours but I think
I've lent it to someone. I understand what it's doing from the code
and the name though.

>> It's also not common AFAIK in other statistical packages
>> (at least not under the name mode).
>
> Press et al claim it is poorly known, but much better than the binning
> method. It saddens me that twenty years on, it's still poorly known.

It is essentially a binning method but it's one that allows the
location and size of the bins to be chosen by the data rather than
arbitrarily fixed a priori. In that sense it is better than the
standard binning method.

None of the mode() functions from other stats packages that I was
listing include either the binning method or the Poisson process
method. They just compute the most frequently occuring values in a
sequence and make no attempt to estimate the mode of a continuous
distribution.

> Does my explanation satisfy your objection?

I would have called it a suggestion rather than an objection. I'm
certainly not objecting to this module; I jnust want it to be as good
as possible (according to my own definition of good!).

> If not, I will consider
> deferring mode for 3.5, which will give me some time to think about a better
> API and documentation.

I also find two other aspects of the mode function a little odd:

I can't work out why I would want the max_modes parameter to be
anything other than 1 or infinity. In fact I normally want it to be
infinity. I've looked at how a couple of other stats packages handle
this and it seems like the mot common thing is just to arbitrarily
return any-old mode (which is rubbish). Yours will by default raise an
error if there isn't a unique mode (which is better). But it seems odd
to do mode(data, max_modes=float('inf')) to say that I want all the
modes. My preference really is just that modes() returns a list of all
modes and the user should decide what to do with however many values
they get back.

The other thing is about this idea that if all values are equally
common then their is "no mode". I want to say that every value is a
mode rathern than none. Otherwise you get strange differences between
e.g.: [2,2,3,3,4,4] and [1,2,2,3,3,4,4]. I've checked on the interweb
though and it seems that most people disagree with me on this point so
never mind!

Oscar