[Python-ideas] Pre-PEP: adding a statistics module to Python

Steven D'Aprano steve at pearwood.info
Mon Aug 5 18:38:30 CEST 2013


On 06/08/13 01:58, Oscar Benjamin wrote:

> Having looked at the reference implementation I'm slightly confused
> about the mode API/implementation. It took a little while for me to
> understand what the ``window`` parameter is for (I was only able to
> understand it by studying the source) but I've got it now.
>
> ISTM that the mode class is splicing two fundamentally different
> things together:
> 1) Finding the most frequently occurring values in some collection of data.
> 2) Estimating the location of the peak of a hypothetical continuous
> probability distribution from which some real-valued numeric data is
> drawn.

Both of these -- the most frequent value, and the peak in a distribution -- are called the mode, and are fundamentally the same thing, and only differ between continuous and discrete data.

In both cases, you are estimating a population mode from a sample. With discrete data, you can count the values, and the one with the highest frequency is the sample mode. With continuous data, you almost certainly will find that every value is unique. There are two approaches to calculating the sample mode for continuous data: bin the data first, then count the frequencies of the bins; or quoting from "Numerical Recipes" (reference in the source), by the technique known in the literature as "Estimating the rate of an inhomogeneous Poisson process from Jth waiting times". That's a mouthful, which is probably why it's so hard to find anything online about it. But check the reference given in the source. Any of the "Numerical Recipes..." by Press et al should have it. (There are versions for C, Fortran and Pascal.)


> The 2) part does not seem like something that is normally in secondary
> school maths.

This is not *just* aimed at secondary school stats :-)


>It's also not common AFAIK in other statistical packages
> (at least not under the name mode).

Press et al claim it is poorly known, but much better than the binning method. It saddens me that twenty years on, it's still poorly known.

Does my explanation satisfy your objection? If not, I will consider deferring mode for 3.5, which will give me some time to think about a better API and documentation.



-- 
Steven


More information about the Python-ideas mailing list