[SciPy-dev] Statistics Review progress

Tue Apr 11 02:02:24 EDT 2006

This weekend I made a first pass through stats.py and did a fair bit of cleanup.
I checked in my changes as separate revisions for easier perusal:

http://projects.scipy.org/scipy/scipy/timeline?from=04%2F09%2F06&daysback=0&changeset=on&update=Update

I got through about 25 or so functions. For the most part, I focused on the
docstrings and making the code look nice and use relatively modern numpy idioms.
Implementing proper unit tests would be the next step for this set of functions.
I have made comments on some of the functions in their associated tickets. If
you would like to keep track of your favorite functions, please look at their
tickets.

http://projects.scipy.org/scipy/scipy/query?status=new&status=assigned&status=reopened&milestone=Statistics+Review+Months&order=priority

There are some issues that I thought should be discussed here rather than the
tracker.

* anova(): I want to get rid of it. It and its support functions take up nearly
a quarter of the entire code of stats.py. It returns nothing but simply prints
it out to stdout. It uses globals. It depends on several undocumented,
uncommented functions that nothing else uses; getting rid of anova() closes a
lot of other tickets, too (I subscribe to the "making progress by moving
goalposts" model of development). It's impossible to unit-test since it returns
no values. Gary Strangman, the original author of most of stats.py, removed it
from his copy some time ago because he didn't have confidence in its implementation.

It appears to me that the function would need a complete rewrite to meet the
standards I have set for passing review. I propose to remove it now.

* Some of the functions like mean() and std() are replications of functionality
in numpy and even the methods of array objects themselves. I would like to
remove them, but I imagine they are being used in various places. There's a
certain amount of code breakage I'm willing to accept in order to clean up
stats.py (e.g. all of my other bullet items), but this seems just gratuitous.

* paired() is a bit similar in that it does not return any values but just
prints them out. It is somewhat better in that it does not do much computation
itself but simply calls other functions depending on the type of data. However,
it determines the type of data by asking the user through raw_input(). It seems
to me that this kind of function does not add anything to scipy.stats. This
functionality would make much more sense as a method in a class that represented
a pairwise dataset. The class could hold the information about the kind of data
it contained, and thus it would not need to ask the user through raw_input().

* histogram() had a bug in how it computed the default bins. I fixed it. If you
use this function's defaults, your results will now be different. This function
will be refactored later to use numpy.histogram() internally as it is much faster.

* Several functions rely on first histogramming the data and then computing some
values from the cumulative histogram. Namely, cmedian(), scoreatpercentile(),
and percentileatscore(). Essentially, these functions are using the histogram to
calculate an empirical CDF and then using that to find particular values.
However, since the fastest histogram implementation readily available,
numpy.histogram() sorts the array first, there really isn't any point in doing
the histogram. It is faster and more accurate simply to sort the data and
lineraly interpolate (sorted_x, linspace(0., 1., len(sorted_x)).

However, the current forms of these functions would be useful again if they are
slightly modified to accept *already histogrammed* data. They would make good
methods on a Histogram class, for instance.

* I changed the input arguments of pointbiserialr() so my head didn't hurt
trying to follow the old implementation. See the ticket for details:

  http://projects.scipy.org/scipy/scipy/ticket/100

* We really need to sort out the issue of biased and unbiased estimators. At
least, a number of scipy.stats functions compute values that could be computed
in two different ways, conventionally given labels "biased" and "unbiased". Now
while there is some disagreement as to which is better (you get to guess which I
prefer), I think we should offer both.

Normally, I try to follow the design principle that if the value of a keyword
argument is almost always given as a constant (e.g. bias=True rather than
bias=flag_set_somewhere_else_in_my_code), then the functionality should be
exposed as two separate functions. However, there are a lot of these functions
in scipy.stats, and I don't think we would be doing anyone any favors by
doubling the number of these functions. IMO, "practicality beats purity" in this
case.

Additionally, if people start writing classes that encapsulate datasets with
methods that estimate quantities (mean, variance, etc.) from that data, they are
likely to want "biased" or "unbiased" estimates for *all* of their quantities
together. A bias flag handles this use-case much more naturally.

The names "biased" and "unbiased" are, of course up for discussion, since the
label "biased" is not particularly precise. The default setting is also up for
discussion.

-- 
Robert Kern
robert.kern at gmail.com

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco