[SciPy-dev] Statistics Review progress

Wed Apr 12 12:51:32 EDT 2006

Robert Kern wrote:
> * anova(): I want to get rid of it. It and its support functions take up nearly
> a quarter of the entire code of stats.py. It returns nothing but simply prints
> it out to stdout. It uses globals. It depends on several undocumented,
> uncommented functions that nothing else uses; getting rid of anova() closes a
> lot of other tickets, too (I subscribe to the "making progress by moving
> goalposts" model of development). It's impossible to unit-test since it returns
> no values. Gary Strangman, the original author of most of stats.py, removed it
> from his copy some time ago because he didn't have confidence in its implementation.
>
>   
+1

> * Some of the functions like mean() and std() are replications of functionality
> in numpy and even the methods of array objects themselves. I would like to
> remove them, but I imagine they are being used in various places. There's a
> certain amount of code breakage I'm willing to accept in order to clean up
> stats.py (e.g. all of my other bullet items), but this seems just gratuitous.
>   
-0  on removing them

> * paired() is a bit similar in that it does not return any values but just
> prints them out. It is somewhat better in that it does not do much computation
> itself but simply calls other functions depending on the type of data. However,
> it determines the type of data by asking the user through raw_input(). It seems
> to me that this kind of function does not add anything to scipy.stats. This
> functionality would make much more sense as a method in a class that represented
> a pairwise dataset. The class could hold the information about the kind of data
> it contained, and thus it would not need to ask the user through raw_input().
>
>   
+1 on removing it

> * histogram() had a bug in how it computed the default bins. I fixed it. If you
> use this function's defaults, your results will now be different. This function
> will be refactored later to use numpy.histogram() internally as it is much faster.
>   
+1 on deleting this

> * Several functions rely on first histogramming the data and then computing some
> values from the cumulative histogram. Namely, cmedian(), scoreatpercentile(),
> and percentileatscore(). Essentially, these functions are using the histogram to
> calculate an empirical CDF and then using that to find particular values.
> However, since the fastest histogram implementation readily available,
> numpy.histogram() sorts the array first, there really isn't any point in doing
> the histogram. It is faster and more accurate simply to sort the data and
> lineraly interpolate (sorted_x, linspace(0., 1., len(sorted_x)).
>
> However, the current forms of these functions would be useful again if they are
> slightly modified to accept *already histogrammed* data. They would make good
> methods on a Histogram class, for instance.
>   
+1 on a Histogram class
> * I changed the input arguments of pointbiserialr() so my head didn't hurt
> trying to follow the old implementation. See the ticket for details:
>
>   http://projects.scipy.org/scipy/scipy/ticket/100
>
> * We really need to sort out the issue of biased and unbiased estimators. At
> least, a number of scipy.stats functions compute values that could be computed
> in two different ways, conventionally given labels "biased" and "unbiased". Now
> while there is some disagreement as to which is better (you get to guess which I
> prefer), I think we should offer both.
>   
Yes we should offer both.  It would be nice if we could allow the user 
to decide as well.

> Normally, I try to follow the design principle that if the value of a keyword
> argument is almost always given as a constant (e.g. bias=True rather than
> bias=flag_set_somewhere_else_in_my_code), then the functionality should be
> exposed as two separate functions. However, there are a lot of these functions
> in scipy.stats, and I don't think we would be doing anyone any favors by
> doubling the number of these functions. IMO, "practicality beats purity" in this
> case.
>   
That's my thinking too. 

> Additionally, if people start writing classes that encapsulate datasets with
> methods that estimate quantities (mean, variance, etc.) from that data, they are
> likely to want "biased" or "unbiased" estimates for *all* of their quantities
> together. A bias flag handles this use-case much more naturally.
>
> The names "biased" and "unbiased" are, of course up for discussion, since the
> label "biased" is not particularly precise. The default setting is also up for
> discussion.
>
>   
How about making the default minimize mean square error  --- i.e  
division by N+1 for variance calculation :-)

-Travis