Request for feedback on API design

Mon Dec 13 15:19:27 EST 2010

Steven D'Aprano <steve+comp.lang.python at pearwood.info> writes:

> I am soliciting feedback regarding the API of my statistics module:
>
> http://code.google.com/p/pycalcstats/
>
>
> Specifically the following couple of issues:
>
> (1) Multivariate statistics such as covariance have two obvious APIs:
>
>     A pass the X and Y values as two separate iterable arguments, e.g.: 
>       cov([1, 2, 3], [4, 5, 6])
>
>     B pass the X and Y values as a single iterable of tuples, e.g.:
>       cov([(1, 4), (2, 5), (3, 6)]
>
> I currently support both APIs. Do people prefer one, or the other, or 
> both? If there is a clear preference for one over the other, I may drop 
> support for the other.
>

I don't have an informed opinion on this.

> (2) Statistics text books often give formulae in terms of sums and 
> differences such as
>
> Sxx = n*Σ(x**2) - (Σx)**2

Interestingly, your Sxx is closely related to the variance:

if x is a list of n numbers then

    Sxx == (n**2)*var(x)

And more generally if x and y have the same length n, then Sxy (*) is
related to the covariance

    Sxy == (n**2)*cov(x, y)

So if you have a variance and covariance function, it would be redundant
to include Sxx and Sxy.  Another argument against including Sxx & co is
that their definition is not universally agreed upon.  For example, I
have seen

    Sxx = Σ(x**2) - (Σx)**2/n

HTH

-- 
Arnaud

(*) Here I take Sxy to be  n*Σ(xy) - (Σx)(Σy), generalising from your
definition of Sxx.