[Python-ideas] Pre-PEP 2nd draft: adding a statistics module to Python

Thu Aug 8 17:48:12 CEST 2013

On 8 August 2013 15:28, Steven D'Aprano <steve at pearwood.info> wrote:
> Attached is the second draft of the pre-PEP for adding a statistics module
> to Python. A brief summary of the most important changes:

It all looks good to me.

About this part in the PEP though:
'''
Open and Deferred Issues

    - At this stage, I am unsure of the best API for multivariate statistical
      functions such as linear regression, correlation coefficient, and
      covariance. Possible APIs include:

        * Separate arguments for x and y data:
          function([x0, x1, ...], [y0, y1, ...])

        * A single argument for (x, y) data:
          function([(x0, y0), (x1, y1), ...])

        * Selecting arbitrary columns from a 2D array:
          function([[a0, x0, y0, z0], [a1, x1, y1, z1], ...], x=1, y=2)

        * Some combination of the above.

      In the absence of a consensus of preferred API for multivariate stats,
      I will defer including such multivariate functions until Python 3.5.
'''

I don't think there's been any discussion about this so there's no
lack of consensus. Or would you just prefer to defer it for now?

I'm just going to say that it basically doesn't matter which of the
first two options you go for; the third one with the 2D array and
indices is an unnecessary complication.

Whichever form you used there would always be situations where the
data would need to be transposed because it is in one or other of the
forms. Numpy actually provides both forms and a transposed variant
with a ``rowvar`` argument e.g.:

>>> help(numpy.corrcoef)
Help on function corrcoef in module numpy.lib.function_base:

corrcoef(x, y=None, rowvar=1, bias=0, ddof=None)
    Return correlation coefficients.

    ...

    Parameters
    ----------
    x : array_like
        A 1-D or 2-D array containing multiple variables and observations.
        Each row of `m` represents a variable, and each column a single
        observation of all those variables. Also see `rowvar` below.
    y : array_like, optional
        An additional set of variables and observations. `y` has the same
        shape as `m`.
    rowvar : int, optional
        If `rowvar` is non-zero (default), then each row represents a
        variable, with observations in the columns. Otherwise, the relationship
        is transposed: each column represents a variable, while the rows
        contain observations.

To me that seems like a bit of a mess (particularly since numpy arrays
are so easily transposed).

The more important question would be whether you intend to compute
covariance/correlation matrices rather than just individual pairwise
values. If the intention is to compute individual values then I would
say just keep the API clean and simple with:

>>> correlation(xdata, ydata)
0.7812312312

A signature like that hardly needs an explanation.

Oscar