[SciPy-user] sparse version of stats.pearsonr ?

Peter Skomoroch peter.skomoroch at gmail.com
Mon Mar 9 20:18:16 EDT 2009


Here is what I have based on pearsonr in scipy.stats:

def sparse_vector_dot(x, y):
    '''Calculates the dot product for two sparse vectors'''
    return (x.T*y).data[0]

def sparse_pearsonr(x, y):
    """Calculates a Pearson correlation coefficient and the p-value for
testing
    non-correlation using two sparse vectors as inputs.

    Parameters
    ----------
    x : 1D sparse array
    y : 1D sparse array the same length as x

    Returns
    -------
    (Pearson's correlation coefficient,
     2-tailed p-value)

    References
    ----------
    http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation"""

    # we form a third sparse vector z where the nonzero entries of z
    # are the union of the nonzero entries in x and y
    z = x + y
    n = z.getnnz() #length of x
    mx = x.data.mean()
    my = y.data.mean()
    # we only want to subtract the mean for non-zero values...
    # so we copy & access the sparse vector components directly:
    xm, ym = x, y
    xm.data, ym.data = x.data-mx, y.data-my
    r_num = n*(sparse_vector_dot(xm,ym))
    r_den = n*sqrt(sparse_vector_dot(xm,xm)*sparse_vector_dot(ym,ym))
    r = (r_num / r_den)

    # Presumably, if r > 1, then it is only some small artifact of floating
    # point arithmetic.
    r = min(r, 1.0)
    df = n-2

    # Use a small floating point value to prevent divide-by-zero nonsense
    # fixme: TINY is probably not the right value and this is probably not
    # the way to be robust. The scheme used in spearmanr is probably better.
    TINY = 1.0e-20
    t = r*sqrt(df/((1.0-r+TINY)*(1.0+r+TINY)))
    prob = betai(0.5*df,0.5,df/(df+t*t))
    return r,prob

On Mon, Mar 9, 2009 at 6:53 PM, Peter Skomoroch
<peter.skomoroch at gmail.com>wrote:

> Before I re-invent the wheel, is there an existing version of
> stats.pearsonr(x,y) that will work with scipy.sparse vectors?
>
> -Pete
>
> --
> Peter N. Skomoroch
> 617.285.8348
> http://www.datawrangling.com
> http://delicious.com/pskomoroch
> http://twitter.com/peteskomoroch
>



-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20090309/a38c6bf1/attachment.html>


More information about the SciPy-User mailing list