[SciPy-User] multivariate empirical distribution function, avoid double loop ?

Wed Aug 24 17:52:48 EDT 2011

On Wed, Aug 24, 2011 at 4:02 PM, David Baddeley
<david_baddeley at yahoo.com.au> wrote:
> Sounds like it could be a case for scipy.spatial.kdtree.

I don't see a way. Any suggestions?

"find all smaller points" doesn't induce a complete order, so I don't
see a way to define a distance.

The only way to reduce comparisons is graphical (?) rule out cases
that we know cannot be smaller.

I got a little bit further

def mvecdfvalues_noties(data):
    #use sort on first column
    ev = np.empty(data.shape[0], int)
    sortind0 = np.argsort(data[:,0])
    datas = data[sortind0, ...]
    for i,x in enumerate(datas):
        ev[i] = mvecdf(datas[:i, 1:], x[1:]) + 1
        #it should be possible to make this recursive

    ev2 = np.empty(data.shape[0], int)
    ev2[sortind0] = ev

    return ev2

poor man's timing

import time
t0 = time.time()
x4 = np.random.randn(5000,2)
mvecdfvalues(x4),
t1 = time.time()
mvecdfvalues_noties(x4)
t2 = time.time()
print t1-t0, t2-t1

>>>
5.492000103 1.59099984169

>>> 5.492000103 / 1.59099984169
3.4519174415292584

much better, but still pretty expensive in a Monte Carlo or Bootstrap.

Cheers,

Josef

>
> Cheers, David
>
> On Thu, 25 Aug 2011 06:59 NZST josef.pktd at gmail.com wrote:
>
>>On Wed, Aug 24, 2011 at 2:27 PM, Alan G Isaac <alan.isaac at gmail.com> wrote:
>>> On 8/24/2011 10:23 AM, josef.pktd at gmail.com wrote:
>>>> Does anyone know whether there is an algorithm that avoids the double
>>>> loop to get a multivariate empirical distribution function?
>>>
>>> I think that is pretty standard.
>>> I'll attach something posted awhile ago.
>>> It seemed right at the time, but I did
>>> not test it.  Once upon a time it was at
>>> http://svn.scipy.org/svn/scipy/trunk/scipy/sandbox/dhuard/stats.py
>>>
>>> Cheers,
>>> Alan
>>>
>>>
>>> def empiricalcdf(data, method='Hazen'):
>>>     """Return the empirical cdf.
>>>
>>>     Methods available (here i goes from 1 to N)
>>>         Hazen:       (i-0.5)/N
>>>         Weibull:     i/(N+1)
>>>         Chegodayev:  (i-.3)/(N+.4)
>>>         Cunnane:     (i-.4)/(N+.2)
>>>         Gringorten:  (i-.44)/(N+.12)
>>>         California:  (i-1)/N
>>>
>>>     :author: David Huard
>>>     """
>>>     i = np.argsort(np.argsort(data)) + 1.
>>>     nobs = len(data)
>>>     method = method.lower()
>>>     if method == 'hazen':
>>>         cdf = (i-0.5)/nobs
>>>     elif method == 'weibull':
>>>         cdf = i/(nobs+1.)
>>>     elif method == 'california':
>>>         cdf = (i-1.)/nobs
>>>     elif method == 'chegodayev':
>>>         cdf = (i-.3)/(nobs+.4)
>>>     elif method == 'cunnane':
>>>         cdf = (i-.4)/(nobs+.2)
>>>     elif method == 'gringorten':
>>>         cdf = (i-.44)/(nobs+.12)
>>>     else:
>>>         raise 'Unknown method. Choose among Weibull, Hazen, Chegodayev, Cunnane, Gringorten and California.'
>>>     return cdf
>>
>>
>>Unfortunately it's 1d only, and I am working on multivariate, at least
>>bivariate.
>>
>>Pierre has a 1d version similar to this in scipy.stats.mstats and a,
>>so far unused, copy is in statsmodels.
>>
>>Thanks,
>>Josef
>>
>>
>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>_______________________________________________
>>SciPy-User mailing list
>>SciPy-User at scipy.org
>>http://mail.scipy.org/mailman/listinfo/scipy-user
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>