[SciPy-User] multivariate empirical distribution function, avoid double loop ?
josef.pktd at gmail.com
josef.pktd at gmail.com
Wed Aug 24 17:52:48 EDT 2011
On Wed, Aug 24, 2011 at 4:02 PM, David Baddeley
<david_baddeley at yahoo.com.au> wrote:
> Sounds like it could be a case for scipy.spatial.kdtree.
I don't see a way. Any suggestions?
"find all smaller points" doesn't induce a complete order, so I don't
see a way to define a distance.
The only way to reduce comparisons is graphical (?) rule out cases
that we know cannot be smaller.
I got a little bit further
def mvecdfvalues_noties(data):
#use sort on first column
ev = np.empty(data.shape[0], int)
sortind0 = np.argsort(data[:,0])
datas = data[sortind0, ...]
for i,x in enumerate(datas):
ev[i] = mvecdf(datas[:i, 1:], x[1:]) + 1
#it should be possible to make this recursive
ev2 = np.empty(data.shape[0], int)
ev2[sortind0] = ev
return ev2
poor man's timing
import time
t0 = time.time()
x4 = np.random.randn(5000,2)
mvecdfvalues(x4),
t1 = time.time()
mvecdfvalues_noties(x4)
t2 = time.time()
print t1-t0, t2-t1
>>>
5.492000103 1.59099984169
>>> 5.492000103 / 1.59099984169
3.4519174415292584
much better, but still pretty expensive in a Monte Carlo or Bootstrap.
Cheers,
Josef
>
> Cheers, David
>
> On Thu, 25 Aug 2011 06:59 NZST josef.pktd at gmail.com wrote:
>
>>On Wed, Aug 24, 2011 at 2:27 PM, Alan G Isaac <alan.isaac at gmail.com> wrote:
>>> On 8/24/2011 10:23 AM, josef.pktd at gmail.com wrote:
>>>> Does anyone know whether there is an algorithm that avoids the double
>>>> loop to get a multivariate empirical distribution function?
>>>
>>> I think that is pretty standard.
>>> I'll attach something posted awhile ago.
>>> It seemed right at the time, but I did
>>> not test it. Once upon a time it was at
>>> http://svn.scipy.org/svn/scipy/trunk/scipy/sandbox/dhuard/stats.py
>>>
>>> Cheers,
>>> Alan
>>>
>>>
>>> def empiricalcdf(data, method='Hazen'):
>>> """Return the empirical cdf.
>>>
>>> Methods available (here i goes from 1 to N)
>>> Hazen: (i-0.5)/N
>>> Weibull: i/(N+1)
>>> Chegodayev: (i-.3)/(N+.4)
>>> Cunnane: (i-.4)/(N+.2)
>>> Gringorten: (i-.44)/(N+.12)
>>> California: (i-1)/N
>>>
>>> :author: David Huard
>>> """
>>> i = np.argsort(np.argsort(data)) + 1.
>>> nobs = len(data)
>>> method = method.lower()
>>> if method == 'hazen':
>>> cdf = (i-0.5)/nobs
>>> elif method == 'weibull':
>>> cdf = i/(nobs+1.)
>>> elif method == 'california':
>>> cdf = (i-1.)/nobs
>>> elif method == 'chegodayev':
>>> cdf = (i-.3)/(nobs+.4)
>>> elif method == 'cunnane':
>>> cdf = (i-.4)/(nobs+.2)
>>> elif method == 'gringorten':
>>> cdf = (i-.44)/(nobs+.12)
>>> else:
>>> raise 'Unknown method. Choose among Weibull, Hazen, Chegodayev, Cunnane, Gringorten and California.'
>>> return cdf
>>
>>
>>Unfortunately it's 1d only, and I am working on multivariate, at least
>>bivariate.
>>
>>Pierre has a 1d version similar to this in scipy.stats.mstats and a,
>>so far unused, copy is in statsmodels.
>>
>>Thanks,
>>Josef
>>
>>
>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>_______________________________________________
>>SciPy-User mailing list
>>SciPy-User at scipy.org
>>http://mail.scipy.org/mailman/listinfo/scipy-user
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
More information about the SciPy-User
mailing list