[SciPy-User] Kuiper test (was Re: scipy.stats.fit inquiry)

Wed Oct 28 14:59:46 EDT 2009

I put the Kuiper test (and come other non-uniformity tests) up on github:
http://github.com/aarchiba/kuiper/
I'd like to clean up the tests somewhat (to reduce the code
duplication between test_kuiper, test_htest, and test_zm2) before
proposing any of it be included in scipy.

Anne

2009/10/21 Anne Archibald <peridot.faceted at gmail.com>:
> 2009/10/21  <josef.pktd at gmail.com>:
>> When the parameters of the t distribution
>> are estimated, then the test has no power.
>> I don't know if adjusting the critical values
>> would help at least for comparing similar
>> distributions like norm and t.
>> (same problem with kstest)
>
> Yes, this test has its limitations. It's also not as sensitive as one
> might wish for distinguishing multimodal distributions from a
> constant.
>
>> To save on the lambda function, the cdf could take args given as
>> an argument to kuiper. Maybe the specification of the cdf argument
>> could be taken from kstest, but not (the generality of) the rvs argument
>> in kstest.
>
> I loathe interfaces that pass args; I find they make the interface
> more confusing while adding no functionality. But I realize that
> they're standard and not going away (what about the currying
> functionality in new pythons?), so I'll add it.
>
>> Your cdf_from_intervals looks nice for a univariate stepfunction distribution,
>> There are several pieces for this on the mailing list, but I never got around
>> to collecting them.
>
> This was just a quick hack because I needed it; I'm not sure it's
> worth including in scipy.stats, since an appropriately general version
> might be just as difficult to call as to reimplement.
>
>> Your interval functions, I haven't quite figured out. Is fold_intervals
>> supposed to convert a set of overlapping intervals to a non-overlapping
>> partitioning with associated weights? For example for merging
>> 2 histograms with non-identical bounds?
>
> That among other things. The context was that I had observations of an
> X-ray binary, consisting of several different segments with several
> different instruments. So I'd have one observation covering orbital
> phases 0.8 to 1.3 (i.e. 0.5 turns) with effective area 10 cm^2 (say),
> and another covering phases 0.2 to 2.1 (i.e. 1.9 turns) with effective
> area 12 cm^2 (say). So a constant flux from the source would produce a
> non-constant distribution of photons modulo 1; fold_intervals is
> designed to convert those two weighted spans to a collection of
> non-overlapping weighted intervals covering [0,1), for use in a CDF
> for the Kuiper test.  The histogram stuff is there to allow plotting
> the results in a familiar form.
>
>> I  haven't figured out the shift invariance.
>
> The idea is just that if you have, say, a collection X of samples from
> [0,1) that you are testing for uniformity, the Kuiper test returns the
> same result for X and for (X+0.3)%1. Since for pulsars there is often
> no natural start time, this is a sensible thing to ask from any test
> for uniformity.
>
>>> I also have some code for the H test (essentially looking at the
>>> Fourier coefficients to find how many harmonics are worth including
>>> and what the significance is; de Jager et al. 1989 "A poweful test for
>>> weak periodic signals with unknown light curve shape in sparse data").
>>> But circular statistics seem to be a bit obscure, so I'm not sure how
>>> much effort should go into putting this in scipy.
>>
>> For sure they are obscure to me, but there are a few circular descriptive
>> statistics in stats.morestats, and I saw a matlab toolbox on the
>> file exchange. I figured out by now that there are some pretty different
>> statistics used in various fields.
>> I guess, it's all up to you.
>
> To be honest, these are a little obscure even among pulsar
> astronomers; here as elsewhere histograms seem to dominate.
>
>> >From your description (below), I would think, that for circular
>> distribution, we would need different generic functions that
>> don't fit in the current distribution classes, integration on a
>> circle (?) instead of integration on the real line.
>
> Basically, yes. Of course you can view integration on a circle as
> integration on a line parameterizing the circle, but there's no way
> around the fact that the first moment is a complex number which
> contains both position (argument) and concentration (magnitude)
> information.
>
>> If I define boundaries, vonmises.a and vonmises.b, then
>> I think, you would not be able to calculate pdf(x), cdf(x)
>> and so on for x outside of the support [a,b]. I don't know
>> whether it is possible to define a,b but not enforce them in
>> _argcheck and only use them as integration bounds.
>
> It's useful to be able to work with pdf and cdf outside the bounds,
> particularly since the effect of loc is to shift the bounds - so any
> code that takes loc as a free parameter would otherwise have to work
> hard to avoid going outside the bounds.
>
>> I checked briefly, vonmises.moment and vonmises.stats
>> work (integrating over ppf not pdf)
>> generic moment calculation with pdf  (_mom0_sc) and
>> entropy fail. fit seems to work with the normal random
>> variable, but I thought I got lots of "weird" fit results
>> before.
>>
>> For the real line, this looks "strange"
>>>>> stats.vonmises.pdf(np.linspace(0,3*np.pi,10), 2)
>> array([ 0.51588541,  0.18978364,  0.02568442,  0.00944877,  0.02568442,
>>        0.18978364,  0.51588541,  0.18978364,  0.02568442,  0.00944877])
>>>>> stats.vonmises.cdf(np.linspace(0,3*np.pi,10), 2)
>> array([  0.5       ,   0.89890776,   0.98532805,   1.        ,
>>         7.28318531,   7.28318531,   7.28318531,   7.28318531,
>>         7.28318531,  13.56637061])
>
> Unfortunately it seems that my attempt to fix the von Mises
> distribution (using vonmises_cython) went badly awry, and now it
> reports nonsense for values outside -pi..pi. I could have sworn I had
> tests for that. I intend to fix this during the weekend's code sprint.
> "Correct" behaviour (IMHO) would make the CDF the antiderivative of
> the PDF, even though this means it leaves [0,1].
>
> Incidentally, how would I get nosetests to run only some of the
> distributions tests (ideally just the ones I choose)? When I just do
> "nosetests scipy.stats" it takes *forever*.
>
>> ppf, calculated generically, works and looks only a little bit strange.
>>
>>>>> stats.vonmises.ppf([0, 1e-10,1-1e-10,1],2)
>> array([       -Inf, -3.14159264,  3.14159264,         Inf])
>>
>>>
>>> It's worth defining the boundaries, but I don't think you'll get
>>> useful moment calculations out of it, since circular moments are
>>> defined differently from linear moments: rather than int(x**n*pdf(x))
>>> they're int(exp(2,j*pi*n*x)*pdf(x)). There are explicit formulas for
>>> these for the von Mises distribution, though, and they're useful
>>> because they are essentially the Fourier coefficients of the pdf.
>>>
>>> For context, I've done some work with representing circular
>>> probability distributions (= pulse profiles) in terms of their moments
>>> (= Fourier series), and in particular using a kernel density estimator
>>> with either a sinc kernel (= truncation) or a von Mises kernel (=
>>> scaling coefficients by von Mises moments). The purpose was to get
>>> not-too-biased estimates of the modulation amplitudes of X-ray
>>> pulsations; the paper has been kind of sidetracked by other work.
>>
>> Sounds like statistics for seasonal time series to me, except you
>> might have a lot more regularity in the repeated pattern then in
>> economics or climate research.
>
> Well, yes and no. Many pulsars have a stable average pulse profile,
> which is what we usually care about, especially where we're in the
> regime of one photon every few tens of thousands of turns. On the
> other hand, when dealing with orbital variability, I ran into a
> statistical question I wasn't sure how to pose: if I add all the
> (partial) orbits of data I have together, I get a distribution that is
> pretty clearly nonconstant. But is that variability actually repeating
> from orbit to orbit, or is it really just random variability? In a
> seasonal context it sounds less exotic: if I combine data from the
> last three years, I find a statistically significant excess of rain in
> the fall. But is this just the effect of one very rainy season, that
> happened to be in the fall one year, or is it that each fall there is
> an excess? The individual years, unfortunately, don't seem to have
> enough events to detect significant non-uniformity.
>
> Anne
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>