[SciPy-User] Kuiper test (was Re: scipy.stats.fit inquiry)

Wed Oct 28 15:21:30 EDT 2009

On Wed, Oct 28, 2009 at 2:59 PM, Anne Archibald
<peridot.faceted at gmail.com> wrote:
> I put the Kuiper test (and come other non-uniformity tests) up on github:
> http://github.com/aarchiba/kuiper/
> I'd like to clean up the tests somewhat (to reduce the code
> duplication between test_kuiper, test_htest, and test_zm2) before
> proposing any of it be included in scipy.

Also docstrings about the purpose of htest and zm2 would be
helpful. I have no idea.

zm2 or Zm2 ?

In most cases (I think), we switched to non-random random tests in
stats by fixing a seed. It won't test any new cases, but in last years
discussion that seemed to be the preferred way.

Josef

>
> Anne
>
> 2009/10/21 Anne Archibald <peridot.faceted at gmail.com>:
>> 2009/10/21  <josef.pktd at gmail.com>:
>>> When the parameters of the t distribution
>>> are estimated, then the test has no power.
>>> I don't know if adjusting the critical values
>>> would help at least for comparing similar
>>> distributions like norm and t.
>>> (same problem with kstest)
>>
>> Yes, this test has its limitations. It's also not as sensitive as one
>> might wish for distinguishing multimodal distributions from a
>> constant.
>>
>>> To save on the lambda function, the cdf could take args given as
>>> an argument to kuiper. Maybe the specification of the cdf argument
>>> could be taken from kstest, but not (the generality of) the rvs argument
>>> in kstest.
>>
>> I loathe interfaces that pass args; I find they make the interface
>> more confusing while adding no functionality. But I realize that
>> they're standard and not going away (what about the currying
>> functionality in new pythons?), so I'll add it.
>>
>>> Your cdf_from_intervals looks nice for a univariate stepfunction distribution,
>>> There are several pieces for this on the mailing list, but I never got around
>>> to collecting them.
>>
>> This was just a quick hack because I needed it; I'm not sure it's
>> worth including in scipy.stats, since an appropriately general version
>> might be just as difficult to call as to reimplement.
>>
>>> Your interval functions, I haven't quite figured out. Is fold_intervals
>>> supposed to convert a set of overlapping intervals to a non-overlapping
>>> partitioning with associated weights? For example for merging
>>> 2 histograms with non-identical bounds?
>>
>> That among other things. The context was that I had observations of an
>> X-ray binary, consisting of several different segments with several
>> different instruments. So I'd have one observation covering orbital
>> phases 0.8 to 1.3 (i.e. 0.5 turns) with effective area 10 cm^2 (say),
>> and another covering phases 0.2 to 2.1 (i.e. 1.9 turns) with effective
>> area 12 cm^2 (say). So a constant flux from the source would produce a
>> non-constant distribution of photons modulo 1; fold_intervals is
>> designed to convert those two weighted spans to a collection of
>> non-overlapping weighted intervals covering [0,1), for use in a CDF
>> for the Kuiper test.  The histogram stuff is there to allow plotting
>> the results in a familiar form.
>>
>>> I  haven't figured out the shift invariance.
>>
>> The idea is just that if you have, say, a collection X of samples from
>> [0,1) that you are testing for uniformity, the Kuiper test returns the
>> same result for X and for (X+0.3)%1. Since for pulsars there is often
>> no natural start time, this is a sensible thing to ask from any test
>> for uniformity.
>>
>>>> I also have some code for the H test (essentially looking at the
>>>> Fourier coefficients to find how many harmonics are worth including
>>>> and what the significance is; de Jager et al. 1989 "A poweful test for
>>>> weak periodic signals with unknown light curve shape in sparse data").
>>>> But circular statistics seem to be a bit obscure, so I'm not sure how
>>>> much effort should go into putting this in scipy.
>>>
>>> For sure they are obscure to me, but there are a few circular descriptive
>>> statistics in stats.morestats, and I saw a matlab toolbox on the
>>> file exchange. I figured out by now that there are some pretty different
>>> statistics used in various fields.
>>> I guess, it's all up to you.
>>
>> To be honest, these are a little obscure even among pulsar
>> astronomers; here as elsewhere histograms seem to dominate.
>>
>>> >From your description (below), I would think, that for circular
>>> distribution, we would need different generic functions that
>>> don't fit in the current distribution classes, integration on a
>>> circle (?) instead of integration on the real line.
>>
>> Basically, yes. Of course you can view integration on a circle as
>> integration on a line parameterizing the circle, but there's no way
>> around the fact that the first moment is a complex number which
>> contains both position (argument) and concentration (magnitude)
>> information.
>>
>>> If I define boundaries, vonmises.a and vonmises.b, then
>>> I think, you would not be able to calculate pdf(x), cdf(x)
>>> and so on for x outside of the support [a,b]. I don't know
>>> whether it is possible to define a,b but not enforce them in
>>> _argcheck and only use them as integration bounds.
>>
>> It's useful to be able to work with pdf and cdf outside the bounds,
>> particularly since the effect of loc is to shift the bounds - so any
>> code that takes loc as a free parameter would otherwise have to work
>> hard to avoid going outside the bounds.
>>
>>> I checked briefly, vonmises.moment and vonmises.stats
>>> work (integrating over ppf not pdf)
>>> generic moment calculation with pdf  (_mom0_sc) and
>>> entropy fail. fit seems to work with the normal random
>>> variable, but I thought I got lots of "weird" fit results
>>> before.
>>>
>>> For the real line, this looks "strange"
>>>>>> stats.vonmises.pdf(np.linspace(0,3*np.pi,10), 2)
>>> array([ 0.51588541,  0.18978364,  0.02568442,  0.00944877,  0.02568442,
>>>        0.18978364,  0.51588541,  0.18978364,  0.02568442,  0.00944877])
>>>>>> stats.vonmises.cdf(np.linspace(0,3*np.pi,10), 2)
>>> array([  0.5       ,   0.89890776,   0.98532805,   1.        ,
>>>         7.28318531,   7.28318531,   7.28318531,   7.28318531,
>>>         7.28318531,  13.56637061])
>>
>> Unfortunately it seems that my attempt to fix the von Mises
>> distribution (using vonmises_cython) went badly awry, and now it
>> reports nonsense for values outside -pi..pi. I could have sworn I had
>> tests for that. I intend to fix this during the weekend's code sprint.
>> "Correct" behaviour (IMHO) would make the CDF the antiderivative of
>> the PDF, even though this means it leaves [0,1].
>>
>> Incidentally, how would I get nosetests to run only some of the
>> distributions tests (ideally just the ones I choose)? When I just do
>> "nosetests scipy.stats" it takes *forever*.
>>
>>> ppf, calculated generically, works and looks only a little bit strange.
>>>
>>>>>> stats.vonmises.ppf([0, 1e-10,1-1e-10,1],2)
>>> array([       -Inf, -3.14159264,  3.14159264,         Inf])
>>>
>>>>
>>>> It's worth defining the boundaries, but I don't think you'll get
>>>> useful moment calculations out of it, since circular moments are
>>>> defined differently from linear moments: rather than int(x**n*pdf(x))
>>>> they're int(exp(2,j*pi*n*x)*pdf(x)). There are explicit formulas for
>>>> these for the von Mises distribution, though, and they're useful
>>>> because they are essentially the Fourier coefficients of the pdf.
>>>>
>>>> For context, I've done some work with representing circular
>>>> probability distributions (= pulse profiles) in terms of their moments
>>>> (= Fourier series), and in particular using a kernel density estimator
>>>> with either a sinc kernel (= truncation) or a von Mises kernel (=
>>>> scaling coefficients by von Mises moments). The purpose was to get
>>>> not-too-biased estimates of the modulation amplitudes of X-ray
>>>> pulsations; the paper has been kind of sidetracked by other work.
>>>
>>> Sounds like statistics for seasonal time series to me, except you
>>> might have a lot more regularity in the repeated pattern then in
>>> economics or climate research.
>>
>> Well, yes and no. Many pulsars have a stable average pulse profile,
>> which is what we usually care about, especially where we're in the
>> regime of one photon every few tens of thousands of turns. On the
>> other hand, when dealing with orbital variability, I ran into a
>> statistical question I wasn't sure how to pose: if I add all the
>> (partial) orbits of data I have together, I get a distribution that is
>> pretty clearly nonconstant. But is that variability actually repeating
>> from orbit to orbit, or is it really just random variability? In a
>> seasonal context it sounds less exotic: if I combine data from the
>> last three years, I find a statistically significant excess of rain in
>> the fall. But is this just the effect of one very rainy season, that
>> happened to be in the fall one year, or is it that each fall there is
>> an excess? The individual years, unfortunately, don't seem to have
>> enough events to detect significant non-uniformity.
>>
>> Anne
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>