[SciPy-User] Kuiper test (was Re: scipy.stats.fit inquiry)

Wed Oct 28 22:48:42 EDT 2009

2009/10/28  <josef.pktd at gmail.com>:
> On Wed, Oct 28, 2009 at 2:59 PM, Anne Archibald
> <peridot.faceted at gmail.com> wrote:
>> I put the Kuiper test (and come other non-uniformity tests) up on github:
>> http://github.com/aarchiba/kuiper/
>> I'd like to clean up the tests somewhat (to reduce the code
>> duplication between test_kuiper, test_htest, and test_zm2) before
>> proposing any of it be included in scipy.
>
> Also docstrings about the purpose of htest and zm2 would be
> helpful. I have no idea.

Ahem. Yes, the code was pretty impenetrable without them. Sorry about
that! I lose my (few) good coding habits very easily when there's a
proposal due.

> zm2 or Zm2 ?

That's a good question. The test is usually written in LaTeX as
$Z_m^2$, so in a sense Zm2 would be more natural, but it's not pep8
(and I'm really used to caps indicating a class by now).

> In most cases (I think), we switched to non-random random tests in
> stats by fixing a seed. It won't test any new cases, but in last years
> discussion that seemed to be the preferred way.

That's a good point too. Do you set the seed in a per-function setup,
or just once per module? The latter seems simpler and saner but in
principle it introduces interrelations between tests. How
deterministic is nosetests? (e.g. does it always run all the tests in
a file in the same order no matter how or when it runs them?) I guess
running single tests will give them different random numbers than
running the module as a whole... anyway, seeding once per module
should be enough to avoid end users being bitten by statistical
failures.

Anne

> Josef
>
>>
>> Anne
>>
>> 2009/10/21 Anne Archibald <peridot.faceted at gmail.com>:
>>> 2009/10/21  <josef.pktd at gmail.com>:
>>>> When the parameters of the t distribution
>>>> are estimated, then the test has no power.
>>>> I don't know if adjusting the critical values
>>>> would help at least for comparing similar
>>>> distributions like norm and t.
>>>> (same problem with kstest)
>>>
>>> Yes, this test has its limitations. It's also not as sensitive as one
>>> might wish for distinguishing multimodal distributions from a
>>> constant.
>>>
>>>> To save on the lambda function, the cdf could take args given as
>>>> an argument to kuiper. Maybe the specification of the cdf argument
>>>> could be taken from kstest, but not (the generality of) the rvs argument
>>>> in kstest.
>>>
>>> I loathe interfaces that pass args; I find they make the interface
>>> more confusing while adding no functionality. But I realize that
>>> they're standard and not going away (what about the currying
>>> functionality in new pythons?), so I'll add it.
>>>
>>>> Your cdf_from_intervals looks nice for a univariate stepfunction distribution,
>>>> There are several pieces for this on the mailing list, but I never got around
>>>> to collecting them.
>>>
>>> This was just a quick hack because I needed it; I'm not sure it's
>>> worth including in scipy.stats, since an appropriately general version
>>> might be just as difficult to call as to reimplement.
>>>
>>>> Your interval functions, I haven't quite figured out. Is fold_intervals
>>>> supposed to convert a set of overlapping intervals to a non-overlapping
>>>> partitioning with associated weights? For example for merging
>>>> 2 histograms with non-identical bounds?
>>>
>>> That among other things. The context was that I had observations of an
>>> X-ray binary, consisting of several different segments with several
>>> different instruments. So I'd have one observation covering orbital
>>> phases 0.8 to 1.3 (i.e. 0.5 turns) with effective area 10 cm^2 (say),
>>> and another covering phases 0.2 to 2.1 (i.e. 1.9 turns) with effective
>>> area 12 cm^2 (say). So a constant flux from the source would produce a
>>> non-constant distribution of photons modulo 1; fold_intervals is
>>> designed to convert those two weighted spans to a collection of
>>> non-overlapping weighted intervals covering [0,1), for use in a CDF
>>> for the Kuiper test.  The histogram stuff is there to allow plotting
>>> the results in a familiar form.
>>>
>>>> I  haven't figured out the shift invariance.
>>>
>>> The idea is just that if you have, say, a collection X of samples from
>>> [0,1) that you are testing for uniformity, the Kuiper test returns the
>>> same result for X and for (X+0.3)%1. Since for pulsars there is often
>>> no natural start time, this is a sensible thing to ask from any test
>>> for uniformity.
>>>
>>>>> I also have some code for the H test (essentially looking at the
>>>>> Fourier coefficients to find how many harmonics are worth including
>>>>> and what the significance is; de Jager et al. 1989 "A poweful test for
>>>>> weak periodic signals with unknown light curve shape in sparse data").
>>>>> But circular statistics seem to be a bit obscure, so I'm not sure how
>>>>> much effort should go into putting this in scipy.
>>>>
>>>> For sure they are obscure to me, but there are a few circular descriptive
>>>> statistics in stats.morestats, and I saw a matlab toolbox on the
>>>> file exchange. I figured out by now that there are some pretty different
>>>> statistics used in various fields.
>>>> I guess, it's all up to you.
>>>
>>> To be honest, these are a little obscure even among pulsar
>>> astronomers; here as elsewhere histograms seem to dominate.
>>>
>>>> >From your description (below), I would think, that for circular
>>>> distribution, we would need different generic functions that
>>>> don't fit in the current distribution classes, integration on a
>>>> circle (?) instead of integration on the real line.
>>>
>>> Basically, yes. Of course you can view integration on a circle as
>>> integration on a line parameterizing the circle, but there's no way
>>> around the fact that the first moment is a complex number which
>>> contains both position (argument) and concentration (magnitude)
>>> information.
>>>
>>>> If I define boundaries, vonmises.a and vonmises.b, then
>>>> I think, you would not be able to calculate pdf(x), cdf(x)
>>>> and so on for x outside of the support [a,b]. I don't know
>>>> whether it is possible to define a,b but not enforce them in
>>>> _argcheck and only use them as integration bounds.
>>>
>>> It's useful to be able to work with pdf and cdf outside the bounds,
>>> particularly since the effect of loc is to shift the bounds - so any
>>> code that takes loc as a free parameter would otherwise have to work
>>> hard to avoid going outside the bounds.
>>>
>>>> I checked briefly, vonmises.moment and vonmises.stats
>>>> work (integrating over ppf not pdf)
>>>> generic moment calculation with pdf  (_mom0_sc) and
>>>> entropy fail. fit seems to work with the normal random
>>>> variable, but I thought I got lots of "weird" fit results
>>>> before.
>>>>
>>>> For the real line, this looks "strange"
>>>>>>> stats.vonmises.pdf(np.linspace(0,3*np.pi,10), 2)
>>>> array([ 0.51588541,  0.18978364,  0.02568442,  0.00944877,  0.02568442,
>>>>        0.18978364,  0.51588541,  0.18978364,  0.02568442,  0.00944877])
>>>>>>> stats.vonmises.cdf(np.linspace(0,3*np.pi,10), 2)
>>>> array([  0.5       ,   0.89890776,   0.98532805,   1.        ,
>>>>         7.28318531,   7.28318531,   7.28318531,   7.28318531,
>>>>         7.28318531,  13.56637061])
>>>
>>> Unfortunately it seems that my attempt to fix the von Mises
>>> distribution (using vonmises_cython) went badly awry, and now it
>>> reports nonsense for values outside -pi..pi. I could have sworn I had
>>> tests for that. I intend to fix this during the weekend's code sprint.
>>> "Correct" behaviour (IMHO) would make the CDF the antiderivative of
>>> the PDF, even though this means it leaves [0,1].
>>>
>>> Incidentally, how would I get nosetests to run only some of the
>>> distributions tests (ideally just the ones I choose)? When I just do
>>> "nosetests scipy.stats" it takes *forever*.
>>>
>>>> ppf, calculated generically, works and looks only a little bit strange.
>>>>
>>>>>>> stats.vonmises.ppf([0, 1e-10,1-1e-10,1],2)
>>>> array([       -Inf, -3.14159264,  3.14159264,         Inf])
>>>>
>>>>>
>>>>> It's worth defining the boundaries, but I don't think you'll get
>>>>> useful moment calculations out of it, since circular moments are
>>>>> defined differently from linear moments: rather than int(x**n*pdf(x))
>>>>> they're int(exp(2,j*pi*n*x)*pdf(x)). There are explicit formulas for
>>>>> these for the von Mises distribution, though, and they're useful
>>>>> because they are essentially the Fourier coefficients of the pdf.
>>>>>
>>>>> For context, I've done some work with representing circular
>>>>> probability distributions (= pulse profiles) in terms of their moments
>>>>> (= Fourier series), and in particular using a kernel density estimator
>>>>> with either a sinc kernel (= truncation) or a von Mises kernel (=
>>>>> scaling coefficients by von Mises moments). The purpose was to get
>>>>> not-too-biased estimates of the modulation amplitudes of X-ray
>>>>> pulsations; the paper has been kind of sidetracked by other work.
>>>>
>>>> Sounds like statistics for seasonal time series to me, except you
>>>> might have a lot more regularity in the repeated pattern then in
>>>> economics or climate research.
>>>
>>> Well, yes and no. Many pulsars have a stable average pulse profile,
>>> which is what we usually care about, especially where we're in the
>>> regime of one photon every few tens of thousands of turns. On the
>>> other hand, when dealing with orbital variability, I ran into a
>>> statistical question I wasn't sure how to pose: if I add all the
>>> (partial) orbits of data I have together, I get a distribution that is
>>> pretty clearly nonconstant. But is that variability actually repeating
>>> from orbit to orbit, or is it really just random variability? In a
>>> seasonal context it sounds less exotic: if I combine data from the
>>> last three years, I find a statistically significant excess of rain in
>>> the fall. But is this just the effect of one very rainy season, that
>>> happened to be in the fall one year, or is it that each fall there is
>>> an excess? The individual years, unfortunately, don't seem to have
>>> enough events to detect significant non-uniformity.
>>>
>>> Anne
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>