[SciPy-Dev] studentized range approximations
Bruce Southey
bsouthey at gmail.com
Fri Jun 3 09:37:32 EDT 2011
On 06/02/2011 01:15 PM, Roger Lew wrote:
> Hi Josef,
>
> Thanks for the feedback. The "choice" of using cdf is more a carryover
> from how the algorithm is described by Gleason. Perhaps it would be
> best to have it match your intuition and accept the survival function?
>
> Feel free to treat it like your own for statsmodels. I will definitely
> check out some of your multcomp module so I'm not reinventing the wheel.
>
> In the grand scheme, I could see these having a home in scipy.special
> (after more extensive review of course). That is where I went to look
> for it when I didn't find it in distributions.
>
> Roger
>
> On Thu, Jun 2, 2011 at 1:38 AM, <josef.pktd at gmail.com
> <mailto:josef.pktd at gmail.com>> wrote:
>
> On Thu, Jun 2, 2011 at 12:53 AM, Roger Lew <rogerlew at gmail.com
> <mailto:rogerlew at gmail.com>> wrote:
> > Hi,
> > I have implemented some approximations for studentized range
> quantiles and
> > probabilities based on John R. Gleason's (1999) "An accurate,
> non-iterative
> > approximation for studentized range quantiles." Computational
> Statistics &
> > Data Analysis, (31), 147-158.
> > Probability approximations rely on scipy.optimize.fminbound. The
> functions
> > accept both scalars or array-like data thanks to
> numpy.vectorize. A fair
> > amount of validation and testing has been conducted on the code.
> More
> > details can be found here: http://code.google.com/p/qsturng-py/
> > I welcome any thoughts as to whether you all think this might be
> useful to
> > add to SciPy or make into a scikit. Any general comments would
> be helpful as
> > well. I should mention I'm a cognitive neuroscientist by trade,
> my use of
> > statistical jargon probably isn't that good.
>
> Hi Roger,
>
> I'm very interested in using this in scikits.statsmodels. The table
> that I am currently using is very limited
> http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandbox.stats.multicomp.get_tukeyQcrit.html#scikits.statsmodels.sandbox.stats.multicomp.get_tukeyQcrit
>
> >From a quick look it looks very good.
> What I found a bit confusing is that qstrung takes the probability of
> the cdf and not of the survival function. Without reading the
> docstring carefully enough, I interpreted it as a p-value (upper tail)
> especially since pstrung returns the upper tail probability,
>
> >>> import scikits.statsmodels.sandbox.stats.multicomp as smmc
> >>> for i in range(3, 10):
> x = qsturng(0.95, i, 16)
> x, psturng(x, i, 16), smmc.get_tukeyQcrit(i, 16, 0.05),
> smmc.tukey_pvalues(x*np.ones(i), 16)[0]
>
>
> (3.647864471854692, 0.049999670839029453, array(3.6499999999999999),
> 0.050092818925981608)
> (4.0464124382823847, 0.050001178443752514, array(4.0499999999999998),
> 0.037164602483501508)
> (4.3332505094058114, 0.049999838126148499, array(4.3300000000000001),
> 0.029954033157223781)
> (4.5573603020371234, 0.049999276281813887, array(4.5599999999999996),
> 0.025276987281047769)
> (4.7410585998112742, 0.049998508166777755, array(4.7400000000000002),
> 0.022010630154416622)
> (4.8965400268915289, 0.04999983345598491, array(4.9000000000000004),
> 0.019614841752159107)
> (5.0312039650945257, 0.049999535359310343, array(5.0300000000000002),
> 0.017721848279719898)
>
> The last column is (in my interpretation) supposed to be 0.05. I was
> trying to get the pvalues for Tukeys range statistic through the
> multivariate t-distribution, but the unit test looks only at one point
> (and I ran out of time to work on this during Christmas break). Either
> there is a bug (it's still in the sandbox) or my interpretation is
> wrong.
>
> The advantage of the multivariate t-distribution is that it allows for
> arbitrary correlation, but it's not a substitute for pre-calculated
> tables for standard cases/distributions because it's much too slow.
>
> ------------
> As a bit of background on the multiple testing, multiple comparison
> status in statsmodels:
>
> The tukeyhsd test has one test case against R, but it has too many
> options (it allows unequal variances and unequal sample sizes, that
> still need to be checked.)
>
> http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandbox.stats.multicomp.tukeyhsd.html#scikits.statsmodels.sandbox.stats.multicomp.tukeyhsd
>
> What I did manage to finish and verify against R
>
> http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandbox.stats.multicomp.multipletests.html#scikits.statsmodels.sandbox.stats.multicomp.multipletests
>
> multiple testing for general linear models is very incomplete
>
> and as an aside: I'm not a statistician, and if the module in the
> statsmodels sandbox is still a mess then it's because I took me a long
> time and many functions to figure out what's going on.
> ----------
>
> scipy.special has a nice collection of standard distributions
> functions, but it would be very useful to have some additional
> distributions either in scipy or scikits.statsmodels available, like
> your studentized range statistic, (and maybe some others in multiple
> comparisons, like Duncan, Dunnet) and Anderson-Darling, and ...
>
> Thanks,
>
> Josef
>
>
> > Regards,
> > Roger
> > Roger Lew
> >
> > _______________________________________________
> > SciPy-Dev mailing list
> > SciPy-Dev at scipy.org <mailto:SciPy-Dev at scipy.org>
> > http://mail.scipy.org/mailman/listinfo/scipy-dev
> >
> >
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org <mailto:SciPy-Dev at scipy.org>
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
The problem I have is that this is still an approximation that probably
covers what most situations.
Do you have the Copenhaver & Holland (1988) approach or know any
BSD-licensed Python code for it?
Also, the code probably needs to conform to numpy/scipy standards (which
I don't remember the links).
You also have vary less desirable features:
1) hard coded numbers like '1e38' - numpy does define infinity (np.inf).
2) comparisons to 'inf' ('v==inf') that are not desirable.
Bruce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20110603/7eba1da6/attachment.html>
More information about the SciPy-Dev
mailing list