[SciPy-Dev] studentized range approximations

Fri Jun 3 09:37:32 EDT 2011

On 06/02/2011 01:15 PM, Roger Lew wrote:
> Hi Josef,
>
> Thanks for the feedback. The "choice" of using cdf is more a carryover 
> from how the algorithm is described by Gleason. Perhaps it would be 
> best to have it match your intuition and accept the survival function?
>
> Feel free to treat it like your own for statsmodels. I will definitely 
> check out some of your multcomp module so I'm not reinventing the wheel.
>
> In the grand scheme, I could see these having a home in scipy.special 
> (after more extensive review of course). That is where I went to look 
> for it when I didn't find it in distributions.
>
> Roger
>
> On Thu, Jun 2, 2011 at 1:38 AM, <josef.pktd at gmail.com 
> <mailto:josef.pktd at gmail.com>> wrote:
>
>     On Thu, Jun 2, 2011 at 12:53 AM, Roger Lew <rogerlew at gmail.com
>     <mailto:rogerlew at gmail.com>> wrote:
>     > Hi,
>     > I have implemented some approximations for studentized range
>     quantiles and
>     > probabilities based on John R. Gleason's (1999) "An accurate,
>     non-iterative
>     > approximation for studentized range quantiles." Computational
>     Statistics &
>     > Data Analysis, (31), 147-158.
>     > Probability approximations rely on scipy.optimize.fminbound. The
>     functions
>     > accept both scalars or array-like data thanks to
>     numpy.vectorize. A fair
>     > amount of validation and testing has been conducted on the code.
>     More
>     > details can be found here: http://code.google.com/p/qsturng-py/
>     > I welcome any thoughts as to whether you all think this might be
>     useful to
>     > add to SciPy or make into a scikit. Any general comments would
>     be helpful as
>     > well. I should mention I'm a cognitive neuroscientist by trade,
>     my use of
>     > statistical jargon probably isn't that good.
>
>     Hi Roger,
>
>     I'm very interested in using this in scikits.statsmodels. The table
>     that I am currently using is very limited
>     http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandbox.stats.multicomp.get_tukeyQcrit.html#scikits.statsmodels.sandbox.stats.multicomp.get_tukeyQcrit
>
>     >From a quick look it looks very good.
>     What I found a bit confusing is that qstrung takes the probability of
>     the cdf and not of the survival function. Without reading the
>     docstring carefully enough, I interpreted it as a p-value (upper tail)
>     especially since pstrung returns the upper tail probability,
>
>     >>> import scikits.statsmodels.sandbox.stats.multicomp as smmc
>     >>> for i in range(3, 10):
>            x = qsturng(0.95, i, 16)
>            x, psturng(x, i, 16), smmc.get_tukeyQcrit(i, 16, 0.05),
>     smmc.tukey_pvalues(x*np.ones(i), 16)[0]
>
>
>     (3.647864471854692, 0.049999670839029453, array(3.6499999999999999),
>     0.050092818925981608)
>     (4.0464124382823847, 0.050001178443752514, array(4.0499999999999998),
>     0.037164602483501508)
>     (4.3332505094058114, 0.049999838126148499, array(4.3300000000000001),
>     0.029954033157223781)
>     (4.5573603020371234, 0.049999276281813887, array(4.5599999999999996),
>     0.025276987281047769)
>     (4.7410585998112742, 0.049998508166777755, array(4.7400000000000002),
>     0.022010630154416622)
>     (4.8965400268915289, 0.04999983345598491, array(4.9000000000000004),
>     0.019614841752159107)
>     (5.0312039650945257, 0.049999535359310343, array(5.0300000000000002),
>     0.017721848279719898)
>
>     The last column is (in my interpretation) supposed to be 0.05. I was
>     trying to get the pvalues for Tukeys range statistic through the
>     multivariate t-distribution, but the unit test looks only at one point
>     (and I ran out of time to work on this during Christmas break). Either
>     there is a bug (it's still in the sandbox) or my interpretation is
>     wrong.
>
>     The advantage of the multivariate t-distribution is that it allows for
>     arbitrary correlation, but it's not a substitute for pre-calculated
>     tables for standard cases/distributions because it's much too slow.
>
>     ------------
>     As a bit of background on the multiple testing, multiple comparison
>     status in statsmodels:
>
>     The tukeyhsd test has one test case against R, but it has too many
>     options (it allows unequal variances and unequal sample sizes, that
>     still need to be checked.)
>
>     http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandbox.stats.multicomp.tukeyhsd.html#scikits.statsmodels.sandbox.stats.multicomp.tukeyhsd
>
>     What I did manage to finish and verify against R
>
>     http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandbox.stats.multicomp.multipletests.html#scikits.statsmodels.sandbox.stats.multicomp.multipletests
>
>     multiple testing for general linear models is very incomplete
>
>     and as an aside: I'm not a statistician, and if the module in the
>     statsmodels sandbox is still a mess then it's because I took me a long
>     time and many functions to figure out what's going on.
>     ----------
>
>     scipy.special has a nice collection of standard distributions
>     functions, but it would be very useful to have some additional
>     distributions either in scipy or scikits.statsmodels available, like
>     your studentized range statistic, (and maybe some others in multiple
>     comparisons, like Duncan, Dunnet) and Anderson-Darling, and ...
>
>     Thanks,
>
>     Josef
>
>
>     > Regards,
>     > Roger
>     > Roger Lew
>     >
>     > _______________________________________________
>     > SciPy-Dev mailing list
>     > SciPy-Dev at scipy.org <mailto:SciPy-Dev at scipy.org>
>     > http://mail.scipy.org/mailman/listinfo/scipy-dev
>     >
>     >
>     _______________________________________________
>     SciPy-Dev mailing list
>     SciPy-Dev at scipy.org <mailto:SciPy-Dev at scipy.org>
>     http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
The problem I have is that this is still an approximation that probably 
covers what most situations.
Do you have the Copenhaver & Holland (1988) approach or know any 
BSD-licensed Python code for it?

Also, the code probably needs to conform to numpy/scipy standards (which 
I don't remember the links).
You also have vary less desirable features:
1) hard coded numbers like '1e38' - numpy does define infinity (np.inf).
2) comparisons to 'inf' ('v==inf') that are not desirable.

Bruce

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20110603/7eba1da6/attachment.html>