[SciPy-Dev] Deprecate stats.glm?
Bruce Southey
bsouthey at gmail.com
Thu Jun 3 13:14:44 EDT 2010
On 06/03/2010 10:32 AM, Nathaniel Smith wrote:
> On Thu, Jun 3, 2010 at 6:38 AM,<josef.pktd at gmail.com> wrote:
>
>> On Thu, Jun 3, 2010 at 8:50 AM, Warren Weckesser
>> <warren.weckesser at enthought.com> wrote:
>>
>>> stats.glm looks like it was started and then abandoned without being
>>> finished. It was last touched in November 2007. Should this function
>>> be deprecated so it can eventually be removed?
>>>
>> My thoughts when I looked at it was roughly:
>> leave it alone since it's working, but don't "advertise" it because we
>> should get a better replacement.
>> similar to linregress the more general version will be available when
>> scipy.stats gets the full OLS model.
>>
> Wait, what does 'glm' have to do with OLS (or t-tests) anyway? Surely
> if anything it *should* be a function that fits, you know, GLMs
> (generalized linear models)?
>
> I guess this is a vote for removing it, because GLMs are one of the
> fundamental stats models that people will look for, and having some
> weird, broken, other thing in the obvious place is just confusing and
> looks really bad.
>
> -- Nathaniel
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
Perhaps people should actually read the code before jumping to incorrect
conclusions. It is not similar to linregress unless you know how to
'trick' linreg.
Granted that stats.glm is a crippled but it is well intended (like most
things in scipy.stats). The docstring intended it to general linear
models such as SAS's glm procedure and R's glm function (without
generalized part). At present is just does 1-way anova with only two
levels but could do more.
>>> drug=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
>>> postrt=[6, 0, 2, 8, 11, 4, 13, 1, 8, 0, 0, 2, 3, 1, 18, 4, 14, 9,
1, 9, 13, 10, 18, 5, 23, 12, 5, 16, 1, 20]
>>> t_val,t_probs=stats.glm(postrt,drug)
>>> t_val
-1.5463854661015379
>>> t_probs
0.13324062984741347
>>> idrug=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0] #create dummies to trick linreg
>>> print stats.linregress(idrug, postrt)
(-3.9000000000000044, 9.2000000000000011, -0.280506586484015,
0.13324062984741378, 2.5220102526131258)
>>> -3.9000000000000044/2.5220102526131258 #this is the t-value of
stats.glm
-1.5463854661015373
I have major concerns about depreciating code when there is no
alternative proposed for such an important statistical function. As
David has said elsewhere, this is just Python code and has little or no
maintenance cost. The full solution is probably Jonathan Taylor's glm
class but that uses the formula class and is for generalized linear
models. However, I don't see that in scipy anywhere soon.
So the options are:
1) Rewrite the internals to fix address the current limitation - not
hard but would need an API change and more importantly better options exist.
2) OLS is a superior version to linregress but needs changes to get
ANOVA etc added
http://www.scipy.org/Cookbook/OLS
3) The best candidate that I know that can replace both stats.linregress
and stats.glm is Skipper's try_ols_anova.py code from pystatsmodel (at
least posted on the list). But I am not sure what the current state of
that is.
4) Some other option?
Bruce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20100603/408fe912/attachment.html>
More information about the SciPy-Dev
mailing list