[SciPy-Dev] Deprecate stats.glm?

Thu Jun 3 13:14:44 EDT 2010

On 06/03/2010 10:32 AM, Nathaniel Smith wrote:
> On Thu, Jun 3, 2010 at 6:38 AM,<josef.pktd at gmail.com>  wrote:
>    
>> On Thu, Jun 3, 2010 at 8:50 AM, Warren Weckesser
>> <warren.weckesser at enthought.com>  wrote:
>>      
>>> stats.glm looks like it was started and then abandoned without being
>>> finished.  It was last touched in November 2007.  Should this function
>>> be deprecated so it can eventually be removed?
>>>        
>> My thoughts when I looked at it was roughly:
>> leave it alone since it's working, but don't "advertise" it because we
>> should get a better replacement.
>> similar to linregress the more general version will be available when
>> scipy.stats gets the full OLS model.
>>      
> Wait, what does 'glm' have to do with OLS (or t-tests) anyway? Surely
> if anything it *should* be a function that fits, you know, GLMs
> (generalized linear models)?
>
> I guess this is a vote for removing it, because GLMs are one of the
> fundamental stats models that people will look for, and having some
> weird, broken, other thing in the obvious place is just confusing and
> looks really bad.
>
> -- Nathaniel
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>    
Perhaps people should actually read the code before jumping to incorrect 
conclusions. It is not similar to linregress unless you know how to 
'trick' linreg.

Granted that stats.glm is a crippled but it is well intended (like most 
things in scipy.stats). The docstring intended it to general linear 
models such as SAS's glm procedure and R's glm function (without 
generalized part). At present is just does 1-way anova with only two 
levels but could do more.

 >>> drug=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
 >>> postrt=[6, 0, 2, 8, 11, 4, 13, 1, 8, 0, 0, 2, 3, 1, 18, 4, 14, 9, 
1, 9, 13, 10, 18, 5, 23, 12, 5, 16, 1, 20]
 >>> t_val,t_probs=stats.glm(postrt,drug)
 >>> t_val
-1.5463854661015379
 >>> t_probs
0.13324062984741347
 >>> idrug=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0] #create dummies to trick linreg
 >>> print stats.linregress(idrug, postrt)
(-3.9000000000000044, 9.2000000000000011, -0.280506586484015, 
0.13324062984741378, 2.5220102526131258)
 >>> -3.9000000000000044/2.5220102526131258 #this is the t-value of 
stats.glm
-1.5463854661015373

I have major concerns about depreciating code when there is no 
alternative proposed for such an important statistical function. As 
David has said elsewhere, this is just Python code and has little or no 
maintenance cost. The full solution is probably Jonathan Taylor's glm 
class but that uses the formula class and is for generalized linear 
models. However, I don't see that in scipy anywhere soon.

So the options are:

1) Rewrite the internals to fix address the current limitation - not 
hard but would need an API change and more importantly better options exist.
2) OLS is a superior version to linregress but needs changes to get 
ANOVA etc added
http://www.scipy.org/Cookbook/OLS
3) The best candidate that I know that can replace both stats.linregress 
and stats.glm is Skipper's try_ols_anova.py code from pystatsmodel (at 
least posted on the list).  But I am not sure what the current state of 
that is.
4) Some other option?

Bruce

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20100603/408fe912/attachment.html>