[SciPy-Dev] Deprecate stats.glm?

Thu Jun 3 15:15:10 EDT 2010

On 06/03/2010 12:53 PM, josef.pktd at gmail.com wrote:
> On Thu, Jun 3, 2010 at 1:14 PM, Bruce Southey<bsouthey at gmail.com>  wrote:
>    
>> On 06/03/2010 10:32 AM, Nathaniel Smith wrote:
>>
>> On Thu, Jun 3, 2010 at 6:38 AM,<josef.pktd at gmail.com>  wrote:
>>
>>
>> On Thu, Jun 3, 2010 at 8:50 AM, Warren Weckesser
>> <warren.weckesser at enthought.com>  wrote:
>>
>>
>> stats.glm looks like it was started and then abandoned without being
>> finished.  It was last touched in November 2007.  Should this function
>> be deprecated so it can eventually be removed?
>>
>>
>> My thoughts when I looked at it was roughly:
>> leave it alone since it's working, but don't "advertise" it because we
>> should get a better replacement.
>> similar to linregress the more general version will be available when
>> scipy.stats gets the full OLS model.
>>
>>
>> Wait, what does 'glm' have to do with OLS (or t-tests) anyway? Surely
>> if anything it *should* be a function that fits, you know, GLMs
>> (generalized linear models)?
>>
>> I guess this is a vote for removing it, because GLMs are one of the
>> fundamental stats models that people will look for, and having some
>> weird, broken, other thing in the obvious place is just confusing and
>> looks really bad.
>>
>> -- Nathaniel
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>>
>> Perhaps people should actually read the code before jumping to incorrect
>> conclusions. It is not similar to linregress unless you know how to 'trick'
>> linreg.
>>      
> It's similar in the sense that it promises a lot, but is very limited
> or "crippled", and that the replacement is not just a quick rewrite.
>
>    
>> Granted that stats.glm is a crippled but it is well intended (like most
>> things in scipy.stats). The docstring intended it to general linear models
>> such as SAS's glm procedure and R's glm function (without generalized part).
>> At present is just does 1-way anova with only two levels but could do more.
>>
>>      
>>>>> drug=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
>>>>> 2, 2, 2, 2, 2, 2, 2, 2]
>>>>> postrt=[6, 0, 2, 8, 11, 4, 13, 1, 8, 0, 0, 2, 3, 1, 18, 4, 14, 9, 1, 9,
>>>>> 13, 10, 18, 5, 23, 12, 5, 16, 1, 20]
>>>>> t_val,t_probs=stats.glm(postrt,drug)
>>>>> t_val
>>>>>            
>> -1.5463854661015379
>>      
>>>>> t_probs
>>>>>            
>> 0.13324062984741347
>>      
>>>>> idrug=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>>>>> 0, 0, 0, 0, 0, 0, 0, 0] #create dummies to trick linreg
>>>>> print stats.linregress(idrug, postrt)
>>>>>            
>> (-3.9000000000000044, 9.2000000000000011, -0.280506586484015,
>> 0.13324062984741378, 2.5220102526131258)
>>      
>>>>> -3.9000000000000044/2.5220102526131258 #this is the t-value of stats.glm
>>>>>            
>> -1.5463854661015373
>>
>>
>> I have major concerns about depreciating code when there is no alternative
>> proposed for such an important statistical function. As David has said
>> elsewhere, this is just Python code and has little or no maintenance cost.
>> The full solution is probably Jonathan Taylor's glm class but that uses the
>> formula class and is for generalized linear models. However, I don't see
>> that in scipy anywhere soon.
>>      
> Currently the alternative is using ttest_ind, which produces the same result.
>    
Not exactly since you have to reformat the input. Also you can do 
ttest_ind with linregress...

> The cost of glm is the confusion that it creates if there is such a
> big mismatch between name and result, which is exactly the response
> Nathaniel and I had.
>    
Generalized linear models is 'new' (so 1972) but general linear models 
is older (I think back to the 1950's when it was shown the relationship 
between ANOVA and regression). Yet both got back to the 1800's. But sure 
anyone is going to get confused if they come from the S/R world and 
don't check to see if the function at least has distribution and link 
arguments/options.

> And Warren was proposing to deprecate it not to delete it right away.
>
>    
>> So the options are:
>>
>> 1) Rewrite the internals to fix address the current limitation - not hard
>> but would need an API change and more importantly better options exist.
>> 2) OLS is a superior version to linregress but needs changes to get ANOVA
>> etc added
>> http://www.scipy.org/Cookbook/OLS
>> 3) The best candidate that I know that can replace both stats.linregress and
>> stats.glm is Skipper's try_ols_anova.py code from pystatsmodel (at least
>> posted on the list).  But I am not sure what the current state of that is.
>> 4) Some other option?
>>      
> Yes, move the OLS model and associated code from statsmodels to
> scipy.stats (maybe we can discuss this after Skipper's gsoc), or use
> statsmodels as addition to scipy.stats.
>
> http://bazaar.launchpad.net/~scipystats/statsmodels/trunk/annotate/head%3A/scikits/statsmodels/sandbox/regression/try_ols_anova.py
> was just my initial experimental script,
Sorry - I just recalled his script but not the history.

> and I think we might still
> need a few versions (with Skipper's data and dummy handling and maybe
> Jonathan's formula framework) before we come to a final design.
>
> I don't think any duplication of effort to expand on stats.linregress
> or stats.glm is productive.
>
> Josef
>
>    
I totally agree as adding that at the same time justifies depreciation 
of both functions.

Bruce

>>
>> Bruce
>>
>>
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>>
>>      
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>