[SciPy-Dev] Deprecate stats.glm?

Thu Jun 3 13:53:08 EDT 2010

On Thu, Jun 3, 2010 at 1:14 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> On 06/03/2010 10:32 AM, Nathaniel Smith wrote:
>
> On Thu, Jun 3, 2010 at 6:38 AM,  <josef.pktd at gmail.com> wrote:
>
>
> On Thu, Jun 3, 2010 at 8:50 AM, Warren Weckesser
> <warren.weckesser at enthought.com> wrote:
>
>
> stats.glm looks like it was started and then abandoned without being
> finished.  It was last touched in November 2007.  Should this function
> be deprecated so it can eventually be removed?
>
>
> My thoughts when I looked at it was roughly:
> leave it alone since it's working, but don't "advertise" it because we
> should get a better replacement.
> similar to linregress the more general version will be available when
> scipy.stats gets the full OLS model.
>
>
> Wait, what does 'glm' have to do with OLS (or t-tests) anyway? Surely
> if anything it *should* be a function that fits, you know, GLMs
> (generalized linear models)?
>
> I guess this is a vote for removing it, because GLMs are one of the
> fundamental stats models that people will look for, and having some
> weird, broken, other thing in the obvious place is just confusing and
> looks really bad.
>
> -- Nathaniel
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
> Perhaps people should actually read the code before jumping to incorrect
> conclusions. It is not similar to linregress unless you know how to 'trick'
> linreg.

It's similar in the sense that it promises a lot, but is very limited
or "crippled", and that the replacement is not just a quick rewrite.

>
> Granted that stats.glm is a crippled but it is well intended (like most
> things in scipy.stats). The docstring intended it to general linear models
> such as SAS's glm procedure and R's glm function (without generalized part).
> At present is just does 1-way anova with only two levels but could do more.
>
>>>> drug=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
>>>> 2, 2, 2, 2, 2, 2, 2, 2]
>>>> postrt=[6, 0, 2, 8, 11, 4, 13, 1, 8, 0, 0, 2, 3, 1, 18, 4, 14, 9, 1, 9,
>>>> 13, 10, 18, 5, 23, 12, 5, 16, 1, 20]
>>>> t_val,t_probs=stats.glm(postrt,drug)
>>>> t_val
> -1.5463854661015379
>>>> t_probs
> 0.13324062984741347
>>>> idrug=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>>>> 0, 0, 0, 0, 0, 0, 0, 0] #create dummies to trick linreg
>>>> print stats.linregress(idrug, postrt)
> (-3.9000000000000044, 9.2000000000000011, -0.280506586484015,
> 0.13324062984741378, 2.5220102526131258)
>>>> -3.9000000000000044/2.5220102526131258 #this is the t-value of stats.glm
> -1.5463854661015373
>
>
> I have major concerns about depreciating code when there is no alternative
> proposed for such an important statistical function. As David has said
> elsewhere, this is just Python code and has little or no maintenance cost.
> The full solution is probably Jonathan Taylor's glm class but that uses the
> formula class and is for generalized linear models. However, I don't see
> that in scipy anywhere soon.

Currently the alternative is using ttest_ind, which produces the same result.
The cost of glm is the confusion that it creates if there is such a
big mismatch between name and result, which is exactly the response
Nathaniel and I had.

And Warren was proposing to deprecate it not to delete it right away.

>
> So the options are:
>
> 1) Rewrite the internals to fix address the current limitation - not hard
> but would need an API change and more importantly better options exist.
> 2) OLS is a superior version to linregress but needs changes to get ANOVA
> etc added
> http://www.scipy.org/Cookbook/OLS
> 3) The best candidate that I know that can replace both stats.linregress and
> stats.glm is Skipper's try_ols_anova.py code from pystatsmodel (at least
> posted on the list).  But I am not sure what the current state of that is.
> 4) Some other option?

Yes, move the OLS model and associated code from statsmodels to
scipy.stats (maybe we can discuss this after Skipper's gsoc), or use
statsmodels as addition to scipy.stats.

http://bazaar.launchpad.net/~scipystats/statsmodels/trunk/annotate/head%3A/scikits/statsmodels/sandbox/regression/try_ols_anova.py
was just my initial experimental script, and I think we might still
need a few versions (with Skipper's data and dummy handling and maybe
Jonathan's formula framework) before we come to a final design.

I don't think any duplication of effort to expand on stats.linregress
or stats.glm is productive.

Josef

>
>
> Bruce
>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>