[SciPy-Dev] Deprecate stats.glm?

Thu Jun 3 12:03:23 EDT 2010

On Thu, Jun 3, 2010 at 11:51 AM, Warren Weckesser
<warren.weckesser at enthought.com> wrote:
> josef.pktd at gmail.com wrote:
>> On Thu, Jun 3, 2010 at 10:49 AM,  <josef.pktd at gmail.com> wrote:
>>
>>> On Thu, Jun 3, 2010 at 10:18 AM, Warren Weckesser
>>> <warren.weckesser at enthought.com> wrote:
>>>
>>>> josef.pktd at gmail.com wrote:
>>>>
>>>>> On Thu, Jun 3, 2010 at 8:50 AM, Warren Weckesser
>>>>> <warren.weckesser at enthought.com> wrote:
>>>>>
>>>>>
>>>>>> stats.glm looks like it was started and then abandoned without being
>>>>>> finished.  It was last touched in November 2007.  Should this function
>>>>>> be deprecated so it can eventually be removed?
>>>>>>
>>>>>>
>>>>> My thoughts when I looked at it was roughly:
>>>>> leave it alone since it's working, but don't "advertise" it because we
>>>>> should get a better replacement.
>>>>>
>>>>>
>>>> How does one not advertise it?
>>>>
>>>> The docstring is wrong, incomplete, and not useful.
>>>>
>>> That's it's not advertised
>>>
>>>
>>>> It has no tests.
>>>>
>>> It has no tests (except for examples on my computer), but the results
>>> (for the basic case that I looked at) are correct.
>>> If we increase test coverage or start removing functions that don't
>>> have tests yet, I would work on box-cox, and several other functions
>>> in morestats.py . Mainly a question of priorities.
>>>
>>>
>>>> Currently, it appears that it just duplicates ttest_ind.  As far as I
>>>> know, no one is working on it.
>>>>
>>>> Leaving it in wastes users' time reading about it.  It erodes confidence
>>>> in other functions in scipy:  "Is foo() a good function, or has it been
>>>> abandoned, like glm()?"
>>>>
>>>> To me, it is an ideal candidate for removal.
>>>>
>>> If we apply strict criteria along those lines, we can reduce the size
>>> of scipy.stats.stats and scipy.stats.morestats, I guess, by at least a
>>> third. (Which I would do if I could start from scratch).
>>> A big fraction of functions in scipy.stats are in the category "no one
>>> is working on it".
>>>
>>> For glm specifically, I don't see any big cost of leaving it in, nor
>>> for deprecating it, and then I usually stick to the status-quo. But
>>> you can as well deprecate it, and point to ttest_ind.
>>>
>>> And for "bigger fish" like pdfmoments and pdf_approx, I never received
>>> a reply or opinion on the mailing list.
>>>
>>> statsmodels will have (or better, has in the sandbox) a generalization
>>> for glm, that works for any number of groups and includes both t_test
>>> and f_test.
>>>
>>
>> Actually, now that I have to think about glm again, I'm also in favor
>> of deprecating it, since I can always point to the general version in
>> statsmodels.
>>
>> Josef
>>
>>
>
> Heh... meanwhile I'm starting to think that my call for deprecation was
> premature, and maybe all it really needs is an updated, accurate
> docstring that explains what the current implementation does.  :)

You should stay firm to compensate for my reluctance to change things
that are not (obviously or really) broken. :)

As, I said I'm really pretty indifferent in this case. (But I wouldn't
want to see wide spread use of it, because as Nathaniel said, the name
is very misleading for the current result.)

So, if you want to keep it mention clearly that it only does a ttest.

Josef

>
> Warren
>
>>
>>
>>
>>> Josef
>>>
>>>
>>>> Warren
>>>>
>>>>
>>>>> similar to linregress the more general version will be available when
>>>>> scipy.stats gets the full OLS model.
>>>>>
>>>>>
>>>>>
>>>>>>>> x = (np.arange(20)>9).astype(int)
>>>>>>>> y = x + np.random.randn(20)
>>>>>>>> stats.glm(y,x)
>>>>>>>>
>>>>>>>>
>>>>> (-1.7684287512254859, 0.093933208147769023)
>>>>>
>>>>>
>>>>>>>> stats.ttest_ind(y[:10], y[10:])
>>>>>>>>
>>>>>>>>
>>>>> (-1.7684287512254859, 0.093933208147768926)
>>>>>
>>>>> In the current form it doesn't do much different than ttest_ind except
>>>>> for different argument structure.
>>>>>
>>>>> I think it could be made to work on string labels if _support.unique
>>>>> is replaced by np.unique (which we are doing in statsmodels)
>>>>>
>>>>>
>>>>>
>>>>>>>> x = (np.arange(20)>9).astype(str)
>>>>>>>> x
>>>>>>>>
>>>>>>>>
>>>>> array(['F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'T', 'T', 'T',
>>>>>        'T', 'T', 'T', 'T', 'T', 'T', 'T'],
>>>>>       dtype='|S1')
>>>>>
>>>>>
>>>>>>>> stats.glm(y,x)
>>>>>>>>
>>>>>>>>
>>>>> Traceback (most recent call last):
>>>>>   File "<pyshell#24>", line 1, in <module>
>>>>>     stats.glm(y,x)
>>>>>   File "C:\Josef\_progs\Subversion\scipy-trunk_after\trunk\dist\scipy-0.8.0.dev6416.win32\Programs\Python25\Lib\site-packages\scipy\stats\stats.py",
>>>>> line 3315, in glm
>>>>>     p = _support.unique(para)
>>>>>   File "C:\Josef\_progs\Subversion\scipy-trunk_after\trunk\dist\scipy-0.8.0.dev6416.win32\Programs\Python25\Lib\site-packages\scipy\stats\_support.py",
>>>>> line 45, in unique
>>>>>     if np.add.reduce(np.equal(uniques,item).flat) == 0:
>>>>> AttributeError: 'NotImplementedType' object has no attribute 'flat'
>>>>>
>>>>> Josef
>>>>>
>>>>>
>>>>>
>>>>>> Warren
>>>>>>
>>>>>> _______________________________________________
>>>>>> SciPy-Dev mailing list
>>>>>> SciPy-Dev at scipy.org
>>>>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>>>>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> SciPy-Dev mailing list
>>>>> SciPy-Dev at scipy.org
>>>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>>>
>>>>>
>>>> _______________________________________________
>>>> SciPy-Dev mailing list
>>>> SciPy-Dev at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>>
>>>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>