[SciPy-user] predicting values based on (linear) models

Bruce Southey bsouthey at gmail.com
Thu Jan 15 10:09:39 EST 2009


josef.pktd at gmail.com wrote:
> On Wed, Jan 14, 2009 at 11:24 PM, Pierre GM <pgmdevlist at gmail.com> wrote:
>   
>> On Jan 14, 2009, at 10:15 PM, josef.pktd at gmail.com wrote:
>>     
>>> The function in stats, that I tested or rewrote, are usually identical
>>> to around 1e-15, but in some cases R has a more accurate test
>>> distribution for small samples (option "exact" in R), while in
>>> scipy.stats we only have the asymptotic distribution.
>>>       
>> We could try to reimplement part of it in C,. In any   case, it might
>> be worth to output a warning (or at least be very explicit in the doc)
>> that the results may not hold for samples smaller than 10-20.
>>     
>
> I am not a "C" person and I never went much beyond HelloWorld in C.
> I just checked some of the doc strings, and I am usually mention that
> we use the asymptotic distribution, but there are still pretty vague
> statements in some of the doc strings, such as
>
> "The p-values are not entirely reliable but are probably reasonable for
> datasets larger than 500 or so."
>
>
>   
The 'exact' test are usually Fisher's exact tests 
(http://en.wikipedia.org/wiki/Fisher%27s_exact_test) which are very 
different from the asymptotic testing and can get very demanding. Also I 
do not think that such statements should be part of the doc strings.

>>> Also, not all
>>> existing functions in scipy.stats are tested (yet).
>>>       
>> We should also try to make sure missing data are properly supported
>> (not always possible) and that the results are consistent between the
>> masked and non-masked versions.
>>
>>     
>
> I added a ticket so we don't forget to check this.
>
>
>
>   
>> IMHO, the readiness to incorporate user feedback is here. The feedback
>> is not, or at least not as much as we'd like.
>>     
>
> That depends on the subpackage, some problems in stats have been
> reported and known for quite some time and the expected lifetime of a
> ticket can be pretty long. I was looking at different python packages
> that use statistics, and many of them are reluctant to use scipy while
> numpy looks very well established. But, I suppose this will improve
> with time and the user base will increase, especially with the recent
> improvements in the build/distribution and the documentation.
>
> Josef
> _______________________________________________
> SciPy-user mailing list
> SciPy-user at scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-user
>   
There are different reasons for a lack of user base. One of the reasons 
for R is that many, many statistics classes use it.

Some of the reasons that I do not use scipy for stats (and have not 
looked at this in some time) included:
1) The difficulty of installation which is considerably better now.
2) Lack of support for missing values as virtually everything that I 
have worked with involves missing values at some stage.
3) Lack of an suitable statistical modeling interface where you can 
specify the model to be fit without having to create each individual 
array. The approach must work for a range of scenarios.

Bruce



More information about the SciPy-User mailing list