[SciPy-user] predicting values based on (linear) models

Thu Jan 15 12:36:43 EST 2009

josef.pktd at gmail.com wrote:
>> There are different reasons for a lack of user base. One of the reasons
>> for R is that many, many statistics classes use it.
>>
>> Some of the reasons that I do not use scipy for stats (and have not
>> looked at this in some time) included:
>> 1) The difficulty of installation which is considerably better now.
>> 2) Lack of support for missing values as virtually everything that I
>> have worked with involves missing values at some stage.
>> 3) Lack of an suitable statistical modeling interface where you can
>> specify the model to be fit without having to create each individual
>> array. The approach must work for a range of scenarios.
>>
>>     
>
> With 2 and 3 I have little experience
> Missing observations, I usually remove or clean in the initial data
> preparation. mstats provides functions for masked arrays, but stats
> mostly assumes no missing values. What would be the generic treatment
> for missing observations, just dropping all observations that have
> NaNs or converting them to masked arrays and expand the function that
> can handle those?
>   
No! We have had considerable discussion on this aspect in the past on 
the numpy/scipy lists. Basically a missing observation should not be 
treated as an NaNs (and there are different types of NaNs) because they 
are not the same. In some cases, missing values disappear in the 
calculations such as creating the X'X matrix etc but you probably do not 
want that if you have real NaNs in your data (say after taking square 
root of an array that includes negative numbers).

> Jonathan Taylor included a formula framework in stats.models similar
> to R, but I haven't looked very closely at it. I haven't learned much
> of R's syntax and I usually prefer to build by own arrays (with some
> exceptions such as polynomials) than hide them behind a mini model
> language.
> For both stats.models and for the interface for general stats
> functions, feedback would be very appreciated.
>
> Josef
> _______________________________________________
> SciPy-user mailing list
> SciPy-user at scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-user
>   
If you look at R's lm function you can see that you can fit a model 
using a formula. Without a similar framework, you can not do useful 
stats. Also you must have a 'mini model language' because the inputs 
must be created correctly and it gets very repetitive very quickly.

For example, in R (and all major stats languages like SAS) you can just 
fit regression models like lm(Y~ x2) and  lm( Y~ x3 + x1), where Y, x1, 
x2, and x3 are with the appropriate dataframe (not necessarily in that 
order).

If I understand mstats.linregress correctly, I have to create two arrays 
just to fit one of these two models. In the second case, I have to 
create yet another array. If I have my original data in one array, now I 
have unnecessarily duplicated 3 columns of that array not to mention had 
to do all this extra work, hopefully error free, just to do 2 lines of R 
code.

Jonathan's formula is along the right approach but, based on the doc 
string, rather cumbersome and does not use array inputs. It probably 
would be more effective with a record masked array.

Bruce

PS Way back when I did give feedback to the direction of stats stuff.