[SciPy-user] predicting values based on (linear) models

Thu Jan 15 13:25:45 EST 2009

On Thu, Jan 15, 2009 at 12:36 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> josef.pktd at gmail.com wrote:
>>> There are different reasons for a lack of user base. One of the reasons
>>> for R is that many, many statistics classes use it.
>>>
>>> Some of the reasons that I do not use scipy for stats (and have not
>>> looked at this in some time) included:
>>> 1) The difficulty of installation which is considerably better now.
>>> 2) Lack of support for missing values as virtually everything that I
>>> have worked with involves missing values at some stage.
>>> 3) Lack of an suitable statistical modeling interface where you can
>>> specify the model to be fit without having to create each individual
>>> array. The approach must work for a range of scenarios.
>>>
>>>
>>
>> With 2 and 3 I have little experience
>> Missing observations, I usually remove or clean in the initial data
>> preparation. mstats provides functions for masked arrays, but stats
>> mostly assumes no missing values. What would be the generic treatment
>> for missing observations, just dropping all observations that have
>> NaNs or converting them to masked arrays and expand the function that
>> can handle those?
>>
> No! We have had considerable discussion on this aspect in the past on
> the numpy/scipy lists. Basically a missing observation should not be
> treated as an NaNs (and there are different types of NaNs) because they
> are not the same. In some cases, missing values disappear in the
> calculations such as creating the X'X matrix etc but you probably do not
> want that if you have real NaNs in your data (say after taking square
> root of an array that includes negative numbers).
>
>> Jonathan Taylor included a formula framework in stats.models similar
>> to R, but I haven't looked very closely at it. I haven't learned much
>> of R's syntax and I usually prefer to build by own arrays (with some
>> exceptions such as polynomials) than hide them behind a mini model
>> language.
>> For both stats.models and for the interface for general stats
>> functions, feedback would be very appreciated.
>>
>> Josef
>> _______________________________________________
>> SciPy-user mailing list
>> SciPy-user at scipy.org
>> http://projects.scipy.org/mailman/listinfo/scipy-user
>>
> If you look at R's lm function you can see that you can fit a model
> using a formula. Without a similar framework, you can not do useful
> stats. Also you must have a 'mini model language' because the inputs
> must be created correctly and it gets very repetitive very quickly.
>
> For example, in R (and all major stats languages like SAS) you can just
> fit regression models like lm(Y~ x2) and  lm( Y~ x3 + x1), where Y, x1,
> x2, and x3 are with the appropriate dataframe (not necessarily in that
> order).

For the simple case, it could be done with accepting a sequence of
args, and building
the design matrix inside the function, e.g.

ols( Y, x3, x1, x2**2 )

To build design matrices, I wrote,  for myself, functions like
simplex(x,n) where x is a 2D column matrix and it builds the
interaction terms matrix, x[:,1], x[:,2],  x[:,1]*x[:,2], ...
x[:,1]**n, which if I read the R stats help correctly would correspond
to (x[:,1] + x[:,2])^n

My ols call would then be
       ols(Y, simplex(x3,x1,2) ),

This uses explicit functions and avoids the mini-language, but it
requires some design building functions.

Being able to access some meta-information to data arrays would be
nice, but I haven't used these features much, except for building my
own classes in python or structs in matlab.

Josef