[SciPy-User] [ANN] scikit.statsmodels 0.2.0 release

Fri Feb 19 14:49:32 EST 2010

On Fri, Feb 19, 2010 at 1:04 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
> On Fri, Feb 19, 2010 at 12:55 PM, Gael Varoquaux
> <gael.varoquaux at normalesup.org> wrote:
>> On Fri, Feb 19, 2010 at 12:45:04PM -0500, Skipper Seabold wrote:
>>> > Also, there will be differences APIs, as far as I understand the
>>> > statsmodel API. For instance, I believe that constructors of models
>>> > should work without passing it the data (the data could be optional). The
>>> > reason being that on-line estimators shouldn't be passed in
>>> > initiallisation data. As a consequence, maybe the 'fit' method should
>>> > take the data... All this is quite open to me, and I don't want to draw
>>> > any premature conclusion.
>>
>>
>>> Just a quick comment (disclaimer: all my own thoughts and
>>> misunderstandings...feel free to correct me).  Historically, the
>>> statsmodels package accepted a design during the model instantiation
>>> then you used your dependent variable during the fit method.  To my
>>> mind, though this didn't seem to make much sense for how I think of a
>>> model (probably somewhat discipline specific?).  For the estimators
>>> that we have we are usually fitting a parametric model in order to
>>> test a given theory about the data generating process.  The model
>>> doesn't make much sense to me without the data (my data is not
>>> real-time and I am not data mining).
>>
>> Suppose you implement recursive estimation, say Kalman filtering? There
>> are usecases for that, and we want to solve them.

Or incremental least squares as the discussion with Nathaniel shows.

>>
>> Sometimes your data doesn't fit in memory. If you have a forward
>> selection regression model on huge data, say genomics data that is never
>> remotely going to fit in your memory and that you are fishing out of a
>> database, the API is also going to break down.
>>
>> Also, being able to give initial guesses to the estimator, to do
>> warm-restart of a convex optimisation for instance, might change
>> significantly the computational cost of eg a cross validation.

For many of the linear models so far we didn't need starting values,
but for the maximum likelihood estimators, fit takes a starting_value
argument. For examples of GLS, and in general for two-step or
multi-stage methods, we can call the same estimator several times with
updated starting values, e.g. GLSAR has a fit and an iterative_fit
where iterative_fit does the updating internally.
One problem is that fit returns a Results instance, but with deferred
calculations the cost might not be high anymore.

We don't have currently any updating on data in statsmodels, but that
doesn't need to stay this way. Since we don't have any
recursive/iterative (over observations) algorithms yet, it would
currently not make any difference in the calculations. I don't have
any idea yet how or how much to do with "online" estimation and batch
processing in time series. But cross-validation is a big reason to
find a way.
Depending on the use case, it would also differ considerably which
style of implementation is the best, e.g. one call to convolve or
correlate or fft, and I have a very fast c loop over the entire time
series, but not useful if I want to do Kalman filtering.

I don't think very similar API's need to be a target either, as long
as they are easy to figure out, machine learning is "training" the
model, and we are "fitting" it to the data. Since the audience and
main usecases can be rather different, the terminology also varies.

What I find more of an issue to easy and efficient usage across
packages, is how cheap it is to get the data in and the results out.
And since the core routines just require numpy arrays, in learn and
statsmodels, this should remain easy and compatible, without having to
go through lots of layers first.

>>
>> On the other hand, my experience is that trying to solve all the possible
>> usecases beforehand without working code and examples just leads to
>> developers staring at a white board. So I'd rather move forward, and
>> think about API based on examples.
>>
>
> All valid points, and I agree.  In the past, I've found it very hard
> to code for use cases that I am not aware of ;)

Me too. The API for the current models looks ok, but new models will
require adjustments to the generic API.

Josef

>
>> I just wanted to warn that we are probably not going to follow exactly
>> the existing APIs, and that there were reasons for that. I am not trying
>> to bash existing APIs, this is a pointless activity, IMHO.
>>
>> Gaël
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>