[SciPy-User] [ANN] scikit.statsmodels 0.2.0 release

Fri Feb 19 12:19:54 EST 2010

On Fri, Feb 19, 2010 at 10:57:01AM -0600, Bruce Southey wrote:
> Will it end up as cython?

I am trying to convince the engineer who is doing the work to go down
that way but he does like cython. I am hesitent to impose my point of
view to a highly qualified engineer, but I don't like having this
hand-written C bind, I must admit.

> (I just used the supplied Python bindings of libsvm so this could be 
> interesting.)

Well, we provide much more, like access to the weights, or vectorized
predict :).

> > Lets say that the focus between scikit.learn and statsmodel is most
> > probably going to be slightly different.

> Having done both (with papers), I find this type of comment assuming 
> because underlying both is the same concepts. What I would like to avoid 
> is having different user syntax for basic models for the same model. For 
> example, with logistic regression in SAS you have to be careful of which 
> is the default event setting as it varies across procedures. At least 
> these SAS procedures use the same unmodified dataset unlike some of the 
> R packages that do lars/lasso.

Indeed, I agree. We'll try to look very closely at statsmodel and not
differ if we can. However, (rant ahead), we hear this story everywhere we
go: match our API. So we are struggling between pymvpa, mdp and statmodel
(I am probably forgetting a few) that all differ slightly. We are willing
to adapt as long as it is not damaging for our usecases, but it would be
nice to have a common discussion.

Also, there will be differences APIs, as far as I understand the
statsmodel API. For instance, I believe that constructors of models
should work without passing it the data (the data could be optional). The
reason being that on-line estimators shouldn't be passed in
initiallisation data. As a consequence, maybe the 'fit' method should
take the data... All this is quite open to me, and I don't want to draw
any premature conclusion.

We have not done any API design so far, because we are trying to
get a feal of what the existing APIs are, and because we want to have
working code to throw usecases at it. Also, we are extremely open to
comments, just subscribe to the scikit.learn mailing list (not everybody
involved with scikit learn follows this high-traffic mailing list).

> >> What would be nice is the acceptance of input data types between learn
> >> and statsmodels especially for things like logistic regression. While I
> >> understand the need for duplicate functions, it may be desirable share
> >> at least code since both code bases are still relatively 'new'.

> > Well, as far as I am concerned, data types are numpy arrays. I am weary
> > of implmenting higher level abstractions. Its more the APIs that may
> > different, and that we will have to keep in sync.

> I do agree especially now that I have learnt the 'array' approach of 
> doing things.

> In some way my view of integration of things is Zelig -not that I have 
> really looked at it (as it is in R) :
> http://gking.harvard.edu/zelig/

Well, let us try not to have to build common API and integration a
posteriori, build right from the start. A bit of API work is well worth
the effort, I believe. And please feal free to pitch in.

> The seamless ability to link packages is rather appealing and both 
> scikits share at least numpy.

And scipy, I believe.

Cheers,

Gaël