[SciPy-dev] GSoC Project Proposal: Datasource and Jonathan Taylor's statistical models

Sat Mar 28 00:11:09 EDT 2009

I think it would be very good if you can improve some of the
statistics in scipy.

On Fri, Mar 27, 2009 at 6:09 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
> Bruce Southey wrote:
>> Not getting into the merits of either part, I think you are asking for
>> trouble doing both because there is not clear connection between the two
>> parts. Knowing one part is not going to help you with the other. (The
>> argument that it helps get 'your feet wet' is rather lame.)
>
> Your point is well taken.  I think I will focus on the second part, as
> there seems to be much more interest in the statistical functionality.
>  And my work would undoubtedly be better if focused.

I think there is enough to do in improving statistics that you don't
need to add another side project. And as a warmup increasing test
coverage would be very useful.

>
>>I would strongly suggest that the main emphasis is just to get
>>Jonathan's code integrated into Scipy and perhaps something from various
>>places like the Scikit learn (how many logistic regression or least

>From a help search in R, I would say 5 to 10 logistic regression
implementations.

>>squares codes do we really need?) and EconPy
>>http://code.google.com/p/econpy/wiki/EconPy
>
> I will have a closer look through Scikit learn and econpy and revise.
>

One of my current favorites is pymvpa it looks mostly well written and
has quite a good coverage of multivariate statistics, for
distributions pymc is the most complete (which will not concern you so
much in the focus on models) and of course nipy. (all MIT or BSD
licence)
most machine learning libraries have more restrictive licenses, which
constrains how much we can look at them

For regression and implementation details I also look quite often at
Econometrics Toolbox: by James P. LeSage
http://www.spatial-econometrics.com/   in matlab, which has
"classical" algorithms in econometrics in public domain.

>>I would think that it is essential to get these to work with masked
>>arrays (allows missing observations) or record array (enables the use of
>>'variable' names in model statements like most statistics packages do).
>
> I agree.  There has been some discussion of the most appropriate way
> to handle this in your thread previously mentioned (eg., it would not
> always be appropriate to force conversion to a masked array, should
> stats and mstats be merged, etc.), and I would appreciate any
> direction that could be offered.  I like the idea of the "usemask"
> flag here http://mail.scipy.org/pipermail/scipy-dev/2009-February/011414.html
> but obviously would defer to others for the best solution.  Should I
> be spending most of my time looking through mstats rather than stats?
>
>>I would like to see the inclusion of Statistical Reference Datasets Project:
>>http://www.itl.nist.gov/div898/strd/
>>
>>The datasets would allow us to validate the accuracy of the code.
>
> Very good idea.

The problem is that it has very limited coverage, I recently
scraped/parsed the ANOVA examples (has only balanced) to check
stats.f_oneway and anova in pymvpa. I took a non-linear regression
case to test optimize.curve_fit and there is additional linear
regression and descriptive which would be more for numpy and one more.
I was looking for other benchmarks but with only limited success.

>
> Thanks for some initial feedback.  I will take under advisement and
> revise my proposal as needed.
>

A straightforward port of "models" would not be a lot of work, mainly
increasing test coverage and fixing any bugs that they reveal.
However, changes to the structure (refactoring) and completing missing
pieces such as additional test statistics can be quite time consuming.
>From my experience with stats, one of the biggest time sink in
checking the code from someone else, can be hunting for a reference to
fix some numbers that are not quite right compared to R or matlab
(e.g. tiehandling or some "exotic" distributions). Being able to
follow some good books is very helpful.

Some time will be required on the design when pulling in new code into
scipy because code that is written for a specialized package might not
be in the right form for a general purpose scipy.

I assume we will have more discussion later,

Josef