[SciPy-dev] GSoC Project Proposal: Datasource and Jonathan Taylor's statistical models

Fri Mar 27 15:21:33 EDT 2009

Skipper Seabold wrote:
> Hello all,
>
> I am a first year PhD student in Economics at American University, and
> I would very much like to participate in the GSoC with the NumPy/SciPy
> community.  I am looking for some feedback and discussion before I
> submit a proposal.
>
> Judging by the ideas page and the discussion in this thread (
> http://mail.scipy.org/pipermail/scipy-dev/2009-February/011373.html )
> I think the following project proposal would be useful to the
> community.
>
> My proposal would have two parts, the first would be to improve
> datasource and integrate it into the numpy/scipy io.  I see this as a
> way to get my feet wet working on a project.  I do not imagine that it
> would take more than 2-3 weeks work on my end.
>   
Can you provide a link to datasource?

> The second part would be to get Jonathan Taylor's statistical models
> from the NiPy project into scipy.stats.  I think that I would be a
> good candidate for this work, as I am currently studying statistics
> and learning the ins and outs of NumPy/SciPy, so I don't mind doing
> some of the less appealing work as this is also a great learning
> opportunity.  Also I see this as a great way to get involved in the
> SciPy community in an area that currently needs some attention.  I am
> a student, so I would be able to help maintain the code, bug fix, and
> address other areas of the statistical capabilities that need
> attention.
>   
I would be willing to help to some degree.

I would strongly suggest that the main emphasis is just to get 
Jonathan's code integrated into Scipy and perhaps something from various 
places like the Scikit learn (how many logistic regression or least 
squares codes do we really need?) and EconPy
http://code.google.com/p/econpy/wiki/EconPy

It is too complex to address anything more than this and this would 
provide a very solid base for future development.

> Below is a general outline of my proposal with some areas that I have
> identified as needing work.  I am eager to discuss some aspects of the
> projects with those that are interested and to work on the appropriate
> milestones.
>
> 1) Improve datasource and integrate it into all the numpy/scipy io
>
> Bug Fixes
>     Catch and handle malformed URLs
>
> Refactoring
>
> Enhancements
>     Improve findfile method
>     Improve cache method
>     Add zip archive, tar file handling capabilities
>     Improve networking interface to handle timeouts and proxies if
> there is sufficient interest
>
> Documentation
>     Document changes
>
> Tests
>     Implement test coverage for new changes
>
> Copy/Move to scipy.io
>   
This looks like quite a lot of work for a short period especially to do 
both parts (I am also biased in having the stats part finished).

> 2) Integrate Jonathan Taylor's statistical models into scipy.stats
>
> These models are currently in the NiPy project
> Merge relevant branches (branch trunk-josef models has the most recent
> changes, I believe)
>
> I will focus mostly on bringing over the linear models, which I
> believe would include at the least:
> bspline.py, contrast.py, gam.py, glm.py, model.py, regression.py, utils.py
>   
Not that it is really that important, but these are not all 'linear 
models' :-)

> Bug Fixes
>     Bug hunting
>     Improve existing test coverage
>
> Refactoring
>     Eliminate existing and created duplicate functionality
>     Make sure parameters are consistent, etc.
>   
I would not be that concerned with duplicate functionality because it is 
better to train people to use the new code and depreciate the old code. 
There some cases where you may want different versions, for example code 
that assumes normality will be faster than code for generalized linear 
models where non-normal distributions are allowed.
> Enhancements
>
>   
I would think that it is essential to get these to work with masked 
arrays (allows missing observations) or record array (enables the use of 
'variable' names in model statements like most statistics packages do).
> Documentation
>     Document changes
>     Make any necessary changes to stats/info.py
>   
Actually the reference is the SciPy documentation marathon. I would also 
suggest that examples/tutorials are important here.
> Testing
>     Make sure test coverage is adequate
>   
I would like to see the inclusion of Statistical Reference Datasets Project:
http://www.itl.nist.gov/div898/strd/

The datasets would allow us to validate the accuracy of the code.

Regards
Bruce