[SciPy-User] R vs Python for simple interactive data analysis

Mon Aug 29 17:03:00 EDT 2011

On Mon, Aug 29, 2011 at 12:13 PM,  <josef.pktd at gmail.com> wrote:
> On Mon, Aug 29, 2011 at 12:59 PM,  <josef.pktd at gmail.com> wrote:
>> On Mon, Aug 29, 2011 at 11:42 AM,  <josef.pktd at gmail.com> wrote:
>>> On Mon, Aug 29, 2011 at 11:34 AM, Christopher Jordan-Squire
>>> <cjordan1 at uw.edu> wrote:
>>>> On Mon, Aug 29, 2011 at 10:27 AM,  <josef.pktd at gmail.com> wrote:
>>>>> On Mon, Aug 29, 2011 at 11:10 AM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>>>> On Mon, Aug 29, 2011 at 10:57 AM, Christopher Jordan-Squire
>>>>>> <cjordan1 at uw.edu> wrote:
>>>>>>> On Sun, Aug 28, 2011 at 2:54 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>>>>>> On Sat, Aug 27, 2011 at 10:15 PM, Bruce Southey <bsouthey at gmail.com> wrote:
>>>>>>>>> On Sat, Aug 27, 2011 at 5:06 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>>>>>>>>> On Sat, Aug 27, 2011 at 5:03 PM, Jason Grout
>>>>>>>>>> <jason-sage at creativetrax.com> wrote:
>>>>>>>>>>> On 8/27/11 1:19 PM, Christopher Jordan-Squire wrote:
>>>>>>>>>>>> This comparison might be useful to some people, so I stuck it up on a
>>>>>>>>>>>> github repo. My overall impression is that R is much stronger for
>>>>>>>>>>>> interactive data analysis. Click on the link for more details why,
>>>>>>>>>>>> which are summarized in the README file.
>>>>>>>>>>>
>>>>>>>>>>>  From the README:
>>>>>>>>>>>
>>>>>>>>>>> "In fact, using Python without the IPython qtconsole is practically
>>>>>>>>>>> impossible for this sort of cut and paste, interactive analysis.
>>>>>>>>>>> The shell IPython doesn't allow it because it automatically adds
>>>>>>>>>>> whitespace on multiline bits of code, breaking pre-formatted code's
>>>>>>>>>>> alignment. Cutting and pasting works for the standard python shell,
>>>>>>>>>>> but then you lose all the advantages of IPython."
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> You might use %cpaste in the ipython normal shell to paste without it
>>>>>>>>>>> automatically inserting spaces:
>>>>>>>>>>>
>>>>>>>>>>> In [5]: %cpaste
>>>>>>>>>>> Pasting code; enter '--' alone on the line to stop.
>>>>>>>>>>> :if 1>0:
>>>>>>>>>>> :    print 'hi'
>>>>>>>>>>> :--
>>>>>>>>>>> hi
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Jason
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> SciPy-User mailing list
>>>>>>>>>>> SciPy-User at scipy.org
>>>>>>>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This strikes me as a textbook example of why we need an integrated
>>>>>>>>>> formula framework in statsmodels. I'll make a pass through when I get
>>>>>>>>>> a chance and see if there are some places where pandas would really
>>>>>>>>>> help out.
>>>>>>>>>
>>>>>>>>> We used to have a formula class is scipy.stats and I do not follow
>>>>>>>>> nipy (http://nipy.sourceforge.net/nipy/stable/index.html) as it also
>>>>>>>>> had this (extremely flexible but very hard to comprehend). It was what
>>>>>>>>> I had argued was needed ages ago for statsmodel. But it needs a
>>>>>>>>> community effort because the syntax required serves multiple
>>>>>>>>> communities with different annotations and needs. That is also seen
>>>>>>>>> from the different approaches taken by the stats packages from S/R,
>>>>>>>>> SAS, Genstat (and those are just are ones I have used).
>>>>>>>>>
>>>>>>>>
>>>>>>>> We have held this discussion at _great_ length multiple times on the
>>>>>>>> statsmodels list and are in the process of trying to integrate
>>>>>>>> Charlton (from Nathaniel) and/or Formula (from Jonathan / NiPy) into
>>>>>>>> the statsmodels base.
>>>>>>>>
>>>>>>>> http://statsmodels.sourceforge.net/dev/roadmap_todo.html#formula-framework
>>>>>>>>
>>>>>>>> and more recently
>>>>>>>>
>>>>>>>> https://groups.google.com/group/pystatsmodels/browse_thread/thread/a76ea5de9e96964b/fd85b80ae46c4931?
>>>>>>>>
>>>>>>>> https://github.com/statsmodels/formula
>>>>>>>> https://github.com/statsmodels/charlton
>>>>>>>>
>>>>>>>> Wes and I made some effort to go through this at SciPy. From where I
>>>>>>>> sit, I think it's difficult to disentangle the data structures from
>>>>>>>> the formula implementation, or maybe I'd just prefer to finish
>>>>>>>> tackling the former because it's much more straightforward. So I'd
>>>>>>>> like to first finish the pandas-integration branch that we've started
>>>>>>>> and then focus on the formula support. This is on my (our, I hope...)
>>>>>>>> immediate long-term goal list. Then I'd like to come back to the
>>>>>>>> community and hash out the 'rules of the game' details for formulas
>>>>>>>> after we have some code for people to play with, which promises to be
>>>>>>>> "fun."
>>>>>>>>
>>>>>>>> https://github.com/statsmodels/statsmodels/tree/pandas-integration
>>>>>>>>
>>>>>>>> FWIW, I could also improve the categorical function to be much nicer
>>>>>>>> for the given examples (ie., take a list, drop a reference category),
>>>>>>>> but I don't know that it's worth it, because it's really just a
>>>>>>>> stop-gap and ideally users shouldn't have to rely on it. Thoughts on
>>>>>>>> more stop-gap?
>>>>>>>>
>>>>>>>
>>>>>>> I want more usability, but I agree that a stop-gap probably isn't the
>>>>>>> right way to go, unless it has things we'd eventually want anyways.
>>>>>>>
>>>>>>>> If I understand Chris' concerns, I think pandas + formula will go a
>>>>>>>> long way towards bridging the gap between Python and R usability, but
>>>>>>>
>>>>>>> Yes, I agree. pandas + formulas would go a long, long way towards more
>>>>>>> usability.
>>>>>>>
>>>>>>> Though I really, really want a scatterplot smoother (i.e., lowess) in
>>>>>>> statsmodels. I use it a lot, and the final part of my R file was
>>>>>>> entirely lowess. (And, I should add, that was the part people liked
>>>>>>> best since one of the main goals of the assignment was to generate
>>>>>>> nifty pictures that could be used to summarize the data.)
>>>>>>>
>>>>>>
>>>>>> Working my way through the pull requests. Very time poor...
>>>>>>
>>>>>>>> it's a large effort and there are only a handful (at best) of people
>>>>>>>> writing code -- Wes being the only one who's more or less "full time"
>>>>>>>> as far as I can tell. The 0.4 statsmodels release should be very
>>>>>>>> exciting though, I hope. I'm looking forward to it, at least. Then
>>>>>>>> there's only the small problem of building an infrastructure and
>>>>>>>> community like CRAN so we can have specialists writing and maintaining
>>>>>>>> code...but I hope once all the tools are in place this will seem much
>>>>>>>> less daunting. There certainly seems to be the right sentiment for it.
>>>>>>>>
>>>>>>>
>>>>>>> At the very least creating and testing models would be much simpler.
>>>>>>> For weeks I've been wanting to see if gmm is the same as gee by
>>>>>>> fitting both models to the same dataset, but I've been putting it off
>>>>>>> because I didn't want to construct the design matrices by hand for
>>>>>>> such a simple question. (GMM--Generalized Method of Moments--is a
>>>>>>> standard econometrics model and GEE--Generalized Estimating
>>>>>>> Equations--is a standard biostatics model. They're both
>>>>>>> generalizations of quasi-likelihood and appear very similar, but I
>>>>>>> want to fit some models to figure out if they're exactly the same.)
>>>>>
>>>>> Since GMM is still in the sandbox, the interface is not very polished,
>>>>> and it's missing some enhancements. I recommend asking on the mailing
>>>>> list if it's not clear.
>>>>>
>>>>> Note GMM itself is very general and will never be a quick interactive
>>>>> method. The main work will always be to define the moment conditions
>>>>> (a bit similar to non-linear function estimation, optimize.leastsq).
>>>>>
>>>>> There are and will be special subclasses, eg. IV2SLS, that have
>>>>> predefined moment conditions, but, still, it's up to the user do
>>>>> construct design and instrument arrays.
>>>>> And as far as I remember, the GMM/GEE package in R doesn't have a
>>>>> formula interface either.
>>>>>
>>>>
>>>> Both of the two gee packages in R I know of have formula interfaces.
>>>>
>>>> http://cran.r-project.org/web/packages/geepack/
>>>> http://cran.r-project.org/web/packages/gee/index.html
>>
>> This is very different from what's in GMM in statsmodels so far. The
>> help file is very short, so I'm mostly guessing.
>> It seems to be for (a subset) of generalized linear models with
>> longitudinal/panel covariance structures. Something like this will
>> eventually (once we get panel data models)  as a special case of GMM
>> in statsmodels, assuming it's similar to what I know from the
>> econometrics literature.
>>
>> Most of the subclasses of GMM that I currently have, are focused on
>> instrumental variable estimation, including non-linear regression.
>> This should be expanded over time.
>>
>> But GMM itself is designed for subclassing by someone who wants to use
>> her/his own moment conditions, as in
>> http://cran.r-project.org/web/packages/gmm/index.html
>> or for us to implement specific models with it.
>>
>> If someone wants to use it, then I have to quickly add the options for
>> the kernels of the weighting matrix, which I keep postponing.
>> Currently there is only a truncated, uniform kernel that assumes
>> observations are order by time, but users can provide their own
>> weighting function.
>>
>> Josef
>>
>>>
>>> I have to look at this. I mixed up some acronyms, I meant GEL and GMM
>>> http://cran.r-project.org/web/packages/gmm/index.html
>>> the vignette was one of my readings, and the STATA description for GMM.
>>>
>>> I never really looked at GEE. (That's Skipper's private work so far.)
>>>
>>> Josef
>>>
>>>>
>>>> -Chris JS
>>>>
>>>>> Josef
>>>>>
>>>>>>>
>>>>>>
>>>>>> Oh, it's not *that* bad. I agree, of course, that it could be better,
>>>>>> but I've been using mainly Python for my work, including GMM and
>>>>>> estimating equations models (mainly empirical likelihood and
>>>>>> generalized maximum entropy) for the last ~two years.
>>>>>>
>>>>>> Skipper
>>>>>> _______________________________________________
>>>>>> SciPy-User mailing list
>>>>>> SciPy-User at scipy.org
>>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>>>
>>>>> _______________________________________________
>>>>> SciPy-User mailing list
>>>>> SciPy-User at scipy.org
>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>>
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> SciPy-User at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>
>>>
>>
>
> just to make another point:
>
> Without someone adding mixed effects, hierachical, panel/longitudinal
> models, and .... it will not help to have a formula interface to them.
> (Thanks to Scott we will soon have survival)
>

I don't think I understand.

I assumed that the formula framework is essentially orthogonal to the
models themselves. In the sense that it should be simple to adapt a
formula framework to new models. At least if they're some variety of
linear model, and provided the formula framework is designed to allow
for grouping syntax from the beginning. I think easy of extension to
new models is a major goal, in fact, since we want it to be easy for
people to contribute new models.

-Chris JS

> Josef
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>