[SciPy-User] R vs Python for simple interactive data analysis

Mon Aug 29 12:59:08 EDT 2011

On Mon, Aug 29, 2011 at 11:42 AM,  <josef.pktd at gmail.com> wrote:
> On Mon, Aug 29, 2011 at 11:34 AM, Christopher Jordan-Squire
> <cjordan1 at uw.edu> wrote:
>> On Mon, Aug 29, 2011 at 10:27 AM,  <josef.pktd at gmail.com> wrote:
>>> On Mon, Aug 29, 2011 at 11:10 AM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>> On Mon, Aug 29, 2011 at 10:57 AM, Christopher Jordan-Squire
>>>> <cjordan1 at uw.edu> wrote:
>>>>> On Sun, Aug 28, 2011 at 2:54 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>>>> On Sat, Aug 27, 2011 at 10:15 PM, Bruce Southey <bsouthey at gmail.com> wrote:
>>>>>>> On Sat, Aug 27, 2011 at 5:06 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>>>>>>> On Sat, Aug 27, 2011 at 5:03 PM, Jason Grout
>>>>>>>> <jason-sage at creativetrax.com> wrote:
>>>>>>>>> On 8/27/11 1:19 PM, Christopher Jordan-Squire wrote:
>>>>>>>>>> This comparison might be useful to some people, so I stuck it up on a
>>>>>>>>>> github repo. My overall impression is that R is much stronger for
>>>>>>>>>> interactive data analysis. Click on the link for more details why,
>>>>>>>>>> which are summarized in the README file.
>>>>>>>>>
>>>>>>>>>  From the README:
>>>>>>>>>
>>>>>>>>> "In fact, using Python without the IPython qtconsole is practically
>>>>>>>>> impossible for this sort of cut and paste, interactive analysis.
>>>>>>>>> The shell IPython doesn't allow it because it automatically adds
>>>>>>>>> whitespace on multiline bits of code, breaking pre-formatted code's
>>>>>>>>> alignment. Cutting and pasting works for the standard python shell,
>>>>>>>>> but then you lose all the advantages of IPython."
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> You might use %cpaste in the ipython normal shell to paste without it
>>>>>>>>> automatically inserting spaces:
>>>>>>>>>
>>>>>>>>> In [5]: %cpaste
>>>>>>>>> Pasting code; enter '--' alone on the line to stop.
>>>>>>>>> :if 1>0:
>>>>>>>>> :    print 'hi'
>>>>>>>>> :--
>>>>>>>>> hi
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Jason
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> SciPy-User mailing list
>>>>>>>>> SciPy-User at scipy.org
>>>>>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>>>>>>
>>>>>>>>
>>>>>>>> This strikes me as a textbook example of why we need an integrated
>>>>>>>> formula framework in statsmodels. I'll make a pass through when I get
>>>>>>>> a chance and see if there are some places where pandas would really
>>>>>>>> help out.
>>>>>>>
>>>>>>> We used to have a formula class is scipy.stats and I do not follow
>>>>>>> nipy (http://nipy.sourceforge.net/nipy/stable/index.html) as it also
>>>>>>> had this (extremely flexible but very hard to comprehend). It was what
>>>>>>> I had argued was needed ages ago for statsmodel. But it needs a
>>>>>>> community effort because the syntax required serves multiple
>>>>>>> communities with different annotations and needs. That is also seen
>>>>>>> from the different approaches taken by the stats packages from S/R,
>>>>>>> SAS, Genstat (and those are just are ones I have used).
>>>>>>>
>>>>>>
>>>>>> We have held this discussion at _great_ length multiple times on the
>>>>>> statsmodels list and are in the process of trying to integrate
>>>>>> Charlton (from Nathaniel) and/or Formula (from Jonathan / NiPy) into
>>>>>> the statsmodels base.
>>>>>>
>>>>>> http://statsmodels.sourceforge.net/dev/roadmap_todo.html#formula-framework
>>>>>>
>>>>>> and more recently
>>>>>>
>>>>>> https://groups.google.com/group/pystatsmodels/browse_thread/thread/a76ea5de9e96964b/fd85b80ae46c4931?
>>>>>>
>>>>>> https://github.com/statsmodels/formula
>>>>>> https://github.com/statsmodels/charlton
>>>>>>
>>>>>> Wes and I made some effort to go through this at SciPy. From where I
>>>>>> sit, I think it's difficult to disentangle the data structures from
>>>>>> the formula implementation, or maybe I'd just prefer to finish
>>>>>> tackling the former because it's much more straightforward. So I'd
>>>>>> like to first finish the pandas-integration branch that we've started
>>>>>> and then focus on the formula support. This is on my (our, I hope...)
>>>>>> immediate long-term goal list. Then I'd like to come back to the
>>>>>> community and hash out the 'rules of the game' details for formulas
>>>>>> after we have some code for people to play with, which promises to be
>>>>>> "fun."
>>>>>>
>>>>>> https://github.com/statsmodels/statsmodels/tree/pandas-integration
>>>>>>
>>>>>> FWIW, I could also improve the categorical function to be much nicer
>>>>>> for the given examples (ie., take a list, drop a reference category),
>>>>>> but I don't know that it's worth it, because it's really just a
>>>>>> stop-gap and ideally users shouldn't have to rely on it. Thoughts on
>>>>>> more stop-gap?
>>>>>>
>>>>>
>>>>> I want more usability, but I agree that a stop-gap probably isn't the
>>>>> right way to go, unless it has things we'd eventually want anyways.
>>>>>
>>>>>> If I understand Chris' concerns, I think pandas + formula will go a
>>>>>> long way towards bridging the gap between Python and R usability, but
>>>>>
>>>>> Yes, I agree. pandas + formulas would go a long, long way towards more
>>>>> usability.
>>>>>
>>>>> Though I really, really want a scatterplot smoother (i.e., lowess) in
>>>>> statsmodels. I use it a lot, and the final part of my R file was
>>>>> entirely lowess. (And, I should add, that was the part people liked
>>>>> best since one of the main goals of the assignment was to generate
>>>>> nifty pictures that could be used to summarize the data.)
>>>>>
>>>>
>>>> Working my way through the pull requests. Very time poor...
>>>>
>>>>>> it's a large effort and there are only a handful (at best) of people
>>>>>> writing code -- Wes being the only one who's more or less "full time"
>>>>>> as far as I can tell. The 0.4 statsmodels release should be very
>>>>>> exciting though, I hope. I'm looking forward to it, at least. Then
>>>>>> there's only the small problem of building an infrastructure and
>>>>>> community like CRAN so we can have specialists writing and maintaining
>>>>>> code...but I hope once all the tools are in place this will seem much
>>>>>> less daunting. There certainly seems to be the right sentiment for it.
>>>>>>
>>>>>
>>>>> At the very least creating and testing models would be much simpler.
>>>>> For weeks I've been wanting to see if gmm is the same as gee by
>>>>> fitting both models to the same dataset, but I've been putting it off
>>>>> because I didn't want to construct the design matrices by hand for
>>>>> such a simple question. (GMM--Generalized Method of Moments--is a
>>>>> standard econometrics model and GEE--Generalized Estimating
>>>>> Equations--is a standard biostatics model. They're both
>>>>> generalizations of quasi-likelihood and appear very similar, but I
>>>>> want to fit some models to figure out if they're exactly the same.)
>>>
>>> Since GMM is still in the sandbox, the interface is not very polished,
>>> and it's missing some enhancements. I recommend asking on the mailing
>>> list if it's not clear.
>>>
>>> Note GMM itself is very general and will never be a quick interactive
>>> method. The main work will always be to define the moment conditions
>>> (a bit similar to non-linear function estimation, optimize.leastsq).
>>>
>>> There are and will be special subclasses, eg. IV2SLS, that have
>>> predefined moment conditions, but, still, it's up to the user do
>>> construct design and instrument arrays.
>>> And as far as I remember, the GMM/GEE package in R doesn't have a
>>> formula interface either.
>>>
>>
>> Both of the two gee packages in R I know of have formula interfaces.
>>
>> http://cran.r-project.org/web/packages/geepack/
>> http://cran.r-project.org/web/packages/gee/index.html

This is very different from what's in GMM in statsmodels so far. The
help file is very short, so I'm mostly guessing.
It seems to be for (a subset) of generalized linear models with
longitudinal/panel covariance structures. Something like this will
eventually (once we get panel data models)  as a special case of GMM
in statsmodels, assuming it's similar to what I know from the
econometrics literature.

Most of the subclasses of GMM that I currently have, are focused on
instrumental variable estimation, including non-linear regression.
This should be expanded over time.

But GMM itself is designed for subclassing by someone who wants to use
her/his own moment conditions, as in
http://cran.r-project.org/web/packages/gmm/index.html
or for us to implement specific models with it.

If someone wants to use it, then I have to quickly add the options for
the kernels of the weighting matrix, which I keep postponing.
Currently there is only a truncated, uniform kernel that assumes
observations are order by time, but users can provide their own
weighting function.

Josef

>
> I have to look at this. I mixed up some acronyms, I meant GEL and GMM
> http://cran.r-project.org/web/packages/gmm/index.html
> the vignette was one of my readings, and the STATA description for GMM.
>
> I never really looked at GEE. (That's Skipper's private work so far.)
>
> Josef
>
>>
>> -Chris JS
>>
>>> Josef
>>>
>>>>>
>>>>
>>>> Oh, it's not *that* bad. I agree, of course, that it could be better,
>>>> but I've been using mainly Python for my work, including GMM and
>>>> estimating equations models (mainly empirical likelihood and
>>>> generalized maximum entropy) for the last ~two years.
>>>>
>>>> Skipper
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> SciPy-User at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>