[SciPy-User] R vs Python for simple interactive data analysis

Mon Aug 29 10:57:43 EDT 2011

On Sun, Aug 28, 2011 at 2:54 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
> On Sat, Aug 27, 2011 at 10:15 PM, Bruce Southey <bsouthey at gmail.com> wrote:
>> On Sat, Aug 27, 2011 at 5:06 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>> On Sat, Aug 27, 2011 at 5:03 PM, Jason Grout
>>> <jason-sage at creativetrax.com> wrote:
>>>> On 8/27/11 1:19 PM, Christopher Jordan-Squire wrote:
>>>>> This comparison might be useful to some people, so I stuck it up on a
>>>>> github repo. My overall impression is that R is much stronger for
>>>>> interactive data analysis. Click on the link for more details why,
>>>>> which are summarized in the README file.
>>>>
>>>>  From the README:
>>>>
>>>> "In fact, using Python without the IPython qtconsole is practically
>>>> impossible for this sort of cut and paste, interactive analysis.
>>>> The shell IPython doesn't allow it because it automatically adds
>>>> whitespace on multiline bits of code, breaking pre-formatted code's
>>>> alignment. Cutting and pasting works for the standard python shell,
>>>> but then you lose all the advantages of IPython."
>>>>
>>>>
>>>>
>>>> You might use %cpaste in the ipython normal shell to paste without it
>>>> automatically inserting spaces:
>>>>
>>>> In [5]: %cpaste
>>>> Pasting code; enter '--' alone on the line to stop.
>>>> :if 1>0:
>>>> :    print 'hi'
>>>> :--
>>>> hi
>>>>
>>>> Thanks,
>>>>
>>>> Jason
>>>>
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> SciPy-User at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>
>>>
>>> This strikes me as a textbook example of why we need an integrated
>>> formula framework in statsmodels. I'll make a pass through when I get
>>> a chance and see if there are some places where pandas would really
>>> help out.
>>
>> We used to have a formula class is scipy.stats and I do not follow
>> nipy (http://nipy.sourceforge.net/nipy/stable/index.html) as it also
>> had this (extremely flexible but very hard to comprehend). It was what
>> I had argued was needed ages ago for statsmodel. But it needs a
>> community effort because the syntax required serves multiple
>> communities with different annotations and needs. That is also seen
>> from the different approaches taken by the stats packages from S/R,
>> SAS, Genstat (and those are just are ones I have used).
>>
>
> We have held this discussion at _great_ length multiple times on the
> statsmodels list and are in the process of trying to integrate
> Charlton (from Nathaniel) and/or Formula (from Jonathan / NiPy) into
> the statsmodels base.
>
> http://statsmodels.sourceforge.net/dev/roadmap_todo.html#formula-framework
>
> and more recently
>
> https://groups.google.com/group/pystatsmodels/browse_thread/thread/a76ea5de9e96964b/fd85b80ae46c4931?
>
> https://github.com/statsmodels/formula
> https://github.com/statsmodels/charlton
>
> Wes and I made some effort to go through this at SciPy. From where I
> sit, I think it's difficult to disentangle the data structures from
> the formula implementation, or maybe I'd just prefer to finish
> tackling the former because it's much more straightforward. So I'd
> like to first finish the pandas-integration branch that we've started
> and then focus on the formula support. This is on my (our, I hope...)
> immediate long-term goal list. Then I'd like to come back to the
> community and hash out the 'rules of the game' details for formulas
> after we have some code for people to play with, which promises to be
> "fun."
>
> https://github.com/statsmodels/statsmodels/tree/pandas-integration
>
> FWIW, I could also improve the categorical function to be much nicer
> for the given examples (ie., take a list, drop a reference category),
> but I don't know that it's worth it, because it's really just a
> stop-gap and ideally users shouldn't have to rely on it. Thoughts on
> more stop-gap?
>

I want more usability, but I agree that a stop-gap probably isn't the
right way to go, unless it has things we'd eventually want anyways.

> If I understand Chris' concerns, I think pandas + formula will go a
> long way towards bridging the gap between Python and R usability, but

Yes, I agree. pandas + formulas would go a long, long way towards more
usability.

Though I really, really want a scatterplot smoother (i.e., lowess) in
statsmodels. I use it a lot, and the final part of my R file was
entirely lowess. (And, I should add, that was the part people liked
best since one of the main goals of the assignment was to generate
nifty pictures that could be used to summarize the data.)

> it's a large effort and there are only a handful (at best) of people
> writing code -- Wes being the only one who's more or less "full time"
> as far as I can tell. The 0.4 statsmodels release should be very
> exciting though, I hope. I'm looking forward to it, at least. Then
> there's only the small problem of building an infrastructure and
> community like CRAN so we can have specialists writing and maintaining
> code...but I hope once all the tools are in place this will seem much
> less daunting. There certainly seems to be the right sentiment for it.
>

At the very least creating and testing models would be much simpler.
For weeks I've been wanting to see if gmm is the same as gee by
fitting both models to the same dataset, but I've been putting it off
because I didn't want to construct the design matrices by hand for
such a simple question. (GMM--Generalized Method of Moments--is a
standard econometrics model and GEE--Generalized Estimating
Equations--is a standard biostatics model. They're both
generalizations of quasi-likelihood and appear very similar, but I
want to fit some models to figure out if they're exactly the same.)

-Chris JS

> Skipper
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>