[SciPy-User] R vs Python for simple interactive data analysis

Skipper Seabold jsseabold at gmail.com
Sun Aug 28 15:54:49 EDT 2011


On Sat, Aug 27, 2011 at 10:15 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> On Sat, Aug 27, 2011 at 5:06 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>> On Sat, Aug 27, 2011 at 5:03 PM, Jason Grout
>> <jason-sage at creativetrax.com> wrote:
>>> On 8/27/11 1:19 PM, Christopher Jordan-Squire wrote:
>>>> This comparison might be useful to some people, so I stuck it up on a
>>>> github repo. My overall impression is that R is much stronger for
>>>> interactive data analysis. Click on the link for more details why,
>>>> which are summarized in the README file.
>>>
>>>  From the README:
>>>
>>> "In fact, using Python without the IPython qtconsole is practically
>>> impossible for this sort of cut and paste, interactive analysis.
>>> The shell IPython doesn't allow it because it automatically adds
>>> whitespace on multiline bits of code, breaking pre-formatted code's
>>> alignment. Cutting and pasting works for the standard python shell,
>>> but then you lose all the advantages of IPython."
>>>
>>>
>>>
>>> You might use %cpaste in the ipython normal shell to paste without it
>>> automatically inserting spaces:
>>>
>>> In [5]: %cpaste
>>> Pasting code; enter '--' alone on the line to stop.
>>> :if 1>0:
>>> :    print 'hi'
>>> :--
>>> hi
>>>
>>> Thanks,
>>>
>>> Jason
>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>
>> This strikes me as a textbook example of why we need an integrated
>> formula framework in statsmodels. I'll make a pass through when I get
>> a chance and see if there are some places where pandas would really
>> help out.
>
> We used to have a formula class is scipy.stats and I do not follow
> nipy (http://nipy.sourceforge.net/nipy/stable/index.html) as it also
> had this (extremely flexible but very hard to comprehend). It was what
> I had argued was needed ages ago for statsmodel. But it needs a
> community effort because the syntax required serves multiple
> communities with different annotations and needs. That is also seen
> from the different approaches taken by the stats packages from S/R,
> SAS, Genstat (and those are just are ones I have used).
>

We have held this discussion at _great_ length multiple times on the
statsmodels list and are in the process of trying to integrate
Charlton (from Nathaniel) and/or Formula (from Jonathan / NiPy) into
the statsmodels base.

http://statsmodels.sourceforge.net/dev/roadmap_todo.html#formula-framework

and more recently

https://groups.google.com/group/pystatsmodels/browse_thread/thread/a76ea5de9e96964b/fd85b80ae46c4931?

https://github.com/statsmodels/formula
https://github.com/statsmodels/charlton

Wes and I made some effort to go through this at SciPy. From where I
sit, I think it's difficult to disentangle the data structures from
the formula implementation, or maybe I'd just prefer to finish
tackling the former because it's much more straightforward. So I'd
like to first finish the pandas-integration branch that we've started
and then focus on the formula support. This is on my (our, I hope...)
immediate long-term goal list. Then I'd like to come back to the
community and hash out the 'rules of the game' details for formulas
after we have some code for people to play with, which promises to be
"fun."

https://github.com/statsmodels/statsmodels/tree/pandas-integration

FWIW, I could also improve the categorical function to be much nicer
for the given examples (ie., take a list, drop a reference category),
but I don't know that it's worth it, because it's really just a
stop-gap and ideally users shouldn't have to rely on it. Thoughts on
more stop-gap?

If I understand Chris' concerns, I think pandas + formula will go a
long way towards bridging the gap between Python and R usability, but
it's a large effort and there are only a handful (at best) of people
writing code -- Wes being the only one who's more or less "full time"
as far as I can tell. The 0.4 statsmodels release should be very
exciting though, I hope. I'm looking forward to it, at least. Then
there's only the small problem of building an infrastructure and
community like CRAN so we can have specialists writing and maintaining
code...but I hope once all the tools are in place this will seem much
less daunting. There certainly seems to be the right sentiment for it.

Skipper



More information about the SciPy-User mailing list