Strengths of R (was Re: [SciPy-dev] IPython updated (Emacs works now))

Wed Feb 20 12:39:42 EST 2002

On Tue, 19 Feb 2002, eric wrote:

> Hey Tony,
> 
> > Note that one thing I'm working on doing is extending ESS (an Emacs mode for
> > data analysis, usually R or SAS, but also for other "interactive" data analysis
> > stuff) to accomodate iPython, in the hopes of (very slowly) moving towards using
> > R and SciPy for most of my work.
> 
> What are the benefits of R over Python/SciPy.  Is there a philosophical
> difference that makes it better suited for statistics (and other things for that
> matter), or is it simply that it has more stats functionality and is much more
> mature?  If there is a different philosophy behind it, can you summarize it?
> Maybe we can incorporate some of its strong points into SciPy's stats module.
> Travis Oliphant is working on it as we speak.  We could definitely do with some
> of you stats guys input!

John Barnard, who is on this list, should speak up, being one of the few other people that I know of (Doug Bates and some of his students at U Wisc being another exception) that actually use Python for "work" (database or computation). 

R (and the language it implements, S) is a language intended for primarily interactive data analysis.  So, a good bit of thought has gone into data structures (lists/dataframes in R parlance, which are a bit like arrays with row/column labels which can be used interchangeably instead of row/column numbers), data types such as factors (and factor coding -- some analytic approaches are not robust to choice of coding style for nominal (categorical, non-ordered) data.  It has a means for handling missing data (similar to the MA extension for Numeric), and it also has a strong modeling structure, i.e.
fitting linear models (using least squares or weighted least squares) is done in a language which "looks right", i.e. 
  lm(y ~ x)
fits a model which looks like
 y = b x + e, e following the usual linear models assumptions.

as well as smoothing methods (splines, kernels) done in similar fashion.  Models are a data object as well, which means that you can act on it appropriately, comparing 2 fitted models, etc, etc.  

R, as opposed to the commercial version S or S-PLUS, also has a flexible packaging scheme for add-on packages (for things like Expression Array analysis, spatial-temporal data analysis, graphics, and generalized linear models, marginal models, and it seems like hundreds more.  It also can call out to C, Fortran, Java, Python, and Perl (and C++, but that's recent, in the last year or so).  Database work is simple, as well, though not up to Perl (or Python) interfaces. 

It also has lexical scoping, and is based originally on scheme (though the syntax is like python).  

However, it's not a true OO language like python, and some things seem to be hacks.  This is mostly an aesthetic problem, not a functional problem.

It's worth a look if you do data analysis.  In many ways, the strength is in the ease of programming good graphics, analyses, etc, with output which is easily read and intelligible.

It has problems, in terms of scope of data and speed.  It's not as clean to read as python (i.e. I _LIKE_ meaningful indentation, which makes me weird :-), and 
isn't as generally glexible (it took me twice as long to write a reader for Flow Cytometry standard data file formats in R than in Python) but annotation of the resulting data is much easier in R than in Python (and default summary statistics, both numerical and graphical, are easier to work with).

So, I don't think I'll be giving up R, but I am looking forward to SciPy (esp things like the sparse array work, which is much more difficult to handle in R, in a nice format).

One thing that I did write was RPVM, for using PVM (LAN cluster library) with R; and patched up PyPVM so that the previous authors work actually worked :-); that is how I'm thinking of doing the interfacing.

In general, R is great for pre- and post-processing small and medium sized datasets, as well as for descriptive and inferential statistics, but for custom analyses, one would still go to C, Fortran, or C++ after prototyping (much like Python).  

I can try to say more, but it's hard to describe a full language quickly.  See http://www.r-project.org/ for more details (and for ESS, http://software.biostat.washington.edu/statsoft/ess/, if anyone is interested).

best,
-tony

HOWEVER, it doesn't have