basic statistics in python

Sat Mar 16 10:31:58 EST 2002

Siegfried Gonzi wrote:
> 
> Tim Churches wrote:
> 
> > Note that stats.py also returns the 2-tailed p-value as well (which can
> > also easily be obtained from R via RPy).
> >
> > Tim C
> 
> As a side note (but not related to the above problem): there exists also
> some peculiarities with the R-language:

"Particularities" might be a better word to avoid the perjorative sense
of "pecularity".

> 
> For example:
> 
> > data <- c(0.23,1.0023,1.223,1.235,5.6,9.0,23.3456,34.458,34.56,78.9)
> 
> > summary(data)
> 
> delivers:
> 
>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>   0.230   1.226   7.300  18.960  31.680  78.900
> 
> Everything is correct, except the 1st quantile and 3rd quantile.

You mean 1st quartile and 3rd quartile, not quantile. And the values
calculated by R are not wrong, just different (see below).

> 
> First, I could not believe it and fired up XLispStat:
> 
> (setf data (make-array 10 :initial-contents  '(0.23 1.0023 1.223 1.235
> 5.6 9.0 23.3456
> 34.458 34.56 78.9)))
> 
> (quantile data 0.25) and (quantile data 0.75) respectively:
> 
> delivers: 1.229 and 28.9018
> 
> The R-language calculates not only on Windows the values wrong; even on
> Unix: the values are the same as on Windows.

The results given by R are indeed the same on both platforms, but they
not wrong.

> 
> Maybe they use some other method for calculating the quantiles.

There are a number of methods for calculating quantiles. In R, the
summary() function calls the quantile() function to calculate the 1st
and 3rd quartiles and the median. The quantile() function uses linear
interpolation to calculate the sample quantile for the probabilities of
0.25 and 0.75, whereas XLispStat is just taking the arithmetic mean of
the 2nd and 3rd, and 6th and 7th values respectively (using zero-based
indexing/counting, since this is the Python list).

The methods used by R are fully described in the R manual (see
help(quantile)), but a commonsense explanation of the R approach is as
follows (again using zero-based indexing/counting). If the number of
elements is even, the median (50th percentile) is the average of the 4th
and 5th (ordered) values, which can be thought of the the "4.5th value"
i.e. half way between the 4th value and the 5th value. Note that (9 - 0)
* 0.50 = 4.5. By extension, the 25th percentile (1st quartile) should be
the "2.25th value" i.e. one quarter of the way between the 2nd value and
the 3rd value (noting that (9 - 0) * 0.25 = 2.25), and the 75th
percentile (3rd quartile) should be the "6.75th value" i.e.
three-quarters of the way between the 6th and 7th values (noting that (9
- 0) * 0.75 = 6.75), which is what R returns. If you think about your
sequence of numbers as an empirical distribution function, then the
appraoch used by R makes better sense, IMHO.

> 
> Personally: I can not cope with the R-language. It is rich of many
> build-in functions; but most of the time I am not successful in finding
> what I am searching for.

The R documentation is very extensive - if printed, I think the manual
for the base package runs to over 800 pages, and there are scores of
extension libraries each with their own documentation on top of that.
Finding the right function can require persistence - R certainly does
not take a minimalist approach to built-in functions like Python does -
but the online help/manual pages are searchable at the word level, so it
is unlikely that you will fail to find something if it exists.

> 
> The graphics are good; but I would always prefer Dislin as long as I do
> not need any specialized graphics from the field of statistics.

The main problem with Dislin is that it is not open source, nor is it
free on Unix or Mac OS/X platforms. R is distributed under the GPL.

> 
> In R one can even read in binary files (you can even swap the binary
> order). But plotting a large array is a pain in the neck. In Dislin it
> is very fast and one can overlay maps (e.g. coastlines) without any
> problems (you get even the x- and y-axis annotation right:
> -180E....+180W,...).

There is a maps library available for R but geographical mapping is not
its forte. 

Tim C