[SciPy-dev] An interesting exercise - reproduce R analysis using python.

josef.pktd at gmail.com josef.pktd at gmail.com
Tue Jan 27 00:31:47 EST 2009


On Mon, Jan 26, 2009 at 3:55 AM, Alex Holcombe <alexh at psych.usyd.edu.au> wrote:
> Alan Jackson <alan <at> ajackson.org> writes:
>
>>
>> A quite good online book, "Using R for Data Analysis and Graphics" by
>> John Maindonald has been updated and has all the data and exercises
>> posted on the net. Relevance to this group? I think it might be quite
>> fun and instructive to work through the 'R' exercises with
>> numpy/matplotlib/scipy instead. I'm considering doing it, partly just
>> to broaden my numpy/matplotlib skillset, but I thought I would post
>> this here in case anyone else might find it an interesting challenge.
>>
>> Exercises and datasets can be found from links here :
>>
> http://blog.revolution-computing.com/2009/01/using-r-for-data-analysis-and-graphics.html
>>
>
> A first step for me in having the R functionality I need within python is to
> take the data from an experiment and plot the dependent variable as a function
> of a subset of the independent variables in the experiment. For example plotting
> vehicle fuel efficiency as a function of mpg and vehicle weight. The loadtxt
> function easily imports my data files in a structure called a recarray, similar
> to a data.frame in R, a lot like a flat spreadsheet with a name for each column.
> I would like to collapse the other variables to determine the mean fuel
> efficiency for every combination of mpg and vehicle weight. If done in Excel, I
> think this involves a "PivotTable". For python, I wrote a function where I pass
> a recarray, the dependent variable name, and the names of the variables
> (datafile columns) that I want to collapse by, and it passes back
> multi-dimensional arrays providing the mean, standard deviation, and number of
> data points for every combination of the variables.
> collapseBy(data,DV,*factors)
> I am new to SciPy and don't know how to make my code follow its style, but I've
> posted the code here http://openwetware.org/images/7/7b/CollapseBy.txt
> blogged on the subject here
> http://alexholcombe.wordpress.com/2009/01/26/summarizing-data-by-combinations-of-variables-with-python/
> I would hope that something like this could be put into SciPy?
>
> Alex Holcombe, http://www.psych.usyd.edu.au/staff/alexh/
>

Your code is stylisticaly pretty difficult to read, for someone used
to python code. I recommend giving a look at
http://www.python.org/dev/peps/pep-0008/. Also I had problems
following all the (temporary) variables, so I tried to come up with
something that is easier to read.

Since I'm not using record arrays, I wrote it for a standard array
holding the data, with indices defining dependent and explanatory
variables. Below is the best of three versions I came up with, 2 other
versions are also in the attached file. I didn't try any timing and I
don't have mdp installed so I couldn't directly compare with your
program.

I was also looking at some references for R and I was pretty surprised
about the number of tutorials and books are available for R.

Josef

#----------------------------------
import numpy as np

data = np.random.randint(1,3, size=(10,5))
keep = [1, 4]     # index in data of explanatory variable under consideration
dv = 0            # index in data of dependent variable

# version1b: using dictionary to store indices to data
#-----------------------------------------------------

# build dictionary with unique combination as keys
#   and corresponding row indices as values
result1 = {}
for index, row in enumerate(data):
    result1.setdefault(tuple(row[[1, 4]]),[]).append(index)

# calculate statistics for each combination (key)
stat1 = []
for k,v in sorted(result1.iteritems()):
    m = data[v,dv].mean()
    s = data[v,dv].std()
    stat1.append(list(k) + [m, s, len(v)])

# convert result statistic to numpy arrays
stat1n = np.array(stat1)

print "combination                mean        var         count"
print stat1n
#-----------------------------------------------
-------------- next part --------------
'''calculate basic statistics of a dependent variable conditional on some
explanatory variables while ignoring other explanatory variables.

works only for discrete data
'''

import numpy as np

data = np.random.randint(1,3, size=(10,5))
keep = [1, 4]     # index in data of explanatory variable under consideration
dv = 0            # index in data of dependent variable


# version1: using dictionary to store data rows
#----------------------------------------------

# build dictionary with unique combination as keys
#   and corresponding data as values
result = {}
for row in data:
    print row
    result.setdefault(tuple(row[ [1, 4]]),[]).append(row)

# calculate statistics for each combination (key)
stat = []
for k,v in sorted(result.iteritems()):
    y = np.asarray(v)[:,dv]
    stat.append(list(k) + [y.mean(), y.std(), y.shape[0]])

# convert result statistic to numpy arrays
statn = np.array(stat)

print "combination                mean        var         count"  
print statn
assert np.sum(statn[:,len(keep)]*statn[:,-1])/data.shape[0] \
           == data[:,dv].mean()

# version1b: using dictionary to store indices to data
#-----------------------------------------------------

# build dictionary with unique combination as keys
#   and corresponding row indices as values
result1 = {}
for index, row in enumerate(data):
    result1.setdefault(tuple(row[[1, 4]]),[]).append(index)

# calculate statistics for each combination (key)
stat1 = []
for k,v in sorted(result1.iteritems()):
    m = data[v,dv].mean()
    s = data[v,dv].std()
    stat1.append(list(k) + [m, s, len(v)])

# convert result statistic to numpy arrays
stat1n = np.array(stat1)

print "combination                mean        var         count"  
print stat1n

assert np.all(stat1n == statn)




# version2: using itertools groupby
#----------------------------------

import itertools

#sort rows, can use numpy instead
datasorted = np.array(sorted(list(data), key=lambda(x):repr(x[keep])))
#use repr in sort key to avoid numpy array comparison

stat2 = []
for k, v in itertools.groupby(datasorted, lambda(x):repr(x[keep])): 
    v2 = np.array(list(v))
    y = v2[:,dv]
    stat2.append(v2[0,keep].tolist() + [y.mean(), y.std(), y.shape[0]])

stat2n = np.array(stat2)

print "combination                mean        var         count"  
print statn
assert np.all(stat2n == statn)


More information about the SciPy-Dev mailing list