data management with python from perl
Peter Otten
__peter__ at web.de
Wed Oct 8 06:38:52 EDT 2003
ben moretti wrote:
> hi
>
> i'm learning python, and one area i'd use it for is data management in
> scientific computing. in the case i've tried i want to reformat a data
> file from a normalised list to a matrix with some sorted columns. to
> do this at the moment i am using perl, which is very easy to do, and i
> want to see if python is as easy.
>
> so, the data i am using is some epiphyte population abundance data for
> particular sites, and it looks like this:
>
> 1.00 1.00 1.00 "MO" 906.00 "genus species 1" 1.00
> 1.00 1.00 1.00 "MO" 906.00 "genus species 2" 1.00
> 1.00 1.00 1.00 "MO" 906.00 "genus species 3" 1.00
> 1.00 1.00 1.00 "MO" 906.00 "genus species 4" 1.00
>
> (i have changed the data to protect the innocent) the first four
> columns relate to the location, the fifth to the substrate, the sixth
> is the epiphyte species and the seventh the abundance. i need to turn
> this into a substrate x species matrix with columns 1 to 4 retained as
> sorting columns and the intersection of speces and substrate being the
> abundance. the species name needs to be the column headers. this is
> going to go into a multivariate analysis of variance programme that
> only takes its data in that format. here is an example of the output
>
> region location site stand substrate genus species 1 genus species
> 2 genus species 3 genus species 4 genus species 5 genus species
> 6 genus species 7
>
> <..etc..>
>
> 1 1 1 MO 906 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>
> <..etc...>
>
> so, to do this in perl - and i won't bore you with the whole script -
> i read the file, split it into tokens and then populate a hash of
> hashes, the syntax of which is
>
>
$HoH{$tokens[0]}{$tokens[1]}{$tokens[2]}{$tokens[3]}{$tokens[4]}{$tokens[5]}
> = $tokens[6]
>
> with the various location and species values are the keys of the hash,
> and the abundance is the $tokens[6] value. this now gives me a
> multidimensional data structure that i can use to loop over the keys
> and sort them by each as i go, then to write out the data into a
> matrix as above. the syntax for this is generally like
>
> # level 1 - region
> foreach $region (sort {$a <=> $b} keys %HoH) {
>
> # level 2 - location
> foreach $location (sort {$a <=> $b} keys %{ $HoH{$region} }) {
>
> # level 3 - site
> foreach $site (sort {$a <=> $b} keys %{ $HoH{$region}{$location} })
>
> <... etc ...>
>
> there is a bit more perl obviously, but that is the general gist of
> it. multidimensional hash and then looping and sorting to get the data
> out.
>
> ok. so how do i do this in python? i've tried the "perlish" way but
> didn't get very far, however i know it must be able to be done!
The best solution would probably to be to rely on a database that supports
pivot tables.
However, I've put together a simple class to generate a pivot table to get
you started. It's only 2D, i. e. f(row,col) -> value, but if I have
understood you correctly that should be sufficient (I am not good at
reading perl).
To read your data from a (text) file, have a look at Python's csv module.
Peter
<code>
import sets
class Adder(object):
""" Adds all values entered via set()
"""
def __init__(self, value=0):
self.value = value
def set(self, value):
self.value += value
def get(self):
return self.value
_none = object()
class First(object):
""" Accepts any value the first time set() is called,
requires the same value on subsequent calls of set().
"""
def __init__(self):
self.value = _none
def set(self, value):
if self.value is _none:
self.value = value
else:
if value != self.value:
raise ValueError, "%s expected but got %s" % (self.value,
value)
def get(self):
return self.value
class Pivot(object):
""" A simple Pivot table generator class
"""
def __init__(self, valueAccumulator, rowHeaders):
self.rows = sets.Set()
self.columns = sets.Set()
self.values = {}
self.valueAccumulator = valueAccumulator
self.rowHeaders = rowHeaders
def extend(self, table, extractRow, extractColumn, extractValue):
for record in table:
r = extractRow(record)
c = extractColumn(record)
self.rows.add(r)
self.columns.add(c)
try:
fxy = self.values[r, c]
except KeyError:
fxy = self.valueAccumulator()
self.values[r, c] = fxy
fxy.set(extractValue(record))
def toTable(self, defaultValue=None, columnCompare=None,
rowCompare=None):
""" returns a list of lists.
"""
table = []
rows = list(self.rows)
rows.sort(rowCompare)
columns = list(self.columns)
columns.sort(columnCompare)
headers = self.rowHeaders + [c for c in columns]
table.append(headers)
for row in rows:
record = list(row)
for column in columns:
v = self.values.get((row, column), None)
if v is not None:
v = v.get()
record.append(v)
table.append(record)
return table
def printTable(p):
for row in p.toTable():
print row
if __name__ == "__main__":
table = [
"Jack Welsh Beer 1",
"Richard Maier Beer 1",
"Bill Bush Wine 2",
"Bill Bush Wine 2",
]
table = [row.split() for row in table]
print table
print "-" * 10
p = Pivot(Adder, ["Christian", "Surname"])
def extractRow(record):
return record[0], record[1]
def extractValue(record):
return int(record[3])
def extractColumn(record):
return record[2]
p.extend(table, extractRow, extractColumn, extractValue)
printTable(p)
columns = "region location site stand substrate species
abundance".split()
table = [
[1.0, 1.0, 1.0, "MO", 906, "species 1", 1],
[1.0, 1.0, 1.0, "MO", 906, "species 2", 1],
[1.0, 1.0, 1.0, "MO", 906, "species 3", 1],
[1.0, 1.0, 1.0, "MO", 906, "species 1", 1],
[1.0, 1.0, 1.0, "GO", 706, "species 4", 1],
# [1.0, 1.0, 1.0, "GO", 706, "species 4", 2],# uncomment me
[1.0, 1.0, 1.0, "GO", 806, "species 1", 1],
[1.0, 1.0, 1.0, "GO", 906, "species 1", 1],
[1.0, 1.0, 1.0, "GO", 106, "species 1", 1],
]
p = Pivot(First, columns[:5])
p.extend(table, lambda r: tuple(r[:5]),
lambda r: r[5],
lambda r: r[6])
printTable(p)
</code>
More information about the Python-list
mailing list