data management with python from perl

Wed Oct 8 06:38:52 EDT 2003

ben moretti wrote:

> hi
> 
> i'm learning python, and one area i'd use it for is data management in
> scientific computing. in the case i've tried i want to reformat a data
> file from a normalised list to a matrix with some sorted columns. to
> do this at the moment i am using perl, which is very easy to do, and i
> want to see if python is as easy.
> 
> so, the data i am using is some epiphyte population abundance data for
> particular sites, and it looks like this:
> 
> 1.00  1.00    1.00    "MO"    906.00  "genus species 1"       1.00
> 1.00  1.00    1.00    "MO"    906.00  "genus species 2"       1.00
> 1.00  1.00    1.00    "MO"    906.00  "genus species 3"       1.00
> 1.00  1.00    1.00    "MO"    906.00  "genus species 4"       1.00
> 
> (i have changed the data to protect the innocent) the first four
> columns relate to the location, the fifth to the substrate, the sixth
> is the epiphyte species and the seventh the abundance. i need to turn
> this into a substrate x species matrix with columns 1 to 4 retained as
> sorting columns and the intersection of speces and substrate being the
> abundance. the species name needs to be the column headers. this is
> going to go into a multivariate analysis of variance programme that
> only takes its data in that format. here is an example of the output
> 
> region        location        site    stand   substrate       genus species 1 genus species
> 2     genus species 3 genus species 4 genus species 5 genus species
> 6     genus species 7
> 
> <..etc..>
> 
> 1     1       1       MO      906     0       0       0       0       0       0       0       0       0       0       0       0       0       0
> 
> <..etc...>
> 
> so, to do this in perl - and i won't bore you with the whole script -
> i read the file, split it into tokens and then populate a hash of
> hashes, the syntax of which is
> 
>
$HoH{$tokens[0]}{$tokens[1]}{$tokens[2]}{$tokens[3]}{$tokens[4]}{$tokens[5]}
> = $tokens[6]
> 
> with the various location and species values are the keys of the hash,
> and the abundance is the $tokens[6] value. this now gives me a
> multidimensional data structure that i can use to loop over the keys
> and sort them by each as i go, then to write out the data into a
> matrix as above. the syntax for this is generally like
> 
> # level 1 - region
> foreach $region (sort {$a <=> $b} keys %HoH) {
> 
> # level 2 - location
> foreach $location (sort {$a <=> $b} keys %{ $HoH{$region} }) {
> 
> # level 3 - site
> foreach $site (sort {$a <=> $b} keys %{ $HoH{$region}{$location} })
> 
> <... etc ...>
> 
> there is a bit more perl obviously, but that is the general gist of
> it. multidimensional hash and then looping and sorting to get the data
> out.
> 
> ok. so how do i do this in python? i've tried the "perlish" way but
> didn't get very far, however i know it must be able to be done!

The best solution would probably to be to rely on a database that supports
pivot tables. 
However, I've put together a simple class to generate a pivot table to get
you started. It's only 2D, i. e. f(row,col) -> value, but if I have
understood you correctly that should be sufficient (I am not good at
reading perl).
To read your data from a (text) file, have a look at Python's csv module.

Peter

<code>
import sets

class Adder(object):
    """ Adds all values entered via set()
    """
    def __init__(self, value=0):
        self.value = value
    def set(self, value):
        self.value += value
    def get(self):
        return self.value

_none = object()
class First(object):
    """ Accepts any value the first time set() is called,
        requires the same value on subsequent calls of set().
    """
    def __init__(self):
        self.value = _none
    def set(self, value):
        if self.value is _none:
            self.value = value
        else:
            if value != self.value:
                raise ValueError, "%s expected but got %s" % (self.value,
value)
    def get(self):
        return self.value

class Pivot(object):
    """ A simple Pivot table generator class
    """
    def __init__(self, valueAccumulator, rowHeaders):
        self.rows = sets.Set()
        self.columns = sets.Set()
        self.values = {}
        self.valueAccumulator = valueAccumulator
        self.rowHeaders = rowHeaders
    def extend(self, table, extractRow, extractColumn, extractValue):
        for record in table:
            r = extractRow(record)
            c = extractColumn(record)
            self.rows.add(r)
            self.columns.add(c)
            try:
                fxy = self.values[r, c]
            except KeyError:
                fxy = self.valueAccumulator()
                self.values[r, c] = fxy
            fxy.set(extractValue(record))

    def toTable(self, defaultValue=None, columnCompare=None,
rowCompare=None):
        """ returns a list of lists.
        """
        table = []
        rows = list(self.rows)
        rows.sort(rowCompare)
        columns = list(self.columns)
        columns.sort(columnCompare)
        headers = self.rowHeaders + [c for c in columns]
        table.append(headers)
        for row in rows:
            record = list(row)
            for column in columns:
                v = self.values.get((row, column), None)
                if v is not None:
                    v = v.get()
                record.append(v)
            table.append(record)
        return table

def printTable(p):
    for row in p.toTable():
        print row

if __name__ == "__main__":
    table = [
        "Jack Welsh Beer 1",
        "Richard Maier Beer 1",
        "Bill Bush Wine 2",
        "Bill Bush Wine 2",
    ]
    table = [row.split() for row in table]
    print table
    print "-" * 10
    p = Pivot(Adder, ["Christian", "Surname"])
    def extractRow(record):
        return record[0], record[1]
    def extractValue(record):
        return int(record[3])
    def extractColumn(record):
        return record[2]
    p.extend(table, extractRow, extractColumn, extractValue)

    printTable(p)

    columns = "region location site stand substrate species
abundance".split()

    table = [
        [1.0, 1.0, 1.0, "MO", 906, "species 1", 1],
        [1.0, 1.0, 1.0, "MO", 906, "species 2", 1],
        [1.0, 1.0, 1.0, "MO", 906, "species 3", 1],
        [1.0, 1.0, 1.0, "MO", 906, "species 1", 1],
        [1.0, 1.0, 1.0, "GO", 706, "species 4", 1],
#        [1.0, 1.0, 1.0, "GO", 706, "species 4", 2],# uncomment me
        [1.0, 1.0, 1.0, "GO", 806, "species 1", 1],
        [1.0, 1.0, 1.0, "GO", 906, "species 1", 1],
        [1.0, 1.0, 1.0, "GO", 106, "species 1", 1],
    ]
    p = Pivot(First, columns[:5])
    p.extend(table, lambda r: tuple(r[:5]),
                    lambda r: r[5],
                    lambda r: r[6])
    printTable(p)
</code>