A little advice please? (Convert my boss to Python)

Mon Apr 15 17:48:05 EDT 2002

My boss is considering moving to Python (from Poplog), so I've coded up
something to read in records from a text file (as below) and return a
statistic based on the numbers of unique and paired records (as he
requested).  Easy enough (code below), but now he wants to compare it for
speed against some existing code (Poplog?).  He also wants to be able to
calculate the statistic for only a subset of variables.

So what I'm looking for is speed, and some advice so that I don't end up
trying too many alternatives.  Maybe I should be using Numeric arrays (slice
out the superfluous columns)?  Maybe an n-dimensional array (for n
variables) and just count the cells with 1 and 2?  (Then slice / marginalise
and recount for queries on subsets of variables.  I like the sound of this,
but maybe there are limitations on array size / number of dimensions?)
Maybe I could avoid reading the data for superfluous variables and compare
records without the need to 'line.split()'?  Maybe my approach of using the
record as a dictionary key and incrementing the values is not the best way
of counting uniques and pairs?

Anyone done anything similar?  Any advice?  TIA.

Duncan

Identification        Var1    Var2    Var3    Var4    Var5 ...

0000000000001    N        0           2           3        0
0000000000002    N        0           2           2        0
0000000000003    N        1           3           3        1
0000000000004    Y        0           2           2        2
0000000000005    N        0           2           1        0

def do_stuff(filename, S):  #(S is floating point)
    f = open(filename, 'r')
    lines = f.readlines()
    f.close()
    lines = [line.split()[1:] for line in lines]
    lines = [line for line in lines if line != []]
    lines = lines[1:]
    dict = {}
    for line in lines:
        my_key = tuple(line)
        if dict.has_key(my_key):
            dict[my_key] += 1
        else: dict[my_key] = 1
    U = P = 0
    for value in dict.values():
        if value == 1:
            U += 1
        elif value == 2:
            P += 1

    return U*S/(U*S+P*(1-S))