Elementary string-parsing

Odysseus odysseus1479-at at yahoo-dot.ca
Sun Feb 3 22:21:18 EST 2008


I'm writing my first 'real' program, i.e. that has a purpose aside from 
serving as a learning exercise. I'm posting to solicit comments about my 
efforts at translating strings from an external source into useful data, 
regarding efficiency and 'pythonicity' both. My only significant 
programming experience is in PostScript, and I feel that I haven't yet 
'found my feet' concerning the object-oriented aspects of Python, so I'd 
be especially interested to know where I may be neglecting to take 
advantage of them.

My input is in the form of correlated lists of strings, which I want to 
merge (while ignoring some extraneous items). I populate a dictionary 
called "found" with these data, still in string form. It contains 
sub-dictionaries of various items keyed to strings extracted from the 
list "names"; these sub-dictionaries in turn contain the associated 
items I want from "cells". After loading in the strings (I have omitted 
the statements that pick up strings that require no further processing, 
some of them coming from a third list), I convert selected items in 
place. Here's the function I wrote:

def extract_data():
    i = 0
    while i < len(names):
        name = names[i][6:] # strip off "Name: "
        found[name] = {'epoch1': cells[10 * i + na],
                       'epoch2': cells[10 * i + na + 1],
                       'time': cells[10 * i + na + 5],
                       'score1': cells[10 * i + na + 6],
                       'score2': cells[10 * i + na + 7]}
###
Following is my first parsing step, for those data that represent real 
numbers. The two obstacles I'm contending with here are that the figures 
have commas grouping the digits in threes, and that sometimes the data 
are non-numeric -- I'll deal with those later. Is there a more elegant 
way of removing the commas than the split-and-rejoin below?
###
        for k in ('time', 'score1', 'score2'):
            v = found[name][k]
            if v != "---" and v != "n/a": # skip non-numeric data
                v = ''.join(v.split(",")) # remove commas between 000s
                found[name][k] = float(v)
###
The next one is much messier. A couple of the strings represent times, 
which I think will be most useful in 'native' form, but the input is in 
the format "DD Mth YYYY HH:MM:SS UTC". Near the beginning of my program 
I have "from calendar import timegm". Before I can feed the data to this 
function, though, I have to convert the month abbreviation to a number. 
I couldn't come up with anything more elegant than look-up from a list: 
the relevant part of my initialization is
'''
m_abbrevs = ("Jan", "Feb", "Mar", "Apr", "May", "Jun",
             "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
'''
I'm also rather unhappy with the way I kluged the seventh and eighth 
values in the tuple passed to timegm, the order of the date in the week 
and in the year respectively. (I would hate to have to calculate them.) 
The function doesn't seem to care what values I give it for these -- as 
long as I don't omit them -- so I guess they're only there for the sake 
of matching the output of the inverse function. Is there a version of 
timegm that takes a tuple of only six (or seven) elements, or any better 
way to handle this situation?
###
        for k in ('epoch1', 'epoch2'):
            dlist = found[name][k].split(" ")
            m = 0
            while m < 12:
                if m_abbrevs[m] == dlist[1]:
                    dlist[1] = m + 1
                    break
                m += 1
            tlist = dlist[3].split(":")
            found[name][k] = timegm((int(dlist[2]), int(dlist[1]),
                                     int(dlist[0]), int(tlist[0]),
                                     int(tlist[1]), int(tlist[2]),
                                     -1, -1, 0))
        i += 1

The function appears to be working OK as is, but I would welcome any & 
all suggestions for improving it or making it more idiomatic.

-- 
Odysseus



More information about the Python-list mailing list