[Chicago] Parsing Metra's Online Schedule

Wed Apr 9 23:52:01 CEST 2008

I don't ride Metra anymore so I don't have much motivation to help
with Cosmin's Metra Schedule App. However, I happened to have some
free time on my hands (I'm unemployed), so I thought it might be fun
to make a rudimentary parser. Surprisingly, I hardly did any real
HTML parsing, since Metra's pages actually use PRE tags instead of
TABLE tags to display the tabular parts. So it devolved into typical
regex hacking.

The following code should not be construed as a complete solution.
All it does is parse the text inside the 3 PRE tags and put the data
into a single 2D matrix. I also included a small function that
creates an HTML table out of the data. I only tested my code on a
single page so far, but I think all the schedule pages are pretty
much the same. Basically, you can use this as a starting point.

P.S. You need lxml to run the code.

------------------------------------------------------------
import re
import lxml.html as lh

def get_rows(tree):
    texts = [n.text_content() for n in tree.xpath('//pre')]

    trainNumRow = [' ']
    ampmRow = [' ']
    timeRows = []   # list of lists

    for i, text in enumerate(texts):
        lines = [line for line in text.split('\n')
                 if line.strip()]

        trainNums = lines[0].split()
        trainNumRow += trainNums
        ampmRow += lines[1].split()

        for j, line in enumerate(lines[2:]):
            matches = [m for m in re.finditer(r"x?\d+\:\d+|.---|\|",
line)]
            if len(matches) != len(trainNums):
                break

            pos = matches[0].start()
            town = line[:pos].strip()
            times = [m.group() for m in matches]

            if j >= len(timeRows):
                timeRows.append([])

            timeRows[j] += [town]+times if i==0 else times

    yield trainNumRow
    yield ampmRow
    for row in timeRows:
        yield row

def make_table_file(filename, rows):
    import codecs
    fout = codecs.open(filename, 'w', 'utf-8')
    fout.write('<table border="1">')
    for row in rows:
        fout.write('<tr>')
        for v in row:
            if v.endswith('---'):   # get rid of the stupid \x97 char
                v = '----'
            fout.write('<td>%s</td>' % v)
        fout.write('</tr>')
    fout.write('</table>')
    fout.close()

if __name__ == '__main__':
    tree = lh.parse('test.html')
    rows = get_rows(tree)
    make_table_file('table.html', rows)

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com