[Chicago] Parsing Metra's Online Schedule
Feihong Hsu
hsu.feihong at yahoo.com
Wed Apr 9 23:52:01 CEST 2008
I don't ride Metra anymore so I don't have much motivation to help
with Cosmin's Metra Schedule App. However, I happened to have some
free time on my hands (I'm unemployed), so I thought it might be fun
to make a rudimentary parser. Surprisingly, I hardly did any real
HTML parsing, since Metra's pages actually use PRE tags instead of
TABLE tags to display the tabular parts. So it devolved into typical
regex hacking.
The following code should not be construed as a complete solution.
All it does is parse the text inside the 3 PRE tags and put the data
into a single 2D matrix. I also included a small function that
creates an HTML table out of the data. I only tested my code on a
single page so far, but I think all the schedule pages are pretty
much the same. Basically, you can use this as a starting point.
P.S. You need lxml to run the code.
------------------------------------------------------------
import re
import lxml.html as lh
def get_rows(tree):
texts = [n.text_content() for n in tree.xpath('//pre')]
trainNumRow = [' ']
ampmRow = [' ']
timeRows = [] # list of lists
for i, text in enumerate(texts):
lines = [line for line in text.split('\n')
if line.strip()]
trainNums = lines[0].split()
trainNumRow += trainNums
ampmRow += lines[1].split()
for j, line in enumerate(lines[2:]):
matches = [m for m in re.finditer(r"x?\d+\:\d+|.---|\|",
line)]
if len(matches) != len(trainNums):
break
pos = matches[0].start()
town = line[:pos].strip()
times = [m.group() for m in matches]
if j >= len(timeRows):
timeRows.append([])
timeRows[j] += [town]+times if i==0 else times
yield trainNumRow
yield ampmRow
for row in timeRows:
yield row
def make_table_file(filename, rows):
import codecs
fout = codecs.open(filename, 'w', 'utf-8')
fout.write('<table border="1">')
for row in rows:
fout.write('<tr>')
for v in row:
if v.endswith('---'): # get rid of the stupid \x97 char
v = '----'
fout.write('<td>%s</td>' % v)
fout.write('</tr>')
fout.write('</table>')
fout.close()
if __name__ == '__main__':
tree = lh.parse('test.html')
rows = get_rows(tree)
make_table_file('table.html', rows)
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
More information about the Chicago
mailing list