[Chicago] Parsing Metra's Online Schedule

Thu Apr 10 00:23:00 CEST 2008

Thanks. I added your code to
http://github.com/cosmin/metratime/commit/60a7723281c98a5297b31222493bf2e9a8bf10e4
This should work for most pages - as far as I remember there was a
single page in the schedule that had some odd exception to it
(something like an HTML comment in the middle of the schedule). But
last time I looked at it was probably a year ago so it might have
changed.

I'll try to get all the data parsed out sometime today and probably
publish it in JSON format if anyone else wants to do something clever
with it (and I won't require the use of my web framework in exchange).

- Cosmin

On Wed, Apr 9, 2008 at 4:52 PM, Feihong Hsu <hsu.feihong at yahoo.com> wrote:
> I don't ride Metra anymore so I don't have much motivation to help
>  with Cosmin's Metra Schedule App. However, I happened to have some
>  free time on my hands (I'm unemployed), so I thought it might be fun
>  to make a rudimentary parser. Surprisingly, I hardly did any real
>  HTML parsing, since Metra's pages actually use PRE tags instead of
>  TABLE tags to display the tabular parts. So it devolved into typical
>  regex hacking.
>
>  The following code should not be construed as a complete solution.
>  All it does is parse the text inside the 3 PRE tags and put the data
>  into a single 2D matrix. I also included a small function that
>  creates an HTML table out of the data. I only tested my code on a
>  single page so far, but I think all the schedule pages are pretty
>  much the same. Basically, you can use this as a starting point.
>
>  P.S. You need lxml to run the code.
>
>  ------------------------------------------------------------
>  import re
>  import lxml.html as lh
>
>  def get_rows(tree):
>     texts = [n.text_content() for n in tree.xpath('//pre')]
>
>     trainNumRow = [' ']
>     ampmRow = [' ']
>     timeRows = []   # list of lists
>
>     for i, text in enumerate(texts):
>         lines = [line for line in text.split('\n')
>                  if line.strip()]
>
>         trainNums = lines[0].split()
>         trainNumRow += trainNums
>         ampmRow += lines[1].split()
>
>         for j, line in enumerate(lines[2:]):
>             matches = [m for m in re.finditer(r"x?\d+\:\d+|.---|\|",
>  line)]
>             if len(matches) != len(trainNums):
>                 break
>
>             pos = matches[0].start()
>             town = line[:pos].strip()
>             times = [m.group() for m in matches]
>
>             if j >= len(timeRows):
>                 timeRows.append([])
>
>             timeRows[j] += [town]+times if i==0 else times
>
>     yield trainNumRow
>     yield ampmRow
>     for row in timeRows:
>         yield row
>
>  def make_table_file(filename, rows):
>     import codecs
>     fout = codecs.open(filename, 'w', 'utf-8')
>     fout.write('<table border="1">')
>     for row in rows:
>         fout.write('<tr>')
>         for v in row:
>             if v.endswith('---'):   # get rid of the stupid \x97 char
>                 v = '----'
>             fout.write('<td>%s</td>' % v)
>         fout.write('</tr>')
>     fout.write('</table>')
>     fout.close()
>
>  if __name__ == '__main__':
>     tree = lh.parse('test.html')
>     rows = get_rows(tree)
>     make_table_file('table.html', rows)
>
>
>
>
>  __________________________________________________
>  Do You Yahoo!?
>  Tired of spam?  Yahoo! Mail has the best spam protection around
>  http://mail.yahoo.com
>  _______________________________________________
>  Chicago mailing list
>  Chicago at python.org
>  http://mail.python.org/mailman/listinfo/chicago
>

-- 
Cosmin Stejerean
http://blog.offbytwo.com