HTMLParser problems.

Thu Oct 30 16:36:53 EST 2003

I'm trying to take a webpage that has a nxn table of entries (bus times) and
convert it to a 2D array (list of lists).  Initially this was simple but I
need to be able to access whole 'columns' of data so the 2D array cannot be
sparse but in the HTML file I'm parsing there can be sparse entries which
are repsented in the table as &nbsp entities.  The sparse output breaks my
ability to use entire columns and have entries correspond properly.

Is there a simple way to tell the parser whenever you see a &nbsp in table
data return say... "-1" or "NaN"?
The HTMLParser documentation is a bit.... terse.  I was considering using
the handle_entityref() method but I would assume the data has already been
parsed at that point.

I could try:
        def handle_entityref(self,entity):
                if self.in_td == 1:
                    if entity == "nbsp":
                        self.row.append(-1)

But that seems ulgy... (comments?).

As an example here is some code I'm using and partial output:

#!/usr/local/bin/python
import htmllib,os,string,urllib
from HTMLParser import HTMLParser

class foo(HTMLParser):
        def __init__(self):
                self.in_td = 0
                self.in_tr = 0
                self.matrix = []
                self.row = []
                self.reset()

        def handle_starttag(self,tag,attrs):
                if tag == "td":
                        self.in_td = 1
                elif tag == "tr":
                        self.in_tr = 1

        def handle_data(self,data):
                if self.in_td == 1:
                    data = string.lstrip(data)
                    if data != "":
                        self.row.append(data)

        def handle_endtag(self,tag):
                if tag == "td":
                        self.in_td = 0
                elif tag == "tr":
                        self.in_tr = 0
                        if self.row != []:
                            self.matrix.append(self.row)
                        self.row=[]

parser = foo()
socket =
urllib.urlopen("http://winnipegtransit.com/TIMETABLE/TODAY/STOPS/105413botto
m.html")
parser.feed(socket.read())
socket.close()
parser.close()
for row in parser.matrix:
    print row

A partial output of the above code is:
['5:12 C', '5:52 W']
['5:34 C']
['5:50 P']
['6:01 P', '6:10 G', '6:09 S', '6:59 U']
['6:10 P', '6:26 G', '6:23 C']
['6:23 P', '6:42 G', '6:35 W']
['6:34 P', '6:54 G', '6:47 S']
['6:46 P', '6:59 C']

Any tips or suggestions or comments would be greatly appriciated,

--
Sean
p.s.  If I already answered my question that's great but it would be nice to
have this in the groups archive for people with similar problems in the
future.