HTMLParser problems.
Peter Otten
__peter__ at web.de
Fri Oct 31 14:30:52 EST 2003
Sean Cody wrote:
> I'm trying to take a webpage that has a nxn table of entries (bus times)
> and
> convert it to a 2D array (list of lists). Initially this was simple but I
> need to be able to access whole 'columns' of data so the 2D array cannot
> be sparse but in the HTML file I'm parsing there can be sparse entries
> which
> are repsented in the table as   entities. The sparse output breaks my
> ability to use entire columns and have entries correspond properly.
>
> Is there a simple way to tell the parser whenever you see a   in table
> data return say... "-1" or "NaN"?
> The HTMLParser documentation is a bit.... terse. I was considering using
> the handle_entityref() method but I would assume the data has already been
> parsed at that point.
>
> I could try:
> def handle_entityref(self,entity):
> if self.in_td == 1:
> if entity == "nbsp":
> self.row.append(-1)
>
> But that seems ulgy... (comments?).
>
> As an example here is some code I'm using and partial output:
[...]
> parser.feed(socket.read())
The simplest solution is to replace the above line with
parser.feed(socket.read().replace(" ", "NaN")
Below is an only slightly more robust solution. It implements a rudimentary
"what table are we in?" check and can handle table cells with multiple data
chunks.
import htmllib,os,string,urllib
from HTMLParser import HTMLParser
class foo(HTMLParser):
def __init__(self):
self.matrix = []
self.row = None
self.cell = None
self.in_table = 0
self.empty = "NaN"
self.reset()
def handle_starttag(self,tag,attrs):
if tag == "table":
self.in_table += 1
elif self.in_table == 2:
if tag == "td":
assert self.cell is None
self.cell = []
elif tag == "tr":
self.row = []
self.matrix.append(self.row)
def handle_data(self,data):
if self.in_table == 2:
if self.cell is not None:
data = string.strip(data)
if data or True:
self.cell.append(data)
def handle_endtag(self,tag):
if tag == "table":
self.in_table -= 1
elif self.in_table == 2:
if tag == "td":
s = " ".join(self.cell).replace("\n", " ")
if s == "":
s = self.empty
self.row.append(s)
self.cell = None
elif tag == "tr":
self.row = None
parser = foo()
if 0:
instream = urllib.urlopen(
"http://winnipegtransit.com/TIMETABLE/TODAY/STOPS/105413bottom.html")
else:
instream = file("105413bottom.html")
data = instream.read()
parser.feed(data)
instream.close()
parser.close()
for row in parser.matrix:
assert len(row) == 4
print row
I've replaced the urlopen() call with access to a local file as you might
want to run your tests with a local copy of the time table, too.
Peter
More information about the Python-list
mailing list