Parsing HTML (continued)
Charlie Clark
charlie at begeistert.org
Fri Aug 10 14:55:37 EDT 2001
Danny Yoo gave me the following example:
>class EmphasisGlancer(htmllib.HTMLParser):
> def __init__(self):
> htmllib.HTMLParser.__init__(self,
> formatter.NullFormatter())
> self.in_bold = 0
> self.in_underline = 0
>
> def start_b(self, attrs):
> print "Hey, I see a bold tag!"
> self.in_bold = 1
>
> def end_b(self):
> self.in_bold = 0
>
> def start_u(self, attrs):
> print "Hey, I see some underscored text!"
> self.in_underline = 1
>
> def end_u(self):
> self.in_underline = 0
>
>
> def start_blink(self, attrs):
> print "Hey, this is some heinously blinking test... *grrrr*"
>
>
> def handle_data(self, data):
> if self.in_bold:
> print "BOLD:", data
> elif self.in_underline:
> print "UNDERLINE:", data
>###
Well, I've had more success than I would have imagined possible but I'm
still struggling with some stuff in this sisyphian task. What I'm still
having difficulty with:
1) Nested tags
<br> and html entities cause difficulties as they can be included
with impunity inside other tags. I've been setting flags and collecting
data only to get tripped up by <br> or an html-entity and seeing as I'm
parsing German text there a lot of those.
2) Doing work only on specific attributes
I've written little string searches to fast forward in a page and
reduce the size of what has to be parsed. For the same reason I'd like
to be able to stop parsing on a specific event.
I've now got a particularly nasty webpage which distributes its
relevant content in various blocks and triggering on simple anchors
catches too much data. How do I go about this? The specific example
would be checking the colour of a specific table cell:
<td height="40" bgcolor="eeeeff" width="50">
there doesn't seem to be predefined methods for tables in htmllib so do
they all get handled with "unknown tag"? Would the thing to do be to use
a def start_td or a do_td? and what do the _bgn methods do? The reason I
ask is because the example in the "Python standard library" works with
"anchor_bgn" and not "do_a" or "start_a"
I'm thinking along the lines of
self.text = 0 # flag for whether I need the text
def ....td(self, attrs)
if self.bgcolor = "eeeeff":
store data, nested_tags
else: fast_foward(next_td)
many thanx,
Charlie
More information about the Python-list
mailing list