[Tutor] re question

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Fri Aug 8 16:42:57 EDT 2003



Hi Jonathan,




> Find all instances of such-and-such between two tags (for which I've
> received a helpful response).

I actually prefer using SGMLParser over HTMLParser.  SGMLParser has some
documentation here:

    http://www.python.org/doc/lib/module-sgmllib.html

but examples can't hurt.  *grin* Here's a small example of a program that
pays attention to italics:


###
>>> import sgmllib
>>> class MyParser(sgmllib.SGMLParser):
...     def __init__(self):
...         sgmllib.SGMLParser.__init__(self)
...         self.in_italic = 0
...         self.italicized_words = []
...     def start_i(self, attributes):
...         self.in_italic = 1
...     def end_i(self):
...         self.in_italic = 0
...     def handle_data(self, data):
...         if self.in_italic:
...             self.italicized_words.append(data)
...     def getItalics(self):
...         return self.italicized_words
...
>>> p = MyParser()
>>> p.feed("""Hello, this is a bit of <i>italicized</i> text.
... <i>hello</i> world!""")
>>> p.getItalics()
['italicized', 'hello']
###



> Strip out all (or possibly all-but-whitelist) tags from an HTML page
> (substitute "" for "<.*?>" over multiple lines?).

We can do this by redefining our parser not to do anything except pay
attention to handle_data():

###
>>> def strip_html(text):
...     class silly_parser(sgmllib.SGMLParser):
...         def __init__(self):
...             sgmllib.SGMLParser.__init__(self)
...             self.text = []
...         def handle_data(self, data):
...             self.text.append(data)
...         def get_text(self):
...             return self.text
...     p = silly_parser()
...     p.feed(text)
...     return ''.join(p.get_text())
...
>>> strip_html("""<html><body><p>Hello, this is <b>bolded</b>
...               test!</p></body></html""")
'Hello, this is bolded\n              test!'
###


Note: we can adjust this parser to be a little more discriminating on the
kinds of tags it ignores (all-but-whitelist?) by working with the
handle_starttag() method.



> Iterate over links / images and selectively change the targets / sources
> (which would take me a lot of troubleshooting to do with RE).
>
> Possibly other things; I'm not trying to compete with HTMLParser but get
> some basic functionality. I'd welcome suggestions on how to do this with
> HTMLParser.


With the examples above, you should be able to retarget them to do the
other tasks in quick order.  These are the sort of things where SGMLParser
really shines --- it's worth learning how to use it well.


If you run into problems, please feel free to email the Tutor list again,
and we can talk about SGMLParser in more depth.  Good luck!




More information about the Tutor mailing list