Help on regular expression match

John J. Lee jjl at pobox.com
Sat Sep 24 06:54:02 EDT 2005


"Fredrik Lundh" <fredrik at pythonware.com> writes:
[...]
> or, if you're going to parse HTML pages from many different sources, a
> real parser:
> 
>     from HTMLParser import HTMLParser
> 
>     class MyHTMLParser(HTMLParser):
> 
>         def handle_starttag(self, tag, attrs):
>             if tag == "a":
>                 for key, value in attrs:
>                     if key == "href":
>                         print value
> 
>     p = MyHTMLParser()
>     p.feed(text)
>     p.close()
> 
> see:
> 
>     http://docs.python.org/lib/module-HTMLParser.html
>     http://docs.python.org/lib/htmlparser-example.html
>     http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html

It's worth noting that module HTMLParser is less tolerant of the bad
HTML you find in the real world than is module sgmllib, which has a
similar interface.  There are also third party libraries like
BeautifulSoup and mxTidy that you may find useful for parsing "HTML as
deployed" (ie. bad HTML, often).

Also, htmllib is an extension to sgmllib, and will do your link
parsing with even less effort:

import htmllib, formatter, urllib2
pp = htmllib.HTMLParser(formatter.NullFormatter())
pp.feed(urllib2.urlopen("http://python.org/").read())
print pp.anchorlist


Module HTMLParser does have better support for XHTML, though.


John




More information about the Python-list mailing list