understanding htmllib

Fredrik Lundh fredrik at pythonware.com
Wed Oct 4 03:06:06 EDT 2006


David Bear wrote:

> I'm trying to understand how to use the HTMLParser in htmllib but I'm not
> seeing enough examples.
> 
> I just want to grab the contents of everything enclosed in a '<body>' tag,
> i.e. items from where <body> begins to where </body> ends. I start by doing
> 
> class HTMLBody(HTMLParser):
>    def __init__(self):
>       self.contents = []
> 
>    def handle_starttag()..
> 
> Now I'm stuck. I cant see that there is a method on handle_starttag that
> would return everthing to the end tag. And I haven't seen anything on how
> to define my one handle_unknowntag..

htmllib is designed to be used together with a formatting object.  if 
you just want to work with tags, use sgmllib instead.  some variation of 
the SGMLFilter example on this page might be what you need:

     http://effbot.org/librarybook/sgmllib.htm

if you want a DOM-like structure instead of an event stream, use

     http://www.crummy.com/software/BeautifulSoup/

usage:

 >>> import BeautifulSoup as BS
 >>> soup = BS.BeautifulSoup(open("page.html"))
 >>> str(soup.body)
'<body>\n<h1>Body Title</h1>\n<p>Paragraph</p>\n</body>'
 >>> soup.body.renderContents()
'\n<h1>Body Title</h1>\n<p>Paragraph</p>\n'

</F>




More information about the Python-list mailing list