html parser , unexpected '<' char in declaration

Sakcee sakcee at gmail.com
Mon Feb 20 18:01:53 EST 2006


html =
'<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
<head></head> <body bgcolor=#ffffff>\r\n Foo foo , blah blah
</body></html>'

>>> import htmllib
>>> import formatter
>>> parser=htmllib.HTMLParser(formatter.NullFormatter())
>>> parser.feed(html)

Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
    self.goahead(0)
  File "/usr/lib/python2.4/sgmllib.py", line 165, in goahead
    k = self.parse_declaration(i)
File "/usr/lib/python2.4/markupbase.py", line 132, in parse_declaration
    self.error(
  File "/usr/lib/python2.4/htmllib.py", line 40, in error
    raise HTMLParseError(message)
htmllib.HTMLParseError: unexpected '<' char in declaration


the error is generated by unclosed DOCTYPE declaration

what is the best way to handle this kind of document. should I use
regex to check and strip, or does HTMLParser offers something? , can i
override default sgmllib behaviour
I have to work with this htmllib because of existing modules .


thanks




More information about the Python-list mailing list