SGMLParseError

Jay Parlar jparlar at home.com
Wed Aug 22 21:18:02 EDT 2001


In trying to get the HTMLParser to work, I occasionally come upon the following problem. 

>>> from formatter import AbstractFormatter,DumbWriter
>>> from htmllib import HTMLParser
>>> parser = HTMLParser(AbstractFormatter(DumbWriter()))
>>> parser.feed(urllib.urlopen('http://cbc.ca').read())
CBC.CA    Wednesday, Aug 22, 2001 nmweb02   shop[1] · help[2] · contact[3]
· search[4]   (image)[5] Email News Digest[6] | Audio[7] | Video[8] |
CBC Radio Newscast[9] | CBC Newsworld Newscast[10]    Traceback (most recent call last):
  File "<interactive input>", line 1, in ?
  File "c:\program files\python21\lib\sgmllib.py", line 91, in feed
    self.goahead(0)
  File "c:\program files\python21\lib\sgmllib.py", line 158, in goahead
    k = self.parse_declaration(i)
  File "c:\program files\python21\lib\sgmllib.py", line 238, in parse_declaration
    raise SGMLParseError(
SGMLParseError: unexpected char in declaration: '<'

It doesn't happen with every page (in fact, I have code which runs HTMLParser on over 400 separate pages, and 
only five of the pages cause this), but I really can't have it happening at all.

I've checked the list archives, and haven't found any solutions to this problem. Is there anything I can do besides 
catching the error? I'd really like some solution other than ignoring the pages that create this type of error. 

I've also been trying to use MSHTML as my parser, but that's giving me a whole array of problems in itself. The people 
on the Microsoft group I've been going to don't seem to be nearly as helpful as the Python people are :)

Jay Parlar
----------------------------------------------------------------
Software Engineering III
McMaster University
Hamilton, Ontario, Canada

"Though there are many paths
At the foot of the mountain
All those who reach the top
See the same moon."






More information about the Python-list mailing list