BeautifulSoup/sgmllib crash

John Nagle nagle at animats.com
Wed May 14 01:37:24 EDT 2008


    Here's another example of the annoying "attributes must be ASCII
but sgmllib doesn't check" problem.

Run "http://www.serversdirect.com" through BeautifulSoup, and watch it
blow up at this bogus HTML:

      <LI>Support Multi-Core Intel® Xeon® processor 3200/3000 sequence 
</LISUPPORT sequence 32003000 processor xeon® intel® multi-core>

The parser uses the ® symbol as part of an attribute name:

SGMLParser.feed(self, markup or "")
File "/usr/local/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/local/lib/python2.5/sgmllib.py", line 138, in goahead
k = self.parse_endtag(i)
File "/usr/local/lib/python2.5/sgmllib.py", line 315, in parse_endtag
self.finish_endtag(tag)
File "/usr/local/lib/python2.5/sgmllib.py", line 353, in finish_endtag
method = getattr(self, 'end_' + tag)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 46: 
ordinal not in range(128)

And we're downhill from there.  Probably worth fixing, since it's one of the
few real-world HTML bugs that totally blows up BeautifulSoup.

					John Nagle
					SiteTruth



More information about the Python-list mailing list