sgmllib bug in Python 2.5, works in 2.4.

John Nagle nagle at animats.com
Sun Feb 4 21:49:15 EST 2007


(Was prevously posted as a followup to something else by accident.)

    I'm running a website page through BeautifulSoup.  It parses OK
with Python 2.4, but Python 2.5 fails with an exception:

Traceback (most recent call last):
    File "./sitetruth/InfoSitePage.py", line 268, in httpfetch
      self.pagetree = BeautifulSoup.BeautifulSoup(sitetext) # parse into tree form
    File "./sitetruth/BeautifulSoup.py", line 1326, in __init__
      BeautifulStoneSoup.__init__(self, *args, **kwargs)
    File "./sitetruth/BeautifulSoup.py", line 973, in __init__
      self._feed()
    File "./sitetruth/BeautifulSoup.py", line 998, in _feed
      SGMLParser.feed(self, markup or "")
    File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
      self.goahead(0)
    File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
      k = self.parse_starttag(i)
    File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
      self.finish_starttag(tag, attrs)
    File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
      self.handle_starttag(tag, method, attrs)
    File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
      method(attrs)
    File "./sitetruth/BeautifulSoup.py", line 1416, in start_meta
      self._feed(self.declaredHTMLEncoding)
    File "./sitetruth/BeautifulSoup.py", line 998, in _feed
      SGMLParser.feed(self, markup or "")
    File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
      self.goahead(0)
    File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
      k = self.parse_starttag(i)
    File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
      self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal
not in range(128)

      The code that's failing is in "_convert_ref", which is new in Python 2.5.
That function wasn't present in 2.4.  I think the code is trying to
handle single quotes inside of double quotes, or something like that.

      To replicate, run

	http://www.bankofamerica.com
or
	http://www.gm.com

through BeautifulSoup.

Something about this code doesn't like big companies. Web sites of smaller
companies are going through OK.

Also reported as a bug:

[ 1651995 ] sgmllib _convert_ref UnicodeDecodeError exception, new in 2.5


					John Nagle



More information about the Python-list mailing list