sgmllib bug in Python 2.5, works in 2.4.

John Nagle nagle at animats.com
Wed Feb 7 03:37:50 EST 2007


John Nagle wrote:
> (Was prevously posted as a followup to something else by accident.)
> 
>    I'm running a website page through BeautifulSoup.  It parses OK
> with Python 2.4, but Python 2.5 fails with an exception:
> 
> Traceback (most recent call last):
>    File "./sitetruth/InfoSitePage.py", line 268, in httpfetch
>      self.pagetree = BeautifulSoup.BeautifulSoup(sitetext) # parse into 
> tree form
>    File "./sitetruth/BeautifulSoup.py", line 1326, in __init__
>      BeautifulStoneSoup.__init__(self, *args, **kwargs)
>    File "./sitetruth/BeautifulSoup.py", line 973, in __init__
>      self._feed()
>    File "./sitetruth/BeautifulSoup.py", line 998, in _feed
>      SGMLParser.feed(self, markup or "")
>    File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
>      self.goahead(0)
>    File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
>      k = self.parse_starttag(i)
>    File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
>      self.finish_starttag(tag, attrs)
>    File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
>      self.handle_starttag(tag, method, attrs)
>    File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
>      method(attrs)
>    File "./sitetruth/BeautifulSoup.py", line 1416, in start_meta
>      self._feed(self.declaredHTMLEncoding)
>    File "./sitetruth/BeautifulSoup.py", line 998, in _feed
>      SGMLParser.feed(self, markup or "")
>    File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
>      self.goahead(0)
>    File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
>      k = self.parse_starttag(i)
>    File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
>      self._convert_ref, attrvalue)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: 
> ordinal
> not in range(128)
> 
>      The code that's failing is in "_convert_ref", which is new in 
> Python 2.5.
> That function wasn't present in 2.4.  I think the code is trying to
> handle single quotes inside of double quotes, or something like that.
> 
>      To replicate, run
> 
>     http://www.bankofamerica.com
> or
>     http://www.gm.com
> 
> through BeautifulSoup.
> 
> Something about this code doesn't like big companies. Web sites of smaller
> companies are going through OK.
> 
> Also reported as a bug:
> 
> [ 1651995 ] sgmllib _convert_ref UnicodeDecodeError exception, new in 2.5

    Found the problem and updated the bug report with a fix.  But someone
else will have to check it in.

    There's a place in SGMLParser where someone assumed that values 0..255
were valid ASCII characters.  But in fact the allowed range is 0..127.
The effect is that Unicode strings containing values between 128 and 255
will blow up SGMLParser.

    In fact, you can even make this happen with an ASCII
source file by using an HTML entity which has a Unicode representation
between 128 and 255, (such as "§"), then using something
Unicode-oriented like BeautifulSoup on it.

				John Nagle





More information about the Python-list mailing list