sgmllib bug in Python 2.5, works in 2.4.

Stefan Rank stefan.rank at ofai.at
Mon Feb 5 02:12:06 EST 2007


on 05.02.2007 03:49 John Nagle said the following:
> (Was prevously posted as a followup to something else by accident.)
> 
>     I'm running a website page through BeautifulSoup.  It parses OK
> with Python 2.4, but Python 2.5 fails with an exception:
> 
> Traceback (most recent call last):
>     File "./sitetruth/InfoSitePage.py", line 268, in httpfetch
>       self.pagetree = BeautifulSoup.BeautifulSoup(sitetext) # parse into tree form
>     File "./sitetruth/BeautifulSoup.py", line 1326, in __init__
>       BeautifulStoneSoup.__init__(self, *args, **kwargs)
>     File "./sitetruth/BeautifulSoup.py", line 973, in __init__
>       self._feed()
>     File "./sitetruth/BeautifulSoup.py", line 998, in _feed
>       SGMLParser.feed(self, markup or "")
>     File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
>       self.goahead(0)
>     File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
>       k = self.parse_starttag(i)
>     File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
>       self.finish_starttag(tag, attrs)
>     File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
>       self.handle_starttag(tag, method, attrs)
>     File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
>       method(attrs)
>     File "./sitetruth/BeautifulSoup.py", line 1416, in start_meta
>       self._feed(self.declaredHTMLEncoding)
>     File "./sitetruth/BeautifulSoup.py", line 998, in _feed
>       SGMLParser.feed(self, markup or "")
>     File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
>       self.goahead(0)
>     File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
>       k = self.parse_starttag(i)
>     File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
>       self._convert_ref, attrvalue)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal
> not in range(128)
> 
>       The code that's failing is in "_convert_ref", which is new in Python 2.5.
> That function wasn't present in 2.4.  I think the code is trying to
> handle single quotes inside of double quotes, or something like that.
> 
>       To replicate, run
> 
> 	http://www.bankofamerica.com
> or
> 	http://www.gm.com
> 
> through BeautifulSoup.
> 
> Something about this code doesn't like big companies. Web sites of smaller
> companies are going through OK.
> 
> Also reported as a bug:
> 
> [ 1651995 ] sgmllib _convert_ref UnicodeDecodeError exception, new in 2.5
> 
> 
> 					John Nagle

Hi,

I had a similar problem recently and did not have time to file a 
bug-report. Thanks for doing that.

The problem is the code that handles entity and character references in 
SGMLParser.parse_starttag. Seems that it is not careful about 
unicode/str issues.

My quick'n'dirty workaround was to remove the offending char-entity from 
the website before feeding it to Beautifulsoup::

   text = text.replace('®', '') # remove rights reserved sign entity

cheers,
stefan




More information about the Python-list mailing list