sgmllib bug in Python 2.5, works in 2.4.
Stefan Rank
stefan.rank at ofai.at
Mon Feb 5 02:12:06 EST 2007
on 05.02.2007 03:49 John Nagle said the following:
> (Was prevously posted as a followup to something else by accident.)
>
> I'm running a website page through BeautifulSoup. It parses OK
> with Python 2.4, but Python 2.5 fails with an exception:
>
> Traceback (most recent call last):
> File "./sitetruth/InfoSitePage.py", line 268, in httpfetch
> self.pagetree = BeautifulSoup.BeautifulSoup(sitetext) # parse into tree form
> File "./sitetruth/BeautifulSoup.py", line 1326, in __init__
> BeautifulStoneSoup.__init__(self, *args, **kwargs)
> File "./sitetruth/BeautifulSoup.py", line 973, in __init__
> self._feed()
> File "./sitetruth/BeautifulSoup.py", line 998, in _feed
> SGMLParser.feed(self, markup or "")
> File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
> self.goahead(0)
> File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
> k = self.parse_starttag(i)
> File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
> self.finish_starttag(tag, attrs)
> File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
> self.handle_starttag(tag, method, attrs)
> File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
> method(attrs)
> File "./sitetruth/BeautifulSoup.py", line 1416, in start_meta
> self._feed(self.declaredHTMLEncoding)
> File "./sitetruth/BeautifulSoup.py", line 998, in _feed
> SGMLParser.feed(self, markup or "")
> File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
> self.goahead(0)
> File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
> k = self.parse_starttag(i)
> File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
> self._convert_ref, attrvalue)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal
> not in range(128)
>
> The code that's failing is in "_convert_ref", which is new in Python 2.5.
> That function wasn't present in 2.4. I think the code is trying to
> handle single quotes inside of double quotes, or something like that.
>
> To replicate, run
>
> http://www.bankofamerica.com
> or
> http://www.gm.com
>
> through BeautifulSoup.
>
> Something about this code doesn't like big companies. Web sites of smaller
> companies are going through OK.
>
> Also reported as a bug:
>
> [ 1651995 ] sgmllib _convert_ref UnicodeDecodeError exception, new in 2.5
>
>
> John Nagle
Hi,
I had a similar problem recently and did not have time to file a
bug-report. Thanks for doing that.
The problem is the code that handles entity and character references in
SGMLParser.parse_starttag. Seems that it is not careful about
unicode/str issues.
My quick'n'dirty workaround was to remove the offending char-entity from
the website before feeding it to Beautifulsoup::
text = text.replace('®', '') # remove rights reserved sign entity
cheers,
stefan
More information about the Python-list
mailing list