Unicode problem in BeautifulSoup; worked in Python 2.4, fails in Python 2.5.

Mizipzor mizipzor at gmail.com
Sun Feb 4 17:47:54 EST 2007


On Feb 4, 11:39 pm, John Nagle <n... at animats.com> wrote:
>     I'm running a website page through BeautifulSoup.  It parses OK
> with Python 2.4, but Python 2.5 fails with an exception:
>
> Traceback (most recent call last):
>    File "./sitetruth/InfoSitePage.py", line 268, in httpfetch
>      self.pagetree = BeautifulSoup.BeautifulSoup(sitetext) # parse into tree form
>    File "./sitetruth/BeautifulSoup.py", line 1326, in __init__
>      BeautifulStoneSoup.__init__(self, *args, **kwargs)
>    File "./sitetruth/BeautifulSoup.py", line 973, in __init__
>      self._feed()
>    File "./sitetruth/BeautifulSoup.py", line 998, in _feed
>      SGMLParser.feed(self, markup or "")
>    File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
>      self.goahead(0)
>    File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
>      k = self.parse_starttag(i)
>    File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
>      self.finish_starttag(tag, attrs)
>    File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
>      self.handle_starttag(tag, method, attrs)
>    File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
>      method(attrs)
>    File "./sitetruth/BeautifulSoup.py", line 1416, in start_meta
>      self._feed(self.declaredHTMLEncoding)
>    File "./sitetruth/BeautifulSoup.py", line 998, in _feed
>      SGMLParser.feed(self, markup or "")
>    File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
>      self.goahead(0)
>    File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
>      k = self.parse_starttag(i)
>    File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
>      self._convert_ref, attrvalue)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal
> not in range(128)
>
>      The code that's failing is in "_convert_ref", which is new in Python 2.5.
> That function wasn't present in 2.4.  I think the code is trying to
> handle single quotes inside of double quotes, or something like that.
>
>      To replicate, run
>
>        http://www.bankofamerica.com
> or
>        http://www.gm.com
>
> through BeautifulSoup.
>
> Something about this code doesn't like big companies. Web sites of smaller
> companies are going through OK.
>
> Also reported as a bug:
>
> [ 1651995 ] sgmllib _convert_ref UnicodeDecodeError exception, new in 2.5
>
>                                         John Nagle

I think this post got rather missplaced, hehe.




More information about the Python-list mailing list