BeautifulSoup error

Ben Finney bignose+hates-spam at benfinney.id.au
Fri Jun 16 01:20:48 EDT 2006


William Xu <william.xwl at gmail.com> writes:

> >>> import urllib
> >>> from BeautifulSoup import BeautifulSoup
> >>> url = 'http://www.google.com'
> >>> port = urllib.urlopen(url).read()

Gets the data from the HTTP response. (I'm not sure why you call this
"port".) The data is HTML text encoded to a string of bytes according
to the character encoding specified in the response header fields.

> >>> soup = BeautifulSoup()
> >>> soup.feed(port)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
>   File "/usr/lib/python2.3/sgmllib.py", line 94, in feed
>     self.rawdata = self.rawdata + data
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xb8 in position 565: ordinal not in range(128)
> >>>

Uses the default Python text encoding, 'ascii', when it needs to
decode the data in 'port' to Unicode. Some of the data in that object
makes no sense in the 'ascii' encoding, so it barfs.

> Any ideas to solve this?

Get the character encoding specified in the HTTP response, and decode
the data to Unicode from that encoding.

-- 
 \       "Man cannot be uplifted; he must be seduced into virtue."  -- |
  `\                                       Donald Robert Perry Marquis |
_o__)                                                                  |
Ben Finney




More information about the Python-list mailing list