Need debugging knowhow for my creeping Unicodephobia

Chris Rebert clp2 at rebertia.com
Wed Feb 10 16:43:14 EST 2010


On Wed, Feb 10, 2010 at 1:03 PM, kj <no.email at please.post> wrote:
> In <402ac982-0750-4977-adb2-602b19149d81 at m24g2000prn.googlegroups.com>
Jonathan Gardner <jgardner at jonathangardner.net> writes:
<huge snip>
>>It sounds like someone, probably beautiful soup, is trying to turn
>>your strings into unicode. A full stacktrace would be useful to see
>>who did what where.
>
> Unfortunately, there's not much in the stacktrace:
>
> Traceback (most recent call last):
>  File "./download_tt.py", line 427, in <module >
>    x = "%s %s" % (table['id'], table.tr.renderContents())
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 41:
ordinal not in range(128)

Think I've found the problem. According to the BeautifulSoup docs,
renderContents() returns
a (by default, UTF-8-encoded) str [i.e. byte sequence] as opposed to
unicode[i.e. abstract code point sequence]
.
Thus, as was said previously, you're combining the unicode from
table['id']and the
str from renderContents(),
so Python tries to automatically+implicitly convert the str to unicode by
decoding it as ASCII.
However, it's not ASCII but UTF-8, hence you get the error about it having
non-ASCII bytes.

Solution: Convert the output of renderContents() back to unicode.

x = u"%s %s" % (table['id'], table.tr.renderContents().decode('utf8'))

Now only unicode objects are being combined.

Your problem is particularly ironic considering how well BeautifulSoup
handles Unicode overall;
I was unable to locate a renderContents() equivalent that returned unicode.

Cheers,
Chris
--
http://blog.rebertia.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20100210/e8dbee78/attachment-0001.html>


More information about the Python-list mailing list