Need debugging knowhow for my creeping Unicodephobia

Wed Feb 10 16:33:41 EST 2010

On Wed, Feb 10, 2010 at 1:03 PM, kj <no.email at please.post> wrote:

> >What are y and z?
>
>   x = "%s %s" % (table['id'], table.tr.renderContents())
>
> where the variable table represents a BeautifulSoup.Tag instance.
>
> >Are they unicode or strings?
>
> The first item (table['id']) is unicode, and the second is str.
>

The problem is you are mixing unicode and strings; if you want to  support
unicode at all, you should use it -everywhere- within your program.

At the entry-points, you should convert from strings to unicode, then use
unicode -exclusively-.

Here, you are trying to combine a unicode variable with a string variable,
and Python is -trying- to convert it for you, but it can't. I suspect you
aren't actually realizing this implicit conversion is going on: Python's
trying to produce an output string according to what you've requested. To do
that, it is upgrading x from string to unicode, and then writing to that
string first table['id'] then table.tr.renderContents().

Its the table.tr.renderContents that, I believe, is your problem: it is a
byte string, but it contains encoded data. You have to 'upgrade' it to
unicode before you can pump it over into x, because Python doesn't know what
encoding that is. It just knows there's a character somewhere in there that
doesn't fit into ASCII.

I can't quite tell you what the encoding is, though it may be UTF8. Try:
table.tr.renderContents().decode("utf8")

Its an easy mistake to make.

To further illustrate, do this:

>>> "%s %s" % (u'hi', u'bye')
u'hi bye'

Even though you're assigning a regular string to "x", you're using a format
str to merge in a unicode string to it. So Python implicitly upgrades your
format string, as if you had typed:

>>> u"%s %s" % (u'hi', 'bye')

Upgrading a string to unicode is easy, provided a string contains nothing
but plain ASCII. The moment it contains encoded data, the default
translation fails. So you have to explicitly tell Python how to convert the
byte string back into unicode-- meaning, you have to decode it.

Take this example:

>>> s = u"He\u2014llo".encode("utf8")
>>> s
'He\xe2\x80\x94llo'
>>> "%s %s" % (u'hi', s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2:
ordinal not in range(128)
>>> "%s %s" % (u'hi', s.decode("utf8"))
u'hi He\u2014llo'

I started out with a string, "s", containing Hello with an em-dash stuck in
the middle. I encoded it as UTF8: its now a byte string  similar to what may
exist on any webpage you get. Remember, pure "unicode" exists only in
memory. Everything on the internet or a file is /encoded/. You work with
unicode in memory, then encode it to some form when writing it out--
anywhere.

When I first try to combine the two, I get an error like you got.

But if I explicitly decode my 's', it all works fine. The encoded form is
translated back into unicode.

As for how to debug and deal with issues like this... never mix unicode and
regular strings, use unicode strings exclusively in your logic, and decode
as early as possible. Then the only problem you're likely to run into is
situations where a claimed encoding/charset is a lie. (Web servers sometimes
claim charset/encoding A, or a file may claim charset/encoding B, even
though the file was written out by someone with charset/encoding C... and
there's no easy way to go around fixing such situations)

HTH,
--S
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20100210/49ad3065/attachment-0001.html>