unicode encoding usablilty problem

Neil Hodgson nhodgson at bigpond.net.au
Fri Feb 18 18:31:51 EST 2005


Martin v. Löwis:

> Eventually, the primary string type should be the Unicode
> string. If you are curious how far we are still off that goal,
> just try running your program with the -U option.

   Tried both -U and sys.setdefaultencoding("undefined") on a couple of my
most used programs and saw a few library problems. One program reads job
advertisements from a mailing list, ranks them according to keywords, and
displays them using unicode to ensure that HTML entities like • are
displayed correctly. That program worked without changes.

   The second program reads my spam filled mail box removing messages that
match a set of header criteria. It uses decode_header and make_header from
the email.Header library module to convert each header from a set of encoded
strings into a single unicode string. As email.Header is strongly concerned
with unicode, I expected it would be able to handle the two modifications
well.

   With -U, there was one bug in my code assuming that a string would be 8
bit and that was easily fixed. In email.Charset, __init__ expects a
non-unicode argument as it immediately calls unicode(input_charset, 'ascii')
which fails when the argument is unicode. This can be fixed explicitly in
the __init__ but I would argue for a more lenient approach with unicode(u,
enc, err) always ignoring the enc and err arguments when the input is
already in unicode. Next sre breaks when building a mapping array because
array.array can not have a unicode type code. This should probably be fixed
in array rather than sre as mapping = array.array('b'.encode('ascii'),
mapping).tostring() is too ugly. The final issue was in encodings.idna where
there is ace_prefix = "xn--"; uace_prefix = unicode(ace_prefix, "ascii")
which again could avoid breakage if unicode was more lenient.

   With sys.setdefaultencoding("undefined"), there were more problems and
they were harder to work around. One addition that could help would be a
function similar to str but with an optional encoding that would be used
when the input failed to convert to string because of a UnicodeError.
Something like

def stri(x, enc='us-ascii'):
    try:
        return str(x)
    except UnicodeError:
        return unicode(x).encode(enc)

   Neil





More information about the Python-list mailing list