unicode woes

Thu Sep 26 05:27:42 EDT 2002

Ulli Stein <mennosimons at gmx.net> writes:

> Some weeks ago the management decided to deploy unicode so that we can 
> handle every charset uniformly. The problem which showed at first: we 
> realized that we cannot change the encoding in site.py afterwards, i.e. we 
> have to specify the encoding all the way. 

You mean, every time? Not necessarily. If you want to write Unicode to
a file, I recommend to use codecs.open.

> It got worse and worse because 
> the unicode encoding spread like a virus through all the source code. 
> Simple Exceptions which we wanted to print to the console throw another 
> Exception:
> UnicodeError: ASCII decoding error: ordinal not in range(128)

I cannot understand this problem. I get

>>> def foo():
...   raise Exception(u"\u20ac")
... 
>>> foo()
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "<stdin>", line 2, in foo
Exception>>> 

So even though it doesn't print the Unicode string, it does not cause
a UnicodeError, either.

> Furthermore, in Python you have no way to detect, which encoding a
> special unicode string atually is. We ended up having our own u()
> function.

What do you mean, 'actually'? A single Unicode string can be encoded
in many encodings, e.g. u"Hallo" can be encoded in ascii, latin-1,
koi-8r, or utf-8. Every Unicode string can be encoded in utf-8.

> The problem with that: we have to pass _every_ string through
> it. Ouch!

Again, I cannot understand the problem. Can you please show a bit of
code that demonstrates how you use that u() function, and what it
does?

> Now our source code is as ugly as our program is slow. Isn't there
> another way?

Not sure what problem you are trying to solve, so it is hard to tell
whether there is another way to solve it.

> When will the types string and unicode merged together like in Java? If 
> this is not changing and we do not find another way, we will have to port 
> _all_ source code to C++.

Don't despair; I'm certain that there is an easier way. Try to follow
the following principles:

- Never mix byte strings and Unicode strings (unless the byte strings
  are restricted to bytes <127, perhaps).
- In a Unicode application, convert all byte strings to Unicode as early
  as possible.
- Convert all Unicode data back to byte strings as late as possible.
- If you need to be sure that something can be printed for diagnostic
  output, use repr.
- For normal output, use an codecs.StreamWriter.

> def u(obj):

I'm not quite sure what this u function is supposed to do. If this is
meant to be an equivalent of str() which returns a Unicode object, I
recommend that you implement this as

def u(obj):
  try:
    return unicode(obj)
  except UnicodeError:
    return unicode(obj, getEncoding())

>         encoding = getEncoding()
>         if encoding == None:
>                 raise EncodingNotInitialized("There was no encoding defined previously.")

This is quite inefficient. Why is it that no encoding could be defined
previously? There is alway the system default encoding. Also, can't
you arrange encoding to be global, and change every time the encoding
changes? That would avoid the function call.

>         
>         if type(obj) == types.StringType:
>                 return unicode(obj, encoding)

I recommend to put the most frequent case first. I also recommend to
avoid type tests. Instead, just invoke 

   return unicode(obj, encoding)

and catch the exception you get if that fails (TypeError)

>         elif type(obj) == types.UnicodeType:
>                 # As we can't be sure what encoding the unicode object has we receive,
>                 # we will simply re-encode it.
>                 return unicode(obj.encode(encoding), encoding)

What good does that do? You perform a roundtrip conversion, so that
either fails (with a UnicodeError), or you get back the same object.
I recommend to write

                  return obj

instead. If you really think you need the UnicodeError if the Unicode
string cannot be encoded in encoding, write

                  obj.encode(encoding)
                  return obj

That saves atleast one of the conversions. However, it is not clear to
me why you need the UnicodeError at all.

>         elif type(obj) == types.IntType or type(obj) == types.LongType:
>                 return unicode("%d" % obj, encoding)
>         elif type(obj) == types.FloatType:
>                 return unicode("%f" % obj, encoding)
>         elif type(obj) in [ types.TupleType, types.ListType, types.Dic
> tType, types.DictionaryType ]:
>                 return unicode(str(obj), encoding)

These can be summarized as

                  return unicode(obj)

It is not clear why you special-case None; if this is a string
conversion function, returning None is just as surprising. Instead, I
would expect u(None) to work as

>>> unicode(None)
u'None'

> >>> "öäüß%s"%unicode("öäü", "iso8859-15")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeError: ASCII decoding error: ordinal not in range(128)

What do you expect to happen? You have a byte string, and you want to
insert a unicode string. Those two cannot be mixed unless you convert
one form to the other. For that, you need to know what encoding to
use; Python uses the default encoding, which only converts ASCII in
the standard installation.

If you follow the suggestion to never mix byte strings and Unicode
strings, this would not happen.

Regards,
Martin