unicode woes

Thu Sep 26 04:52:08 EDT 2002

Hi,

in the firm I am working we have a big Python project. Four developers are 
working on it for over a half year now, and it has grown really big. 
Nevertheless, due to the Python syntax, and its high level, we still felt 
very strong in the decision to realize our project in Python.

Some weeks ago the management decided to deploy unicode so that we can 
handle every charset uniformly. The problem which showed at first: we 
realized that we cannot change the encoding in site.py afterwards, i.e. we 
have to specify the encoding all the way. It got worse and worse because 
the unicode encoding spread like a virus through all the source code. 
Simple Exceptions which we wanted to print to the console throw another 
Exception:
UnicodeError: ASCII decoding error: ordinal not in range(128)

Furthermore, in Python you have no way to detect, which encoding a special 
unicode string atually is. We ended up having our own u() function. The 
problem with that: we have to pass _every_ string through it. Ouch! Now our 
source code is as ugly as our program is slow. Isn't there another way? 
When will the types string and unicode merged together like in Java? If 
this is not changing and we do not find another way, we will have to port 
_all_ source code to C++.

Please help.

def u(obj):
        encoding = getEncoding()
        if encoding == None:
                raise EncodingNotInitialized("There was no encoding defined previously.")

        if type(obj) == types.StringType:
                return unicode(obj, encoding)
        elif type(obj) == types.UnicodeType:
                # As we can't be sure what encoding the unicode object has we receive,
                # we will simply re-encode it.
                return unicode(obj.encode(encoding), encoding)
        elif type(obj) == types.NoneType:
                return None
        elif type(obj) == types.IntType or type(obj) == types.LongType:
                return unicode("%d" % obj, encoding)
        elif type(obj) == types.FloatType:
                return unicode("%f" % obj, encoding)
        elif type(obj) in [ types.TupleType, types.ListType, types.Dic
tType, types.DictionaryType ]:
                return unicode(str(obj), encoding)
        else:
                raise EncodingError("Unconvertable object passed to the u-function.")

>>> print "äöüß"
äöüß
>>> a = "äöüß"
>>> str(a)
'\xe4\xf6\xfc\xdf'
>>> print a
äöüß
>>> "öäüß%s"%unicode("öäü", "iso8859-15")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: ASCII decoding error: ordinal not in range(128)