unicode woes

Boudewijn Rempt boud at valdyas.org
Thu Sep 26 05:10:06 EDT 2002


Ulli Stein wrote:

> 
> Some weeks ago the management decided to deploy unicode so that we can
> handle every charset uniformly. The problem which showed at first: we
> realized that we cannot change the encoding in site.py afterwards, i.e. we
> have to specify the encoding all the way. 

Use a sitecustomize.py file that contains:

import sys

sys.setappdefaultencoding=sys.setdefaultencoding

Then, in the application starter script use:

if hasattr(sys, 'setappdefaultencoding'):
    sys.setappdefaultencoding('utf-8')
elif sys.getdefaultencoding() != 'utf-8':
    print 'Warning: encoding not set for unicode - see ReadMe file'

That should solve a lot of your problems. Then you need to carefully 
determine how the data that enters and leaves your application should
be encoded. For instance, you might be getting data from files, databases,
network or paste operations. If that data is many different encodings, you
need to check the encoding at every entry point. And whenever you write 
data, you need to be aware of the encoding.

But at least _inside_ your application all data will be utf-8, which makes
it easy to handle.

Don't be too sanguine about the unicode handling of Java, by the way. Lots 
of room for hard-to-find unicode-related bugs there.

> It got worse and worse because
> the unicode encoding spread like a virus through all the source code.
> Simple Exceptions which we wanted to print to the console throw another
> Exception:
> UnicodeError: ASCII decoding error: ordinal not in range(128)

Other people will be able to tell you why it was a good design decision to
make it difficult to set a default application encoding different from 
ASCII. I'm still not convinced, but that might be just my denseness.

-- 
Boudewijn Rempt | http://www.valdyas.org



More information about the Python-list mailing list