unicode woes

Ulli Stein mennosimons at gmx.net
Thu Sep 26 05:52:49 EDT 2002


Boudewijn Rempt wrote:

> Ulli Stein wrote:
> 
>> 
>> Some weeks ago the management decided to deploy unicode so that we can
>> handle every charset uniformly. The problem which showed at first: we
>> realized that we cannot change the encoding in site.py afterwards, i.e.
>> we have to specify the encoding all the way.
> 
> Use a sitecustomize.py file that contains:
> 
> import sys
> 
> sys.setappdefaultencoding=sys.setdefaultencoding
> 
> Then, in the application starter script use:
> 
> if hasattr(sys, 'setappdefaultencoding'):
>     sys.setappdefaultencoding('utf-8')
> elif sys.getdefaultencoding() != 'utf-8':
>     print 'Warning: encoding not set for unicode - see ReadMe file'
> 
> That should solve a lot of your problems. Then you need to carefully
> determine how the data that enters and leaves your application should
> be encoded. For instance, you might be getting data from files, databases,
> network or paste operations. If that data is many different encodings, you
> need to check the encoding at every entry point. And whenever you write
> data, you need to be aware of the encoding.
> 
> But at least _inside_ your application all data will be utf-8, which makes
> it easy to handle.
> 
> Don't be too sanguine about the unicode handling of Java, by the way. Lots
> of room for hard-to-find unicode-related bugs there.
> 
>> It got worse and worse because
>> the unicode encoding spread like a virus through all the source code.
>> Simple Exceptions which we wanted to print to the console throw another
>> Exception:
>> UnicodeError: ASCII decoding error: ordinal not in range(128)
> 
> Other people will be able to tell you why it was a good design decision to
> make it difficult to set a default application encoding different from
> ASCII. I'm still not convinced, but that might be just my denseness.
> 

But nevertheless you would have to use everywhere the Python unicode() 
function.

Or how would you do this (nonsense string):
str = str + "blaäö" + str[:5] + "ßüä"

Would you write then:
str = str + unicode("blaäö") + str[:5] + unicode("ßüä")?

And what about the "encode" variable in site.py: If another Python 
application running in parallel to our application changes the encode 
string, will it affect our app? Or has every application its own encode 
string?

U.



More information about the Python-list mailing list