unicode encoding usablilty problem

Fri Feb 18 12:10:04 EST 2005

I have long find the Python default encoding of strict ASCII frustrating.  
For one thing I prefer to get garbage character than an exception. But the  
biggest issue is Unicode exception often pop up in unexpected places and  
only when a non-ASCII or unicode character first found its way into the  
system.

Below is an example. The program may runs fine at the beginning. But as  
soon as an unicode character u'b' is introduced, the program boom out  
unexpectedly.

>>> sys.getdefaultencoding()
'ascii'
>>> a='\xe5'
>>> # can print, you think you're ok
... print a
å
>>> b=u'b'
>>> a==b
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0:  
ordinal not in range(128)
>>>

One may suggest the correct way to do it is to use decode, such as

   a.decode('latin-1') == b

This brings up another issue. Most references and books focus exclusive on  
entering unicode literal and using the encode/decode methods. The fallacy  
is that string is such a basic data type use throughout the program, you  
really don't want to make a individual decision everytime when you use  
string (and take a penalty for any negligence). The Java has a much more  
usable model with unicode used internally and encoding/decoding decision  
only need twice when dealing with input and output.

I am sure these errors are a nuisance to those who are half conscious to  
unicode. Even for those who choose to use unicode, it is almost impossible  
to ensure their program work correctly.