[Tutor] Unicode Problem

Fri Sep 10 20:40:42 CEST 2004

Hi Rainer,
      A disclaimer first: This is what I have learnt, I could be
wrong. Please do correct me if I am.

>     >>> print u'ä', 'ä'
>     ä ä
> So print can handle Unicode and non-Unicode.

Although that is correct, you should remember that python will only
treat, as unicode those strings that you prefix with an 'u'.

>>> print type(u'ä'), type('ä')
<type 'unicode'> <type 'str'>
>>>

So, you see here, the print statement thinks that the second 'ä' is a
normal "str".

>     >>> print u'ä' + 'ä'
>     Traceback (most recent call last):
>       File "<interactive input>", line 1, in ?
>     UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in
>     position 0: ordinal not in range(128)

What is happening here is that, python is trying to 'promote' the
second 'ä' to unicode before doing the concatenation (since the first
'ä' is an unicode string, the resulting concatenated string would also
be unicoded).
      To do this the print statement tries to decode, what it thinks
is an normal ascii str using the default encoding, which well,
normally is set to ascii. So;

>>> print u'ä' + 'ä'
is the equivalent of:
>>> print u'ä' + 'ä'.decode('ascii')     # what I don't get is why
this is called decode ??

when instead you instead probably wanted the behaviour
>>> print u'ä' + 'ä'.decode('iso-8859-1')
ää
>>>

HTH
Regards
Steve