string processing question

Fri May 1 05:28:12 EDT 2009

Scott David Daniels schrieb:
> To discover what is happening, try something like:
>     python -c 'for a in "ä", unicode("ä"): print len(a), a'
>
> I suspect that in your encoding, "ä" is two bytes long, and in
> unicode it is converted to to a single character.

:> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a'
2 ä
1 ä
:>

Yes it is. That is one of the two problems I see.
The solution for this is to unicode(<string>, <coding>) each string.

I'd like to have my python programs unicode enabled.

:> python -c 'for a in "ä", unicode("ä"): print len(a), a'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

It seems that the default encoding is "ascii", so unicode() cannot cope
with "ä".
If I specify "utf8" for the encoding, unicode() works.

:> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a'
2 ä
1 ä
:>                 

But the print statement yelds an UnicodeEncodeError
if I pipe the output to a program or a file.

:> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a' | cat
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
position 0: ordinal not in range(128)
2 ä
1 :>

So it seems to me, that piping the output changes the behavior of the
print statement:

:> python -c 'for a in "ä", unicode("ä", "utf8", "ignore"): print a,
len(a), type(a)'
ä 2 <type 'str'>
ä 1 <type 'unicode'>

:> python -c 'for a in "ä", unicode("ä", "utf8", "ignore"): print a,
len(a), type(a)'  | cat
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
position 0: ordinal not in range(128)
ä 2 <type 'str'>
:>

How can I achieve that my python programs are unicode enabled:
- Input strings can have different encodings (mostly ascii, latin_1 or utf8)
- My python programs should always output "utf8".

Is that a good idea??

TIA
-- 
Kurt Müller, mu at problemlos.ch