q: how to output a unicode string?

Frank Stajano usenet423.4.fms at neverbox.com
Wed Apr 25 06:41:13 EDT 2007


Diez B. Roggisch wrote:
> Frank Stajano wrote:
> 
>> A simple unicode question. How do I print?
>>
>> Sample code:
>>
>> # -*- coding: utf-8 -*-
>> s1 = u"héllô wórld"
>> print s1
>> # Gives UnicodeEncodeError: 'ascii' codec can't encode character
>> # u'\xe9' in position 1: ordinal not in range(128)
>>
>>
>> What I actually want to do is slightly more elaborate: read from a text
>> file which is in utf-8, do some manipulations of the text and print the
>> result on stdout. I understand I must open the file with
>>
>> f = codecs.open("input.txt", "r", "utf-8")
>>
>> but then I get stuck as above.
>>
>> I tried
>>
>> s2 = s1.encode("utf-8")
>> print s2
>>
>> but got
>>
>> héllô wórld
> 
> Which is perfectly alright - it's just that your terminal isn't prepared to
> decode UTF-8, but some other encoding, like latin1.

Aha! Thanks for spotting this. You are right about the terminal 
(rxvt/cygwin) not being ready to handle utf-8, as I can now confirm with a

  cat t2.py

(t2.py being the program above) which displays the source code garbled 
in the same way.

If I do

s1 = u"héllô wórld"
print s1

at the interactive prompt of Idle, I get the proper output

héllô wórld

So why is it that in the first case I got UnicodeEncodeError: 'ascii' 
codec can't encode? Seems as if, within Idle, a utf-8 codec is being 
selected automagically... why should that be so there and not in the 
first case?

>> Then, in the hope of being able to write the string to a file if not to
>> stdout, I also tried
>>
>>
>> import codecs
>> f = codecs.open("out.txt", "w", "utf-8")
>> f.write(s2)
>>
>> but got
>>
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
>> ordinal not in range(128)
> 
> Instead of writing s2 (which is a byte-string!!!), write s1. It will work.

OK, many thanks, I got this to work!

> The error you get stems from f.write wanting a unicode-object, but s2 is a
> bytestring (you explicitly converted it before), so python tries to encode
> the bytestring with the default encoding - ascii - to a unicode string.
> This of course fails.

I think I have a better understanding of it now. If the terminal hadn't 
fooled me, I probably wouldn't have assumed that the code I originally 
wrote (following the first examples I found) was wrong! I assume that 
when you say "bytestring" you mean "a string of bytes in a certain 
encoding (here utf-8) that can be used as an external representation for 
the unicode string which is instead a sequence of code points".

Thanks again



More information about the Python-list mailing list