[Tutor] symbol encoding and processing problem

Wed Oct 17 18:53:26 CEST 2007

Evert Rol wrote:
>>>  >>> print unicode("125° 15' 5.55''", 'utf-8')
>>> UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in  
>>> position 3: ordinal not in range(128)
>>
>> This is the same as the first encode error.
> 
> This is the thing I don't get; or only partly: I'm sending a utf-8 
> encoded string to print.

No, you are sending a unicode string to print.
   unicode("125° 15' 5.55''", 'utf-8')
means the same as
   "125° 15' 5.55''".decode('utf-8')
which is, "create a unicode string from this utf-8-encoded byte string". 
Once you have decoded to Unicode it is no longer utf-8.

> print apparently ignores that, and still tries 
> to print things using ascii encoding. If I'm correct in that assessment, 
> then why would print ignore that?

print just knows that you want to print a unicode string. stdout is 
byte-oriented so the unicode chars have to be converted to a byte 
stream. This is done by encoding with sys.getdefaultencoding(), i.e.
   print unicode("125° 15' 5.55''", 'utf-8')
is the same as
   print u"125° 15' 5.55''"
which is the same as
   print u"125° 15' 5.55''".encode(sys.getdefaultencoding())

> Ie, use encode('utf-8') where necessary?

Yes.

> But I did see some examples pass by using
> 
>   import sys
>   sys.setdefaultencoding('utf-8')

Yes, that will make the examples pass, it just isn't the recommended 
solution.

> Oh well, in general I tend to play long enough with things like this 
> that 1) I get it (script) working, and 2) I have a decent feeling (90%) 
> that I actually understand what is going on, and why other things 
> failed. Which is roughly where I am now ;-).

The key thing is to realize that there are implicit conversions between 
str and unicode and they will break if the data is not ascii. The best 
fix is to make the conversions explicit by providing the correct encoding.

Kent