can't get utf8 / unicode strings from embedded python

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri Aug 23 21:54:01 EDT 2013


On Fri, 23 Aug 2013 13:49:23 -0700, David M. Cotter wrote:

> note everything works great if i use Ascii, but:
> 
> in my utf8-encoded script i have this:
> 
>>	print "frøânçïé"


I see you are using Python 2, in which case there are probably two or 
three errors being made here.

Firstly, in Python 2, the compiler assumes that the source code is 
encoded in ASCII, actually ASCII plus arbitrary bytes. Since your source 
code is *actually* UTF-8, the bytes in the file are:

70 72 69 6E 74 20 22 66 72 C3 B8 C3 A2 6E C3 A7 C3 AF C3 A9 22

But Python doesn't know the file is encoded in UTF-8, it thinks it is 
reading ASCII plus junk, so when it reads the file it parses those bytes 
into a line of code:

print "~~~~~"

where the ~~~~~ represents a bunch of 13 rubbish junk bytes. So that's 
the first problem to fix. You can fix this by adding an encoding cookie 
at the beginning of your module, in the first or second line:

# -*- coding: utf-8 -*-


The second problem is that even once you've fixed the source encoding, 
you're still not dealing with a proper Unicode string. In Python 2, you 
need to use u" ... " delimiters for Unicode, otherwise the results you 
get are completely arbitrary and depend on the encoding of your terminal. 
For example, if I set my terminal encoding to IBM-850, I get:

fr°Ônþ´Ú

from those bytes. If I set it to Central European ISO-8859-3 I get this:

frĝânçïé

Clearly not what I intended. So change the line of code to:

print u"frøânçïé"

Those two changes ought to fix the problem, but if they don't, try 
setting your terminal encoding to UTF-8 as well and see if that helps.


[...]
> but what it *actually* prints is this:
> 
>>	print "frøânçïé"
> --> frøânçïé

It's hard to say what *exactly* is happening here, because you don't 
explain how the python print statement somehow gets into your C++ Log 
code. Do I guess right that it catches stdout?

If so, then what I expect is happening is that Python has read in the 
source code of

print "~~~~~"

with ~~~~~ as a bunch of junk bytes, and then your terminal is displaying 
those junk bytes according to whatever encoding it happens to be using. 
Since you are seeing this:

frøânçïé

my guess is that you're using a Mac, and the encoding is set to the 
MacRoman encoding. Am I close?

To summarise:

* Add an encoding cookie, to tell Python to use UTF-8 when parsing your 
source file.

* Use a Unicode string u"frøânçïé".

* Consider setting your terminal to use UTF-8, otherwise it may not be 
able to print all the characters you would like.

* You may need to change the way data gets into your C++ Log function. If 
it expects bytes, you may need to use u"...".encode('utf-8') rather than 
just u"...". But since I don't understand how data is getting into your 
Log function, I can't be sure about this.


I think that is everything. Does that fix your problem?


-- 
Steven



More information about the Python-list mailing list