can't get utf8 / unicode strings from embedded python

David M. Cotter me at davecotter.com
Sat Aug 24 02:45:29 EDT 2013


> I see you are using Python 2
correct

> Firstly, in Python 2, the compiler assumes that the source code is encoded in ASCII
gar, i must have been looking at doc for v3, as i thought it was all assumed to be utf8

> # -*- coding: utf-8 -*- 
okay, did that, still no change

> you need to use u" ... " delimiters for Unicode, otherwise the results you get are completely arbitrary and depend on the encoding of your terminal. 
okay, well, i'm on a mac, and not using "terminal" at all.  but if i were, it would be utf8
but it's still not flying :(

> For example, if I set my terminal encoding to IBM-850
okay how do you even do that?  this is not an interactive session, this is embedded python, within a C++ app, so there's no terminal.  

but that is a good question: all the docs say "default encoding" everywhere (as in "If string is a Unicode object, this function computes the default encoding of string and operates on that"), but fail to specify just HOW i can set the default encoding.  if i could just say "hey, default encoding is utf8", i think i'd be done?

> So change the line of code to: 
> print u"frøânçïé" 
okay, sure... 
but i get the exact same results

> Those two changes ought to fix the problem, but if they don't, try setting your terminal encoding to UTF-8 as well
well, i'm not sure what you mean by that.  i don't have a terminal here.
i'm logging to a utf8 log file (when i print)


> but what it *actually* prints is this: 
> 
>        print "frøânçïé" 
> --> frøânçïé 

>It's hard to say what *exactly* is happening here, because you don't explain how the python print statement somehow gets into your C++ Log code. Do I guess right that it catches stdout?
yes, i'm redirecting stdout to my own custom print class, and then from that function i call into my embedded C++ print function

>If so, then what I expect is happening is that Python has read in the source code of 

>print "~~~~~" 

>with ~~~~~ as a bunch of junk bytes, and then your terminal is displaying those junk bytes according to whatever encoding it happens to be using. 
>Since you are seeing this: 

>frøânçïé 

>my guess is that you're using a Mac, and the encoding is set to the MacRoman encoding. Am I close?
you hit the nail on the head there, i think.  using that as a hint, i took this text "fr√∏√¢n√ß√Ø√©" and pasted that into a "macRoman" document, then *reinterpreted* it as UTF8, and voala: "frøânçïé"

so, it seems that i AM getting my utf8 bytes, but i'm getting them converted to macRoman.  huh?  where is macRoman specified, and how to i change that to utf8?  i think that's the missing golden ticket



More information about the Python-list mailing list