can't get utf8 / unicode strings from embedded python

Sat Aug 24 14:31:45 EDT 2013

Le samedi 24 août 2013 18:47:19 UTC+2, David M. Cotter a écrit :
> > What _are_ you using? 
> 
> i have scripts in a file, that i am invoking into my embedded python within a C++ program.  there is no terminal involved.  the "print" statement has been redirected (via sys.stdout) to my custom print class, which does not specify "encoding", so i tried the suggestion above to set it:
> 
> 
> 
> static const char *s_RedirectScript = 
> 
> 	"import " kEmbeddedModuleName "\n"
> 
> 	"import sys\n"
> 
> 	"\n"
> 
> 	"class CustomPrintClass:\n"
> 
> 	"	def write(self, stuff):\n"
> 
> 	"		" kEmbeddedModuleName "." kCustomPrint "(stuff)\n"
> 
> 	"class CustomErrClass:\n"
> 
> 	"	def write(self, stuff):\n"
> 
> 	"		" kEmbeddedModuleName "." kCustomErr "(stuff)\n"
> 
> 	"sys.stdout = CustomPrintClass()\n"
> 
> 	"sys.stderr = CustomErrClass()\n"
> 
> 	"sys.stdout.encoding = 'UTF-8'\n"
> 
> 	"sys.stderr.encoding = 'UTF-8'\n";
> 
> 
> 
> 
> 
> but it didn't help.
> 
> 
> 
> I'm still getting back a string that is a utf-8 string of characters that, if converted to "macRoman" and then interpreted as UTF8, shows the original, correct string.  who is specifying macRoman, and where, and how do i tell whoever that is that i really *really* want utf8?

--------

Always encode a "unicode" into the coding of the "system"
which will host it.

Adapting the hosting system to your "unicode" (encoded
unicode) is not a valid solution. A non sense.

sys.std***.encodings do nothing. They only give you
information about the coding of the hosting system.

The "system" can be anything, a db, a terminal, a gui, ...

Shortly, your "writer" should encode your "stuff"
to your "host" in a adequate way. It is up to you to
manage coherence. If your passive "writer" support only one
coding, adapt "stuff", if "stuff" lives in its own coding
(due to c++ ?) adapt your "writer".

Example from my interactive interpreter. It is in Python 3,
not important, basically the job is the same in Python 2.
This interpreter has the capability to support many codings,
and the coding of this host system can be changed on the
fly.

A commented session.

By default, a string, type str, is a unicode. The
host accepts "unicode". So, by default the sys.stdout
coding is '<unicode>'. 

>>> sys.stdout.encoding = '<unicode>'
>>> print("frøânçïé")
frøânçïé
>>> 

Setting the host to utf-8 and printing the above string gives
"something", but encoding into utf-8 works fine.

>>> sys.stdout.encoding = 'utf-8'
>>> sys.stdout.encoding
'utf-8'
>>> print("frøânçïé")
frÃ¸Ã¢nÃ§Ã¯Ã©
>>> print("frøânçïé".encode('utf-8'))
'frøânçïé'

Setting the host to 'mac-roman' works fine too,
as long it is properly encoded!

>>> sys.stdout.encoding = 'mac-roman'
>>> print("frøânçïé".encode('mac-roman'))
'frøânçïé'

But

>>> print("frøânçïé".encode('utf-8'))
'fr√∏√¢n√ß√Ø√©'

Ditto for cp850

>>> sys.stdout.encoding = 'cp850'
>>> print("frøânçïé".encode('cp850'))
'frøânçïé'

If the repertoire of characters of a coding scheme does not
contain the characters -> replace

>>> sys.stdout.encoding = 'cp437'
>>> print("frøânçïé".encode('cp437'))
Traceback (most recent call last):
  File "<eta last command>", line 1, in <module>
  File "c:\python32\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\xf8' in position 2: character maps to
<undefined>
>>> print("frøânçïé".encode('cp437', 'replace'))
'fr?ânçïé'

Curiousities

>>> sys.stdout.encoding = 'utf-16-be'
>>> print("frøânçïé")
 f r ø â n ç ï é 
>>> print("frøânçïé".encode('utf-16-be'))
'frøânçïé' 
>>> sys.stdout.encoding = 'utf-32-be'
>>> print("frøânçïé".encode('utf-32-be'))
'frøânçïé'

jmf