Unicode problem with exec

Fri Jun 23 18:59:35 EDT 2006

On 23/06/2006 9:06 PM, Thomas Heller wrote:
> I'm using code.Interactive console but it doesn't work correctly
> with non-ascii characters.  I think it boils down to this problem:
> 
> Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on 
> win32
> Type "help", "copyright", "credits" or "license" for more information.
>>>> print u"ä"

This is utterly useless for diagnostic purposes. What you see is NOT 
what you've got. Use repr().

What you've got, as the error message says, is u'\x84' which is not
u"\N{LATIN SMALL LETTER A WITH DIAERESIS}", it is a control character.

See below.

> ä
>>>> exec 'print u"ä"'
> Traceback (most recent call last):
>  File "<stdin>", line 1, in ?
>  File "<string>", line 1, in ?
>  File "c:\python24\lib\encodings\cp850.py", line 18, in encode
>    return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode character u'\x84' in 
> position 0: character maps to <undefined>
>>>> ^Z
> 
> Why does the exec call fail, and is there a workaround?
> 

Executive summary:

The exec statement didn't fail, it was the print statement trying to 
print, to your CP850 console, a unicode char that doesn't exist in CP850.

This happened because you copied a character whose repr() is '\x84' from 
your MS-DOS console and pasted it into 'u"<insert any old rubbish 
here>"' :-)

Details:

Windows XP, in a console screen:

Python 2.4.2 (#67, Sep 28 2005, 12:41:11) [MSC v.1310 32 bit (Intel)] on 
win32
Type "help", "copyright", "credits" or "license" for more information.
|>> uc = u"\N{LATIN SMALL LETTER A WITH DIAERESIS}"
|>> uc
u'\xe4' <<== agrees with Unicode book
|>> encoded = uc.encode('cp850')
 >>> encoded
'\x84' <<== agrees with 
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP850.TXT
|>> print uc
ä <<== looks like LATIN SMALL LETTER A WITH DIAERESIS, as expected
|>> print encoded
ä <<== looks like LATIN SMALL LETTER A WITH DIAERESIS, as expected
|>> print u"\x84"
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
   File "c:\python24\lib\encodings\cp850.py", line 18, in encode
     return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x84' in 
position 0: character maps to <undefined>
<<== as expected

Looks like Python is working fine to me ...
So, what's happening? Look at this:

|>> char1 = u"ä" <<= corresponds to your "print"
|>> char2 = "ä" <<= corresponds to your exec -- which was given a STRING 
constant, like this, not a Unicode constant.

Character in char1 was copied from DOS console.
Second line was obtained by DOS console editing of copy of first line.

|>> char1
u'\xe4'
|>> char2
'\x84' <<= Aha!

What you have done is effectively: exec 'print u"\x84"'

Workaround/kludge/bypass:

exec u'print u"ä"'
.....^

Much better: embed non-ASCII characters in source code *ONLY* when you 
have a proper coding header: http://www.python.org/dev/peps/pep-0263/

HTH,
John