Python 3.1.1 bytes decode with replace bug

Dave Angel davea at ieee.org
Sun Oct 25 07:06:56 EDT 2009


Joe wrote:
>> For the reason BK explained, the important difference is that I ran in
>> the IDLE shell, which handles screen printing of unicode better ;-)
>>     
>
> Something still does not seem right here to me.
>
> In the example above the bytes were decoded to 'UTF-8' with the
>   
*nope*  you're decoding FROM utf-8 to unicode.
> replace option so any characters that were not UTF-8 were replaced and
> the resulting string is '\ufffdabc' as BK explained.  I understand
> that the replace worked.
>
> Now consider this:
>
> Python 3.1.1 (r311:74483, Aug 17 2009, 16:45:59) [MSC v.1500 64 bit
> (AMD64)] on
> win32
> Type "help", "copyright", "credits" or "license" for more information.
>   
>>>> s = '\ufffdabc'
>>>> print(s)
>>>>         
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "p:\SW64\Python.3.1.1\lib\encodings\cp437.py", line 19, in
> encode
>     return codecs.charmap_encode(input,self.errors,encoding_map)[0]
> UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in
> position
> 0: character maps to <undefined>
>   
>>>> import sys
>>>> sys.getdefaultencoding()
>>>>         
> 'utf-8'
>
> This too fails for the exact same reason (and doesn't invole decode).
>
> In the original example I decoded to UTF-8 and in this example the
> default encoding is UTF-8 so why is cp437 being used?
>
> Thanks in advance for your assistance!
>
>
>   
Benjamin had it right, but you still don't understand what he said.

The problem in your original example, and in the current one, is not in 
decode(), but in encode(), which is implicitly called by print(), when 
needed to convert from Unicode to some byte format of the console.  Take 
your original example:

>>>>>  b'\x80abc'.decode('utf-8', 'replace')


The decode() is explicit, and converts *FROM* utf8 string to a unicode 
one.  But since you're running in a debugger, there's an implicit print, 
which is converting unicode into whatever your default console encoding 
is.  That calls encode() (or one of its variants,  charmap_encode(), on 
the unicode string.  There is no relationship between the two steps.

In your current example, you're explicitly doing the print(), but still 
have the same implicit encoding to cp437, which gets the equivalent 
error.  That's the encoding that your Python 3.x is choosing for the 
stdout console, based on country-specific Windows settings.  In the US, 
that implicit encoding is ASCII.  I don't know how to override it 
generically, but I know it's possible to replace stdout with a wrapper 
that does your preferred encoding.  You probably want to keep cp437, but 
change the error handling to ignore.  Or if this is a one-time problem, 
I suspect you could do the encoding manually, to a byte array, then 
print that.

DaveA




More information about the Python-list mailing list