Python 3.1.1 bytes decode with replace bug

Mon Oct 26 00:54:13 EDT 2009

"Dave Angel" <davea at ieee.org> wrote in message 
news:4AE43150.9010901 at ieee.org...
> Joe wrote:
>>> For the reason BK explained, the important difference is that I ran in
>>> the IDLE shell, which handles screen printing of unicode better ;-)
>>>
>>
>> Something still does not seem right here to me.
>>
>> In the example above the bytes were decoded to 'UTF-8' with the
>>
> *nope*  you're decoding FROM utf-8 to unicode.
>> replace option so any characters that were not UTF-8 were replaced and
>> the resulting string is '\ufffdabc' as BK explained.  I understand
>> that the replace worked.
>>
>> Now consider this:
>>
>> Python 3.1.1 (r311:74483, Aug 17 2009, 16:45:59) [MSC v.1500 64 bit
>> (AMD64)] on
>> win32
>> Type "help", "copyright", "credits" or "license" for more information.
>>
>>>>> s = '\ufffdabc'
>>>>> print(s)
>>>>>
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>>   File "p:\SW64\Python.3.1.1\lib\encodings\cp437.py", line 19, in
>> encode
>>     return codecs.charmap_encode(input,self.errors,encoding_map)[0]
>> UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in
>> position
>> 0: character maps to <undefined>
>>
>>>>> import sys
>>>>> sys.getdefaultencoding()
>>>>>
>> 'utf-8'
>>
>> This too fails for the exact same reason (and doesn't invole decode).
>>
>> In the original example I decoded to UTF-8 and in this example the
>> default encoding is UTF-8 so why is cp437 being used?
>>
>> Thanks in advance for your assistance!
>>
>>
>>
> Benjamin had it right, but you still don't understand what he said.
>
> The problem in your original example, and in the current one, is not in 
> decode(), but in encode(), which is implicitly called by print(), when 
> needed to convert from Unicode to some byte format of the console.  Take 
> your original example:
>
>>>>>>  b'\x80abc'.decode('utf-8', 'replace')
>
>
> The decode() is explicit, and converts *FROM* utf8 string to a unicode 
> one.  But since you're running in a debugger, there's an implicit print, 
> which is converting unicode into whatever your default console encoding 
> is.  That calls encode() (or one of its variants,  charmap_encode(), on 
> the unicode string.  There is no relationship between the two steps.
>
> In your current example, you're explicitly doing the print(), but still 
> have the same implicit encoding to cp437, which gets the equivalent error. 
> That's the encoding that your Python 3.x is choosing for the stdout 
> console, based on country-specific Windows settings.  In the US, that 
> implicit encoding is ASCII.  I don't know how to override it generically, 
> but I know it's possible to replace stdout with a wrapper that does your 
> preferred encoding.  You probably want to keep cp437, but change the error 
> handling to ignore.  Or if this is a one-time problem, I suspect you could 
> do the encoding manually, to a byte array, then print that.

You can also replace the Unicode replacement character U+FFFD with a valid 
cp437 character before displaying it:

>>> b'\x80abc'.decode('utf8','replace').replace('\ufffd','?')
'?abc'

-Mark