UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>

Steven D'Aprano steve+comp.lang.python at pearwood.info
Wed May 23 02:03:38 EDT 2018


On Wed, 23 May 2018 00:31:03 +0200, Peter J. Holzer wrote:

> On 2018-05-23 07:38:27 +1000, Chris Angelico wrote:
[...]
>> You can find an encoding which is capable of decoding a file. That's
>> not the same thing.
> 
> If the result is correct, it is the same thing.

But how do you know what is correct and what isn't? In the most general 
case, even if you know the language nominally being used, you might not 
be able to recognise good output from bad:

    Max Steele strained his mighty thews against his bonds, but
    the §-rays had left him as weak as a kitten. The evil Galactic
    Emperor, Giµx-Õƒin The Terrible of the planet Œe∂¥, laughed: "I 
    have you now, Steele, and by this time tomorrow my armies will
    have overrun your pitiful Earth defences!"

If this text is encoding using MacRoman, then decoded in Latin-1, it 
works, and looks barely any more stupid than the original:

    Max Steele strained his mighty thews against his bonds, but
    the ¤-rays had left him as weak as a kitten. The evil Galactic
    Emperor, Giµx-ÍÄin The Terrible of the planet Îe¶´, laughed: "I
    have you now, Steele, and by this time tomorrow my armies will
    have overrun your pitiful Earth defences!"

but it clearly isn't the original text.

Mojibake is especially difficult to deal with when you are dealing with 
short text snippets like file names or user names which can contain 
arbitrary characters, where there is rarely any way to recognise the 
"correct" string. If you think Giµx-Õƒin The Terrible is a ludicrous 
example of text, you ought to look at user names on web forums.



-- 
Steve




More information about the Python-list mailing list