UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>

Tue May 29 04:34:50 EDT 2018

On 2018-05-23 06:03:38 +0000, Steven D'Aprano wrote:
> On Wed, 23 May 2018 00:31:03 +0200, Peter J. Holzer wrote:
> > On 2018-05-23 07:38:27 +1000, Chris Angelico wrote:
> >> You can find an encoding which is capable of decoding a file. That's
> >> not the same thing.
> > 
> > If the result is correct, it is the same thing.
> 
> But how do you know what is correct and what isn't? In the most general 
> case, even if you know the language nominally being used, you might not 
> be able to recognise good output from bad:
> 
>     Max Steele strained his mighty thews against his bonds, but
>     the §-rays had left him as weak as a kitten. The evil Galactic
>     Emperor, Giµx-Õƒin The Terrible of the planet Œe∂¥, laughed: "I 
>     have you now, Steele, and by this time tomorrow my armies will
>     have overrun your pitiful Earth defences!"
> 
> If this text is encoding using MacRoman, then decoded in Latin-1, it 
> works, and looks barely any more stupid than the original:
> 
>     Max Steele strained his mighty thews against his bonds, but
>     the ¤-rays had left him as weak as a kitten. The evil Galactic
>     Emperor, Giµx-ÍÄin The Terrible of the planet Îe¶´, laughed: "I
>     have you now, Steele, and by this time tomorrow my armies will
>     have overrun your pitiful Earth defences!"
> 
> but it clearly isn't the original text.

Please note that I wrote "almost always", not "always". It is of course
possible to construct contrived examples where it is impossible to find
the correct encoding, because all encodings lead to equally ludicrous
results.

I would still maintain that the kind of person who invents names like
this for their fanfic is also unlikely to be able to tell you what
encoding they used ("What's an encoding? I just clicked on 'Save'!").

> Mojibake is especially difficult to deal with when you are dealing with 
> short text snippets like file names or user names which can contain 
> arbitrary characters, where there is rarely any way to recognise the 
> "correct" string.

For single file names or user names, sure. But if you have a list of
them, there is still a high probability that many of them will contain
recognizable words which can be used to deduce the (or a) correct
encoding. (Unless it's from the Ministry of Silly Names).

        hp

-- 
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp at hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20180529/8828191f/attachment.sig>