UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>

Peter J. Holzer hjp-python at hjp.at
Tue May 29 04:15:28 EDT 2018


On 2018-05-23 08:43:02 +1000, Chris Angelico wrote:
> On Wed, May 23, 2018 at 8:31 AM, Peter J. Holzer <hjp-python at hjp.at> wrote:
> > On 2018-05-23 07:38:27 +1000, Chris Angelico wrote:
> >> > 1) For any given file it is almost always possible to find the correct
> >> >    encoding (or *a* correct encoding, as there may be more than one).
> >>
> >> You can find an encoding which is capable of decoding a file. That's
> >> not the same thing.
> >
> > If the result is correct, it is the same thing.
> >
> > If I have an input file
> >
> >     4c 69 65 62 65 20 47 72 fc df 65 0a
> >
> > and I decode it correctly to
> >
> >     Liebe Grüße
> >
> > it doesn't matter whether I used ISO-8859-1 or ISO-8859-2. The mapping
> > for all bytes in the input file is the same in both encodings.
> 
> Sure, but if you try it as ISO-8859-5 or  -7, you won't get an error,
> but you also won't get that string. So it DOES matter.

I get
    Liebe Grќпe
or
    Liebe Grόίe
which I can immediately recognize as wrong: They mix Cyrillic resp.
Greek letters with Latin letters in the same word which doesn't happen
in any natural language. Of course "Grќпe" could be a nickname in an
online forum (I've seen stranger names than that), but since "Liebe
Grüße" is a common German phrase it is much much more likely to the
correct interpretation. Also, a real file will usually contain more than
two words. So if the text is German it will contain more words with
umlauts and each byte which is part of a correctly spelled German word
when interpreted according to ISO-8859-1 increases the probability that
decoding with ISO-8859-1 will produce the correct result. There remains
a tiny probability that all those matches are mere coincidence, but I
wrote "almost always", not "always", so I can live with an error rate of
0.000001% (or something like that).

        hp

-- 
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp at hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20180529/78038034/attachment.sig>


More information about the Python-list mailing list