UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>

Peter J. Holzer hjp-python at hjp.at
Tue May 29 06:09:18 EDT 2018


On 2018-05-29 19:46:24 +1000, Chris Angelico wrote:
> On Tue, May 29, 2018 at 6:15 PM, Peter J. Holzer <hjp-python at hjp.at> wrote:
> > So if the text is German it will contain more words with
> > umlauts and each byte which is part of a correctly spelled German word
> > when interpreted according to ISO-8859-1 increases the probability that
> > decoding with ISO-8859-1 will produce the correct result. There remains
> > a tiny probability that all those matches are mere coincidence, but I
> > wrote "almost always", not "always", so I can live with an error rate of
> > 0.000001% (or something like that).
> 
> That's basically what the chardet module does, and its error rate is
> far FAR higher than that. If you think it's easy to detect encodings,
> I'm sure the chardet maintainers will be happy to accept pull
> requests!

We were talking about humans, not programs.

        hp

-- 
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp at hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20180529/5ba613d1/attachment.sig>


More information about the Python-list mailing list