UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>

Peter J. Holzer hjp-python at hjp.at
Tue May 29 08:04:19 EDT 2018


On 2018-05-29 21:13:43 +1000, Chris Angelico wrote:
> You can always solve a subset of problems. Using your own knowledge of
> German, you are able to better solve problems involving German text.
> But that doesn't make you any better than chardet at validating
> Chinese text, or Korean text, or Klingon text, or any other language
> you don't know.

But I don't have to. Chardet has to be reasonably good at identifying
any encoding. I only have to be good at identifying the encoding of
files which I need to import (or otherwise process.).

Please go back to the original posting. The poster has one file which he
wants to read, and asked how to determine the encoding. He was told
categorically that this is impossible and he must ask the source.

THIS is what I'm responding to, not the problem of finding a generic
solution which works for every possible file.

The OP has one file. He wants to read it. The very fact that he wants to
read this particular file makes it very likely that he knows something
about the contents of the file. So he has domain knowledge. Which makes
it very likely that he can distinguish a correct from an incorrect
decoding. He probably can't distinguish Korean poetry from a Vietnamese
shopping list, but his file probably isn't either.

        hp

-- 
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp at hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20180529/d428d054/attachment.sig>


More information about the Python-list mailing list