UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>

Chris Angelico rosuav at gmail.com
Tue May 29 07:13:43 EDT 2018


On Tue, May 29, 2018 at 8:59 PM, Peter J. Holzer <hjp-python at hjp.at> wrote:
> On 2018-05-29 20:28:54 +1000, Chris Angelico wrote:
>> Sure, but you're describing a set of rules. If you can define a set of
>> rules that pin down the encoding, you could teach chardet to follow
>> those rules. If you can't teach chardet to follow those rules, you
>> can't teach a human to follow them either. What is the human going to
>> do? Guess?
>
> Xkcd to the rescue:
>
> https://xkcd.com/1425/
>
> There are a lot of things which are easy to do for a human (recognize a
> bird, understand a sentence), but very hard to write a program for
> (mostly because we don't understand how our brain works, I think).
>
> I haven't looked in detail on how chardet works but it looks like has a
> few simple statistical models for the probability of characters and
> character sequences. This is very different from what a human does, who
> a) recognises whole words, and b) knows what they mean and whether they
> make sense in context.
>
> For a sufficiently narrow range of texts, you can write a program which
> is better at recognizing encoding or language than any human can (As an
> obvious improvement to chardet, you could supply it with dictionaries of
> all languages). However, in the general case that would need a strong
> AI. And we aren't there yet, by far.

I would go further. Some things aren't just beyond current technology
(the "is it a bird" example is just now coming into current tech), and
others are fundamentally impossible. Here's a challenge: Go through a
collection of usernames and identify the language that they were
derived from. Some of them are arbitrary collections of letters and
have no "base language". Others are concatenations of words, not
individual words. A few are going to be mash-ups. Others might be
reversed or otherwise mangled. Okay. Now figure out how to pronounce
those, because that depends on the language.

Impossible? Yep. Now replace "language" with "encoding" and it's still
just as impossible. Sometimes you'll get it wrong and it won't matter
(because the end result of your guess is the same as the end result of
the actual encoding), but other times it will matter.

You can always solve a subset of problems. Using your own knowledge of
German, you are able to better solve problems involving German text.
But that doesn't make you any better than chardet at validating
Chinese text, or Korean text, or Klingon text, or any other language
you don't know. In fact, you are WORSE than a computer, because a
computer can be programmed to be fluent in six million forms of
communication, where a human is notable with six. (My apologies if you
happen to know Chinese, Korean, or Klingon. Pick other languages.)
Suppose you were to teach a machine all your tricks for understanding
German text - but someone else teaches the same machine how to
understand other languages too. We're right back where we started,
unable to recognize which language something is. Or needing external
information about the language in order to better guess the encoding.

ChrisA



More information about the Python-list mailing list