UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>

Peter J. Holzer hjp-python at hjp.at
Tue May 29 06:59:29 EDT 2018


On 2018-05-29 20:28:54 +1000, Chris Angelico wrote:
> On Tue, May 29, 2018 at 8:09 PM, Peter J. Holzer <hjp-python at hjp.at> wrote:
> > On 2018-05-29 19:46:24 +1000, Chris Angelico wrote:
> >> That's basically what the chardet module does, and its error rate is
> >> far FAR higher than that. If you think it's easy to detect encodings,
> >> I'm sure the chardet maintainers will be happy to accept pull
> >> requests!
> >
> > We were talking about humans, not programs.
> >
> 
> Sure, but you're describing a set of rules. If you can define a set of
> rules that pin down the encoding, you could teach chardet to follow
> those rules. If you can't teach chardet to follow those rules, you
> can't teach a human to follow them either. What is the human going to
> do? Guess?

Xkcd to the rescue:

https://xkcd.com/1425/

There are a lot of things which are easy to do for a human (recognize a
bird, understand a sentence), but very hard to write a program for
(mostly because we don't understand how our brain works, I think).

I haven't looked in detail on how chardet works but it looks like has a
few simple statistical models for the probability of characters and
character sequences. This is very different from what a human does, who
a) recognises whole words, and b) knows what they mean and whether they
make sense in context.

For a sufficiently narrow range of texts, you can write a program which
is better at recognizing encoding or language than any human can (As an
obvious improvement to chardet, you could supply it with dictionaries of
all languages). However, in the general case that would need a strong
AI. And we aren't there yet, by far.

        hp

-- 
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp at hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20180529/adf7dcdd/attachment.sig>


More information about the Python-list mailing list