UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>

Chris Angelico rosuav at gmail.com
Tue May 29 06:28:54 EDT 2018


On Tue, May 29, 2018 at 8:09 PM, Peter J. Holzer <hjp-python at hjp.at> wrote:
> On 2018-05-29 19:46:24 +1000, Chris Angelico wrote:
>> On Tue, May 29, 2018 at 6:15 PM, Peter J. Holzer <hjp-python at hjp.at> wrote:
>> > So if the text is German it will contain more words with
>> > umlauts and each byte which is part of a correctly spelled German word
>> > when interpreted according to ISO-8859-1 increases the probability that
>> > decoding with ISO-8859-1 will produce the correct result. There remains
>> > a tiny probability that all those matches are mere coincidence, but I
>> > wrote "almost always", not "always", so I can live with an error rate of
>> > 0.000001% (or something like that).
>>
>> That's basically what the chardet module does, and its error rate is
>> far FAR higher than that. If you think it's easy to detect encodings,
>> I'm sure the chardet maintainers will be happy to accept pull
>> requests!
>
> We were talking about humans, not programs.
>

Sure, but you're describing a set of rules. If you can define a set of
rules that pin down the encoding, you could teach chardet to follow
those rules. If you can't teach chardet to follow those rules, you
can't teach a human to follow them either. What is the human going to
do? Guess?

ChrisA



More information about the Python-list mailing list