UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>

Thu Jan 29 20:22:50 EST 2009

Benjamin Kaplan <benjamin.kaplan <at> case.edu> writes:

> First of all, you're right that might be confusing. I was thinking of
auto-detect as in "check the platform and locale and guess what they usually
use". I wasn't thinking of it like the web browsers use it.I think it uses
locale.getpreferredencoding().

You're probably right. I'd forgotten about locale.getpreferredencoding(). I'll
raise a request on the bug tracker to get some more precise wording in the
open() docs.

> On my machine, I get sys.getpreferredencoding() == 'utf-8' and
locale.getdefaultencoding()== 'cp1252'. 

sys <-> locale ... +1 long-range transposition typo of the year :-)

> If you check my response to Anjanesh's comment, I mentioned that he should
either find out which encoding it is in particular or he should open the file in
binary mode. I suggested utf-8 and latin1 because those are the most likely
candidates for his file since cp1252 was already excluded.

The OP is on a Windows machine. His file looks like a source code file. He is
unlikely to be creating latin1 files himself on a Windows box. Under the
hypothesis that he is accidentally or otherwise reading somebody else's source
files as data, it could be any encoding. In one package with which I'm familiar,
the encoding is declared as cp1251 in every .py file; AFAICT the only file with
non-ASCII characters is an example script containing his wife's name!

The OP's 0x9d is a defined character in code pages 1250, 1251, 1256, and 1257 --
admittedly all as implausible as the latin1 control character.

> Looking at a character map, 0x9d is a control character in latin1, so the page
is probably UTF-8 encoded. Thinking about it now, it could also be MacRoman but
that isn't as common as UTF-8.

Late breaking news: I presume you can see two instances of U+00DD (LATIN CAPITAL
LETTER Y WITH ACUTE) in the OP's report 
"query":"0 1»Ý \u2021 0\u201a0 \u2021»Ý","

Well, u'\xdd'.encode('utf8') is '\xc3\x9d' ... the Bayesian score for utf8 just
went up a notch.

The preceding character U+00BB (looks like >>) doesn't cause an exception
because 0xBB unlike 0x9D is defined in cp1252.

Curiously looking at the \uxxxx escape sequences:
\u2021 is "double dagger", \u201a is "single low-9 quotation mark" ... what
appears to be the value part of an item in a hard-coded dictionary is about as
comprehensible as the Voynich manuscript.

Trouble with cases like this is as soon as they become interesting, the OP often
snatches somebody's one-liner that "works" (i.e. doesn't raise an exception),
makes a quick break for the county line, and they're not seen again :-)

Cheers,
John