UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>

Thu Jan 29 16:19:43 EST 2009

Benjamin Kaplan <bsk16 <at> case.edu> writes:

> 
> 
> On Thu, Jan 29, 2009 at 12:09 PM, Anjanesh Lekshminarayanan <mail <at>
anjanesh.net> wrote:
> > It does auto-detect it as cp1252- look at the files in the traceback and
> > you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong
> > encoding, try opening it as utf-8 or latin1 and see if that fixes it.

Benjamin, "auto-detect" has strong connotations of the open() call (with mode
including text and encoding not specified) reading some/all of the file and
trying to guess what the encoding might be -- a futile pursuit and not what the
docs say: 

"""encoding is the name of the encoding used to decode or encode the file. This
should only be used in text mode. The default encoding is platform dependent,
but any encoding supported by Python can be passed. See the codecs module for
the list of supported encodings"""

On my machine [Windows XL SP3] sys.getdefaultencoding() returns 'utf-8'. It
would be interesting to know
(1) what is produced on Anjanesh's machine
(2) how the default encoding is derived (I would have thought I was a prime
candidate for 'cp1252')
(3) whether the 'default encoding' of open() is actually the same as the
'default encoding' of sys.getdefaultencoding() -- one would hope so but the docs
don't say so.

> Thanks a lot ! utf-8 and latin1 were accepted !

Benjamin and Anjanesh, Please understand that
any_random_rubbish.decode('latin1') will be "accepted". This is *not* useful
information to be greeted with thanks and exclamation marks. It is merely a
by-product of the fact that *any* single-byte character set like latin1 that
uses all 256 possible bytes can not fail, by definition; no character "maps to
<undefined>".

> If you want to read the file as text, find out which encoding it actually is.
In one of those encodings, you'll probably see some nonsense characters. If you
are just looking at the file as a sequence of bytes, open the file in binary
mode rather than text. That way, you'll avoid this issue all together (just make
sure you use byte strings instead of unicode strings).

In fact, inspection of Anjanesh's report:
"""UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
10442: character maps to <undefined> 
The string at position 10442 is something like this :
"query":"0 1»Ý \u2021 0\u201a0 \u2021»Ý"," """ 

draws two observations:
(1) there is nothing in the reported string that can be unambiguously identified
as corresponding to "0x9d"
(2) it looks like a small snippet from a Python source file!

Anjanesh, Is it a .py file? If so, is there something like "# encoding: cp1252"
or "# encoding: utf-8" near the start of the file? *Please* tell us what
sys.getdefaultencoding() returns on your machine.

Instead of "something like", please report exactly what is there:

print(ascii(open('the_file', 'rb').read()[10442-20:10442+21]))

Cheers,
John