UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>

Sun Jun 10 17:29:36 EDT 2018

On 10Jun2018 13:04, bellcanadardp at gmail.com <bellcanadardp at gmail.com> wrote:
>here is the full error once again
>to summarize, my script works fine in python2
>i get this error trying to run it in python3
>plz see below after the error, my settings for python 2 and python 3
>for me it seems i need to change some settings to 'utf-8'..either just in python 3, since thats where i am having issues or change the settings to 'utf-8' both in python 2 and 3....i would appreciate feedback b4 i do some trial and error
>thanks for the consideration
>tommy
>
>***********************************************
>Traceback (most recent call last):
>File "createIndex.py", line 132, in <module>
>c.createindex()
>File "creatIndex.py", line 102, in createIndex
>pagedict=self.parseCollection()
>File "createIndex.py", line 47, in parseCollection
>for line in self.collFile:
>File 
>"C:\Users\Robert\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", 
>line 23, in decode
>return codecs.charmap_decode(input,self.errors,decoding_table[0]
>UnicodeDecodeError: 'charmap'codec can't decode byte 0x9d in position 7414: character maps to <undefined>

Ok, this is more helpful. It says that the decoding error, which occurred in 
...\cp1252.py, was decoding lines from the file self.collFile.

What is that file? And how was it opened?

Also, your settings below may indeed be important.

>***************************************************
>python 3 settings
>import sys
> import locale
>locale.getpreferredencoding()
>'cp1252'

The setting above is the default encoding used when you open a file in text 
mode in Python 3, but you can override it.

In Python 3 this matters a lot, because Python 3 strings are Unicode. In Python 
2, strings are just bytes, and are not "decoded" (there is a whole separate 
"unicode" type for that when it matters).

So in Python 3 the text file reader is decoding the text in the file according 
to what it expects the encoding to be.

Find the place where self.collFile is opened. You can specify the decoding 
method there by adding the "encoding=" parameter to the open() call. It is 
defaulting to "cp1252" because that is what locale.getpreferredencoding() 
returns, but presumably the actual file data are not encoded that way.

You can (a) find out what encoding _is_ used in the file and specify that or 
(b) tell Python to be less picky. Choice (a) is better if it is feasible.

If you have to guess because you don't know the encoding, one possibility is 
that collFile contains utf-8 or utf-16; of these 2, utf-8 seems more likely 
given the 0x9d byte causing the trouble.  Try adding:

  encoding='utf-8'

to the open() call, eg:

  self.collFile = open('path-to-the-coll-file', encoding='utf-8')

at the appropriate place.

If that just produces a different decoding error, you have 2 choices: pick an 
encoding where every byte is "valid", such as 'iso8859-1', or to tell the 
decode to just cope with th errors by adding the errors="replace" or 
"errors="ignore" or errors="namereplace" parameter to the open() call.

Both these choices have downsides.

There are several ISO8859 encodings, and they might all be wrong for your file, 
leading to _incorrect_ text lines.

The errors="..." parameter also has downsides: you will also end up with 
missing (errors="ignore") or incorrect (errors="replace" or 
errors="namereplace") text, because the decoder has to do something with the 
data: drop it or replace it with something wrong. The former loses data while 
the latter puts in bad data, but at least it is visible if you inspect the data 
later.

The full documentation for Python 3's open() call is here:

  https://docs.python.org/3/library/functions.html#open

where the various encoding= and errors= choices are described.

Cheers,
Cameron Simpson <cs at cskk.id.au>