UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>

Wed Jun 13 06:55:58 EDT 2018

On Sunday, 10 June 2018 17:29:59 UTC-4, Cameron Simpson  wrote:
> On 10Jun2018 13:04, bellcanadardp at gmail.com <bellcanadardp at gmail.com> wrote:
> >here is the full error once again
> >to summarize, my script works fine in python2
> >i get this error trying to run it in python3
> >plz see below after the error, my settings for python 2 and python 3
> >for me it seems i need to change some settings to 'utf-8'..either just in python 3, since thats where i am having issues or change the settings to 'utf-8' both in python 2 and 3....i would appreciate feedback b4 i do some trial and error
> >thanks for the consideration
> >tommy
> >
> >***********************************************
> >Traceback (most recent call last):
> >File "createIndex.py", line 132, in <module>
> >c.createindex()
> >File "creatIndex.py", line 102, in createIndex
> >pagedict=self.parseCollection()
> >File "createIndex.py", line 47, in parseCollection
> >for line in self.collFile:
> >File 
> >"C:\Users\Robert\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", 
> >line 23, in decode
> >return codecs.charmap_decode(input,self.errors,decoding_table[0]
> >UnicodeDecodeError: 'charmap'codec can't decode byte 0x9d in position 7414: character maps to <undefined>
> 
> Ok, this is more helpful. It says that the decoding error, which occurred in 
> ...\cp1252.py, was decoding lines from the file self.collFile.
> 
> What is that file? And how was it opened?
> 
> Also, your settings below may indeed be important.
> 
> >***************************************************
> >python 3 settings
> >import sys
> > import locale
> >locale.getpreferredencoding()
> >'cp1252'
> 
> The setting above is the default encoding used when you open a file in text 
> mode in Python 3, but you can override it.
> 
> In Python 3 this matters a lot, because Python 3 strings are Unicode. In Python 
> 2, strings are just bytes, and are not "decoded" (there is a whole separate 
> "unicode" type for that when it matters).
> 
> So in Python 3 the text file reader is decoding the text in the file according 
> to what it expects the encoding to be.
> 
> Find the place where self.collFile is opened. You can specify the decoding 
> method there by adding the "encoding=" parameter to the open() call. It is 
> defaulting to "cp1252" because that is what locale.getpreferredencoding() 
> returns, but presumably the actual file data are not encoded that way.
> 
> You can (a) find out what encoding _is_ used in the file and specify that or 
> (b) tell Python to be less picky. Choice (a) is better if it is feasible.
> 
> If you have to guess because you don't know the encoding, one possibility is 
> that collFile contains utf-8 or utf-16; of these 2, utf-8 seems more likely 
> given the 0x9d byte causing the trouble.  Try adding:
> 
>   encoding='utf-8'
> 
> to the open() call, eg:
> 
>   self.collFile = open('path-to-the-coll-file', encoding='utf-8')
> 
> at the appropriate place.
> 
> If that just produces a different decoding error, you have 2 choices: pick an 
> encoding where every byte is "valid", such as 'iso8859-1', or to tell the 
> decode to just cope with th errors by adding the errors="replace" or 
> "errors="ignore" or errors="namereplace" parameter to the open() call.
> 
> Both these choices have downsides.
> 
> There are several ISO8859 encodings, and they might all be wrong for your file, 
> leading to _incorrect_ text lines.
> 
> The errors="..." parameter also has downsides: you will also end up with 
> missing (errors="ignore") or incorrect (errors="replace" or 
> errors="namereplace") text, because the decoder has to do something with the 
> data: drop it or replace it with something wrong. The former loses data while 
> the latter puts in bad data, but at least it is visible if you inspect the data 
> later.
> 
> The full documentation for Python 3's open() call is here:
> 
>   https://docs.python.org/3/library/functions.html#open
> 
> where the various encoding= and errors= choices are described.
> 
> Cheers,
> Cameron Simpson <cs at cskk.id.au>

hello community forums
still failed but i tried many things and i beleive we are getting close to getting

1st is this script is from a library module online open source so many things in source code i may not trully be sure since i cant reach the original authors or sub authors..so

the collFile has to be like a variable that would refer to the file Collection.dat..thats my best guess
also in the error line , it doesnt actually open the file ...so i didnt try your code but i did try somethin similar b4 and it didnt work..here is a snippet of the code line error
*******
def parseCollection(self):
        ''' returns the id, title and text of the next page in the collection '''
        doc=[]		
        for line in self.collFile:
            if line=='</page>\n':
                break
            doc.append(line)

*********
so as you can see there is not open file line to try to encode to utf-8

now here is what i tried anyways
*******************
# -*- encoding: utf-8 -*-
#for line in self.collFile.decode("utf-8"):
#pagedict=self.parseCollection.encode("utf-8")()
**********************

i tried the 1st line at the top of the script.....no effect
i tried the 2nd line at the error line for the collFile....no effect
i tried the 3rd line and the parseCollection line...no effect

no here is what i found on stackoverflow and i had some progress but still get errors , so maybe i am not inserting the code in the right place

 i tried this in various spots in the createindex.py file

******************
import locale
def getpreferredencoding(do_setlocale = True):
    return "utf-8"
locale.getpreferredencoding = getpreferredencoding
****************************
i tried this in the actual file, in many different spots...but no effect

so i tried it at the python 3.6.4 interpeter prompt and when i enter the code and after last line..i press enter  and then enter the finction
locale.getpreferredencoding()......i do finally get "utf-8".....instaed of default "cp1252"

but when i go to run my script,,,i still get the same cant decode byte in cp1252 file....................

so to me it seems my code that changed the preferedencoding is not posting to my file
how can i make it permanent or actually have an effect cuz i do see the change at the interperter..but not yet at the script

tommy