UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>

Sat Oct 20 13:23:17 EDT 2018

On 10/20/2018 8:24 AM, pjmclenon at gmail.com wrote:
> On Saturday, October 13, 2018 at 7:24:14 PM UTC-4, MRAB wrote:

> i have a sort of decode error
> UnicodeDecodeError; 'utf-8' can't decode byte 0xb0 in position 83064: invalid start byte
> *****************
> and it seems to refer to my code line:
> ***********
> data = f.read()
> ***************
> which is part of this block of code
> ********************
> # Read content of files
>      for path in files:
>          with open(join("docs", path), encoding="utf-8") as f:
>          #with open(join("docs", path)) as f:
>              data = f.read()
>              process_data(data)
> ***********************************************
> 
> would the solution fix be this?
> **********************
> data = f.read(), decoding = "utf-8"  #OR
> data = f.read(), decoding = "ascii" # is this the right fix or previous or both wrong??

Both statements are invalid syntax.  The encoding is set in the open 
statement.

What you need to find out: is '0xb0' a one-byte error or is 'utf-8' the 
wrong encoding?  Things I might do:

1. Change the encoding in open() to 'ascii' and see if the exception 
message still refers to position 83064 or if there is a non-ascii 
character earlier in the file.  The latter would mean that there is at 
least one earlier non-ascii sequence that was decoded as uft-8.  This 
would suggest that 'utf-8' might be correct and that the '0xb0' byte is 
an error.

2. In the latter case, add "errors='handler'", where 'handler' is 
something other than the default 'strict'.  Look in the doc or see 
help(open) for alternatives.

3. In open(), replace "encoding='utf-8'" with "mode='rb'" so that 
f.read() creates data as bytes instead of a text string.  Then print, 
say, data[83000:83200] to see the context of the non-ascii byte.

4. Change to encoding in open() to 'latin-1'.  The file will then be 
read as text without error, even if latin-1 is the wrong encoding.

-- 
Terry Jan Reedy