UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>

Peter J. Holzer hjp-python at hjp.at
Wed May 30 18:08:19 EDT 2018


On 2018-05-29 16:20:36 +0000, Steven D'Aprano wrote:
> On Tue, 29 May 2018 14:04:19 +0200, Peter J. Holzer wrote:
> 
> > The OP has one file. 
> 
> We don't know that. All we know is that he had one file which he was 
> unable to read. For all we know, he has a million files, and this was 
> merely the first of many failures.

This is of course possible. It is also possible that the file is updated
daily and the person updating the file is always choosing a random
encoding[2], so his program will always fail the next day. 

But that isn't what he has told us. And I don't find it very helpful to
invent some specific scenario and base the answers on that invented
scenario instead of what the OP has told us.


> > He wants to read it. The very fact that he wants to
> > read this particular file makes it very likely that he knows something
> > about the contents of the file. So he has domain knowledge.
> 
> An unjustified assumption. I've wanted to read many files with only the 
> vaguest guess of what they might contain.
> 
> As for his domain knowledge, look again at the OP's post. His solution 
> was to paper over the error, make the error go away, by moving to Python 
> 2 which is more lax about getting the encoding right:

By "domain knowledge" I didn't mean knowledge of Python or encodings. I
meant knowledge about whatever the contents of the file are about.

My users (mostly) have no idea what an "encoding" is. But they know what
their data is about, and can tell me whether an unidentified character
in the "unit" field is a € or a ¥[1] (and also whether the value is
nominal or real or PPP-adjusted and lots of other stuff I don't need to
know to determine the encoding but may (or may not) need to import the
data correctly).


> "i actually got my script to function by running it in python 2.7"
> 
> So he didn't identify the correct encoding,

No, but just because he didn't doesn't mean it is impossible.

        hp

[1] I don't know if there are actually two encodings where these two
    have the same encoding. This is an invented example.

[2] BTDT (almost).

-- 
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp at hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20180531/dbdc104c/attachment.sig>


More information about the Python-list mailing list