UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>

Peter J. Holzer hjp-python at hjp.at
Tue May 29 06:17:23 EDT 2018


On 2018-05-29 19:47:37 +1000, Chris Angelico wrote:
> On Tue, May 29, 2018 at 6:34 PM, Peter J. Holzer <hjp-python at hjp.at> wrote:
> > On 2018-05-23 06:03:38 +0000, Steven D'Aprano wrote:
> >> Mojibake is especially difficult to deal with when you are dealing with
> >> short text snippets like file names or user names which can contain
> >> arbitrary characters, where there is rarely any way to recognise the
> >> "correct" string.
> >
> > For single file names or user names, sure. But if you have a list of
> > them, there is still a high probability that many of them will contain
> > recognizable words which can be used to deduce the (or a) correct
> > encoding. (Unless it's from the Ministry of Silly Names).
> 
> Ohh... are you assuming that, in a list of file names, all of them use
> the same encoding? Ah, yes, well, that WOULD make it easier, wouldn't
> it. Sadly, not the case.

Not in general, but it *IS* the case we were talking about here. The
task is to find *an* encoding which can be used to decode *a* file. This
of course assumes that such an encoding exists. If there are several
encodings in the same file (I use "file" loosely here), then there is no
single encoding which can be used to decode it, so the task is
impossible. (You may still be able to split the file into chunks where
each chunk uses a specific encoding and determine that, but this is a
different task - and one for which the solution "ask the source" is even
less likely to work.)

        hp

-- 
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp at hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20180529/9111a6c8/attachment.sig>


More information about the Python-list mailing list