What extended ASCII character set uses 0x9D?

MRAB python at mrabarnett.plus.com
Thu Aug 17 21:21:35 EDT 2017


On 2017-08-18 01:14, John Nagle wrote:
>       I'm cleaning up some data which has text description fields from
> multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252.
> And some are in some other character set. So I have to examine and
> sanity check each field in a database dump, deciding which character
> set best represents what's there.
> 
>      Here's a hard case:
> 
>    g1 = bytearray(b'\\"Perfect Gift Idea\\"\x9d Each time')
> 
>    g1.decode("utf8")
>      UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position
> 21: invalid start byte
> 
>     g1.decode("windows-1252")
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
> 21: character maps to <undefined>
> 
> 0x9d is unmapped in "windows-1252", according to
> 
> https://en.wikipedia.org/wiki/Windows-1252
> 
> So the Python codec isn't wrong here.
> 
> Trying "latin-1"
> 
>     g1.decode("latin-1")
>    '\\"Perfect Gift Idea\\"\x9d Each time'
> 
> That just converts 0x9d in the input to 0x9d in Unicode.
> That's "Operating System Command" (the "Windows" key?)
> That's clearly wrong; some kind of quote was intended.
> Any ideas?
> 
It's preceded by something in quotes, so it might be ™ (trademark 
symbol, '\u2122') or something similar. No idea which encoding that 
would be, though.



More information about the Python-list mailing list