What extended ASCII character set uses 0x9D?

Thu Aug 17 20:38:52 EDT 2017

On Thu, Aug 17, 2017 at 6:27 PM, Chris Angelico <rosuav at gmail.com> wrote:
> On Fri, Aug 18, 2017 at 10:14 AM, John Nagle <nagle at animats.com> wrote:
>>     I'm cleaning up some data which has text description fields from
>> multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252.
>> And some are in some other character set. So I have to examine and
>> sanity check each field in a database dump, deciding which character
>> set best represents what's there.
>>
>>    Here's a hard case:
>>
>>  g1 = bytearray(b'\\"Perfect Gift Idea\\"\x9d Each time')
>>
>>  g1.decode("utf8")
>>    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 21:
>> invalid start byte
>>
>>   g1.decode("windows-1252")
>> UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 21:
>> character maps to <undefined>
>>
>> 0x9d is unmapped in "windows-1252", according to
>>
>> https://en.wikipedia.org/wiki/Windows-1252
>>
>> So the Python codec isn't wrong here.
>>
>> Trying "latin-1"
>>
>>   g1.decode("latin-1")
>>  '\\"Perfect Gift Idea\\"\x9d Each time'
>>
>> That just converts 0x9d in the input to 0x9d in Unicode.
>> That's "Operating System Command" (the "Windows" key?)
>> That's clearly wrong; some kind of quote was intended.
>> Any ideas?
>
> Another possibility is that it's some kind of dash or ellipsis or
> something, but I can't find anything that does. (You already have
> quote characters in there.) The nearest I can actually find is:
>
>>>> b'\\"Perfect Gift Idea\\"\x9d Each time'.decode("1256")
> '\\"Perfect Gift Idea\\"\u200c Each time'
>>>> unicodedata.name("\u200c")
> 'ZERO WIDTH NON-JOINER'
>
> which, honestly, doesn't make a lot of sense either. :(

In CP437 it's ¥ which makes some sense in the "gift idea" context. But
then I'd expect a number to appear with it.

It could also just be junk data.