What extended ASCII character set uses 0x9D?

John Nagle nagle at animats.com
Fri Aug 18 02:24:58 EDT 2017


On 08/17/2017 10:12 PM, Ian Kelly wrote:

>> Here's some more 0x9d usage, each from a different data item:
>>
>>
>> Guitar Pro, JamPlay, RedBana\\\'s Audition,\x9d Doppleganger\x99s The
>> Lounge\x9d or Heatwave Interactive\x99s Platinum Life Country,\\"
> 
> This one seems like a good hint since \x99 here looks like it should
> be an apostrophe. But what character set has an apostrophe there? The
> best I can come up with is that 0xE2 0x80 0x99 is "right single
> quotation mark" in UTF-8. Also known as the "smart apostrophe", so it
> could have been entered by a word processor.
> 
> The problem is that if that's what it is, then two out of the three
> bytes are outright missing. If the same thing happened to \x9d then
> who knows what's missing from it?
> 
> One possibility is that it's the same two bytes. That would make it
> 0xE2 0x80 0x9D which is "right double quotation mark". Since it keeps
> appearing after ending double quotes that seems plausible, although
> one has to wonder why it appears *in addition to* the ASCII double
> quotes.

     I was wondering if it was a signal to some word processor to
apply smart quote handling.

>> This has me puzzled.  It's often, but not always after a close quote.
>> "TM" or "(R)" might make sense, but what non-Unicode character set
>> has those.  And  "green"(tm) makes no sense.
> 
> CP-1252 has ™ at \x99, perhaps coincidentally. CP-1252 and Latin-1
> both have ® at \xae.

    That's helpful.  All those text snippets failed Windows-1252
decoding, though, because 0x9d isn't in Windows-1252.

    I'm coming around to the idea that some of these snippets
have been previously mis-converted, which is why they make no sense.
Since, as someone pointed out, there was UTF-8 which had been
run through an ASCII-type lower casing algorithm, that's a reasonable
assumption.  Thanks for looking at this, everyone.  If a string won't
parse as either UTF-8 or Windows-1252, I'm just going to convert the
bogus stuff to the Unicode replacement character. I might remove
0x9d chars, since that never seems to affect readability.

				John Nagle




More information about the Python-list mailing list