What extended ASCII character set uses 0x9D?

Ian Kelly ian.g.kelly at gmail.com
Fri Aug 18 01:12:06 EDT 2017


On Thu, Aug 17, 2017 at 9:46 PM, John Nagle <nagle at animats.com> wrote:
>    The 0x9d thing seems unrelated to the Polish names thing.  0x9d
> shows up in the middle of English text that's otherwise ASCII.
> Is this something that can appear as a result of cutting and
> pasting from Microsoft Word?
>
>    I'd like to get 0x9d right, because it comes up a lot. The
> Polish name thing is rare.  There's only about a dozen of those
> in 400MB of database dump. There are hundreds of 0x9d hits.
>
> Here's some more 0x9d usage, each from a different data item:
>
>
> Guitar Pro, JamPlay, RedBana\\\'s Audition,\x9d Doppleganger\x99s The
> Lounge\x9d or Heatwave Interactive\x99s Platinum Life Country,\\"

This one seems like a good hint since \x99 here looks like it should
be an apostrophe. But what character set has an apostrophe there? The
best I can come up with is that 0xE2 0x80 0x99 is "right single
quotation mark" in UTF-8. Also known as the "smart apostrophe", so it
could have been entered by a word processor.

The problem is that if that's what it is, then two out of the three
bytes are outright missing. If the same thing happened to \x9d then
who knows what's missing from it?

One possibility is that it's the same two bytes. That would make it
0xE2 0x80 0x9D which is "right double quotation mark". Since it keeps
appearing after ending double quotes that seems plausible, although
one has to wonder why it appears *in addition to* the ASCII double
quotes.

> This has me puzzled.  It's often, but not always after a close quote.
> "TM" or "(R)" might make sense, but what non-Unicode character set
> has those.  And  "green"(tm) makes no sense.

CP-1252 has ™ at \x99, perhaps coincidentally. CP-1252 and Latin-1
both have ® at \xae.



More information about the Python-list mailing list