What extended ASCII character set uses 0x9D?

MRAB python at mrabarnett.plus.com
Fri Aug 18 05:58:15 EDT 2017


On 2017-08-18 04:46, John Nagle wrote:
> On 08/17/2017 05:53 PM, Chris Angelico wrote:> On Fri, Aug 18, 2017 at
> 10:30 AM, John Nagle <nagle at animats.com> wrote:
>   >> On 08/17/2017 05:14 PM, John Nagle wrote:
>   >>>       I'm cleaning up some data which has text description fields from
>   >>> multiple sources.
>   >> A few more cases:
>   >>
>   >> bytearray(b'\xe5\x81ukasz zmywaczyk')
>   >
>   > This one has to be Polish, and the first character should be the
>   > letter Ł U+0141 or ł U+0142. In UTF-8, U+0141 becomes C5 81, which is
>   > very similar to the E5 81 that you have.
>   >
>   > So here's an insane theory: something attempted to lower-case the byte
>   > stream as if it were ASCII. If you ignore the high bit, 0xC5 looks
>   > like 0x45 or "E", which lower-cases by having 32 added to it, yielding
>   > 0xE5. Reversing this transformation yields sane data for several of
>   > your strings - they then decode as UTF-8:
>   >
>   > miguel Ángel santos
>   > lidija kmetič
>   > Łukasz zmywaczyk
>   > jiří urbančík
>   > Ľubomír mičko
>   > petr urbančík
> 
>      I think you're right for those.  I'm working from a MySQL dump of
> supposedly LATIN-1 data, but LATIN-1 will accept anything. I've
> found UTF-8 and Windows-2152 in there. It's quite possble that someone
> lower-cased UTF-8 stored in a LATIN-1 field.  There are lots of
> questions on the web which complain about getting a Python decode error
> on 0x9d, and the usual answer is "Use Latin-1". But that doesn't really
> decode properly, it just doesn't generate an exception.
> 
>   > That doesn't work for everything, though. The 0x81 0x81 and 0x9d ones
>   > are still a puzzle.
> 
>      The 0x9d thing seems unrelated to the Polish names thing.  0x9d
> shows up in the middle of English text that's otherwise ASCII.
> Is this something that can appear as a result of cutting and
> pasting from Microsoft Word?
> 
>      I'd like to get 0x9d right, because it comes up a lot. The
> Polish name thing is rare.  There's only about a dozen of those
> in 400MB of database dump. There are hundreds of 0x9d hits.
> 
> Here's some more 0x9d usage, each from a different data item:
> 
> 
> Guitar Pro, JamPlay, RedBana\\\'s Audition,\x9d Doppleganger\x99s The
> Lounge\x9d or Heatwave Interactive\x99s Platinum Life Country,\\"
> 
> for example \\"I\\\'ve seen the bull run in Pamplona, Spain\x9d.\\"
> Everything
> 
> Netwise Depot is  a \\"One Stop Web Shop\\"\x9d that provides
> 
> sustainable \\"green\\"\x9d living
> 
> are looking for a \\"Do It for Me\\"\x9d solution
> 
> 
> This has me puzzled.  It's often, but not always after a close quote.
> "TM" or "(R)" might make sense, but what non-Unicode character set
> has those.  And  "green"(tm) makes no sense.
> 
I googled for """Netwise Depot is  a""" and found this page:

     https://www.crunchbase.com/organization/netwise-depot#/entity

It has the text:

     Netwise Depot is a "One Stop Web Shop" that provides a holistic 
solution

Put that through the ascii function and you get:

     'Netwise Depot is a "One Stop Web Shop"\x9d that provides a 
holistic solution'

OK. Try another one.

Google for """Guitar Pro, JamPlay, RedBana""":

     https://www.crunchbase.com/organization/the-rights-workshop#/entity"""

Look familiar?

That page has:

    Guitar Pro, JamPlay, RedBana's Audition, Doppleganger™s

Is that where the data comes from?



More information about the Python-list mailing list