What extended ASCII character set uses 0x9D?

John Nagle nagle at animats.com
Fri Aug 18 17:42:49 EDT 2017


On 08/17/2017 05:53 PM, Chris Angelico wrote:
> On Fri, Aug 18, 2017 at 10:30 AM, John Nagle <nagle at animats.com> wrote:
>> On 08/17/2017 05:14 PM, John Nagle wrote:
>>>       I'm cleaning up some data which has text description fields from
>>> multiple sources.
>> A few more cases:
>>
>> bytearray(b'\xe5\x81ukasz zmywaczyk')
> 
> This one has to be Polish, and the first character should be the
> letter Ł U+0141 or ł U+0142. In UTF-8, U+0141 becomes C5 81, which is
> very similar to the E5 81 that you have.
> 
> So here's an insane theory: something attempted to lower-case the byte
> stream as if it were ASCII. If you ignore the high bit, 0xC5 looks
> like 0x45 or "E", which lower-cases by having 32 added to it, yielding
> 0xE5. Reversing this transformation yields sane data for several of
> your strings - they then decode as UTF-8:
> 
> miguel Ángel santos
> lidija kmetič
> Łukasz zmywaczyk
> jiří urbančík
> Ľubomír mičko
> petr urbančík

    You're exactly right.  The database has columns "name" and
"normalized name".  Normalizing the name was done by forcing it
to lower  case as if in ASCII, even for UTF-8. That resulted in
errors like

KACMAZLAR MEKANİK  -> kacmazlar mekanä°k

Anita Calçados -> anita calã§ados

Felfria Resor för att Koh Lanta -> felfria resor fã¶r att koh lanta

    The "name" field is OK; it's just the "normalized name" field
that is sometimes garbaged. Now that I know this, and have properly
captured the "name" field in UTF-8 where appropriate, I can
regenerate the "normalized name" field.  MySQL/MariaDB know how
to lower-case UTF-8 properly.

    Clean data at last.  Thanks.

    The database, by the way, is a historical snapshot of startup
funding, from Crunchbase.

				John Nagle



More information about the Python-list mailing list