What extended ASCII character set uses 0x9D?

MRAB python at mrabarnett.plus.com
Thu Aug 17 22:15:54 EDT 2017


On 2017-08-18 01:53, Chris Angelico wrote:
> On Fri, Aug 18, 2017 at 10:30 AM, John Nagle <nagle at animats.com> wrote:
>> On 08/17/2017 05:14 PM, John Nagle wrote:
>>>      I'm cleaning up some data which has text description fields from
>>> multiple sources.
>> A few more cases:
>>
>> bytearray(b'\xe5\x81ukasz zmywaczyk')
> 
> This one has to be Polish, and the first character should be the
> letter Ł U+0141 or ł U+0142. In UTF-8, U+0141 becomes C5 81, which is
> very similar to the E5 81 that you have.
> 
> So here's an insane theory: something attempted to lower-case the byte
> stream as if it were ASCII. If you ignore the high bit, 0xC5 looks
> like 0x45 or "E", which lower-cases by having 32 added to it, yielding
> 0xE5. Reversing this transformation yields sane data for several of
> your strings - they then decode as UTF-8:
> 
> miguel Ángel santos

I think that's:

miguel ángel santos

> lidija kmetič
> Łukasz zmywaczyk
> jiří urbančík
> Ľubomír mičko
> petr urbančík
> 
> That doesn't work for everything, though. The 0x81 0x81 and 0x9d ones
> are still a puzzle.
> 



More information about the Python-list mailing list