What extended ASCII character set uses 0x9D?

Chris Angelico rosuav at gmail.com
Thu Aug 17 21:02:59 EDT 2017


On Fri, Aug 18, 2017 at 10:54 AM, Ian Kelly <ian.g.kelly at gmail.com> wrote:
> On Thu, Aug 17, 2017 at 6:52 PM, Ian Kelly <ian.g.kelly at gmail.com> wrote:
>> On Thu, Aug 17, 2017 at 6:30 PM, John Nagle <nagle at animats.com> wrote:
>>> A few more cases:
>>>
>>> bytearray(b'miguel \xe3\x81ngel santos')
>>
>> If that were b'\xc3\x81' it would be Á in UTF-8 which would fit the
>> rest of the name.
>>
>>> bytearray(b'\xe5\x81ukasz zmywaczyk')
>>
>> If that were b'\xc5\x81' it would be Ł in UTF-8 which would fit the
>> rest of the name.
>>
>> I suspect the others contain similar errors. I don't know if it's the
>> result of some form of Mojibake or maybe just transcription errors.
>
> Oh shit, I think know what happened. In ASCII you can lower-case
> letters by just adding 32 (0x20) to them. Somebody tried to do that
> here and fucked up the encoding. That's why all the ASCII letters in
> the strings are lower-case while these ones aren't.

That applies to some, but not all.

> bytearray(b'M\x81\x81\xfcnster')

This should be Münster, which is a U+00FC. You have 81 81 FC. I don't
know of any encoding that does this, but it looks indicative - and
it's not the lower-casing. And the 0x9d doesn't either, but maybe
that's some relation to 0x2d which is an ASCII hyphen?

ChrisA



More information about the Python-list mailing list