What extended ASCII character set uses 0x9D?

Chris Angelico rosuav at gmail.com
Fri Aug 18 02:31:29 EDT 2017


On Fri, Aug 18, 2017 at 4:24 PM, John Nagle <nagle at animats.com> wrote:
>    I'm coming around to the idea that some of these snippets
> have been previously mis-converted, which is why they make no sense.
> Since, as someone pointed out, there was UTF-8 which had been
> run through an ASCII-type lower casing algorithm, that's a reasonable
> assumption.  Thanks for looking at this, everyone.  If a string won't
> parse as either UTF-8 or Windows-1252, I'm just going to convert the
> bogus stuff to the Unicode replacement character. I might remove
> 0x9d chars, since that never seems to affect readability.

That sounds like a good plan. Unless you can pin down a single
coherent encoding (even a broken one, like "UTF-8, then add 32 to
everything between 0xC1 and 0xDA"), all you have is decoding
individual strings. There just isn't enough context to do anything
smarter than flipping unparseable bytes to U+FFFD.

ChrisA



More information about the Python-list mailing list