Reversible malformed UTF-8 to malformed UTF-16 encoding

Florian Weimer fw at deneb.enyo.de
Tue Mar 19 16:32:34 EDT 2019


I've seen occasional proposals like this one coming up:

| I therefore suggested 1999-11-02 on the unicode at unicode.org mailing
| list the following approach. Instead of using U+FFFD, simply encode
| malformed UTF-8 sequences as malformed UTF-16 sequences. Malformed
| UTF-8 sequences consist excludively of the bytes 0x80 - 0xff, and
| each of these bytes can be represented using a 16-bit value from the
| UTF-16 low-half surrogate zone U+DC80 to U+DCFF. Thus, the overlong
| "K" (U+004B) 0xc1 0x8b from the above example would be represented
| in UTF-16 as U+DCC1 U+DC8B. If we simply make sure that every UTF-8
| encoded surrogate character is also treated like a malformed
| sequence, then there is no way that a single high-half surrogate
| could precede the encoded malformed sequence and cause a valid
| UTF-16 sequence to emerge.

<http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html>

Has this ever been implemented in any Python version?  I seem to
remember something like that, but all I could find was me talking
about this in 2000.

It's not entirely clear whether this is a good idea as the default
encoding for security reasons, but it might be nice to be able to read
XML or JSON which is not quite properly encoded, only nearly so,
without treating it as ISO-8859-1 or some other arbitrarily chose
single-byte character set.



More information about the Python-list mailing list