Reversible malformed UTF-8 to malformed UTF-16 encoding

Tue Mar 19 17:49:20 EDT 2019

On 2019-03-19 20:32, Florian Weimer wrote:
> I've seen occasional proposals like this one coming up:
> 
> | I therefore suggested 1999-11-02 on the unicode at unicode.org mailing
> | list the following approach. Instead of using U+FFFD, simply encode
> | malformed UTF-8 sequences as malformed UTF-16 sequences. Malformed
> | UTF-8 sequences consist excludively of the bytes 0x80 - 0xff, and
> | each of these bytes can be represented using a 16-bit value from the
> | UTF-16 low-half surrogate zone U+DC80 to U+DCFF. Thus, the overlong
> | "K" (U+004B) 0xc1 0x8b from the above example would be represented
> | in UTF-16 as U+DCC1 U+DC8B. If we simply make sure that every UTF-8
> | encoded surrogate character is also treated like a malformed
> | sequence, then there is no way that a single high-half surrogate
> | could precede the encoded malformed sequence and cause a valid
> | UTF-16 sequence to emerge.
> 
> <http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html>
> 
> Has this ever been implemented in any Python version?  I seem to
> remember something like that, but all I could find was me talking
> about this in 2000.
> 
> It's not entirely clear whether this is a good idea as the default
> encoding for security reasons, but it might be nice to be able to read
> XML or JSON which is not quite properly encoded, only nearly so,
> without treating it as ISO-8859-1 or some other arbitrarily chose
> single-byte character set.
> 
Python 3 has "surrogate escape". Have a read of PEP 383 -- Non-decodable 
Bytes in System Character Interfaces.