Reversible malformed UTF-8 to malformed UTF-16 encoding

Tue Mar 19 18:53:41 EDT 2019

* MRAB:

> On 2019-03-19 20:32, Florian Weimer wrote:
>> I've seen occasional proposals like this one coming up:
>> 
>> | I therefore suggested 1999-11-02 on the unicode at unicode.org mailing
>> | list the following approach. Instead of using U+FFFD, simply encode
>> | malformed UTF-8 sequences as malformed UTF-16 sequences. Malformed
>> | UTF-8 sequences consist excludively of the bytes 0x80 - 0xff, and
>> | each of these bytes can be represented using a 16-bit value from the
>> | UTF-16 low-half surrogate zone U+DC80 to U+DCFF. Thus, the overlong
>> | "K" (U+004B) 0xc1 0x8b from the above example would be represented
>> | in UTF-16 as U+DCC1 U+DC8B. If we simply make sure that every UTF-8
>> | encoded surrogate character is also treated like a malformed
>> | sequence, then there is no way that a single high-half surrogate
>> | could precede the encoded malformed sequence and cause a valid
>> | UTF-16 sequence to emerge.
>> 
>> <http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html>
>> 
>> Has this ever been implemented in any Python version?  I seem to
>> remember something like that, but all I could find was me talking
>> about this in 2000.

> Python 3 has "surrogate escape". Have a read of PEP 383 -- Non-decodable 
> Bytes in System Character Interfaces.

Thanks, this is the information I was looking for.