[Python-Dev] PEP 393 Summer of Code Project

Fri Aug 26 02:26:53 CEST 2011

On Wed, Aug 24, 2011 at 8:34 PM, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> What about things like the surrogateescape codec that
> deliberately use code units in non-standard ways? Will
> tricks like that still be possible if the code-unit
> level is hidden from the programmer?

I would think that it should still be possible to explicitly put
surrogates into a string, using the appropriate \uxxxx escape or
chr(i) or some such approach; the basic string operations IMO
shouldn't bother with checking for well-formed character sequences
(just as they shouldn't care about normal forms). But decoding bytes
from UTF-16 should not leave any surrogate pairs in, since
interpreting those is part of the decoding.

I'm not sure what should happen with UTF-8 when it (in flagrant
violation of the standard, I presume) contains two separately-encoded
surrogates forming a valid surrogate pair; probably whatever the UTF-8
codec does on a wide build today should be good enough. Similarly for
encoding to UTF-8 on a wide build if one managed to create a string
containing a surrogate pair. Basically, I'm for a
garbage-in-garbage-out approach (with separate library functions to
detect garbage if the app is worried about it).

-- 
--Guido van Rossum (python.org/~guido)