[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

Tue Sep 2 08:51:54 CEST 2008

Adam Olsen <rhamph at gmail.com> added the comment:

Marc, I don't understand what you're saying.  UTF-16's surrogates are
not optional.  Unicode 2.0 and later require them, and Python is
supposed to support it.

Likewise, UCS-4 originally allowed a much larger range of code points,
but it no longer does; allowing them would mean supporting only old,
archaic versions of the standards (which is clearly not desirable.)

You are right in that I shouldn't have said "a pair of ill-formed code
units".  I should have said "a pair of unassigned code points", which is
how UCS-2 always have and always will classify them.

Although python may allow ill-formed sequences to be created internally
(primarily lone surrogates on UTF-16 builds), it cannot encode or decode
them.  The standard is clear that these are to be treated as errors,
which the .decode()'s "errors" argument controls.  You could add a new
value for "errors" to pass-through the garbage, but I fail to see a use
case for it.

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue3297>
_______________________________________