[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

Marc-Andre Lemburg report at bugs.python.org
Wed Apr 29 18:54:27 CEST 2009


Marc-Andre Lemburg <mal at egenix.com> added the comment:

While it's probably ok to fix the codecs, there's an issue which makes
this difficult at least for the utf-8 codec:

The marshal module uses utf-8 to write Unicode objects and these can and
need to be able to store the full range of supported UCS2/UCS4 code
points, including lone surrogates.

If the utf-8 codec were changed to raise an error for these, marshal
would no longer be able to write/read Unicode objects.

It is likely that other existing Python code (outside the std lib) also
relies on this ability.

Changing this would only be possible in 3.1.

The marshal module would then also have to be changed to use a different
encoding which does support encoding lone surrogates.

See issue 3297 for a discussion of UTF-8/16 vs. UCS2/4, the
implications, motivations, etc.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue3672>
_______________________________________


More information about the Python-bugs-list mailing list