[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

Mon Jan 30 08:30:00 CET 2012

Kang-Hao (Kenny) Lu <kennyluck at csail.mit.edu> added the comment:

Attached patch does the following beyond what the patch from haypo does:
  * call the error handler
  * reject 0xd800~0xdfff when decoding utf-32

The followings are on my TODO list, although this patch doesn't depend on any of these and can be reviewed and landed separately:
  * make the surrogatepass error handler work for utf-16 and utf-32. (I should be able to finish this by today)
  * fix an error in the error handler for utf-16-le. (In, Python3.2 b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" instead of "A" for some reason)
  * make unicode_encode_call_errorhandler return bytes so that we can simplify this patch. (This arguably belongs to a separate bug so I'll file it when needed)

> All UTF codecs should reject lone surrogates in strict error mode,

Should we really reject lone surrogates for UTF-7? There's a test in test_codecs.py that tests "\udc80" to be encoded b"+3IA-" (. Given that UTF-7 is not really part of the Unicode Standard and it is more like a "data encoding" than a "text encoding" to me, I am not sure it's a good idea.

> but let them pass using the surrogatepass error handler (the UTF-8
> codec already does) and apply the usual error handling for ignore
> and replace.

For 'replace', the patch now emits b"\x00?" instead of b"?" so that UTF-16 stream doesn't get corrupted. It is not "usual" and not matching

  # Implements the ``replace`` error handling: malformed data is replaced
  # with a suitable replacement character such as ``'?'`` in bytestrings 
  # and ``'\ufffd'`` in Unicode strings.

in the documentation. What do we do? Are there other encodings that are not ASCII compatible besides UTF-7, UTF-16 and UTF-32 that Python supports? I think it would be better to use encoded U+fffd whenever possible and fall back to '?'. What do you think?

Some other self comments on my patch:
  * In the STORECHAR macro for utf-16 and utf-32, I change all instances of "ch & 0xFF" to (unsigned char) ch. I don't have enough C knowledge to know if this is actually better or if this makes any difference at all.
  * The code for utf-16 and utf-32 are duplicates of the uft-8 one. That one's complexity comes from issue #8092 . Not sure if there are ways to simplify these. For example, are there suitable functions there so that we don't need to check integer overflow at these places?

----------
nosy: +kennyluck
Added file: http://bugs.python.org/file24368/utf-16&32_reject_surrogates.patch

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue12892>
_______________________________________