[New-bugs-announce] [issue16585] surrogateescape broken w/ multibytecodecs' encode

Philip Jenvey report at bugs.python.org
Fri Nov 30 21:20:22 CET 2012


New submission from Philip Jenvey:

surrogateescape claims to be "implemented by all standard Python codecs"

http://docs.python.org/3/library/codecs.html#codec-base-classes

However it fails w/ multibytecodecs on encode:

Python 3.2.3+ (3.2:eb999002916c, Oct 26 2012, 16:11:03) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "\u30fb".encode('gb18030')
b'\x819\xa79'
>>> "\u30fb\udc80".encode('gb18030', 'surrogateescape')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: encoding error handler must return (unicode, int) tuple

The problem being that multibytecodec.c forces error handler return results to always be unicode and surrogateescape returns bytes here.

(surrogatepass also similarly returns bytes but it claims to be utf-8 only)

The error handler spec seems to imply that error handlers should always return unicode, because "The encoder will encode the replacement"

http://docs.python.org/3/library/codecs.html#codecs.register_error

but obviously that's not really the case: some codecs special case bytes results and copy them directly to the output, e.g.:

http://hg.python.org/cpython/file/ce3f0399ea33/Objects/unicodeobject.c#l6305

----------
components: Interpreter Core
messages: 176711
nosy: pjenvey
priority: normal
severity: normal
status: open
title: surrogateescape broken w/ multibytecodecs' encode
versions: Python 3.2, Python 3.3

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue16585>
_______________________________________


More information about the New-bugs-announce mailing list