[issue26260] utf8 decoding inconsistency between P2 and P3

STINNER Victor report at bugs.python.org
Mon Feb 1 11:54:25 EST 2016


STINNER Victor added the comment:

> PAYLOAD.decode('utf8')  passes in P2.7.* and fails in P3.4

Well, Python 2 decoder didn't respect the Unicode standard. Please see:
http://unicodebook.readthedocs.org/issues.html#non-strict-utf-8-decoder-overlong-byte-sequences-and-surrogates

Python 3 is now stricted. You can still decode surrogate characters if you need them *for a good reason* using:

>>> b'\xed\xa0\x80'.decode('utf-8', 'surrogatepass')
'\ud800'

By they way, there is also:

>>> b'\xed\xa0\x80'.decode('utf-8', 'surrogateescape')
'\udced\udca0\udc80'

which is very different but may also help.

I suggest to close the issue as NOT A BUG.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue26260>
_______________________________________


More information about the Python-bugs-list mailing list