[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Thu Apr 1 15:19:05 CEST 2010

Marc-Andre Lemburg <mal at egenix.com> added the comment:

John Machin wrote:
> 
> John Machin <sjmachin at users.sourceforge.net> added the comment:
> 
> Unicode has been frozen at 0x10FFFF. That's it. There is no such thing as a valid 5-byte or 6-byte UTF-8 string.

The UTF-8 codec was written at a time when UTF-8 still included
the possibility to have 5 or 6 bytes:

http://www.rfc-editor.org/rfc/rfc2279.txt

Use of those encodings has always raised an error, though. For error
handling purposes it still has to support those possibilities.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue8271>
_______________________________________