[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Thu Apr 1 16:13:01 CEST 2010

Marc-Andre Lemburg <mal at egenix.com> added the comment:

John Machin wrote:
> 
> John Machin <sjmachin at users.sourceforge.net> added the comment:
> 
> @lemburg: RFC 2279 was obsoleted by RFC 3629 over 6 years ago. 

I know.

> The standard now says 21 bits is it. 

It says that the current Unicode codespace only uses 21 bits. In the
early days 16 bits were considered enough, so it wouldn't surprise me,
if they extend that range again at some point in the future - after
all, leaving 11 bits unused in UCS-4 is a huge waste of space.

If you have a reference that the Unicode consortium has decided
to stay with that limit forever, please quote it.

> F5-FF are declared to be invalid. I don't understand what you mean by "supporting those possibilities". The code is correctly issuing an error message. The goal of supporting the new resyncing and FFFD-emitting rules might be better met however by throwing away the code in the default clause and instead merely setting the entries for F5-FF in the utf8_code_length array to zero.

Fair enough. Let's do that.

The reference in the table should then be updated to RFC 3629.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue8271>
_______________________________________