[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Thu Apr 1 08:09:01 CEST 2010

John Machin <sjmachin at users.sourceforge.net> added the comment:

@ezio.melotti: Your second sentence is true, but it is not the whole truth. Bytes in the range C0-FF (whose high bit *is* set) ALSO shouldn't be considered part of the sequence because they (like 00-7F) are invalid as continuation bytes; they are either starter bytes (C2-F4) or invalid for any purpose (C0-C2 and F5-FF). Further, some bytes in the range 80-BF are NOT always valid as the first continuation byte, it depends on what starter byte they follow.

The simple way of summarising the above is to say that a byte that is not a valid continuation byte in the current state ("failing byte") is not a part of the current (now known to be invalid) sequence, and the decoder must try again ("resync") with the failing byte.

Do you agree with my example 3?

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue8271>
_______________________________________