[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Thu Apr 1 05:19:33 CEST 2010

John Machin <sjmachin at users.sourceforge.net> added the comment:

@lemburg: "failing byte" seems rather obvious: first byte that you meet that is not valid in the current state. I don't understand your explanation, especially "does not have the high bit set". I think you mean "is a valid starter byte". See example 3 below.

Example 1: F1 80 41 42 43. F1 implies a 4-byte character. 80 is OK. 41 is not in 80-BF. It is the "failing byte"; high bit not set. Required action is to emit FFFD then resync on the 41, causing 0041 0042 0043 to be emitted. Total output: FFFD 0041 0042 0043. Current code emits FFFD 0043.

Example 2: F1 80 FF 42 43. F1 implies a 4-byte character. 80 is OK. FF is not in 80-BF. It is the "failing byte". Required action is to emit FFFD then resync on the FF. FF is not a valid starter byte, so emit FFFD, and resync on the 42, causing 0042 0043 to be emitted. Total output: FFFD FFFD 0042 0043. Current code emits FFFD 0043.

Example 3: F1 80 C2 81 43. F1 implies a 4-byte character. 80 is OK. C2 is not in 80-BF. It is the "failing byte". Required action is to emit FFFD then resync on the C2. C2 and 81 have the high bit set, but C2 is a valid starter byte, and remaining bytes are OK, causing 0081 0043 to be emitted. Total output: FFFD 0081 0043. Current code emits FFFD 0043.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue8271>
_______________________________________