[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0
Marc-Andre Lemburg
report at bugs.python.org
Wed Mar 31 20:07:45 CEST 2010
Marc-Andre Lemburg <mal at egenix.com> added the comment:
I guess the term "failing" byte somewhat underdefined.
Page 95 of the standard PDF (http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf) suggests to "Replace each maximal subpart of an ill-formed subsequence by a single U+FFFD".
Fortunately, they explain what they are after: if a subsequent byte in the sequence does not have the high bit set, it's not to be considered part of the UTF-8 sequence of the code point.
Implementing that should be fairly straight-forward by adjusting the endinpos variable accordingly.
Any takers ?
----------
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue8271>
_______________________________________
More information about the Python-bugs-list
mailing list