[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Marc-Andre Lemburg report at bugs.python.org
Thu Apr 1 17:01:39 CEST 2010


Marc-Andre Lemburg <mal at egenix.com> added the comment:

Ezio Melotti wrote:
> 
> Ezio Melotti <ezio.melotti at gmail.com> added the comment:
> 
> Even if they are not valid they still "eat" all the 4/5/6 bytes, so they should be fixed too. I haven't see anything about these bytes in chapter 3 so far, but there are at least two possibilities:
> 1) consider all the bytes in range F5-FD as invalid without looking for the other bytes;
> 2) try to read the next 4/5/6 bytes and fail if they are no continuation bytes.
> We can also look at what others do (e.g. browsers and other languages).

By marking those entries as 0 in the length table, they would only
use one byte, however, compared to the current state, that would
produce more replacement code points in the output, so perhaps applying
the same logic as for the other sequences is a better strategy.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue8271>
_______________________________________


More information about the Python-bugs-list mailing list