[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

Sat May 7 11:21:44 CEST 2011

STINNER Victor <victor.stinner at haypocalc.com> added the comment:

_codecs_cn implements different multibyte encodings: gb2312, gbkext, gbcommon, gb18030ext, gbk, gb18030.

And there are other Asian multibyte encodings: big5 family, ISO 2202 family, JIS family, korean encodings (KSX1001, EUC_KR, CP949, ...), Big5, CP950, ...

All of them ignore the all bytes if one byte of a multibyte sequence is invalid (lile 0xFF 0x0A: replaced by ? instead of ?\n using replace error handler).

I don't think that you can/should patch only one encoding: we should use the same rule for all encodings.

By the way, do you have any document explaining which result is the good one (? or ?\n)? For UTF-8, we have well defined standards explaining exactly what to do with invalid byte sequences => see issue #8271. It is easy to fix the decoders, but I would like to be sure that your proposed change is the right way to decode these encodings.

Change the multibyte encodings can also concern the security. Read for example the following section "Check byte strings before decoding them to character strings" of my book:
http://www.haypocalc.com/tmp/unicode-2011-03-25/html/issues.html#check-byte-strings-before-decoding-them-to-character-strings
(https://github.com/haypo/unicode_book/wiki)

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue12016>
_______________________________________