[issue23614] Opaque error message on UTF-8 decoding to surrogates

Fri Mar 13 18:57:35 CET 2015

Ezio Melotti added the comment:

The Table 3-7 of http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (page 93 of the book, or 40 of the pdf) shows that if the start byte is ED the continuation byte must be in range 80..9F.  This means that, in order to decode a sequence starting with ED, you need two more valid continuation bytes.  Since the following byte (B4) is not in allowed range 80..9F and is thus an invalid continuation byte, the decoder doesn't know how to decode the byte in position 0 (i.e. ED).

It is also true that this particular sequence, if allowed, would result in a surrogate.  However, by looking at the first two bytes only, you don't have enough information to be sure about that (e.g. ED B4 00 begins doesn't decode to a surrogate, so Pike's error message is imprecise).

If handling this special case doesn't require too much extra code, it would be ok with me to have something like:
>>> b"\xed\xb4\x80".decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte (possible start of a surrogate)

----------
type:  -> enhancement

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue23614>
_______________________________________