[issue23614] Opaque error message on UTF-8 decoding to surrogates

Fri Mar 13 22:53:27 CET 2015

Chris Angelico added the comment:

Nice document. Is that actually how Python's decoder checks things? Does the decoder have different definitions of "valid continuation byte" based on the lead byte? If that's the case... well, ten out of ten for complying with the spec, to be sure, but unfortunately it leads to some opaque error messages!

I haven't looked into the code even a little bit, but would it be possible to have a specific error message attached to certain "invalid continuation bytes"?

* E0 followed by 80..9F: "non-shortest form"
* ED followed by A0..BF: "surrogate"
* F4 followed by 90..BF: "outside defined range"

If that's too hard, it'd at least be helpful to point out that the "invalid continuation byte" is not the same as the "byte 0x?? in position ?" - the rejection here is actually of the B4 that follows it. How does this look?

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte 0xb4 for this start byte

(BTW, I think Pike's decoder just always emits two bytes, no matter what the actual errant stream (after all, there's no way to know how many bytes "ought to have been" one character, when there's an error in it). So it's incomplete, yes, but when you're dealing with wrong data, completeness isn't all that possible anyway.)

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue23614>
_______________________________________