[Python-Dev] Re: Regression in unicodestr.encode()?
Tim Peters
tim.one@comcast.net
Tue, 09 Apr 2002 21:13:37 -0400
[Guido]
> I knew all that, but I thought I'd read about a hack to encode NUL
> using c0 80, specifically to get around the limitation on encoded
> strings containing a NUL.
Ah, that violates the "shortest encoding" rule, so is invalid UTF-8. I'm
sure people have done it, though, and that many UTF-8 encoders accept it.
Python's doesn't:
>>> unicode('\xc0\x80', 'utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: illegal encoding
>>>
Believe it or not, accepting non-shortest encodings is considered to be "a
security hole"(!). That's a sad story of its own <wink> ...