[Python-Dev] Re: Regression in unicodestr.encode()?

Tue, 09 Apr 2002 21:13:37 -0400

[Guido]
> I knew all that, but I thought I'd read about a hack to encode NUL
> using c0 80, specifically to get around the limitation on encoded
> strings containing a NUL.

Ah, that violates the "shortest encoding" rule, so is invalid UTF-8.  I'm
sure people have done it, though, and that many UTF-8 encoders accept it.
Python's doesn't:

>>> unicode('\xc0\x80', 'utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: illegal encoding
>>>

Believe it or not, accepting non-shortest encodings is considered to be "a
security hole"(!).  That's a sad story of its own <wink> ...