[Python-Dev] Re: Regression in unicodestr.encode()?
Guido van Rossum
guido@python.org
Tue, 09 Apr 2002 20:50:23 -0400
> [Guido van Rossum]
> > Hm, but isn't there a way to encode a NUL that doesn't produce a NUL?
> > In some variant?
[François]
> There is also a rule about the shortest coding. It is invalid UTF-8
> to use more bytes than required, and a given UCS character has a
> unique UTF-8 representation. Moreover, decoders should raise an
> exception on non-minimal UTF-8 codings, and I do not know how Python
> behaves with this. The Gambit author once told me he found a way to
> implement the test very efficiently.
>
> One could use multi-byte sequences, that is, a sequence having no NULs,
> that would fool a lazy UTF-8 decoder into producing a NUL. But for this,
> one has to break the shortest coding rule, and start from invalid UTF-8.
I knew all that, but I thought I'd read about a hack to encode NUL
using c0 80, specifically to get around the limitation on encoded
strings containing a NUL. But I can't find the reference so I'll shut
up.
--Guido van Rossum (home page: http://www.python.org/~guido/)