[Python-Dev] Re: Regression in unicodestr.encode()?

Guido van Rossum guido@python.org
Tue, 09 Apr 2002 20:50:23 -0400


> [Guido van Rossum]
> > Hm, but isn't there a way to encode a NUL that doesn't produce a NUL?
> > In some variant?

[François]
> There is also a rule about the shortest coding.  It is invalid UTF-8
> to use more bytes than required, and a given UCS character has a
> unique UTF-8 representation.  Moreover, decoders should raise an
> exception on non-minimal UTF-8 codings, and I do not know how Python
> behaves with this.  The Gambit author once told me he found a way to
> implement the test very efficiently.
> 
> One could use multi-byte sequences, that is, a sequence having no NULs,
> that would fool a lazy UTF-8 decoder into producing a NUL.  But for this,
> one has to break the shortest coding rule, and start from invalid UTF-8.

I knew all that, but I thought I'd read about a hack to encode NUL
using c0 80, specifically to get around the limitation on encoded
strings containing a NUL.  But I can't find the reference so I'll shut
up.

--Guido van Rossum (home page: http://www.python.org/~guido/)