[I18n-sig] Re: [Python-Dev] Unicode debate

Guido van Rossum guido@python.org
Tue, 02 May 2000 08:30:02 -0400


[MAL]
> > > Unicode itself can be understood as multi-word character
> > > encoding, just like UTF-8. The reason is that Unicode entities
> > > can be combined to produce single display characters (e.g.
> > > u"e"+u"\u0301" will print "é" in a Unicode aware renderer).
> > > Slicing such a combined Unicode string will have the same
> > > effect as slicing UTF-8 data.
[/F]
> > really?  does it result in a decoder error?  or does it just result
> > in a rendering error, just as if you slice off any trailing character
> > without looking...
[MAL]
> In the example, if you cut off the u"\u0301", the "e" would
> appear without the acute accent, cutting off the u"e" would
> probably result in a rendering error or worse put the accent
> over the next character to the left.
> 
> UTF-8 is better in this respect: it warns you about
> the error by raising an exception when being converted to
> Unicode.

I think /F's point was that the Unicode standard prescribes different
behavior here: for UTF-8, a missing or lone continuation byte is an
error; for Unicode, accents are separate characters that may be
inserted and deleted in a string but whose display is undefined under
certain conditions.

(I just noticed that this doesn't work in Tkinter but it does work in
wish.  Strange.)

> FYI: Normalization is needed to make comparing Unicode
> strings robust, e.g. u"é" should compare equal to u"e\u0301".

Aha, then we'll see u == v even though type(u) is type(v) and len(u)
!= len(v).  /F's world will collapse. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)