[I18n-sig] Re: [Python-Dev] Unicode debate

M.-A. Lemburg mal@lemburg.com
Tue, 02 May 2000 17:18:21 +0200


Just van Rossum wrote:
> 
> At 8:30 AM -0400 02-05-2000, Guido van Rossum wrote:
> >I think /F's point was that the Unicode standard prescribes different
> >behavior here: for UTF-8, a missing or lone continuation byte is an
> >error; for Unicode, accents are separate characters that may be
> >inserted and deleted in a string but whose display is undefined under
> >certain conditions.
> >
> >(I just noticed that this doesn't work in Tkinter but it does work in
> >wish.  Strange.)
> >
> >> FYI: Normalization is needed to make comparing Unicode
> >> strings robust, e.g. u"È" should compare equal to u"e\u0301".

                            ^
                            |

Here's a good example of what encoding errors can do: the
above character was an "e" with acute accent (u"é"). Looks like
some mailer converted this to some other code page and yet
another back to Latin-1 again and this even though the
message header for Content-Type clearly states that the
document uses ISO-8859-1.

> >
> >Aha, then we'll see u == v even though type(u) is type(v) and len(u)
> >!= len(v).  /F's world will collapse. :-)
> 
> Does the Unicode spec *really* specifies u should compare equal to v?

The behaviour is needed in order to implement sorting Unicode.
See the www.unicode.org site for more information and the
tech reports describing this.

Note that I haven't mentioned anything about "automatic"
normalization. This should be a method on Unicode strings
and could then be used in sorting compare callbacks.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/