[I18n-sig] Re: [Python-Dev] Unicode debate

M.-A. Lemburg mal@lemburg.com
Wed, 03 May 2000 01:05:28 +0200


Paul Prescod wrote:
> 
> Combining characters are a whole 'nother level of complexity. Charater
> sets are hard. I don't accept that the argument that "Unicode itself has
> complexities so that gives us license to introduce even more
> complexities at the character representation level."
> 
> > FYI: Normalization is needed to make comparing Unicode
> > strings robust, e.g. u"é" should compare equal to u"e\u0301".
> 
> That's a whole 'nother debate at a whole 'nother level of abstraction. I
> think we need to get the bytes/characters level right and then we can
> worry about display-equivalent characters (or leave that to the Python
> programmer to figure out...).

I just wanted to point out that the argument "slicing doesn't
work with UTF-8" is moot.

I do see a point against UTF-8 auto-conversion given the example
that Guido mailed me:

"""
s = 'ab\341\210\264def'        # == str(u"ab\u1234def")
s.find(u"def")

This prints 3 -- the wrong result since "def" is found at s[5:8], not
at s[3:6].
"""

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/