[Python-3000] String comparison

Guido van Rossum guido at python.org
Thu Jun 7 19:10:12 CEST 2007


On 6/7/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> What bothers me about the "sequence of code points" way of thinking is
> that len("Löwis") is nondeterministic.

It doesn't have to be, *for this specific example*. After what I've
read so far, I'm okay with normalization happening on the text of the
source code before it reaches the lexer, if that's what people prefer.
I'm also okay with normalization happening by default in the text I/O
layer, as long as there's a way to disable it that doesn't require me
to switch to bytes.

However, I'm *not* okay with requiring all text strings to be
normalized, or normalizing them before comparing/hashing, after
slicing/concatenation, etc. If you want to have an abstraction that
guarantees you'll never see an unnormalized text string you should
design a library for doing so. I encourage you or others to contribute
such a library (*). But the 3.0 core language's 'str' type (like
Python 2.x's 'unicode' type) will be an array of code points that is
neutral about normalization.

Python is a general programming language, not a text manipulating
library. As a general programming language, it must be possible to
represent unnormalized sequences of code points -- otherwise, it could
not implement algorithms for normalization in Python! (Again, forcing
me to do this using UTF-8-encoded bytes or lists of ints is
unacceptable.)

There are also Jython and IronPython to consider. These have extensive
integration in the Java and .NET runtimes, respectively, where strings
are represented as sequences of code points. Having a correspondence
between the "natural" string type across language boundaries is very
important.

Yes, this makes text processing harder if you want to get every corner
case right. We need to educate our users about Unicode and point them
to relevant portions of the standard. I don't think that can be
avoided anyway -- the complexity is inherent to the domain of
multi-alphabet text processing, and cannot be argued away by insisting
that the language handle it.

(*) It looks like such a library will not have a way to talk about
"\u0308" at all, since it is considered unnormalized. Things like
bidirectionality will probably have to be handled in a different way
(without referencing the code points indicating text direction) as
well.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-3000 mailing list