[Python-3000] String comparison

Stephen J. Turnbull turnbull at sk.tsukuba.ac.jp
Wed Jun 6 14:33:19 CEST 2007


Rauli Ruohonen writes:

 > Strings are internal to Python. This is a whole separate issue from
 > normalization of source code or its parts (such as identifiers).

Agreed.  But please note that we're not talking about representation.
We're talking about the result of evaluating a comparison:

    if u"L\u00F6wis" == u"Lo\u0308wis":
        print "Python is Unicode conforming in this respect."
    else:
        print "I guess it's time to start learning Ruby."

I think it's reasonable to be astonished if Python doesn't at least
try to print "Python is Unicode conforming in this respect." for the
above snippet by default.

 > It is up to Python to define what "==" means, just like it defines
 > what "is" means.

You are of course correct.  However, if given that u prefix Python
chooses to define == in a way that does not respect canonical
equivalence, what's the point of having these things?  

 > Always doing normalization would still force you to use bytes for
 > processing code point sequences (e.g. XML, which must not be
 > normalized), which is not nice.

I'm not talking about "nice" yet, just about Unicode conformance.  How
to implement conformant behavior is of course entirely up to Python.
As is choosing *whether* to conform or not, but it seems bizarre to me
that one might choose to implement UAX#31 verbatim, and also have
u"L\u00F6wis" == u"Lo\u0308wis" evaluate to False.

 > FWIW, I don't buy that normalization is expensive, as most strings are
 > in NFC form anyway, and there are fast checks for that (see UAX#15,
 > "Detecting Normalization Forms"). Python does not currently have
 > a fast path for this, but if it's added, then normalizing everything
 > to NFC should be fast.

If O(n) is "fast".



More information about the Python-3000 mailing list