[Python-3000] String comparison

Thu Jun 7 18:47:17 CEST 2007

On 6/7/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> I apologize for mistyping the example.  *I* *was* talking about a
> string literal containing Unicode characters.

Then I misunderstood you too. To avoid such problems, I will use XML
character references to denote code points here. Wherever you see such
a thing in this e-mail, replace it in your mind with the corresponding
code point *immediately*. E.g. len(r'&#00c5;') == 1, but
len(r'\u00c5') == 6. This is not a proposal for Python syntax, it is a
device to make what I say clear.

> However, on my terminal, you can't see the difference!  So I (ab)used
> the \u escapes to make clear that in one case the representation used
> 5 characters and in the other 6.

Your code was:

> if u"L\u00F6wis" == u"Lo\u0308wis":
>     print "Python is Unicode conforming in this respect."

I take it, by your explanation above, that you meant that the (py3k)
source code is this:

if "L&#00F6;wis" == "Lo&#0308;wis":
    print "Python is Unicode conforming in this respect."

I agree that here == should be true, but only because Python should
normalize the source code to look like this before processing it:

if "L&#00F6;wis" == "L&#00F6;wis":
    print "Python is Unicode conforming in this respect."

In the following code == should be false:

if "L\u00F6wis" == "Lo\u0308wis":
    print "Python is Unicode conforming in this respect."

> I think the default case should be that text operations produce the
> expected result in the text domain, even at the expense of array
> invariants.

If you really want that, then you need a type for sequences of graphemes.
E.g. 'c\u0308' is already normalized according to all four normalization
rules, but it's still one grapheme ('c' with diaeresis, c̈) and two
code points. This type could be provided in the standard library.

> People who need arrays of code points have several ways to get them,
> and the usual comparison operators will work on them as desired.

But regexps and other string operations won't, and those are the whole
point of strings, not comparison operators. If comparisons were enough,
then the string type could be removed as redundant - there's already the
array module (or numpy) if you're only concerned about efficient storage.

> While people who need operations on *text* still have no
> straightforward way to get them, and no promise of one as I read your
> remarks.

Then you missed some of his earlier remarks:

Guido:
: I'm all for adding a way to do normalized string comparisons to the
: library. But I'm not about to change the == operator to apply
: normalization first.