[Python-Dev] Unicode comparisons & normalization

Ka-Ping Yee ping@lfw.org
Wed, 3 May 2000 01:30:02 -0700 (PDT)


On Wed, 3 May 2000, Just van Rossum wrote:
> After quickly browsing through the unicode.org URLs I posted earlier, I
> reach the following (possibly wrong) conclusions:
> 
> - there is a script and language independent canonical form (but automatic
> normalization is indeed a bad idea)
> - ideally, unicode comparisons should follow the rules from
> http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic
> for 1.6, if at all...)

I just looked through this document.  Indeed, there's a lot
of work to be done if we want to compare strings this way.

I thought the most striking feature was that this comparison
method does *not* satisfy the common assumption

    a > b  implies  a + c > b + d        (+ is concatenation)

-- in fact, it is specifically designed to allow for cases
where differences in the *later* part of a string can have
greater influence than differences in an earlier part of a
string.  It *does* still guarantee that

    a + b > a

and of course we can still rely on the most basic rules such as

    a > b  and  b > c  implies  a > c

There are sufficiently many significant transformations
described in the UTR 10 document that i'm pretty sure it
is possible for two things to collate equally but not be
equivalent.  (Even after Unicode normalization, there is
still the possibility of rearrangement in step 1.2.)

This would be another motivation for Python to carefully
separate the three types of equality:

    is         identity-equal
    ==         value-equal
    <=>        magnitude-equal

We currently don't distinguish between the last two;
the operator "<=>" is my proposal for how to spell
"magnitude-equal", and in terms of outward behaviour
you can consider (a <=> b) to be (a <= b and a >= b).
I suspect we will find ourselves needing it if we do
rich comparisons anyway.

(I don't know of any other useful kinds of equality,
but if you've run into this before, do pipe up...)


-- ?!ng