[Python-3000] String comparison

Wed Jun 6 17:57:45 CEST 2007

"Stephen J. Turnbull" <turnbull at sk.tsukuba.ac.jp> wrote:
> Rauli Ruohonen writes:
> 
>  > Strings are internal to Python. This is a whole separate issue from
>  > normalization of source code or its parts (such as identifiers).
> 
> Agreed.  But please note that we're not talking about representation.
> We're talking about the result of evaluating a comparison:
> 
>     if u"L\u00F6wis" == u"Lo\u0308wis":
>         print "Python is Unicode conforming in this respect."
>     else:
>         print "I guess it's time to start learning Ruby."
> 
> I think it's reasonable to be astonished if Python doesn't at least
> try to print "Python is Unicode conforming in this respect." for the
> above snippet by default.
> 
>  > It is up to Python to define what "==" means, just like it defines
>  > what "is" means.
> 
> You are of course correct.  However, if given that u prefix Python
> chooses to define == in a way that does not respect canonical
> equivalence, what's the point of having these things?  

Maybe I'm missing something, but it seems to me that there might be a
simple solution.  Don't normalize any identifiers or strings.

Hear me out for a moment.  People type what they want.  Isn't that the
whole point of PEP 3131? If they don't know what they want, then that is
as much a problem with display/representation as anything else that we
have discussed.  Any of the flagging methods could easily disable things
like u"o\u0308" for identifiers to force them to be in the "one true
form" to begin with.

As for strings, I think we should opt for keeping it as simple as
possible.  Compare by code points.  To handle normalization issues, add
a normalization method that people call if they care about normalized
unicode strings*.

If at some point we think that normalization should happen on
identifiers by default, all we need to do is to call st.normalize() on
any string that is used for getattr, and/or could use a subclass of dict
to make it happen automatically.

 - Josiah

* Or leave out normalization all together in 3.0 .  I haven't heard any
complaints about the lack of normalization in Python so far (though
maybe I'm not reading the right python-list messages), and Python has
had unicode for what, almost 10 years now?