[Python-3000] String comparison

Jim Jewett jimjjewett at gmail.com
Thu Jun 7 03:15:57 CEST 2007


On 6/6/07, Guido van Rossum <guido at python.org> wrote:
> On 6/6/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> > On 6/6/07, Guido van Rossum <guido at python.org> wrote:
> >
> > > > about normalization of data strings.  The big issue is string literals.
> > > > I think I agree with Stephen here:

> > > >     u"L\u00F6wis" == u"Lo\u0308wis"

> > > > should be True (assuming he typed it correctly in the first place :-),
> > > > because they are the same Unicode string.

> > > So let me explain it. I see two different sequences of code points:
> > > 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308',
> > > 'w', 'i', 's' on the other. Never mind that Unicode has semantics that
> > > claim they are equivalent.

> > Your (conforming) editor can silently replace one with the other.

> No it cannot. We are talking about \u escapes, not about a string
> literal containing Unicode characters ("Löwis").

ahh... my apologies.  I was interpreting the \u as a way of showing
the bytes in email.  I discarded the interpretation you are using
because that would require a sequence of 10 or 11 code points, rather
than the 5 or 6 you mentioned.

Python lexes it into a shorter string (just as it lexes 1.0 into a
number) at a conceptually later time.  Those later strings should
compare equal according to unicode, but I agree that you no longer
need to worry about editors introducing bugs.  (And I even agree that
this may be valid case for ignoring the recommendation; if someone has
been explicit by writing out 6 characters to represent one, they
probably meant it.)

-jJ


More information about the Python-3000 mailing list