[Python-3000] String comparison

Stephen J. Turnbull stephen at xemacs.org
Thu Jun 7 15:30:13 CEST 2007


Guido van Rossum writes:

 > No it cannot. We are talking about \u escapes, not about a string
 > literal containing Unicode characters ("Löwis").

Ah, good point.

I apologize for mistyping the example.  *I* *was* talking about a
string literal containing Unicode characters.  However, on my
terminal, you can't see the difference!  So I (ab)used the \u escapes
to make clear that in one case the representation used 5 characters
and in the other 6.

 > > > I might be writing either literal with the expectation to get
 > > > exactly that sequence of code points,

This should be possible, agreed.  Couldn't rawstring read syntax be
given the right semantics?  And of course you've always got tuples of
integers.

What bothers me about the "sequence of code points" way of thinking is
that len("Löwis") is nondeterministic.  To my mind, especially from
the educational standpoint, but also from the point of view of
implementing a text editor or docutils, that's much more horrible than
Martin's point that len(a) + len(b) == len(a+b) could fail if we do
NFC normalization.  (NKD would work here.)

I'm not sure what happened, but after recent upgrades to Python and
docutils (presumably the latter) a bunch of Japanese reST documents of
mine broke.  I have no idea how to count the number of characters in a
line containing Japanese any more (even having fixed the tables by
trial and error, it's not obvious), but of course tables require being
able to do that exactly.  Normalization would guarantee TOOWDTI.

But IMO the right way to do normalization in such cases is in Python
itself.  One is *never* going to be able to keep up with all the
external libraries, and it seems very unlikely that many will be high
quality from this point of view.  So even if your own code does the
right thing, you have to wrap every external module you call.  Or you
can rewrite Python to normalize in the right places once, and then you
don't have to worry about it.  (Bugs, yes, but then you fix them in
the forked Python, and all your code benefits from the fix
automatically.)

 > Bytes are not code points. The unicode string type has always been
 > about code points, not characters.

I wish you had named it "widechar", then.  I think that a language
where len("Löwis") == len("Löwis") is an invariant is one honking good
idea!

 > Have you ever even used the unicode string type in Python 2?

Yes.  On the Mac, I often have to run unicodes through normalization
NFD because some levels of Mac OS X do normalize NFD and others don't
normalize at all.  That means that file names in particular tend to be
different depending on whether I get them from the OS or from the
user.  But a test as simple as creating a file with a name containing
\u010D and trying to stat it can fail, AIUI because stdio normalizes
NFD but the raw OS stat call doesn't.  This particular test does work
in Python, I'm not sure what the difference is.

Granted that that's part of the plan and not serendipity, nonetheless,
I think the default case should be that text operations produce the
expected result in the text domain, even at the expense of array
invariants.  People who need arrays of code points have several ways
to get them, and the usual comparison operators will work on them as
desired.  While people who need operations on *text* still have no
straightforward way to get them, and no promise of one as I read your
remarks.



More information about the Python-3000 mailing list