[Python-3000] String comparison

Rauli Ruohonen rauli.ruohonen at gmail.com
Thu Jun 7 05:32:47 CEST 2007


On 6/7/07, Bill Janssen <janssen at parc.com> wrote:
> I meant to say that *strings* are explicitly sequences of characters,
> not codepoints.

This is false. When you access the contents of a string using the
*sequence* protocol, what you get is code points, not characters
(grapheme clusters). To get those, you have to use a regexp, as
outlined in UAX#29. You could normalize at the same time so you
can do bitwise comparison instead of collation to compare graphemes
the way the user does. If you're going to do all that, then you could
as well implement your own type (which could even be provided by
the standard library).

Note that normalization alone does not produce a sequence of
grapheme clusters, because there aren't precomposed characters for
everything - for full generality you just have to deal with
combining characters.

> I also believe that the literal form '\u0308' should generate a compile
> error.  It's a valid Unicode codepoint, sure, but not a valid string.

Then you wouldn't even be able to iterate over or index strings anymore,
as that could produce such "invalid" strings, which would need to
generate exceptions if you really want to ban them. Or is there point
in
making people type 'o\u0308'[1] instead of '\u0308'?


More information about the Python-3000 mailing list