[Python-3000] String comparison

Stephen J. Turnbull stephen at xemacs.org
Fri Jun 8 10:21:36 CEST 2007


Guido van Rossum writes:

 > If you want to have an abstraction that guarantees you'll never see
 > an unnormalized text string you should design a library for doing so.

OK.

 > (*) It looks like such a library will not have a way to talk about
 > "\u0308" at all, since it is considered unnormalized.

>From the Unicode Standard, v4.0, p. 43: "In the Unicode Standard, all
sequences of character codes are permitted."  Since normalization only
applies to characters with decompositions, "\u0308" is indeed valid
Unicode, a one-character sequence in NFC.

AFAIK, the only strings the Unicode standard absolutely prohibits
emitting are those containing code points guaranteed not to be
characters by the standard.  And normalization is simply a internal
technique that allows text operations to be implemented code-point-
wise without fear that emitting them would result in illegal sequences
or other externally visible incompatibilities with the standard.

So there's nothing "wrong by definition" about defining strings as
sequences of code points, and string operations in code-point-wise
fashion.  It just makes that library for Unicode more expensive to
design and operate, and will require auditing and reimplementation of
common libraries (including the standard library) by every program
that requires strict Unicode conformance.



More information about the Python-3000 mailing list