[Python-3000] String comparison

Wed Jun 13 06:28:23 CEST 2007

Rauli Ruohonen writes:

 > In my mind everything in a Python program is within a single
 > Unicode process,

Which is a *serious* mistake.  It is *precisely* the mistake that
leads to mixing UTF-16 and UCS-2 interpretations in the standard
library.  What you are saying is that if you write a 10-line script
that claims Unicode conformance, you are responsible for the Unicode-
correctness of all modules you call implicitly as well as that of the
Python interpreter.

This is what I mean by "Unicode conformance is not a goal of the
language."

Now, it's really not so bad.  If you look at what MAL and MvL are
doing (inter alia, it's their work I'm most familiar with), what you
will see is that they are gradually implementing conformant modules
here and there.  Eg, I am sure it is not MvL's laziness or inability
to come up with a reasonable spec himself that causes PEP 3131 to be a
profile of UAX #31.

 > Actually, I said that there's no way to always do the right thing as long
 > as they are mixed, but that was a too theoretical argument. Practically
 > speaking, there's little need to interpret surrogate pairs as two
 > code points instead of as one non-BMP code point.

Again, a mistake.  In the standard library, the question is not "do I
need this?", but "what happens if somebody else does it?"  They may
receive the same answer, but then again they may not.

For example, suppose you have a supplier-consumer pair sharing a
fixed-length buffer of 2-octet code units.  If it should happen that
the supplier uses the UCS-2 interpretation, then a surrogate pair may
get split when the buffer is full.  Will a UTF-16 consumer be prepared
for this?  Almost surely some will not, because that would imply
maintaining an internal buffer, which is stupidly inefficient if you
have an external buffer protocol.

Note that an UTF-16 supplier feeding a UCS-2 consumer will have no
problems (unless the UCS-2 consumer can't handle "short reads", but
that's unlikely), and if you have a chain starting with a UTF-16
source, then none of the downstream UTF-16 processes have a problem.
The problem is, suppose somehow you get a UCS-2 source?  Whose
responsibility is it to detect that?

 > Java and C# (and thus Jython and IronPython too) also sometimes use
 > UCS-2, sometimes UTF-16. As long as it works as you expect, there
 > isn't a problem, really.

That depends on how big a penalty you face if you break a promise of
conformance to your client.  Death, taxes, and Murphy's Law are
inescapable.

 > On UCS-4 builds of CPython it's the same (either UCS-4 or UTF-32 with the
 > extension that surrogates work as in UTF-16), but you get the extra
 > complication that some equal strings don't compare equal, e.g.
 > u'\U00010000' != u'\ud800\udc00'. Even that doesn't cause problems in
 > practice, because you shouldn't have strings like u'\ud800\udc00' in the
 > first place.

But the Unicode standard itself gives (the equivalent of) u'\ud800' +
u'\udc00' as an example of the kind of thing you *should be able to
do*.  Because, you know, clients of the standard library *will* be
doing half-witted[1] things like that.

Footnotes: 
[1]  What I wanted to say was いい加減にしろよ！ <wink>