[Python-Dev] UTF-16 code point comparison

Fredrik Lundh Fredrik Lundh" <effbot@telia.com
Thu, 27 Jul 2000 12:12:48 +0200


mal wrote:
> This really has nothing to do with being able to support
> surrogates or not (as Fredrik mentioned), it is the correct
> behaviour provided UTF-16 is used as encoding for UCS-4 values
> in Unicode literals which is what Python currently does.

Really?  I could have sworn that most parts of Python use
UCS-2, not UTF-16.  Built-ins like ord, unichr, len; slicing;
string methods; regular expressions, etc. all clearly assume
that a Py_UNICODE is a unicode code point.

My point is that we shouldn't pretend we're supporting
UTF-16 if we don't do that throughout.

As far as I can tell, cmp() is the *only* unicode function
that thinks the internal storage is UTF-16.

Everything else assumes UCS-2.

And for Python 2.0, it's surely easier to fix cmp() than to
fix everything else. =20

(this is the same problem as 8-bit/unicode comparisions, where
the current consensus is that it's better to raise an exception
if it looks like the programmer doesn't know what he was doing,
rather than pretend it's another encoding).

:::

To summarize, here's the "character encoding guidelines" for
Python 2.0:

    In Unicode context, 8-bit strings contain ASCII. Characters
    in the 0x80-0xFF range are invalid.  16-bit strings contain
    UCS-2.  Characters in the 0xD800-0xDFFF range are invalid.

    If you want to use any other encoding, use the codecs pro-
    vided by the Unicode subsystem.  If you need to use Unicode
    characters that cannot be represented as UCS-2, you cannot
    use Python 2.0's Unicode subsystem.

Anything else is just a hack.

</F>