[Python-Dev] UTF-16 code point comparison

Guido van Rossum guido@beopen.com
Thu, 27 Jul 2000 08:41:25 -0500


> As far as I can tell, cmp() is the *only* unicode function
> that thinks the internal storage is UTF-16.
> 
> Everything else assumes UCS-2.
> 
> And for Python 2.0, it's surely easier to fix cmp() than to
> fix everything else.  

Agreed (I think).

> (this is the same problem as 8-bit/unicode comparisions, where
> the current consensus is that it's better to raise an exception
> if it looks like the programmer doesn't know what he was doing,
> rather than pretend it's another encoding).
> 
> :::
> 
> To summarize, here's the "character encoding guidelines" for
> Python 2.0:
> 
>     In Unicode context, 8-bit strings contain ASCII. Characters
>     in the 0x80-0xFF range are invalid.  16-bit strings contain
>     UCS-2.  Characters in the 0xD800-0xDFFF range are invalid.
> 
>     If you want to use any other encoding, use the codecs pro-
>     vided by the Unicode subsystem.  If you need to use Unicode
>     characters that cannot be represented as UCS-2, you cannot
>     use Python 2.0's Unicode subsystem.
> 
> Anything else is just a hack.

I wouldn't go so far as raising an exception when a comparison
involves 0xD800-0xDFFF; after all we don't raise an exception when an
ASCII string contains 0x80-0xFF either (except when converting to
Unicode).

The invalidity of 0xD800-0xDFFF means that these aren't valid Unicode
code points; it doesn't mean that we should trap all attempts to use
these values.  That ways, apps that need UTF-16 awareness can code it
themselves.

Why?  Because I don't want to proliferate code that explicitly traps
0xD800-0xDFFF throughout the code.

--Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)