[Python-Dev] UTF-16 code point comparison

Fredrik Lundh Fredrik Lundh" <effbot@telia.com
Thu, 27 Jul 2000 17:35:27 +0200


bill wrote:
> So use UCS-4 internal storage now. UTF-16 just seems like a handy =
internal
> storage mechanism to pick since Win32 and Java use it for their native
> string processing.

umm.  the Java docs I have access to doesn't mention surrogates
at all (they do point out that a character is 16-bit, and they don't
provide an \U escape).  on the other hand, MSDN says:

    Windows 2000 provides support for basic input, output, and
    simple sorting of surrogates. However, not all Windows 2000
    system components are surrogate compatible. Also, surrogates
    are not supported in Windows 95/98 or in Windows NT 4.0.

and then mentions all the usual problems with variable-width
encodings...

> > after all, if variable-width internal storage had been easy to deal
> > with, we could have used UTF-8 from the start...  (and just like
> > the Tcl folks, we would have ended up rewriting the whole thing
> > in the next release ;-)
>=20
> Oh please, UTF-16 is substantially simpler to deal with than UTF-8.

in what way?  as in "one or two words" is simpler than "one, two,
three, four, five, or six bytes"?

or as in "nobody will notice anyway..." ;-)

:::

if UCS-2/BMP was good enough for NT 4.0, Unicode 1.1, and Java 1.0,
it's surely good enough for Python 2.0 ;-)

(and if I understand things correctly, 2.1 isn't that far away...)

</F>