[Python-Dev] UTF-16 code point comparison

Bill Tutt billtut@microsoft.com
Thu, 27 Jul 2000 11:23:27 -0700


> umm.  the Java docs I have access to doesn't mention surrogates
> at all (they do point out that a character is 16-bit, and they don't
> provide an \U escape).  

When Java does support surrogates, it seems likely that for backward
compatibility reasons that they'll start paying attention to surrogates.
Altering to a 32-bit int would break too much of their users code.

> on the other hand, MSDN says:
>     Windows 2000 provides support for basic input, output, and
>     simple sorting of surrogates. However, not all Windows 2000
>     system components are surrogate compatible. Also, surrogates
>     are not supported in Windows 95/98 or in Windows NT 4.0.

> and then mentions all the usual problems with variable-width
> encodings...

Which means it supports UTF-16, and the support can only get better.


> > > after all, if variable-width internal storage had been easy to deal
> > > with, we could have used UTF-8 from the start...  (and just like
> > > the Tcl folks, we would have ended up rewriting the whole thing
> > > in the next release ;-)
> >=20
> > Oh please, UTF-16 is substantially simpler to deal with than UTF-8.

> in what way?  as in "one or two words" is simpler than "one, two,
> three, four, five, or six bytes"?

> or as in "nobody will notice anyway..." ;-)

As in it's very easy to determine arbitrarily which byte of the surrogate
you're dealing with based on its 16-bit value.
You can't say that about UTF-8.

> if UCS-2/BMP was good enough for NT 4.0, Unicode 1.1, and Java 1.0,
> it's surely good enough for Python 2.0 ;-)
>
> (and if I understand things correctly, 2.1 isn't that far away...)

They were, since UTF-16 didn't exist at the time. :)
I think we both want Python's unicode support to eventually support
surrogate range characters.
Let me see if I can reiterate what we're both trying to say.

What you're saying:
* We might switch to UCS-4 when we support this extra stuff because variable
length encodings are annoying.
My counter point:
	Switching to UCS-4 isn't backward compatible. If you want to avoid
variable length encoding then start with UCS-4 from the beginning. BUT,
keeping in mind that Windows, and Java are either right now, or likely to be
UTF-16 systems.
What I'm saying:
* Just use UTF-16 and be done with it. Windows is using it, and Java if it
isn't already is definitely likely to do so.
Your counter point:
	Supporting UTF-16 is complicated, and we don't want to do all of
this for Python 2.0. (Especially anything that involves writing a Unicode
string object that hides the surrogate as two 16-bit entities)
My response:
	This is true. I've never suggested we do all of the UTF-16 support
for Python 2.0. The UTF-16 code point order comparision patch I submitted
was just something I noticed online, and submitted the patch more for
feedback and comments rather then wanting, or needing the patch to get in.
However, that doesn't mean bad things will happen if we allow the UTF-8/16
(en/de)coders handle surrogate characters. The Python code can worry about
this variable length encoding on its own. The core should still be able to
UTF-8 the unicode string so that it can be pickled however.

Did I get what you're saying right, and do you understand what I'm getting
at?

Bill






 -----Original Message-----
From: 	Fredrik Lundh [mailto:effbot@telia.com] 
Sent:	Thursday, July 27, 2000 8:35 AM
To:	Bill Tutt
Cc:	python-dev@python.org
Subject:	Re: [Python-Dev] UTF-16 code point comparison

This message uses a character set that is not supported by the Internet
Service.  To view the original message content,  open the attached message.
If the text doesn't display correctly, save the attachment to disk, and then
open it using a viewer that can display the original character set.  <<
File: message.txt >>