[Python-Dev] Internationalization Toolkit

Tim Peters tim_one@email.msn.com
Fri, 12 Nov 1999 00:42:32 -0500


[MAL]
> If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and
> signal failure of this assertion at Unicode object construction time
> via an exception. That way we are within the standard, can use
> reasonably fast code for Unicode manipulation and add those extra 1M
> character at a later stage.

I think this is reasonable.

Using UTF-8 internally is also reasonable, and if it's being rejected on the
grounds of supposed slowness, that deserves a closer look (it's an ingenious
encoding scheme that works correctly with a surprising number of existing
8-bit string routines as-is).  Indexing UTF-8 strings is greatly speeded by
adding a simple finger (i.e., store along with the string an index+offset
pair identifying the most recent position indexed to -- since string
indexing is overwhelmingly sequential, this makes most indexing
constant-time; and UTF-8 can be scanned either forward or backward from a
random internal point because "the first byte" of each encoding is
recognizable as such).

I expect either would work well.  It's at least curious that Perl and Tcl
both went with UTF-8 -- does anyone think they know *why*?  I don't.  The
people here saying UCS-2 is the obviously better choice are all from the
Microsoft camp <wink>.  It's not obvious to me, but then neither do I claim
that UTF-8 is obviously better.