[Python-Dev] Internationalization Toolkit

M.-A. Lemburg mal@lemburg.com
Fri, 12 Nov 1999 10:16:57 +0100


Tim Peters wrote:
> 
> [MAL]
> > If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and
> > signal failure of this assertion at Unicode object construction time
> > via an exception. That way we are within the standard, can use
> > reasonably fast code for Unicode manipulation and add those extra 1M
> > character at a later stage.
> 
> I think this is reasonable.
> 
> Using UTF-8 internally is also reasonable, and if it's being rejected on the
> grounds of supposed slowness, that deserves a closer look (it's an ingenious
> encoding scheme that works correctly with a surprising number of existing
> 8-bit string routines as-is).  Indexing UTF-8 strings is greatly speeded by
> adding a simple finger (i.e., store along with the string an index+offset
> pair identifying the most recent position indexed to -- since string
> indexing is overwhelmingly sequential, this makes most indexing
> constant-time; and UTF-8 can be scanned either forward or backward from a
> random internal point because "the first byte" of each encoding is
> recognizable as such).

Here are some arguments for using the proposed UTF-16 strategy instead:

· all characters have the same length; indexing is fast
· conversion APIs to platform dependent wchar_t implementation are fast
  because they either can simply copy the content or extend the 2-bytes
  to 4 byte
· UTF-8 needs 2 bytes for all the compound Latin-1 characters (e.g. u
  with two dots) which are used in many non-English languages
· from the Unicode Consortium FAQ: "Most Unicode APIs are using UTF-16."

Besides, the Unicode object will have a buffer containing the
<default encoding> representation of the object, which, if all goes
well, will always hold the UTF-8 value. RE engines etc. can then directly
work with this buffer.
 
> I expect either would work well.  It's at least curious that Perl and Tcl
> both went with UTF-8 -- does anyone think they know *why*?  I don't.  The
> people here saying UCS-2 is the obviously better choice are all from the
> Microsoft camp <wink>.  It's not obvious to me, but then neither do I claim
> that UTF-8 is obviously better.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/