[Python-Dev] Internationalization Toolkit
M.-A. Lemburg
mal@lemburg.com
Fri, 12 Nov 1999 10:16:57 +0100
Tim Peters wrote:
>
> [MAL]
> > If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and
> > signal failure of this assertion at Unicode object construction time
> > via an exception. That way we are within the standard, can use
> > reasonably fast code for Unicode manipulation and add those extra 1M
> > character at a later stage.
>
> I think this is reasonable.
>
> Using UTF-8 internally is also reasonable, and if it's being rejected on the
> grounds of supposed slowness, that deserves a closer look (it's an ingenious
> encoding scheme that works correctly with a surprising number of existing
> 8-bit string routines as-is). Indexing UTF-8 strings is greatly speeded by
> adding a simple finger (i.e., store along with the string an index+offset
> pair identifying the most recent position indexed to -- since string
> indexing is overwhelmingly sequential, this makes most indexing
> constant-time; and UTF-8 can be scanned either forward or backward from a
> random internal point because "the first byte" of each encoding is
> recognizable as such).
Here are some arguments for using the proposed UTF-16 strategy instead:
· all characters have the same length; indexing is fast
· conversion APIs to platform dependent wchar_t implementation are fast
because they either can simply copy the content or extend the 2-bytes
to 4 byte
· UTF-8 needs 2 bytes for all the compound Latin-1 characters (e.g. u
with two dots) which are used in many non-English languages
· from the Unicode Consortium FAQ: "Most Unicode APIs are using UTF-16."
Besides, the Unicode object will have a buffer containing the
<default encoding> representation of the object, which, if all goes
well, will always hold the UTF-8 value. RE engines etc. can then directly
work with this buffer.
> I expect either would work well. It's at least curious that Perl and Tcl
> both went with UTF-8 -- does anyone think they know *why*? I don't. The
> people here saying UCS-2 is the obviously better choice are all from the
> Microsoft camp <wink>. It's not obvious to me, but then neither do I claim
> that UTF-8 is obviously better.
--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 49 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/