[Python-Dev] Internationalization Toolkit

M.-A. Lemburg mal@lemburg.com
Wed, 10 Nov 1999 14:08:30 +0100


Greg Stein wrote:
> 
> On Wed, 10 Nov 1999, M.-A. Lemburg wrote:
> >...
> > Well almost... it depends on the current value of <default encoding>.
> 
> Default encodings are kind of nasty when they can be altered. The same
> problem occurred with import hooks. Only one can be present at a time.
> This implies that modules, packages, subsystems, whatever, cannot set a
> default encoding because something else might depend on it having a
> different value. In the end, nobody uses the default encoding because it
> is unreliable, so you end up with extra implementation/semantics that
> aren't used/needed.

I know, but this is a little different: you use strings a lot while
import hooks are rarely used directly by the user.

E.g. people in Europe will probably prefer Latin-1 as default
encoding while people in Asia will use one of the common CJK encodings.

The <default encoding> decides what encoding to use for many typical
tasks: printing, str(u), "s" argument parsing, etc.

Note that setting the <default encoding> is not intended to be
done prior to single operations. It is meant to be settable at
thread creation time.

> [...]
> 
> > BTW, I'm still not too sure about the underlying internal format.
> > The problem here is that Unicode started out as 2-byte fixed length
> > representation (UCS2) but then shifted towards a 4-byte fixed length
> > reprensetation known as UCS4. Since having 4 bytes per character
> > is hard sell to customers, UTF16 was created to stuff the UCS4
> > code points (this is how character entities are called in Unicode)
> > into 2 bytes... with a variable length encoding.
> 
> History is basically irrelevant. What is the situation today? What is in
> use, and what are people planning for right now?
> 
> >...
> > The downside of using UTF16: it is a variable length format,
> > so iterations over it will be slower than for UCS4.
> 
> Bzzt. May as well go with UTF-8 as the internal format, much like Perl is
> doing (as I recall).
> 
> Why go with a variable length format, when people seem to be doing fine
> with UCS-2?

The reason for UTF-16 is simply that it is identical to UCS-2
over large ranges which makes optimizations (e.g. the UCS2 flag
I mentioned in an earlier post) feasable and effective. UTF-8
slows things down for CJK encodings, since the APIs will very often
have to scan the string to find the correct logical position in
the data.
 
Here's a quote from the Unicode FAQ (http://www.unicode.org/unicode/faq/ ):
"""
Q: How about using UCS-4 interfaces in my APIs?

Given an internal UTF-16 storage, you can, of course, still index into text
using UCS-4 indices. However, while converting from a UCS-4 index to a
UTF-16 index or vice versa is fairly straightforward, it does involve a
scan through the 16-bit units up to the index point. In a test run, for
example, accessing UTF-16 storage as UCS-4 characters results in a
10X degradation. Of course, the precise differences will depend on the
compiler, and there are some interesting optimizations that can be
performed, but it will always be slower on average. This kind of
performance hit is unacceptable in many environments.

Most Unicode APIs are using UTF-16. The low-level character indexing
are at the common storage level, with higher-level mechanisms for
graphemes or words specifying their boundaries in terms of the storage
units. This provides efficiency at the low levels, and the required
functionality at the high levels.

Convenience APIs can be produced that take parameters in UCS-4
methods for common utilities: e.g. converting UCS-4 indices back and
forth, accessing character properties, etc. Outside of indexing, differences
between UCS-4 and UTF-16 are not as important. For most other APIs
outside of indexing, characters values cannot really be considered
outside of their context--not when you are writing internationalized code.
For such operations as display, input, collation, editing, and even upper
and lowercasing, characters need to be considered in the context of a
string. That means that in any event you end up looking at more than one
character. In our experience, the incremental cost of doing surrogates is
pretty small.
"""

> Like I said in the other mail note: two large platforms out there are
> UCS-2 based. They seem to be doing quite well with that approach.
> 
> If people truly need UCS-4, then they can work with that on their own. One
> of the major reasons for putting Unicode into Python is to
> increase/simplify its ability to speak to the underlying platform. Hey!
> Guess what? That generally means UCS2.

All those formats are upward compatible (within certain ranges) and
the Python Unicode API will provide converters between its internal
format and the few common Unicode implementations, e.g. for MS
compilers (16-bit UCS2 AFAIK), GLIBC (32-bit UCS4).
 
> If we didn't need to speak to the OS with these Unicode values, then
> people can work with the values entirely in Python,
> PyUnicodeType-be-damned.
> 
> Are we digging a hole for ourselves? Maybe. But there are two other big
> platforms that have the same hole to dig out of *IF* it ever comes to
> that. I posit that it won't be necessary; that the people needing UCS-4
> can do so entirely in Python.
> 
> Maybe we can allow the encoder to do UCS-4 to UTF-8 encoding and
> vice-versa. But: it only does it from String to String -- you can't use
> Unicode objects anywhere in there.

See above.
 
> > Simply sticking to UCS2 is probably out of the question,
> > since Unicode 3.0 requires UCS4 and we are targetting
> > Unicode 3.0.
> 
> Oh? Who says?

>From the FAQ:
"""
Q: What is UTF-16?

Unicode was originally designed as a pure 16-bit encoding, aimed at
representing all modern scripts. (Ancient scripts were to be represented
with private-use characters.) Over time, and especially after the addition
of over 14,500 composite characters for compatibility with legacy sets, it
became clear that 16-bits were not sufficient for the user community. Out
of this arose UTF-16.
"""

Note that there currently are no defined surrogate pairs for
UTF-16, meaning that in practice the difference between UCS-2 and
UTF-16 is probably negligable, e.g. we could define the internal
format to be UTF-16 and raise exception whenever the border between
UTF-16 and UCS-2 is crossed -- sort of as political compromise ;-).

But... I think HP has the last word on this one.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/