[Python-Dev] Internationalization Toolkit

Greg Stein gstein@lyra.org
Thu, 11 Nov 1999 02:46:55 -0800 (PST)


On Wed, 10 Nov 1999, M.-A. Lemburg wrote:
>...
> Well almost... it depends on the current value of <default encoding>.

Default encodings are kind of nasty when they can be altered. The same
problem occurred with import hooks. Only one can be present at a time.
This implies that modules, packages, subsystems, whatever, cannot set a
default encoding because something else might depend on it having a
different value. In the end, nobody uses the default encoding because it
is unreliable, so you end up with extra implementation/semantics that
aren't used/needed.

Have you ever noticed how Python modules, packages, tools, etc, never
define an import hook?

I'll bet nobody ever monkeys with the default encoding either...

I say axe it and say "UTF-8" is the fixed, default encoding. If you want
something else, then do that explicitly.

>...
> Another problem is that Unicode types differ between platforms
> (MS VCLIB uses 16-bit wchar_t, while GLIBC2 uses 32-bit
> wchar_t). Depending on the internal format of Unicode objects
> this could mean calling different conversion APIs.

Exactly the reason to avoid wchar_t.

> BTW, I'm still not too sure about the underlying internal format.
> The problem here is that Unicode started out as 2-byte fixed length
> representation (UCS2) but then shifted towards a 4-byte fixed length
> reprensetation known as UCS4. Since having 4 bytes per character
> is hard sell to customers, UTF16 was created to stuff the UCS4
> code points (this is how character entities are called in Unicode)
> into 2 bytes... with a variable length encoding.

History is basically irrelevant. What is the situation today? What is in
use, and what are people planning for right now?

>...
> The downside of using UTF16: it is a variable length format,
> so iterations over it will be slower than for UCS4.

Bzzt. May as well go with UTF-8 as the internal format, much like Perl is
doing (as I recall).

Why go with a variable length format, when people seem to be doing fine
with UCS-2?

Like I said in the other mail note: two large platforms out there are
UCS-2 based. They seem to be doing quite well with that approach.

If people truly need UCS-4, then they can work with that on their own. One
of the major reasons for putting Unicode into Python is to
increase/simplify its ability to speak to the underlying platform. Hey!
Guess what? That generally means UCS2.

If we didn't need to speak to the OS with these Unicode values, then
people can work with the values entirely in Python,
PyUnicodeType-be-damned.

Are we digging a hole for ourselves? Maybe. But there are two other big
platforms that have the same hole to dig out of *IF* it ever comes to
that. I posit that it won't be necessary; that the people needing UCS-4
can do so entirely in Python.

Maybe we can allow the encoder to do UCS-4 to UTF-8 encoding and
vice-versa. But: it only does it from String to String -- you can't use
Unicode objects anywhere in there.

> Simply sticking to UCS2 is probably out of the question,
> since Unicode 3.0 requires UCS4 and we are targetting
> Unicode 3.0.

Oh? Who says?

Cheers,
-g

--
Greg Stein, http://www.lyra.org/