[Python-Dev] Pre-PEP: Python Character Model

Paul Prescod il8n-sig@python.org
Wed, 07 Feb 2001 12:49:15 -0800


Neil Hodgson wrote:
> 
> ...
> 
>    Matz: "We don't believe there can be any single characer-encoding that
> encompasses all the world's languages.  We want to handle multiple encodings
> at the same time (if you want to)."
> 
>    The approach taken in the next version of Ruby is for all string and
> regex objects to have an encoding attribute and for there to be
> infrastructure to handle operations that combine encodings.

I think Python should support as many encodings as people invent.
Conceptually it doesn't cost me anything, but I'll leave the
implementation to you. :)

But an encoding is only a way of *representing a character in memory or
on disk*. Asking for Python to support multiple encodings in memory is
like asking for it to support both two's complement and one's complement
long integers. Multiple encodings can be only interesting as a
performance issue because the encoding of memory is *transparent* to the
*Python programmer*.

We could support a thousand encodings internally but a Python programmer
should never know or care which one they are dealing with. Which leads
me to ask "what's the point"? Would the small performance gains be worth
it?

>    One of the things that is needed in a project that tries to fulfill the
> needs of large character set users is to have some of those users involved
> in the process. When I first saw proposals to use Unicode in products at
> Reuters back in 1994, it looked to me (and the proposal originators) as if
> it could do everything anyone ever needed. It was only after strenuous and
> persistant argument from the Japanese and Hong Kong offices that it became
> apparent that Unicode just wasn't enough. A partial solution then was to
> include language IDs encoded in the Private Use Area. This was still being
> discussed when I left but while it went some way to satisfying needs, there
> was still some unhappiness.

I think that Unicode has changed quite a bit since 1994. Nevertheless,
language IDs is a fine solution. Unicode is not about distinguishing
between languages -- only characters. There is no better "non-Unicode"
solution that I've ever heard of.

>    If Python could cooperate with Ruby here, then not only could code be
> shared but Python would gain access to developers with large character set
> /needs/ and experience.

I don't see how we could meaningfully cooperate on such a core language
issue. We could of course share codecs but that has nothing to do with
Python's internal representation.

 Paul Prescod