[I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model

Andy Robinson andy@reportlab.com
Wed, 7 Feb 2001 23:06:12 -0000


> The last time we went around there was an anti-Unicode faction who
> argued that adding Unicode support was fine but making it
> the default would inconvenience Japanese users.
Whoops, I nearly missed the biggest debate of the year!

I guess the faction was Brian and I, and our concerns were
misunderstood.  We can lay this to rest forever now as the current
implementation and forward direction incorporate everything I
originally hoped for:

(1) Frequently you need to work with byte arrays, but need a rich
bunch of string-like routines - search and replace, regex etc.
This applies both to non-natural-language data and also to
the special case of corrupt native encodings that need repair.
We loosely defined the 'string interface' in UserString, so that
other people could define string-like types if they wished
and so that users can expect to find certain methods and
operations in both Unicode and Byte Array types.

I'd be really happy one day to explicitly type
  x= ByteArray('some raw data')
as long as I had my old friends split, join, find etc.

(2) Japanese projects often need small extensions to codecs
to deal with user-defined characters.  Java and VB give you
some canned codecs but no way to extend them.  All the Python
asian codec drafts involve 'open' code you can hack and use
simple dictionaries for mapping tables; so it will be really easy
to roll your own "Shift-JIS-plus" with 20 extra characters
mapping to a private use area.  This will be a huge win over
other languages.

(3) The Unicode conversion was based on a more general notion
of 'stream conversion filters' which work with bytes. This
leaves the door open to writing, for example, a direct
Shift-JIS-to-EUC filter which adds nothing in the case of
clean data but is much more robust in the case of user-defined
characters or which can handle cleanup of misencoded data.
We could also write image manipulation or crypto codecs.
Some of us hope to provide general machinery for fast handling
of byte-stream-filters which could be useful in image
processing and crypto as well as encodings. This might
need an extended or different lookup function (after all,
neither end of the filter need be Unicode) but could be
cleanly layered on top of the codec mechanism we have built in.

(4) I agree 100% on being explicit whenever you do I/O
or conversion and on generally using Unicode characters
where possible.  Defaults are evil.  But we needed a
compatibility route to get there.  Guido has said that
long term there will be Unicode strings and Byte Arrays.
That's the time to require arguments to open().

> Similarly, we could improve socket objects so that they
> have different
> readtext/readbinary and writetext/writebinary without unifying the
> string objects. There are lots of small changes we can make without
> breaking anything. One I would like to see right now is a
> unification of
> chr() and unichr().

Here's a thought.  How about BinaryFile/BinarySocket/ByteArray which
do
not need an encoding, and File/Socket/String which require explicit
encodings on opeening.  We keep broad parity between their methods.
That seems more straightforward to me than having text/binary
methods, and also provides a cleaner upgrade path for existing
code.


- Andy