[I18n-sig] SIG charter and goals

Andy Robinson andy@robanal.demon.co.uk
Wed, 09 Feb 2000 02:10:34 GMT


On Tue, 08 Feb 2000 15:31:43 +0100, you wrote:

>> 2. Encodings API and library:
>> --------------------------------
>>=20
>> We must deliver an encodings library which surpasses the features of
>> that in Java.  It should allow conversion between many common
>> encodings; access to Unicode character properties; and anything else
>> which makes encoding conversion more pleasant.  This should be
>> initially based on MAL's draft specification, although the spec may
>> be changed if we find good reason to.
>
>Note that Python will have a builtin codec support. The details
>are described in the proposal paper (not the C API though --
>that still lives in the .h files of the Unicode implementation).
>
>Note that I have made some good experience with the existing
>spec: it is very flexible, extendable and versatile. It also
>greatly reduces coding efforts by providing working baseclasses.
>=20
I can't wait to try the code, and cannot foresee any problems at the
moment based on the spec.  However, it was only discussed on the
Python-dev list, and Marc-Andree was not at IPC8, so I should try to
explain some background for everyone, (and what my agenda as SIG
moderator is too!)

1. HP joined the Python consortium and pushed for Unicode support last
year.  There was a detailed discussion on the Python-dev list (to
which I was invited because my day-job included some very messy
double-byte work in Python for a year).   Marc-Andre's proposal went
through about eight iterations, and he started to code it up under
contract to CNRI.  This is official work, and there is no question of
anybody else's Unicode modules being used - sorry!  Fredrik Lundh's
work on the Unicode regex engine is also under contract and
progressing rapidly.

2. MAL's document defines the API for 'codecs' - conversion filters -
but his taks does not include delivering a package with all the
world's common encodings in it.  That is a necessity in the long run,
and both I (through ReportLab) and Digital Garage need to make at
least the Japanese encodings work quite soon. =20

(Marc-Andre, can you update us on what codecs you are providing, and
how they are implemented? C or Python? )

3. At IPC8 we discussed (among other things) the delivery of the codec
package - both in the i18n forum and in the corridors as usual!  To do
what Java does, we eventually need codecs for 50+ common encodings,
all available and tested.  These will almost certainly not be in the
standard distribution, but there should eventually be a single,
certified, tested source for them, as this stuff has to be 100% right.
Quite a few of us urgently need good Japanese support.

The current spec does not say whether codecs should be in C or Python.
Guido expressed the hope that a few carefully chosen C routines could
allow us to write new filters in Python, but get most of the speed of
C - an idea I'd been drip-feeding to him for some time :-)   I think
that is a proper task for this group, and one I hope to put a lot of
work in to.  I'm personally hoping that we can do a sort of
mini-mxTextTools state machine which has actions for lookups in
single-byte mapping tables, double-byte mapping tables and other
things, so that new encodings can be written and added easily, yet
still run fast.  For example, all single-byte encodings can be dealt
with by a streaming version of something like string.translate(), so
adding a new one just becomes a matter of adding a 256-element list to
a file somewhere.  I believe most of the double-byte ones can be
reduced to a few kb with the right functions as well.  I'll be ready
to talk more about this shortly.

Guido also made it clear that while MAL's proposal is considered
pretty good, it is not set in stone yet. In particular, if the
double-byte specialists find that some minor tweaks would make their
lives better, he would consider it; we need a real-world test-drive
before 1.6, and this group is the place to do it. =20

Now for my own opinions on how things should be run henceforth.  Feel
free to differ!

I should point out that the inner circle of Python developers are NOT
experts in multi-byte data.  I feel strongly that we should seek out
the best expertise in the world, starting now.  This discussion will
not focus on Unicode string implementation in the core, but on what
our encoding library lets you do at the application level.   Ken
Lunde, author of "CJKV Information Processing", is the acknowledged
world leader in this field, and agreed to take part in a discussion
and review our proposals - I'll try to bring him in shortly.  It would
also be good to collar some people involved in the Java i18n libraries
and ask what they would do differently next time around, and to talk
to people who have worked with commercial tools like Unilib and
Rosette.  Then, we won't just hope that Python has the best i18n
support, we'll know it.  Naturally this review needs to happen fairly
promptly in March/April - maybe best to wait until we can run the
code.

I hope this helps a little.  If people have serious issues about where
things are heading, let's hear them now.

Best Regards,

Andy Robinson

p.s. one thing I would be very interested to hear is what people's
angles are - relevant experience, willingness to help out, needs for
solutions etc!