From pf@artcom-gmbh.de  Thu Feb  3 22:01:03 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Thu, 3 Feb 2000 23:01:03 +0100 (MET)
Subject: [I18n-sig] Useful resources available now
Message-ID: <m12GUJD-000CywC@artcom0.artcom-gmbh.de>

This is short list of some resources available now, which
may people help to i18n their python applications:

 * fintl.py -- A pure python module for reading .mo files created by msgfmt
   URL    : <ttp://sourceforge.net/snippet/detail.php?type=snippet&id=100059>
   Auhor  : Peter Funk <mailto: pf@artcom-gmbh.de>   <-- That's me ;-)
   License: Pythonic
   DOC    : inline, no .tex available yet

 * intl.so -- An interface to the (GNU) gettext C library.  Only useful for 
              intenational applications, if they are covered by GPL and will 
              run under Unix-Linux.
   URL    : <http://www.informatik.hu-berlin.de/~loewis/python/intl-960117.tgz>
   Author : Martin von Löwis <mailto:loewis@informatik.hu-berlin.de>
   License: GPL? (due to use of GNU gettext)
   DOC    : README available and an IPC article:
            <http://www.python.org/workshops/1997-10/proceedings/loewis.html>
            no .tex yet

*  GNU gettext -- Suite of utitlities for i18n
   URL    : http://www.gnu.org/software/gettext/gettext.html
   License: GPL

 * pygettext.py -- Barry Warsaws reimplementation of gettext in pure Python
   URL    : ? Ouppsss can't find it at the moment.  Look on Barrys home page
   Author : Barry Warsaw <bwarsaw@python.org>
   License: Pythonic
   DOC    : inline doc strings

further reading:
   http://www.python.org/workshops/1997-10/proceedings/loewis.html

Regards from Germany, Peter
-- 
Peter Funk, Oldenburger Str.86, 27777 Ganderkesee, Tel: 04222 9502 70, Fax: -60


From guido@python.org  Thu Feb  3 22:31:51 2000
From: guido@python.org (Guido van Rossum)
Date: Thu, 03 Feb 2000 17:31:51 -0500
Subject: [I18n-sig] Useful resources available now
In-Reply-To: Your message of "Thu, 03 Feb 2000 23:01:03 +0100."
 <m12GUJD-000CywC@artcom0.artcom-gmbh.de>
References: <m12GUJD-000CywC@artcom0.artcom-gmbh.de>
Message-ID: <200002032231.RAA00550@eric.cnri.reston.va.us>

Great list.  Could someone turn it into HTML for easy pasting into the
sig's home page?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From pf@artcom-gmbh.de  Thu Feb  3 23:13:02 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Fri, 4 Feb 2000 00:13:02 +0100 (MET)
Subject: [I18n-sig] Useful resources available now
In-Reply-To: <200002032231.RAA00550@eric.cnri.reston.va.us> from Guido van Rossum at "Feb 3, 2000  5:31:51 pm"
Message-ID: <m12GVQt-000CywC@artcom0.artcom-gmbh.de>

[Guido]:
> Great list.  Could someone turn it into HTML for easy pasting into the
> sig's home page?

I think my list is still rather incomplete.  There was a thread on
comp.lang.python last year, where François Pinard (spelling?) annouced
something.  Unfortunately I don't remember right now and can't find
it. :-(

Perhaps we should wait some days.  At least until the list maintainer 
had a chance to subsribe to this i18n-sig. ;-)

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, 27777 Ganderkesee, Tel: 04222 9502 70, Fax: -60


From andy@robanal.demon.co.uk  Mon Feb  7 21:25:39 2000
From: andy@robanal.demon.co.uk (Andy Robinson)
Date: Mon, 07 Feb 2000 21:25:39 GMT
Subject: [I18n-sig] SIG charter and goals
Message-ID: <38a03848.685217@post.demon.co.uk>

Apologies to everyone for taking so long to get started; I have been
on the road between IPC8 and today, and hampered by a broken down
laptop..

I'd like to confirm with everyone what we discussed at IPC8, and try
to outline what I see as the SIG's charter.  If this I agreed, I will
put this up on a SIG home page this week. =20

I propose that the SIG's deliverables are:

1.  Support addition of Unicode to the Python core for 1.6:
------------------------------------------------------------------
The key tasks are to add Unicode string support to the Python core
(MAL), and add a new Unicode regex engine (Fredrik).  These are both
well underway.  This group should assist with testing, and be the
primary forum for feedback on those features. =20

2. Encodings API and library:
--------------------------------

We must deliver an encodings library which surpasses the features of
that in Java.  It should allow conversion between many common
encodings; access to Unicode character properties; and anything else
which makes encoding conversion more pleasant.  This should be
initially based on MAL's draft specification, although the spec may
be changed if we find good reason to.

There will be an inevitable initial focus on Japanese support due to
the key people involved.  However, if we can do that well then other
encodings should be less of a problem.

3.  Locales:
--------------
Implement a candidate module for the standard library offering support
for the world's date, time, money and number formats, and for time
zones.

4.   Application Localization:
-----------------------------------
This group is the intended focal point for frameworks for localizing
both conventional applications and Python-powered web sites.  This
field is very large and varied and we set not targets for delivering
'a solution'; however, we hope to generate discussion, how-tos and
references to examples of good and bad practice in this area.

5. Internationalizing Pythonwin and IDLE
-----------------------------------------------------
There are some current bugs/features in these environments which
seriously hamper use in double-byte environments.  We should try to
get these stamped out. =20

Opinions, anyone?  Have I missed any major topics?  Are there any best
left out of the SIG's charter?

- Andy


From alex@ank-sia.com  Tue Feb  8 13:47:07 2000
From: alex@ank-sia.com (alexander smishlajev)
Date: Tue, 08 Feb 2000 15:47:07 +0200
Subject: [I18n-sig] Useful resources available now
Message-ID: <38A01E5B.C8FFE9DE@turnhere.com>

hello Peter!

thanks for the list!  i've seen all of this, but it is nice to have
common summary.  some additions:

* unicode.so -- A unicode string implementation for Python 1.5.
   URL    : http://www.pythonware.com/madscientist/
   Author : Fredrik Lundh
   License: Pythonic

* PyRecode -- wrapper around the librecode/Recode utility (a GPL tool
              for converting text from one character set to another).
   URL    : http://www.suxers.de/python/pyrecode.htm
   Author : Andreas Jung <ajung@sz-sb.de>

* pynicode -- Unicode support and character set translation for Python.
              pure Python module suite started as a translation
              of Perl Unicode modules.
   URL    : http://sourceforge.net/project/?group_id=1825
   Author : alexander smishlajev <als@turnhere.com>
   License: MIT-style
   DOC    : inline

best wishes,
alex.


From pf@artcom-gmbh.de  Tue Feb  8 13:52:51 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Tue, 8 Feb 2000 14:52:51 +0100 (MET)
Subject: [I18n-sig] Useful resources available now
In-Reply-To: <38A01E5B.C8FFE9DE@turnhere.com> from alexander smishlajev at "Feb 8, 2000  3:47: 7 pm"
Message-ID: <m12IB4V-000CxiC@artcom0.artcom-gmbh.de>

Hi alex!

> thanks for the list!  i've seen all of this, but it is nice to have
> common summary.  some additions:
[...]

I will add them to my list and convert it into HTML later this
week.  Thank you for your additions.  I was not aware of them.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, 27777 Ganderkesee, Tel: 04222 9502 70, Fax: -60


From mal@lemburg.com  Tue Feb  8 14:11:06 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 08 Feb 2000 15:11:06 +0100
Subject: [I18n-sig] Useful resources available now
References: <38A01E5B.C8FFE9DE@turnhere.com>
Message-ID: <38A023FA.F855BA17@lemburg.com>

alexander smishlajev wrote:
> 
> hello Peter!
> 
> thanks for the list!  i've seen all of this, but it is nice to have
> common summary.  some additions:
> 
> * unicode.so -- A unicode string implementation for Python 1.5.
>    URL    : http://www.pythonware.com/madscientist/
>    Author : Fredrik Lundh
>    License: Pythonic
> 
> * PyRecode -- wrapper around the librecode/Recode utility (a GPL tool
>               for converting text from one character set to another).
>    URL    : http://www.suxers.de/python/pyrecode.htm
>    Author : Andreas Jung <ajung@sz-sb.de>
> 
> * pynicode -- Unicode support and character set translation for Python.
>               pure Python module suite started as a translation
>               of Perl Unicode modules.
>    URL    : http://sourceforge.net/project/?group_id=1825
>    Author : alexander smishlajev <als@turnhere.com>
>    License: MIT-style
>    DOC    : inline

FYI, Python 1.6 will have native Unicode support. There's
no need to duplicate work in that area... better wait until
the first versions ship and then build on top of the
existing implementation, IMHO anyways ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Tue Feb  8 14:31:43 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 08 Feb 2000 15:31:43 +0100
Subject: [I18n-sig] SIG charter and goals
References: <38a03848.685217@post.demon.co.uk>
Message-ID: <38A028CF.8188427D@lemburg.com>

Andy Robinson wrote:
> 
> Apologies to everyone for taking so long to get started; I have been
> on the road between IPC8 and today, and hampered by a broken down
> laptop..
> 
> I'd like to confirm with everyone what we discussed at IPC8, and try
> to outline what I see as the SIG's charter.  If this I agreed, I will
> put this up on a SIG home page this week.
> 
> I propose that the SIG's deliverables are:
> 
> 1.  Support addition of Unicode to the Python core for 1.6:
> ------------------------------------------------------------------
> The key tasks are to add Unicode string support to the Python core
> (MAL), and add a new Unicode regex engine (Fredrik).  These are both
> well underway.  This group should assist with testing, and be the
> primary forum for feedback on those features.

FYI, the Unicode stuff will go into the public CVS version
sometime in March.
 
> 2. Encodings API and library:
> --------------------------------
> 
> We must deliver an encodings library which surpasses the features of
> that in Java.  It should allow conversion between many common
> encodings; access to Unicode character properties; and anything else
> which makes encoding conversion more pleasant.  This should be
> initially based on MAL's draft specification, although the spec may
> be changed if we find good reason to.

Note that Python will have a builtin codec support. The details
are described in the proposal paper (not the C API though --
that still lives in the .h files of the Unicode implementation).

Note that I have made some good experience with the existing
spec: it is very flexible, extendable and versatile. It also
greatly reduces coding efforts by providing working baseclasses.
 
> There will be an inevitable initial focus on Japanese support due to
> the key people involved.  However, if we can do that well then other
> encodings should be less of a problem.
> 
> 3.  Locales:
> --------------
> Implement a candidate module for the standard library offering support
> for the world's date, time, money and number formats, and for time
> zones.

Hmm, I'd suggest to leave this out of the core and provide it
through third party extensions which are then shipped by some
Python distribution party.
 
> 4.   Application Localization:
> -----------------------------------
> This group is the intended focal point for frameworks for localizing
> both conventional applications and Python-powered web sites.  This
> field is very large and varied and we set not targets for delivering
> 'a solution'; however, we hope to generate discussion, how-tos and
> references to examples of good and bad practice in this area.
> 
> 5. Internationalizing Pythonwin and IDLE
> -----------------------------------------------------
> There are some current bugs/features in these environments which
> seriously hamper use in double-byte environments.  We should try to
> get these stamped out.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From alex@ank-sia.com  Tue Feb  8 16:45:20 2000
From: alex@ank-sia.com (alexander smishlajev)
Date: Tue, 08 Feb 2000 18:45:20 +0200
Subject: [I18n-sig] Useful resources available now
References: <38A01E5B.C8FFE9DE@turnhere.com> <38A023FA.F855BA17@lemburg.com>
Message-ID: <38A04820.A14681EC@turnhere.com>

"M.-A. Lemburg" wrote:
> 
> FYI, Python 1.6 will have native Unicode support.

yes.  unfortunately, i did not know about this at the time of publishing
pynicode.  now i see that i was reinventing the same things that are
listed in your proposal at
http://starship.skyport.net/~lemburg/unicode-proposal.txt  sorry for
that.

by the way, don't you think that standard codecs should include _all_
iso8859 encodings?  MS Windows codepages?

> no need to duplicate work in that area... better wait until
> the first versions ship and then build on top of the
> existing implementation, IMHO anyways ;-)

i think that it would be nice to have a compatible (maybe less
functional) stand-alone module as a temporary solution until Python 1.6
is released.  as far as i remember, about a half of that resource list
was published within last half of a year.  today i have met another one:
http://starship.python.net/crew/gherman/playground/calie/calie.py  IMHO
such frequency of different modules appearing testifies that charset
conversion is badly needed, as soon as possible.

best wishes,
alex.


From alex@ank-sia.com  Tue Feb  8 16:45:20 2000
From: alex@ank-sia.com (alexander smishlajev)
Date: Tue, 08 Feb 2000 18:45:20 +0200
Subject: [I18n-sig] Useful resources available now
References: <38A01E5B.C8FFE9DE@turnhere.com> <38A023FA.F855BA17@lemburg.com>
Message-ID: <38A04820.A14681EC@turnhere.com>

"M.-A. Lemburg" wrote:
> 
> FYI, Python 1.6 will have native Unicode support.

yes.  unfortunately, i did not know about this at the time of publishing
pynicode.  now i see that i was reinventing the same things that are
listed in your proposal at
http://starship.skyport.net/~lemburg/unicode-proposal.txt  sorry for
that.

by the way, don't you think that standard codecs should include _all_
iso8859 encodings?  MS Windows codepages?

> no need to duplicate work in that area... better wait until
> the first versions ship and then build on top of the
> existing implementation, IMHO anyways ;-)

i think that it would be nice to have a compatible (maybe less
functional) stand-alone module as a temporary solution until Python 1.6
is released.  as far as i remember, about a half of that resource list
was published within last half of a year.  today i have met another one:
http://starship.python.net/crew/gherman/playground/calie/calie.py  IMHO
such frequency of different modules appearing testifies that charset
conversion is badly needed, as soon as possible.

best wishes,
alex.


From herzog@online.de  Tue Feb  8 18:00:58 2000
From: herzog@online.de (Bernhard Herzog)
Date: 08 Feb 2000 19:00:58 +0100
Subject: [I18n-sig] Useful resources available now
References: <m12IB4V-000CxiC@artcom0.artcom-gmbh.de>
Message-ID: <m3itzzmv51.fsf@greebo.nodomain.de>

pf@artcom-gmbh.de (Peter Funk) writes:

> > thanks for the list!  i've seen all of this, but it is nice to have
> > common summary.  some additions:
> [...]
> 
> I will add them to my list and convert it into HTML later this
> week.  Thank you for your additions.  I was not aware of them.

François Pinard's po-utils haven't been mentioned yet, I think:

http://www.iro.umontreal.ca/contrib/po/po-utils/

They contain xpot, a replacemant for xgettext that understands Python
syntax, and the po-mode for Emacs


-- 
Bernhard Herzog   | Sketch, a drawing program for Unix
herzog@online.de  | http://sketch.sourceforge.net/


From mal@lemburg.com  Tue Feb  8 18:18:26 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 08 Feb 2000 19:18:26 +0100
Subject: [I18n-sig] Useful resources available now
References: <38A01E5B.C8FFE9DE@turnhere.com> <38A023FA.F855BA17@lemburg.com> <38A04820.A14681EC@turnhere.com>
Message-ID: <38A05DF2.1236DE1D@lemburg.com>

alexander smishlajev wrote:
> 
> "M.-A. Lemburg" wrote:
> >
> > FYI, Python 1.6 will have native Unicode support.
> 
> yes.  unfortunately, i did not know about this at the time of publishing
> pynicode.  now i see that i was reinventing the same things that are
> listed in your proposal at
> http://starship.skyport.net/~lemburg/unicode-proposal.txt  sorry for
> that.
> 
> by the way, don't you think that standard codecs should include _all_
> iso8859 encodings?  MS Windows codepages?

Sure, but not in the core. I have converted all mapping tables
at http://www.unicode.org to dictionary tables usable by Python.
Turns out that this produces 4MB of static data... as a result
I want to include a generic mapping table codec which can
use these tables and then make the mapping tables downloadable
separately.
 
> > no need to duplicate work in that area... better wait until
> > the first versions ship and then build on top of the
> > existing implementation, IMHO anyways ;-)
> 
> i think that it would be nice to have a compatible (maybe less
> functional) stand-alone module as a temporary solution until Python 1.6
> is released.  as far as i remember, about a half of that resource list
> was published within last half of a year.  today i have met another one:
> http://starship.python.net/crew/gherman/playground/calie/calie.py  IMHO
> such frequency of different modules appearing testifies that charset
> conversion is badly needed, as soon as possible.

Hey, it's only a few more weeks until the CVS tree has the code
publically available for everyone to download and test :-)

[If you can't wait, have your company join the Python Consortium
to get early access. The more companies join, the faster Python
will move towards full business awareness.]

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From andy@robanal.demon.co.uk  Wed Feb  9 02:10:42 2000
From: andy@robanal.demon.co.uk (Andy Robinson)
Date: Wed, 09 Feb 2000 02:10:42 GMT
Subject: [I18n-sig] Locales module
In-Reply-To: <38A028CF.8188427D@lemburg.com>
References: <38a03848.685217@post.demon.co.uk> <38A028CF.8188427D@lemburg.com>
Message-ID: <38a4c7cc.9705336@post.demon.co.uk>

On Tue, 08 Feb 2000 15:31:43 +0100, you wrote:

>> 3.  Locales:
>> --------------
>> Implement a candidate module for the standard library offering support
>> for the world's date, time, money and number formats, and for time
>> zones.
>
>Hmm, I'd suggest to leave this out of the core and provide it
>through third party extensions which are then shipped by some
>Python distribution party.

This is definitely not an issue for the language core, and I agree it
should start out as something separate.  There was some discussion of
it going into the standard library in due course, and Guido did not
say 'no'!

Would anyone like to take this on?  It isn't really my field.  I guess
we should start by reviewing what other systems do and how well they
work.

- Andy


From andy@robanal.demon.co.uk  Wed Feb  9 02:10:34 2000
From: andy@robanal.demon.co.uk (Andy Robinson)
Date: Wed, 09 Feb 2000 02:10:34 GMT
Subject: [I18n-sig] SIG charter and goals
In-Reply-To: <38A028CF.8188427D@lemburg.com>
References: <38a03848.685217@post.demon.co.uk> <38A028CF.8188427D@lemburg.com>
Message-ID: <38a6c8e8.9989175@post.demon.co.uk>

On Tue, 08 Feb 2000 15:31:43 +0100, you wrote:

>> 2. Encodings API and library:
>> --------------------------------
>>=20
>> We must deliver an encodings library which surpasses the features of
>> that in Java.  It should allow conversion between many common
>> encodings; access to Unicode character properties; and anything else
>> which makes encoding conversion more pleasant.  This should be
>> initially based on MAL's draft specification, although the spec may
>> be changed if we find good reason to.
>
>Note that Python will have a builtin codec support. The details
>are described in the proposal paper (not the C API though --
>that still lives in the .h files of the Unicode implementation).
>
>Note that I have made some good experience with the existing
>spec: it is very flexible, extendable and versatile. It also
>greatly reduces coding efforts by providing working baseclasses.
>=20
I can't wait to try the code, and cannot foresee any problems at the
moment based on the spec.  However, it was only discussed on the
Python-dev list, and Marc-Andree was not at IPC8, so I should try to
explain some background for everyone, (and what my agenda as SIG
moderator is too!)

1. HP joined the Python consortium and pushed for Unicode support last
year.  There was a detailed discussion on the Python-dev list (to
which I was invited because my day-job included some very messy
double-byte work in Python for a year).   Marc-Andre's proposal went
through about eight iterations, and he started to code it up under
contract to CNRI.  This is official work, and there is no question of
anybody else's Unicode modules being used - sorry!  Fredrik Lundh's
work on the Unicode regex engine is also under contract and
progressing rapidly.

2. MAL's document defines the API for 'codecs' - conversion filters -
but his taks does not include delivering a package with all the
world's common encodings in it.  That is a necessity in the long run,
and both I (through ReportLab) and Digital Garage need to make at
least the Japanese encodings work quite soon. =20

(Marc-Andre, can you update us on what codecs you are providing, and
how they are implemented? C or Python? )

3. At IPC8 we discussed (among other things) the delivery of the codec
package - both in the i18n forum and in the corridors as usual!  To do
what Java does, we eventually need codecs for 50+ common encodings,
all available and tested.  These will almost certainly not be in the
standard distribution, but there should eventually be a single,
certified, tested source for them, as this stuff has to be 100% right.
Quite a few of us urgently need good Japanese support.

The current spec does not say whether codecs should be in C or Python.
Guido expressed the hope that a few carefully chosen C routines could
allow us to write new filters in Python, but get most of the speed of
C - an idea I'd been drip-feeding to him for some time :-)   I think
that is a proper task for this group, and one I hope to put a lot of
work in to.  I'm personally hoping that we can do a sort of
mini-mxTextTools state machine which has actions for lookups in
single-byte mapping tables, double-byte mapping tables and other
things, so that new encodings can be written and added easily, yet
still run fast.  For example, all single-byte encodings can be dealt
with by a streaming version of something like string.translate(), so
adding a new one just becomes a matter of adding a 256-element list to
a file somewhere.  I believe most of the double-byte ones can be
reduced to a few kb with the right functions as well.  I'll be ready
to talk more about this shortly.

Guido also made it clear that while MAL's proposal is considered
pretty good, it is not set in stone yet. In particular, if the
double-byte specialists find that some minor tweaks would make their
lives better, he would consider it; we need a real-world test-drive
before 1.6, and this group is the place to do it. =20

Now for my own opinions on how things should be run henceforth.  Feel
free to differ!

I should point out that the inner circle of Python developers are NOT
experts in multi-byte data.  I feel strongly that we should seek out
the best expertise in the world, starting now.  This discussion will
not focus on Unicode string implementation in the core, but on what
our encoding library lets you do at the application level.   Ken
Lunde, author of "CJKV Information Processing", is the acknowledged
world leader in this field, and agreed to take part in a discussion
and review our proposals - I'll try to bring him in shortly.  It would
also be good to collar some people involved in the Java i18n libraries
and ask what they would do differently next time around, and to talk
to people who have worked with commercial tools like Unilib and
Rosette.  Then, we won't just hope that Python has the best i18n
support, we'll know it.  Naturally this review needs to happen fairly
promptly in March/April - maybe best to wait until we can run the
code.

I hope this helps a little.  If people have serious issues about where
things are heading, let's hear them now.

Best Regards,

Andy Robinson

p.s. one thing I would be very interested to hear is what people's
angles are - relevant experience, willingness to help out, needs for
solutions etc!


From andy@robanal.demon.co.uk  Wed Feb  9 02:10:40 2000
From: andy@robanal.demon.co.uk (Andy Robinson)
Date: Wed, 09 Feb 2000 02:10:40 GMT
Subject: [I18n-sig] Re: I18n
In-Reply-To: <38A07876.6EEA0655@equi4.com>
References: <38A07876.6EEA0655@equi4.com>
Message-ID: <38a8c913.10032615@post.demon.co.uk>

On Tue, 08 Feb 2000 21:11:37 +0100, Jean-Claude Wippler wrote:
>I have an unrelated question: on developer day, someone in your i18n
>session, described why Unicode would not be acceptable in countries such
>as Japan.  I mentioned this to Cameron, who wants to know more.  But I
>lost the name/url of that person, can you help out?
>
Jean-Claude,  I have taken the liberty of forwarding this paragraph to
the new i18n-sig, which contains the people who made that remark!
Please join in to hear more...

Here is a very naive oversimplification, without most of the real
world mess:

There is a standard Japanese character set (Japan Industrial Standard
0208, or JIS-0208 for short) with  6879 characters, which has been
more or less unchanged since 1978.  They are defined in a logical
94x94 space (the 'kuten table'), with some holes in it.  This
character set is commonly encoded in three different ways, all of
which aim for backward-compatibility with ASCII:

1. Shift-JIS is the native encoding on Windows and the Mac, and for
about half of the Japanese HTML on the internet.  It basically says
'if the first byte you read is less than 128, it is ASCII; if it is
above 128 and between (various values), it is the first half of a
kanji'.  There is also a phonetic syllabaryt called "half-width
katakana" encoded in the top half of the code page.

2. EUC-JP (Extended Unix Coding-Japan) is the encoding on Unix, and
the other half of the web pages on the Internet :-)  It does something
similar; less than 128 is ASCII, and higher values are usually the
first half of a kanji.=20

3. JIS is an older encoding designed for mail and news.  It uses shift
sequences to indicate switching from double-byte to single-byte mode
and vice versa.

All three do not contain null bytes or control characters, so most
8-bit-safe software works fine with data in these encodings - you
might not be able to see Japanese in your English word processor, but
it will be preserved intact.  All three are very widely used, and are
the de facto encodings we have to deal with.

(those of us in the IBM world also have to cope with the DBCS-Host
encoding, which is a can of worms I won't afflict you with).

Because they all derive from the 'kuten table', there are neat
algorithmic conversions between them which run very fast and need no
lookup tables.  It is a very common requirement in Japanese IT to
convert between these - for example, to convert a directory of HTML
files from EUC to Shift-JIS.  If such a neat routine exists to go
directly, we don't want to have the overhead of going through Unicode.

Imagine we had a few higher-level functions on top of our encodings
API, such as convertString(data, input_encoding, output_encoding).
The default behaviour of such an encoding would be to go through
Unicode as a central point.  All we need for Japan is to say that if a
filter exists on your system which can go direct from EUC-JP to
Shift-JIS, use it rather than going through Unicode.  I am sure we can
accomodate this; MAL's spec defines a good API, and I think what we
need is a higher level on top of it.

The real world is messier than I have indicated, and there are
actually many corporate variations on the JIS0208 character set - IBM
and Microsoft add an extra 360 characters, NEC adds about 94, and
companies always define their own 'User-defined characters'.  This is
where Unicode breaks down badly.  These additions are in well-known
locations in the 'kuten table', but the mappings to Unicode are not
standard.  So if you need to go outside the strict JIS0208 character
set, you cannot trust Unicode to work as a 'central point'.  That's
when the direct filters are needed.   As an example of this, I worked
all last year on a project where we used the Microsoft character set
(360 characters bigger than JIS0208) plus a small set of user-defined
characters, but it all broke when we had to serve web pages through
Java's encoding libraries, which will not handle the extras.

As a more general point, the business requirements of someone working
in this field are usually to "move data from A to B", where A and B
are not Unicode.  Unicode is a very useful tool which can sit in the
middle most of the time, and Unicode character properties solve many
problems in the CJKV world - but not all of them.

There are also some common cleanup operations one can perform on
Japanese - equivalent to capitalisation, but messier - which can be
done either in Unicode with character properties, or directly.
Sometimes they have to be done directly.

That is why we poor double-byte people want to be able to take a look
at the API when it comes out, and maybe add a tweak or two - hopefully
in a separate layer over the top - and the right convenience functions
to make life easier.

Confused yet?  I could go on...  I will try to write up some decent
background documents over the course of this month.

By the way, if anyone has similar issues with other locales, let's
hear them!


- Andy


From pf@artcom-gmbh.de  Wed Feb  9 07:29:42 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Wed, 9 Feb 2000 08:29:42 +0100 (MET)
Subject: [I18n-sig] Locales module
In-Reply-To: <38a4c7cc.9705336@post.demon.co.uk> from Andy Robinson at "Feb 9, 2000  2:10:42 am"
Message-ID: <m12IRZG-000CxiC@artcom0.artcom-gmbh.de>

Hi!

Andy Robinson:
> >> 3.  Locales:
> >> --------------
> >> Implement a candidate module for the standard library offering support
> >> for the world's date, time, money and number formats, and for time
> >> zones.
> >
> >Hmm, I'd suggest to leave this out of the core and provide it
> >through third party extensions which are then shipped by some
> >Python distribution party.
> 
> This is definitely not an issue for the language core, and I agree it
> should start out as something separate.  There was some discussion of
> it going into the standard library in due course, and Guido did not
> say 'no'!
> 
> Would anyone like to take this on?  It isn't really my field.  I guess
> we should start by reviewing what other systems do and how well they
> work.

Please excuse my ignorance, if I've missed something.  But you sure know
about 'locale.py', which already comes included with Python 1.5.2
and works very well for me together with time.strftime. (at least under
several flavours of Unix/Linux, its currently  missing from Jacks 
ready to run Mac-ppython package).  

What additional functionality should the upcoming module provide?

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, 27777 Ganderkesee, Tel: 04222 9502 70, Fax: -60


From pf@artcom-gmbh.de  Wed Feb  9 08:03:22 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Wed, 9 Feb 2000 09:03:22 +0100 (MET)
Subject: more questions about japanese (was Re: [I18n-sig] Re: I18n)
In-Reply-To: <38a8c913.10032615@post.demon.co.uk> from Andy Robinson at "Feb 9, 2000  2:10:40 am"
Message-ID: <m12IS5q-000CxiC@artcom0.artcom-gmbh.de>

Hi!

[Andy Robinson]:
> There is a standard Japanese character set (Japan Industrial Standard
> 0208, or JIS-0208 for short) with  6879 characters, which has been
> more or less unchanged since 1978.  They are defined in a logical
> 94x94 space (the 'kuten table'), with some holes in it.  This
> character set is commonly encoded in three different ways, all of
> which aim for backward-compatibility with ASCII:
[...]
> All three do not contain null bytes or control characters, so most
> 8-bit-safe software works fine with data in these encodings - you
> might not be able to see Japanese in your English word processor, but
> it will be preserved intact.  All three are very widely used, and are
> the de facto encodings we have to deal with.
[...]
> Confused yet?  I could go on...  I will try to write up some decent
> background documents over the course of this month.

First let me thank you for your insightful elaboration!

And please excuse my ignorance again.  But the i18n work I was involved
with in the past was easily handled within a 8 bit clean ISO-8859-1
character space (german, english, french, italian, spanish).  So I've
some more questions to the above:

1. I guess: Word processors exists, which are able deal with text files
   containing strings in one of the encodings described above, right?  
   So is it possible to submit a .pot-File as produced by GNU xgettext 
   to a japanese translator and he/she would be able to fill in 
   the empty msgstr "" lines with japanese messages?

2. If the resulting .mo-files from 'msgfmt' will be used with a i18n'ed 
   python application, these strings will go unchanged through several layers 
   of software just as would normal ASCII-strings.  There are 
   some japanese fonts coming with XFree86 and Linux, but I've never 
   had look at them.  Would it be possible to choose such a font, and
   will this show the desired output on a X-server running on Unix/Linux?

3. What about MacOS and WinXX?  I guess, these systems will automatically
   show up the right characters, if in step 1. the translator has used 
   a word processor on the same platform?
  
Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)


From mal@lemburg.com  Wed Feb  9 09:34:44 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 09 Feb 2000 10:34:44 +0100
Subject: [I18n-sig] SIG charter and goals
References: <38a03848.685217@post.demon.co.uk> <38A028CF.8188427D@lemburg.com> <38a6c8e8.9989175@post.demon.co.uk>
Message-ID: <38A134B4.674C919A@lemburg.com>

Andy Robinson wrote:
> 
> On Tue, 08 Feb 2000 15:31:43 +0100, you wrote:
> 
> >> 2. Encodings API and library:
> >> --------------------------------
> >>
> >> We must deliver an encodings library which surpasses the features of
> >> that in Java.  It should allow conversion between many common
> >> encodings; access to Unicode character properties; and anything else
> >> which makes encoding conversion more pleasant.  This should be
> >> initially based on MAL's draft specification, although the spec may
> >> be changed if we find good reason to.
> >
> >Note that Python will have a builtin codec support. The details
> >are described in the proposal paper (not the C API though --
> >that still lives in the .h files of the Unicode implementation).
> >
> >Note that I have made some good experience with the existing
> >spec: it is very flexible, extendable and versatile. It also
> >greatly reduces coding efforts by providing working baseclasses.
> >
> I can't wait to try the code, and cannot foresee any problems at the
> moment based on the spec.  However, it was only discussed on the
> Python-dev list, and Marc-Andree was not at IPC8, so I should try to
> explain some background for everyone, (and what my agenda as SIG
> moderator is too!)
> 
> 1. HP joined the Python consortium and pushed for Unicode support last
> year.  There was a detailed discussion on the Python-dev list (to
> which I was invited because my day-job included some very messy
> double-byte work in Python for a year).   Marc-Andre's proposal went
> through about eight iterations, and he started to code it up under
> contract to CNRI.  This is official work, and there is no question of
> anybody else's Unicode modules being used - sorry!  Fredrik Lundh's
> work on the Unicode regex engine is also under contract and
> progressing rapidly.
> 
> 2. MAL's document defines the API for 'codecs' - conversion filters -
> but his taks does not include delivering a package with all the
> world's common encodings in it.  That is a necessity in the long run,
> and both I (through ReportLab) and Digital Garage need to make at
> least the Japanese encodings work quite soon.
> 
> (Marc-Andre, can you update us on what codecs you are providing, and
> how they are implemented? C or Python? )

These codecs are currently included:

                       raw_unicode_escape.py  utf_16_be.py
                       unicode_escape.py      utf_16_le.py
ascii.py               unicode_internal.py    utf_8.py
latin_1.py             utf_16.py

If time permits there will also be a generic mapping codec
API which knows what to do with Python mapping tables. I'm
not sure how this will be done though... perhaps via a
subpackage of encodings which holds any number of tablename.py
modules which a special search function then finds and
uses.

You'd then write something like

u = unicode(rawdata, 'mapping-pc850')

and the search function would then scan the encodings.mapping
package for a module pc850 and use its mapping table for
the conversion.

> 3. At IPC8 we discussed (among other things) the delivery of the codec
> package - both in the i18n forum and in the corridors as usual!  To do
> what Java does, we eventually need codecs for 50+ common encodings,
> all available and tested.  These will almost certainly not be in the
> standard distribution, but there should eventually be a single,
> certified, tested source for them, as this stuff has to be 100% right.
> Quite a few of us urgently need good Japanese support.
> 
> The current spec does not say whether codecs should be in C or Python.

It is designed to make both possible. I currently code the
converters in C and the rest in Python, which works very well
and reduces coding efforts to a minimum (the codec base classes
are designed to provide everything needed to get the most
out of a simple setup).

> Guido expressed the hope that a few carefully chosen C routines could
> allow us to write new filters in Python, but get most of the speed of
> C - an idea I'd been drip-feeding to him for some time :-)   I think
> that is a proper task for this group, and one I hope to put a lot of
> work in to.  I'm personally hoping that we can do a sort of
> mini-mxTextTools state machine which has actions for lookups in
> single-byte mapping tables, double-byte mapping tables and other
> things, so that new encodings can be written and added easily, yet
> still run fast.  For example, all single-byte encodings can be dealt
> with by a streaming version of something like string.translate(), so
> adding a new one just becomes a matter of adding a 256-element list to
> a file somewhere.  I believe most of the double-byte ones can be
> reduced to a few kb with the right functions as well.  I'll be ready
> to talk more about this shortly.

There will be a mapping based translate function or method
in the final relase which you should be able to build upon.

> Guido also made it clear that while MAL's proposal is considered
> pretty good, it is not set in stone yet. In particular, if the
> double-byte specialists find that some minor tweaks would make their
> lives better, he would consider it; we need a real-world test-drive
> before 1.6, and this group is the place to do it.

Right :-)
 
> Now for my own opinions on how things should be run henceforth.  Feel
> free to differ!
> 
> I should point out that the inner circle of Python developers are NOT
> experts in multi-byte data.  I feel strongly that we should seek out
> the best expertise in the world, starting now.  This discussion will
> not focus on Unicode string implementation in the core, but on what
> our encoding library lets you do at the application level.   Ken
> Lunde, author of "CJKV Information Processing", is the acknowledged

(what does the V stand for ?)

> world leader in this field, and agreed to take part in a discussion
> and review our proposals - I'll try to bring him in shortly.  It would
> also be good to collar some people involved in the Java i18n libraries
> and ask what they would do differently next time around, and to talk
> to people who have worked with commercial tools like Unilib and
> Rosette.  Then, we won't just hope that Python has the best i18n
> support, we'll know it.  Naturally this review needs to happen fairly
> promptly in March/April - maybe best to wait until we can run the
> code.
> 
> I hope this helps a little.  If people have serious issues about where
> things are heading, let's hear them now.
> 
> Best Regards,
> 
> Andy Robinson
> 
> p.s. one thing I would be very interested to hear is what people's
> angles are - relevant experience, willingness to help out, needs for
> solutions etc!

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Wed Feb  9 09:42:16 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 09 Feb 2000 10:42:16 +0100
Subject: [I18n-sig] Re: I18n
References: <38A07876.6EEA0655@equi4.com> <38a8c913.10032615@post.demon.co.uk>
Message-ID: <38A13678.CC917B1A@lemburg.com>

Andy Robinson wrote:
> 
> As a more general point, the business requirements of someone working
> in this field are usually to "move data from A to B", where A and B
> are not Unicode.  Unicode is a very useful tool which can sit in the
> middle most of the time, and Unicode character properties solve many
> problems in the CJKV world - but not all of them.

The converters could make use of the Unicode private code
point areas. The Python implementation leaves these untouched.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From andy@robanal.demon.co.uk  Mon Feb 14 09:47:53 2000
From: andy@robanal.demon.co.uk (Andy Robinson)
Date: Mon, 14 Feb 2000 09:47:53 GMT
Subject: [I18n-sig] Locales module
In-Reply-To: <m12IRZG-000CxiC@artcom0.artcom-gmbh.de>
References: <m12IRZG-000CxiC@artcom0.artcom-gmbh.de>
Message-ID: <38aace0a.2761260@post.demon.co.uk>

On Wed, 9 Feb 2000 08:29:42 +0100 (MET), you wrote:

>Please excuse my ignorance, if I've missed something.  But you sure know
>about 'locale.py', which already comes included with Python 1.5.2
>and works very well for me together with time.strftime. (at least under
>several flavours of Unix/Linux, its currently  missing from Jacks=20
>ready to run Mac-ppython package). =20
>
>What additional functionality should the upcoming module provide?
>
>Regards, Peter

Wow!  No, I did not know about this at all.  I tested it and it works
fine on Windows.  I don't know about the POSIX API, but this is
essentially a static database so I presume one could dump the contents
into a data structure which the Mac etc. used if "from _locale import
*" fails.  (I suspect it could do with some more convenience stuff
layered on top to do number and string formatting, and a bit more
documentation - but non-critical).

This was mentioned as a deficiency at IPC8, so we need to find out who
said that and what they think is missing.  Another attendant had
serious questions about time zone handling, and I don't know what the
issues are there either.  Anyone have any feelings on this?

- Andy


From pf@artcom-gmbh.de  Tue Feb 15 01:18:35 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Tue, 15 Feb 2000 02:18:35 +0100 (MET)
Subject: [I18n-sig] Locales module
In-Reply-To: <38aace0a.2761260@post.demon.co.uk> from Andy Robinson at "Feb 14, 2000  9:47:53 am"
Message-ID: <m12KWdP-000Cw3C@artcom0.artcom-gmbh.de>

Hi!

I wrote:
[...]
> >about 'locale.py', which already comes included with Python 1.5.2
> >and works very well for me together with time.strftime.

Andy Robinson:
> Wow!  No, I did not know about this at all.  I tested it and it works
> fine on Windows.  I don't know about the POSIX API, but this is
> essentially a static database so I presume one could dump the contents
> into a data structure which the Mac etc. used if "from _locale import
> *" fails.  (I suspect it could do with some more convenience stuff
> layered on top to do number and string formatting, and a bit more
> documentation - but non-critical).

I don't, if this will work, since this stuff depends on some ANSI-C
library features.  I took a deeper look into the sources by
Martin von Loewis, who has contributed locale.py and _localemodule.c.  

There is already some #ifdef macintosh in 'Modules/_localemodule.c'.
So I really don't know, why Jack Jansens Python 1.5.2c1 binary
distribution for the mac doesn't contain the _locale module.  May be
it was simply forgotten during the build, since it is disabled in
Modules/Setup by default?

I wonder whether it would make sense, to fill 'locale.py' with dummy
stubs, that will be put in if an ImportError exception occurs due
to a missing _locale builtin module?  The following patch against a
recent CVS version will do that.  But I am very unsure, whether this
behaviour is desired.  Better i18n-applications shouldn't depend on
the availability of 'locale' and should contain their own fallback,
if importing locale fails.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, 27777 Ganderkesee, Tel: 04222 9502 70, Fax: -60
*** ../../../Python-CVS_10_02_00-orig/dist/src/Lib/locale.py    Sat Feb  5 10:45:31 2000
--- Lib/locale.py       Tue Feb 15 01:55:35 2000
***************
*** 1,9 ****
  """Support for number formatting using the current locale settings."""
  
  # Author: Martin von Loewis
  
- from _locale import *
  import string
  
  #perform the grouping from right to left
  def _group(s):
--- 1,36 ----
  """Support for number formatting using the current locale settings."""
  
  # Author: Martin von Loewis
+ # Fallback stubs added by Peter Funk
  
  import string
+ try:
+     from _locale import *
+ except ImportError:
+     # this may happen on MacOS or on Unices where the locale support 
+     # in Modules/Setup wasn't uncommented during the build of python
+     # we add some dummy stubs here in order not to break any apps:
+     CHAR_MAX=127
+     LC_CTYPE, LC_NUMERIC, LC_TIME, LC_COLLATE, \
+     LC_MONETARY, LC_MESSAGES, LC_ALL = tuple(range(7))
+     def localeconv():
+         return {'grouping': [127], 'currency_symbol': '', 'n_sign_posn': 127, 
+                 'p_cs_precedes': 127, 'n_cs_precedes': 127, 
+                 'mon_grouping': [], 'n_sep_by_space': 127, 
+                 'decimal_point': '.', 'negative_sign': '', 
+                 'positive_sign': '', 'p_sep_by_space': 127, 
+                 'int_curr_symbol': '', 'p_sign_posn': 127, 
+                 'thousands_sep': '', 'mon_thousands_sep': '', 
+                 'frac_digits': 127, 'mon_decimal_point': '', 
+                 'int_frac_digits': 127}
+     def setlocale(category, arg):
+         if category == LC_ALL:
+             return \
+   "LC_CTYPE=C;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=C"
+         else:
+             return 'C'
+     def strcoll(s1, s2): return cmp(s1, s2)
+     def strxfrm(s): return s
  
  #perform the grouping from right to left
  def _group(s):


From andy@robanal.demon.co.uk  Thu Feb 17 09:43:15 2000
From: andy@robanal.demon.co.uk (Andy Robinson)
Date: Thu, 17 Feb 2000 09:43:15 GMT
Subject: [I18n-sig] i18n talk at Monterey in July
Message-ID: <38b3c2b0.4954644@post.demon.co.uk>

The submissions deadline for the July Open Source conference in
Monterey is tomorrow.  I'd like to ensure that there is a slot for
Python internationalisation (say 45 minutes), to show how to use the
Unicode features and encodings library and explain some of the
problems we are trying to solve.  This could be great fun - we can do
nice visuals with the Japanese stuff - and will be relevant as 1.6
will be out around then.

I will do this myself if needed, but is anyone else willing to
co-present and help prepare the talk? =20

The programme itself won't be published in print for some time, so I
guess dropping out later is allowed, but names must go on draft
proposals today/tomorrow. =20

(e.g. Marc-Andre, Brian, Cyrus?)

Who's planning to be there, anyway?

- Andy Robinson


From mal@lemburg.com  Thu Feb 17 11:05:15 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 17 Feb 2000 12:05:15 +0100
Subject: [I18n-sig] i18n talk at Monterey in July
References: <38b3c2b0.4954644@post.demon.co.uk>
Message-ID: <38ABD5EB.404AC678@lemburg.com>

Andy Robinson wrote:
> 
> The submissions deadline for the July Open Source conference in
> Monterey is tomorrow.  I'd like to ensure that there is a slot for
> Python internationalisation (say 45 minutes), to show how to use the
> Unicode features and encodings library and explain some of the
> problems we are trying to solve.  This could be great fun - we can do
> nice visuals with the Japanese stuff - and will be relevant as 1.6
> will be out around then.
> 
> I will do this myself if needed, but is anyone else willing to
> co-present and help prepare the talk?
> 
> The programme itself won't be published in print for some time, so I
> guess dropping out later is allowed, but names must go on draft
> proposals today/tomorrow.
> 
> (e.g. Marc-Andre, Brian, Cyrus?)
> 
> Who's planning to be there, anyway?

I won't have time to spend on this, because I'm packed with work
(also, I won't be online next week), sorry.

Anyway, the Unicode code will go into CVS within the first
two weeks in March, so you should be able to test and verify
the new features really soon now :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/