From brian@garage.co.jp  Fri Mar 10 07:59:01 2000
From: brian@garage.co.jp (Brian Takashi Hooper)
Date: Fri, 10 Mar 2000 16:59:01 +0900
Subject: [I18n-sig] link: Lessons learned in internationalizing the ECMAScript standard
Message-ID: <38C8AB455D.F6B6BRIAN@smtp.garage.co.jp>

Hi -

Python's obviously not JavaScript :-), but maybe there are some lessons
which can be learned from this:

http://www-4.ibm.com/software/developer/library/internationalization-support.html

--Brian Hooper


From mal@lemburg.com  Fri Mar 10 09:00:58 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 10 Mar 2000 10:00:58 +0100
Subject: [I18n-sig] link: Lessons learned in internationalizing the
 ECMAScript standard
References: <38C8AB455D.F6B6BRIAN@smtp.garage.co.jp>
Message-ID: <38C8B9CA.8BF7AAF2@lemburg.com>

Brian Takashi Hooper wrote:
> 
> Hi -
> 
> Python's obviously not JavaScript :-), but maybe there are some lessons
> which can be learned from this:
> 
> http://www-4.ibm.com/software/developer/library/internationalization-support.html

The document makes some good points. I esp. like the sections
about string operations w/r to i18n.

Note that Python also uses
UTF-16 as internal format, it does provide the combining
character properties for all characters, but does not (in the
core) have support to normalize strings. If someone needs
this functionality a Unicode toolbox would be easy to write
using the information from the Unicode database included
in the core.

BTW, the Python CVS version should include the Unicode
patch RSN... I suppose, Guido is going to post an announcement
about this too, so that the code can be put to some real
world testing ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From andy@reportlab.com  Fri Mar 10 10:06:04 2000
From: andy@reportlab.com (Andy Robinson)
Date: Fri, 10 Mar 2000 10:06:04 -0000
Subject: [I18n-sig] Draft SIG page up for review
Message-ID: <PGECLPOBGNBNKHNAGIJHGEFLCAAA.andy@reportlab.com>

A month too late, I have placed a draft page up at
http://www.reportlab.com/i18n/i18nsig.html

Any quick inclusions/omissions/errors, before it goes up on python.org?

Thanks,

Andy Robinson


From mal@lemburg.com  Fri Mar 10 11:23:21 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 10 Mar 2000 12:23:21 +0100
Subject: [I18n-sig] Draft SIG page up for review
References: <PGECLPOBGNBNKHNAGIJHGEFLCAAA.andy@reportlab.com>
Message-ID: <38C8DB29.7EC6D17F@lemburg.com>

Andy Robinson wrote:
> 
> A month too late, I have placed a draft page up at
> http://www.reportlab.com/i18n/i18nsig.html
> 
> Any quick inclusions/omissions/errors, before it goes up on python.org?

Looks fine... except maybe that we will want to change
the email address to i18n-sig-owner.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From andy@reportlab.com  Fri Mar 10 13:53:59 2000
From: andy@reportlab.com (Andy Robinson)
Date: Fri, 10 Mar 2000 13:53:59 -0000
Subject: [I18n-sig] Draft SIG page up for review
In-Reply-To: <38C8DB29.7EC6D17F@lemburg.com>
Message-ID: <PGECLPOBGNBNKHNAGIJHMEFOCAAA.andy@reportlab.com>

> Looks fine... except maybe that we will want to change
> the email address to i18n-sig-owner.
> 
It goes into a template - the real thing is up there now
http://www.python.org/sigs/i18n-sig/

- Andy


From guido@python.org  Sat Mar 11 00:20:01 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 10 Mar 2000 19:20:01 -0500
Subject: [I18n-sig] Unicode patches checked in
Message-ID: <200003110020.TAA17777@eric.cnri.reston.va.us>

I've just checked in a massive patch from Marc-Andre Lemburg which
adds Unicode support to Python.  This work was financially supported
by Hewlett-Packard.  Marc-Andre has done a tremendous amount of work,
for which I cannot thank him enough.

We're still awaiting some more things: Marc-Andre gave me
documentation patches which will be reviewed by Fred Drake before they
are checked in; Fredrik Lundh has developed a new regular expression
which is Unicode-aware and which should be checked in real soon now.
Also, the documentation is probably incomplete and will be updated,
and of course there may be bugs -- this should be considered alpha
software.  However, I believe it is quite good already, otherwise I
wouldn't have checked it in!

I'd like to invite everyone with an interest in Unicode or Python 1.6
to check out this new Unicode-aware Python, so that we can ensure a
robust code base by the time Python 1.6 is released (planned release
date: June 1, 2000).  The download links are below.

Links:

http://www.python.org/download/cvs.html
    Instructions on how to get access to the CVS version.
    (David Ascher is making nightly tarballs of the CVS version
    available at http://starship.python.net/crew/da/pythondists/)

http://starship.python.net/crew/lemburg/unicode-proposal.txt
    The latest version of the specification on which the Marc
    has based his implementation.

http://www.python.org/sigs/i18n-sig/
    Home page of the i18n-sig (Internationalization SIG), which has
    lots of other links about this and related issues.

http://www.python.org/search/search_bugs.html
    The Python Bugs List.  Use this for all bug reports.

Note that next Tuesday I'm going on a 10-day trip, with limited time
to read email and no time to solve problems.  The usual crowd will
take care of urgent updates.  See you at the Intel Computing Continuum
Conference in San Francisco or at the Python Track at Software
Development 2000 in San Jose!

--Guido van Rossum (home page: http://www.python.org/~guido/)


From shichang@icubed.com" <shichang@icubed.com  Fri Mar 10 22:33:11 2000
From: shichang@icubed.com" <shichang@icubed.com (Shichang Zhao)
Date: Fri, 10 Mar 2000 22:33:11 -0000
Subject: [I18n-sig] RE: Unicode patches checked in
Message-ID: <01BF8AE0.9E911980.shichang@icubed.com>

I would love to test the Python 1.6 (Unicode support) in Chinese language 
aspect, but I don't know where I can get a copy of OS that supports 
Chinese. Anyone can point me a direction?

-----Original Message-----
From:	Guido van Rossum [SMTP:guido@python.org]
Sent:	Saturday, March 11, 2000 12:20 AM
To:	Python mailing list; python-announce@python.org; python-dev@python.org; 
i18n-sig@python.org; string-sig@python.org
Cc:	Marc-Andre Lemburg
Subject:	Unicode patches checked in

I've just checked in a massive patch from Marc-Andre Lemburg which
adds Unicode support to Python.  This work was financially supported
by Hewlett-Packard.  Marc-Andre has done a tremendous amount of work,
for which I cannot thank him enough.

We're still awaiting some more things: Marc-Andre gave me
documentation patches which will be reviewed by Fred Drake before they
are checked in; Fredrik Lundh has developed a new regular expression
which is Unicode-aware and which should be checked in real soon now.
Also, the documentation is probably incomplete and will be updated,
and of course there may be bugs -- this should be considered alpha
software.  However, I believe it is quite good already, otherwise I
wouldn't have checked it in!

I'd like to invite everyone with an interest in Unicode or Python 1.6
to check out this new Unicode-aware Python, so that we can ensure a
robust code base by the time Python 1.6 is released (planned release
date: June 1, 2000).  The download links are below.

Links:

http://www.python.org/download/cvs.html
    Instructions on how to get access to the CVS version.
    (David Ascher is making nightly tarballs of the CVS version
    available at http://starship.python.net/crew/da/pythondists/)

http://starship.python.net/crew/lemburg/unicode-proposal.txt
    The latest version of the specification on which the Marc
    has based his implementation.

http://www.python.org/sigs/i18n-sig/
    Home page of the i18n-sig (Internationalization SIG), which has
    lots of other links about this and related issues.

http://www.python.org/search/search_bugs.html
    The Python Bugs List.  Use this for all bug reports.

Note that next Tuesday I'm going on a 10-day trip, with limited time
to read email and no time to solve problems.  The usual crowd will
take care of urgent updates.  See you at the Intel Computing Continuum
Conference in San Francisco or at the Python Track at Software
Development 2000 in San Jose!

--Guido van Rossum (home page: http://www.python.org/~guido/)

--
http://www.python.org/mailman/listinfo/python-list


From brian@garage.co.jp  Mon Mar 13 12:05:50 2000
From: brian@garage.co.jp (Brian Takashi Hooper)
Date: Mon, 13 Mar 2000 21:05:50 +0900
Subject: [I18n-sig] thinking of CJK codec, some questions
Message-ID: <38CCD99EF9.16E2BRIAN@smtp.garage.co.jp>

Hi there i18n-siggers -

First of all, thank you very very much Marc-Andre (and Fredrik Lundh for
the original implementation) for all your hard work, I checked out the
CVS checkin yesterday and played with it a little, and took a print out
of the source home with me.  It seems really well thought out and
organized.

I scrutinized the code base thinking about issues for a CJK codec, and
came up with a few questions:

1. Should the CJK ideograms also be included in the unicodehelpers
numeric converters?  From my perspective, I'd really like to see them go
in, and think that it would make sense, too - any opinions?

2. Same as above with double-width alphanumeric characters - I assume
these should probably also be included in the lowercase / uppercase
helpers?  Or will there be a way to add to these lists through the codec
API (for those worried about data from unused codecs clogging up their
character type helpers, maybe this would be a good option to have; I
would by contrast like to be able to exclude all the extra Latin 1 stuff
that I don't need, hmm.)

3. Same thing for whitespace - I think there are a number of
double-width whitespace characters around also.

4. Are there any conventions for how non-standard codecs should be
installed?  Should they be added to Python's encodings directory, or
should they just be added to site-packages or site-python like other
third-party modules?

5. Are there any existing tools for converting from Unicode mapping
files to a C source file that can be handily made into a dynamic
library, or am I on my own there?

Anyone who has any opinions on the above please chime in, I'm trying to
start a discussion :-) !

Also, while I was reading the code, I found a few typos and spelling
mistakes (for example the notoriously often misspelled 'occurrence').
While I doubt this is a very high priority, from watching the checkins
list apparently Guido accepts spelling patches - so, I have a big
context diff, who should I send it to?

Thanks,
-Brian Hooper


From mal@lemburg.com  Mon Mar 13 13:58:24 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 13 Mar 2000 14:58:24 +0100
Subject: [I18n-sig] thinking of CJK codec, some questions
References: <38CCD99EF9.16E2BRIAN@smtp.garage.co.jp>
Message-ID: <38CCF400.A7B64CDC@lemburg.com>

Brian Takashi Hooper wrote:
> 
> Hi there i18n-siggers -
> 
> First of all, thank you very very much Marc-Andre (and Fredrik Lundh for
> the original implementation) for all your hard work, I checked out the
> CVS checkin yesterday and played with it a little, and took a print out
> of the source home with me.  It seems really well thought out and
> organized.
>
> I scrutinized the code base thinking about issues for a CJK codec, and
> came up with a few questions:
> 
> 1. Should the CJK ideograms also be included in the unicodehelpers
> numeric converters?  From my perspective, I'd really like to see them go
> in, and think that it would make sense, too - any opinions?
> 
> 2. Same as above with double-width alphanumeric characters - I assume
> these should probably also be included in the lowercase / uppercase
> helpers?  Or will there be a way to add to these lists through the codec
> API (for those worried about data from unused codecs clogging up their
> character type helpers, maybe this would be a good option to have; I
> would by contrast like to be able to exclude all the extra Latin 1 stuff
> that I don't need, hmm.)
> 
> 3. Same thing for whitespace - I think there are a number of
> double-width whitespace characters around also.

I'm not sure I understand what you are intending here: the
unicodectype.c file contains a switch statements which were
deduced from the UnicodeData.txt file available at the
Unicode.org FTP site. It contains all mappings which were defined
in that files -- unless my parser omitted some.
 
If you plan to add new mappings which are not part of the
Unicode standard, I would suggest adding them to a separate
module. E.g. you could extend the versions available through
the unicodedata module. But beware: the Unicode methods
only use the mappings defined in the unicodectype.c file.

> 4. Are there any conventions for how non-standard codecs should be
> installed?  Should they be added to Python's encodings directory, or
> should they just be added to site-packages or site-python like other
> third-party modules?

You can drop them anyplace you want... and then have them
register a search function. The standard encodings package
uses modules as codec basis but you could just as well provide
other means of looking up and even creating codecs on-the-fly.

Don't know what the standard installation method is... this
hasn't been sorted out yet.

My current thinking is to include all standard and small
codecs in the standard dist and include the bigger ones
in a separate Python add-on distribution (e.g. a tar file
that gets untarred on top of an existing installation).
A smart installer should ideally take care of this...

> 5. Are there any existing tools for converting from Unicode mapping
> files to a C source file that can be handily made into a dynamic
> library, or am I on my own there?

No, there is a tool to convert them to a Python source file
though (Misc/gencodec.py). The created codecs will use the
builtin generic mapping codec as basis for their work.
 
If mappings get huge (like the CJK ones), I would create a
new parser though, which then generates extension modules
to have the mapping available as static C data rather
than as Python dictionary on the heap... gencodec.py
should provide a good template for such a tool.

> Anyone who has any opinions on the above please chime in, I'm trying to
> start a discussion :-) !
> 
> Also, while I was reading the code, I found a few typos and spelling
> mistakes (for example the notoriously often misspelled 'occurrence').

Ahem ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From brian@garage.co.jp  Mon Mar 13 14:42:41 2000
From: brian@garage.co.jp (Brian Takashi Hooper)
Date: Mon, 13 Mar 2000 23:42:41 +0900
Subject: [I18n-sig] thinking of CJK codec, some questions
In-Reply-To: <38CCF400.A7B64CDC@lemburg.com>
References: <38CCD99EF9.16E2BRIAN@smtp.garage.co.jp> <38CCF400.A7B64CDC@lemburg.com>
Message-ID: <38CCFE6129.16E7BRIAN@smtp.garage.co.jp>

Hi again,

On Mon, 13 Mar 2000 14:58:24 +0100
"M.-A. Lemburg" <mal@lemburg.com> wrote:

[snip]

> I'm not sure I understand what you are intending here: the
> unicodectype.c file contains a switch statements which were
> deduced from the UnicodeData.txt file available at the
> Unicode.org FTP site. It contains all mappings which were defined
> in that files -- unless my parser omitted some.
>  
> If you plan to add new mappings which are not part of the
> Unicode standard, I would suggest adding them to a separate
> module. E.g. you could extend the versions available through
> the unicodedata module. But beware: the Unicode methods
> only use the mappings defined in the unicodectype.c file.
My mistake - I thought for some reason that double-width Latin
characters, such as are used in Japanese, were part of the CJK ideogram
code space that starts from \u3400, so I was expecting them to map to
lower values in Unicode than they actually do (a double-width 'A', for
example, is \uFF21.

> 
> > 4. Are there any conventions for how non-standard codecs should be
> > installed?  Should they be added to Python's encodings directory, or
> > should they just be added to site-packages or site-python like other
> > third-party modules?
> 
> You can drop them anyplace you want... and then have them
> register a search function. The standard encodings package
> uses modules as codec basis but you could just as well provide
> other means of looking up and even creating codecs on-the-fly.
> 
> Don't know what the standard installation method is... this
> hasn't been sorted out yet.
> 
> My current thinking is to include all standard and small
> codecs in the standard dist and include the bigger ones
> in a separate Python add-on distribution (e.g. a tar file
> that gets untarred on top of an existing installation).
> A smart installer should ideally take care of this...
Maybe one using Distutils?  I guess it would make the most sense if you
run the install script with /usr/local/bin/python, for example, then the
codecs would get installed in the proper place for that Python
installation to use them...

> 
> > 5. Are there any existing tools for converting from Unicode mapping
> > files to a C source file that can be handily made into a dynamic
> > library, or am I on my own there?
> 
> No, there is a tool to convert them to a Python source file
> though (Misc/gencodec.py). The created codecs will use the
> builtin generic mapping codec as basis for their work.
>  
> If mappings get huge (like the CJK ones), I would create a
> new parser though, which then generates extension modules
> to have the mapping available as static C data rather
> than as Python dictionary on the heap... gencodec.py
> should provide a good template for such a tool.
You recommend in the unicode proposal that the mapping should probably
be a buildable as a shared library, to allow multiple interpreter
instances to share the table - for platforms which don't support this
option, then, would it make sense to make the codec such that the
mapping tables can be statically linked into the interpreter? Or, in
such a case, do you think would it be better to try to set things up so
that the mapping tables can be read from a file?

--Brian


From mal@lemburg.com  Mon Mar 13 15:47:44 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 13 Mar 2000 16:47:44 +0100
Subject: [I18n-sig] thinking of CJK codec, some questions
References: <38CCD99EF9.16E2BRIAN@smtp.garage.co.jp> <38CCF400.A7B64CDC@lemburg.com> <38CCFE6129.16E7BRIAN@smtp.garage.co.jp>
Message-ID: <38CD0DA0.8DD4FC38@lemburg.com>

Brian Takashi Hooper wrote:
> 
> > I'm not sure I understand what you are intending here: the
> > unicodectype.c file contains a switch statements which were
> > deduced from the UnicodeData.txt file available at the
> > Unicode.org FTP site. It contains all mappings which were defined
> > in that files -- unless my parser omitted some.
> >
> > If you plan to add new mappings which are not part of the
> > Unicode standard, I would suggest adding them to a separate
> > module. E.g. you could extend the versions available through
> > the unicodedata module. But beware: the Unicode methods
> > only use the mappings defined in the unicodectype.c file.
> My mistake - I thought for some reason that double-width Latin
> characters, such as are used in Japanese, were part of the CJK ideogram
> code space that starts from \u3400, so I was expecting them to map to
> lower values in Unicode than they actually do (a double-width 'A', for
> example, is \uFF21.

Unicode is built upon ASCII -- I don't think that other encodings
were taken into account during the ordinal assignment (not 100%
sure though).

You should be able to get at the numeric information of DBCS
chars (this is what you're talking about, right ?) by first
converting them to Unicode.

> >
> > > 4. Are there any conventions for how non-standard codecs should be
> > > installed?  Should they be added to Python's encodings directory, or
> > > should they just be added to site-packages or site-python like other
> > > third-party modules?
> >
> > You can drop them anyplace you want... and then have them
> > register a search function. The standard encodings package
> > uses modules as codec basis but you could just as well provide
> > other means of looking up and even creating codecs on-the-fly.
> >
> > Don't know what the standard installation method is... this
> > hasn't been sorted out yet.
> >
> > My current thinking is to include all standard and small
> > codecs in the standard dist and include the bigger ones
> > in a separate Python add-on distribution (e.g. a tar file
> > that gets untarred on top of an existing installation).
> > A smart installer should ideally take care of this...
> Maybe one using Distutils?  I guess it would make the most sense if you
> run the install script with /usr/local/bin/python, for example, then the
> codecs would get installed in the proper place for that Python
> installation to use them...

Right. distutils could be a solution on Unix -- the problem
of using distutils is that you first have to have a working
Python installation for it to work, so such an approach 
would only work in two steps: first Python core, then extended
codecs package.
 
> >
> > > 5. Are there any existing tools for converting from Unicode mapping
> > > files to a C source file that can be handily made into a dynamic
> > > library, or am I on my own there?
> >
> > No, there is a tool to convert them to a Python source file
> > though (Misc/gencodec.py). The created codecs will use the
> > builtin generic mapping codec as basis for their work.
> >
> > If mappings get huge (like the CJK ones), I would create a
> > new parser though, which then generates extension modules
> > to have the mapping available as static C data rather
> > than as Python dictionary on the heap... gencodec.py
> > should provide a good template for such a tool.
> You recommend in the unicode proposal that the mapping should probably
> be a buildable as a shared library, to allow multiple interpreter
> instances to share the table - for platforms which don't support this
> option, then, would it make sense to make the codec such that the
> mapping tables can be statically linked into the interpreter? Or, in
> such a case, do you think would it be better to try to set things up so
> that the mapping tables can be read from a file?

Since memory mapped files are not supported by Python per
default I would suggest letting the system linker take care of
sharing the constant C data from a shared (or statically linked)
extension module. Reading the information directly from a file
would probably be too slow.

Note that the module would only have to provide a simple
__getitem__ interface compatible object which then fetches
the data from the static C data. The rest can then be done
in Python in the same way as the other mapping codecs do their
job.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From brian@garage.co.jp  Tue Mar 14 08:10:47 2000
From: brian@garage.co.jp (Brian Takashi Hooper)
Date: Tue, 14 Mar 2000 17:10:47 +0900
Subject: [I18n-sig] thinking of CJK codec, some questions
In-Reply-To: <38CD0DA0.8DD4FC38@lemburg.com>
References: <38CCFE6129.16E7BRIAN@smtp.garage.co.jp> <38CD0DA0.8DD4FC38@lemburg.com>
Message-ID: <38CDF40712.82E2BRIAN@smtp.garage.co.jp>

Hi!

On Mon, 13 Mar 2000 16:47:44 +0100
"M.-A. Lemburg" <mal@lemburg.com> wrote:

[snip]

> Unicode is built upon ASCII -- I don't think that other encodings
> were taken into account during the ordinal assignment (not 100%
> sure though).
> 
> You should be able to get at the numeric information of DBCS
> chars (this is what you're talking about, right ?) by first
> converting them to Unicode.
Yes - it looks like this is the case :-).

> 
> > >
> > > > 4. Are there any conventions for how non-standard codecs should be
> > > > installed?  Should they be added to Python's encodings directory, or
> > > > should they just be added to site-packages or site-python like other
> > > > third-party modules?
> > >
> > > You can drop them anyplace you want... and then have them
> > > register a search function. The standard encodings package
> > > uses modules as codec basis but you could just as well provide
> > > other means of looking up and even creating codecs on-the-fly.
> > >
> > > Don't know what the standard installation method is... this
> > > hasn't been sorted out yet.
> > >
> > > My current thinking is to include all standard and small
> > > codecs in the standard dist and include the bigger ones
> > > in a separate Python add-on distribution (e.g. a tar file
> > > that gets untarred on top of an existing installation).
> > > A smart installer should ideally take care of this...
> > Maybe one using Distutils?  I guess it would make the most sense if you
> > run the install script with /usr/local/bin/python, for example, then the
> > codecs would get installed in the proper place for that Python
> > installation to use them...
> 
> Right. distutils could be a solution on Unix -- the problem
> of using distutils is that you first have to have a working
> Python installation for it to work, so such an approach 
> would only work in two steps: first Python core, then extended
> codecs package.
I guess, then it would be nice to have something that could work in
either case...

Should encoding support be an option to ./configure, when you are first
building Python?  General question to everyone out there - should it be
possible to intentionally build Python without Unicode support?

>  
> > >
> > > > 5. Are there any existing tools for converting from Unicode mapping
> > > > files to a C source file that can be handily made into a dynamic
> > > > library, or am I on my own there?
> > >
> > > No, there is a tool to convert them to a Python source file
> > > though (Misc/gencodec.py). The created codecs will use the
> > > builtin generic mapping codec as basis for their work.
> > >
> > > If mappings get huge (like the CJK ones), I would create a
> > > new parser though, which then generates extension modules
> > > to have the mapping available as static C data rather
> > > than as Python dictionary on the heap... gencodec.py
> > > should provide a good template for such a tool.
> > You recommend in the unicode proposal that the mapping should probably
> > be a buildable as a shared library, to allow multiple interpreter
> > instances to share the table - for platforms which don't support this
> > option, then, would it make sense to make the codec such that the
> > mapping tables can be statically linked into the interpreter? Or, in
> > such a case, do you think would it be better to try to set things up so
> > that the mapping tables can be read from a file?
> 
> Since memory mapped files are not supported by Python per
> default I would suggest letting the system linker take care of
> sharing the constant C data from a shared (or statically linked)
> extension module. Reading the information directly from a file
> would probably be too slow.
> 
> Note that the module would only have to provide a simple
> __getitem__ interface compatible object which then fetches
> the data from the static C data. The rest can then be done
> in Python in the same way as the other mapping codecs do their
> job.
Am I right in thinking that 'static C data' means something like

static Py_UNICODE mapping[] = { ... };

?  Also, from a design standpoint do you (and anyone else on i18n) think
it would be better to emphasize speed and / or memory efficiency by
making specialized codecs for the different CJK encodings (for example,
if a table such as the above is used, then in the case of a particular
encoding, for example EUC, it may be possible to reduce the size of the
table by introducing some EUC-specific casing into the encoder/decoder),
or would it be better to try for a generalized implementation?  We need
something like codecs.charset_encode and codecs.charset_decode for CJK
char sets - I was thinking that this might be best handled by a few
separate C modules (for Japanese, one for SJIS, one for EUC, and one for
JIS) that would in turn use similarly defined mapping modules,
containing only one or more static conversion maps as arrays - in this
sense I am leaning towards making tuned codecs for each encoding set.

I want to try to make something that many people can use - does this
sound like a reasonable approach, or am I on the wrong track here?

--Brian


From mal@lemburg.com  Tue Mar 14 09:55:24 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 14 Mar 2000 10:55:24 +0100
Subject: [I18n-sig] thinking of CJK codec, some questions
References: <38CCFE6129.16E7BRIAN@smtp.garage.co.jp> <38CD0DA0.8DD4FC38@lemburg.com> <38CDF40712.82E2BRIAN@smtp.garage.co.jp>
Message-ID: <38CE0C8C.E0518D0D@lemburg.com>

Brian Takashi Hooper wrote:
> 
> Should encoding support be an option to ./configure, when you are first
> building Python?  General question to everyone out there - should it be
> possible to intentionally build Python without Unicode support?

How would you do this using configure ?

As for the exclusion of Unicode: this is currently not planned.
Doing this would cause the code to become very inelegant due
to the many #ifdefs this introduces (the problem here being that
Unicode support is tightly integrated into the interpreter in
many places).
 
> [Tools for creating codecs from mappings]
>
> > Note that the module would only have to provide a simple
> > __getitem__ interface compatible object which then fetches
> > the data from the static C data. The rest can then be done
> > in Python in the same way as the other mapping codecs do their
> > job.
> Am I right in thinking that 'static C data' means something like
> 
> static Py_UNICODE mapping[] = { ... };

Right.
 
> ?  Also, from a design standpoint do you (and anyone else on i18n) think
> it would be better to emphasize speed and / or memory efficiency by
> making specialized codecs for the different CJK encodings (for example,
> if a table such as the above is used, then in the case of a particular
> encoding, for example EUC, it may be possible to reduce the size of the
> table by introducing some EUC-specific casing into the encoder/decoder),
> or would it be better to try for a generalized implementation? 

How about a lib of common functions needed for CJK and then
a few small extra modules for each of the specific codecs.
Fast encoders/decoder should be done in C, the whole class
business in Python.

> We need
> something like codecs.charset_encode and codecs.charset_decode for CJK
> char sets - I was thinking that this might be best handled by a few
> separate C modules (for Japanese, one for SJIS, one for EUC, and one for
> JIS) that would in turn use similarly defined mapping modules,
> containing only one or more static conversion maps as arrays - in this
> sense I am leaning towards making tuned codecs for each encoding set.

Andy mentioned that it should be possible to write codecs
which do a couple of smaller switches and implement the other
mappings using some more intelligent logic.

The example I gave above has to be seen in the light of using the
generic mapping codec -- which probably is not very much use in a
multi-byte encoding world since it currently only supports
1-1 mappings.

I'd suggest going Andy's way for the CJK codecs... Andy ?

> I want to try to make something that many people can use - does this
> sound like a reasonable approach, or am I on the wrong track here?

Don't think so :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From brian@garage.co.jp  Wed Mar 15 08:51:49 2000
From: brian@garage.co.jp (Brian Takashi Hooper)
Date: Wed, 15 Mar 2000 17:51:49 +0900
Subject: [I18n-sig] thinking of CJK codec, some questions
Message-ID: <38CF4F25F.B74EBRIAN@smtp.garage.co.jp>

Hi,

> Andy mentioned that it should be possible to write codecs
> which do a couple of smaller switches and implement the other
> mappings using some more intelligent logic.
> 
> The example I gave above has to be seen in the light of using the
> generic mapping codec -- which probably is not very much use in a
> multi-byte encoding world since it currently only supports
> 1-1 mappings.
> 
> I'd suggest going Andy's way for the CJK codecs... Andy ?

I like the idea of an encoding/decoding state machine, and have started
thinking about how this would work in the breakdown for the CJKV codecs
- what I've got is kind of like this:

The top level class interfaces, and the StreamReader/Writer classes as
well, will be in Python - I think we can probably group these generally
into modal and non-modal encoding schemes (ISO-2022-JP being an example
of the first, and EUC being an example of the second), the difference
between the two largely being a difference merely in how streams are
handled.

(Note: Andy, please pipe in if I'm misrepresenting your idea, or even if
I'm not, I'd like to know what you think about all this!)

For the encoders/decoders I like Andy's idea of trying to generalize out
a kind of 'mini-language' ala mxTextTools for specifying
encoding/decoding logic separately and then just have a generalized
engine that can generically handle multi-byte mapping tasks.  So, the
main task then is to come up with a generalization that can encompass
all of the manipulations which might be necessary in order to specify
the behavior of the mapping machine:

1. one thing it should definitely be able to do is specify a byte offset
for data in a static table.  So, for example, if I have something like:

static Py_UNICODE *euc2unicode = { 0x3000, 0x3001, ... };

I should know to start indexing from (adding 0x8080 to the first JIS
0208 character, 0x2121) 0xa1a1, that is, EUC 0xa1a2 should be converted
by looking up euc2unicode[1] => 0x3001 in Unicode.

2. another thing that it would be good to be able to do, I think, is to
be able to somehow specify which map to look in.  so, a character set
should be able to be stored in multiple, non-contiguous static arrays;
again using the example of EUC, the code set 2 zone (stuff that begins
with 8e) should refer to a different mapping table than the code set 1
stuff (the regular JIS 0208 zone for EUC-JP).  So, the encoder would be
able to say -> OK, for a character in this range, I should look up the
value at this offset into this mapping table.  For EUC-JP, this would
look like:

first character     look in table             at offset
at offset           
0x21-7e             JIS-Roman->Unicode        - 0x21
0xa1-fe             JIS 0208->Unicode         - 0x8080
0x8e                HW Katakana->Unicode      - 0x8e00
                     (from JIS-Roman)
0x8f                JIS 0212->Unicode         - 0x8080 (lookup w/
                                                  second & third bytes)

Actually, looking at this a little more, probably there should be a
way of calculating the map index given some info about the dimensions
of the map, i.e. it should be possible to set more than one offset, so
that instead of having to have a table with a lot of extra placeholding
space in it, then we know that if we have a 94x94 matrix (pretty common
in the Japanese encodings, as you know), then we can store all the data
in a 5590-element array and just index it according to our chosen
offsets.

3. coming back from Unicode

I'm wondering a little about this, since when we're coming back from
Unicode basically we have no choice (that I can think of) but to have
2^16 * (max number of bytes in target encoding), with placeholders where
there is no mapping.  So, for something like EUC-TW, which has a maximum
of 4 bytes per character, we need an encoding map 256K in size... is
there a better way, that doesn't waste so much space?  'Course, I would
hope that the Taiwanese would put enough memory in their machines (since
memory's pretty cheap there).

I guess the encoder/decoder should also know about how to do modal
encodings - I guess this is easier though if we can assume we have the
whole string, or some convenient chunk of it, to do encoding on.  Or
maybe modal and non-modal encoders/decoders should be separately
implemented (possibly, sharing utility functions)?  I still have to look
at more examples of asian encodings and especially the ISO-2022 style
ones, and vendor encodings, to get a better idea of what manipulations
they should do.

I was also thinking that the maps, to keep them separate from the
encoders/decoders themselves, would be degenerate Python modules that
would return void * pointers to the mapping tables via PyCObjects...
this seemed to me to be a good way to do maps which will primarily be
accessed by other C modules, rather than by Python... does this seem
like an OK thing?

Awaiting further enlightenment,

--Brian


From mal@lemburg.com  Wed Mar 15 14:36:34 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 15 Mar 2000 15:36:34 +0100
Subject: [I18n-sig] thinking of CJK codec, some questions
References: <38CF4F25F.B74EBRIAN@smtp.garage.co.jp>
Message-ID: <38CF9FF2.5E558813@lemburg.com>

Just a few comments about the design (don't have any knowledge
about Asian encodings):

1. Keep large mapping tables in single automatically generated C
   modules that export a lookup object (ones that define __getitem__).
   These could also be generated using some perfect hash table
   generator, BTW, to reduce memory consumption.

2. Write small special encoders/decoders that take the lookup
   table objects as argument.

3. Glue both together using Python code -- forget about the
   PyCObject idea :-) ... it causes too many problems when the import
   fails.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From brian@garage.co.jp  Wed Mar 15 15:07:44 2000
From: brian@garage.co.jp (Brian Takashi Hooper)
Date: Thu, 16 Mar 2000 00:07:44 +0900
Subject: [I18n-sig] thinking of CJK codec, some questions
In-Reply-To: <38CF9FF2.5E558813@lemburg.com>
References: <38CF4F25F.B74EBRIAN@smtp.garage.co.jp> <38CF9FF2.5E558813@lemburg.com>
Message-ID: <38CFA74044.B752BRIAN@smtp.garage.co.jp>

Thanks, this is great advice, and the kind of feedback I have been looking
for!  Especially about not using PyCObject, which seemed like the thing
to do but I have to admit some naivete about its proper use.

I'll try thinking about this a bit more, along the lines you suggest.

--Brian

On Wed, 15 Mar 2000 15:36:34 +0100
"M.-A. Lemburg" <mal@lemburg.com> wrote:

> Just a few comments about the design (don't have any knowledge
> about Asian encodings):
> 
> 1. Keep large mapping tables in single automatically generated C
>    modules that export a lookup object (ones that define __getitem__).
>    These could also be generated using some perfect hash table
>    generator, BTW, to reduce memory consumption.
> 
> 2. Write small special encoders/decoders that take the lookup
>    table objects as argument.
> 
> 3. Glue both together using Python code -- forget about the
>    PyCObject idea :-) ... it causes too many problems when the import
>    fails.
> 
> -- 
> Marc-Andre Lemburg
> ______________________________________________________________________
> Business:                                      http://www.lemburg.com/
> Python Pages:                           http://www.lemburg.com/python/
> 
> 
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://www.python.org/mailman/listinfo/i18n-sig
> 


From chris@ccbs.ntu.edu.tw  Thu Mar 16 04:14:08 2000
From: chris@ccbs.ntu.edu.tw (Christian Wittern)
Date: Thu, 16 Mar 2000 12:14:08 +0800
Subject: [I18n-sig] CJK codecs etc
Message-ID: <NDBBKJMNKBFFOBEFDJGEEEFACOAA.wittern@iis.sinica.edu.tw>

Hi everybody,

I have some comments about CJK codecs, which are more from a user than a
programmers perspective.

1.) Please provide a (configurable?) fallback for failed conversions. This
is of course especially needed for conversions out of Unicode. What I have
in mind is, for example, provide the Unicode codepoint as entity (&U-4e00;)
or Java escape or some such, depending on the users choice. Don't just give
a '?', what M$'s braindead conversion routines do and thus regularily drive
me nuts.

2.) On the same topic, there are some fairly frequently codepoints that map
to different codepoints in Japanese and Taiwans encoding, although this is
in most cases not expected. These codepoints should have been eliminated by
Unicodes unification rules, but crept in via the source-encoding separation
rule -- not a very good decision in my opinion. I have a list of some such
characters at http://www.chibs.edu.tw/~chris/smart/cjkconv.htm, Ideally,
there should be a way for the user to influence the conversion by providing
a list of his choice (with his modifications) to the codec, to overlay the
predefined values.

3.) The nasty problem of user defined characters. I think there should be a
default mapping of the user defined area in DBCS encodings to the Unicode
code range for user characters. Microsoft uses fixed sequential tables and I
think that is a good idea, since it is pretty straightforward. In big5 for
example, the area of user defined characters starts at Fa40, Fa41 ..., which
gets mapped to Unicode E000, E001, .. There should also be an option to use
some kind of entity reference instead.

4.) I developped years ago the habit of using entity references for any
characters not representable in the given characterset used by the system. I
have seen this becoming more widespread in the user communities I work with.
It would be very useful for us, if the Unicode conversion routines in Python
could be told to tread some arbitray entity references (we use things like
&M24501; for the characters assigned by the Mojikyo Font Institute (see
www.mojikyo.gr.jp) and &C4-4e21; for characters in the Taiwanese CNS
encoding). I realize that this is a rather specialised usage, but it would
be great and very helpful to have some hook in the system to treat this
stuff just like any other character.


Any comments?

All the best,

Christian


Dr. Christian Wittern
Chung-Hwa Institute of Buddhist Studies
276, Kuang Ming Road, Peitou 112
Taipei, TAIWAN
Tel. +886-2-2892-6111#65, Email chris@ccbs.ntu.edu.tw


From brian@garage.co.jp  Thu Mar 16 06:49:23 2000
From: brian@garage.co.jp (Brian Takashi Hooper)
Date: Thu, 16 Mar 2000 15:49:23 +0900
Subject: [I18n-sig] thinking of CJK codec, some questions
In-Reply-To: <38CF9FF2.5E558813@lemburg.com>
References: <38CF4F25F.B74EBRIAN@smtp.garage.co.jp> <38CF9FF2.5E558813@lemburg.com>
Message-ID: <38D083F3254.189CBRIAN@smtp.garage.co.jp>

On Wed, 15 Mar 2000 15:36:34 +0100
"M.-A. Lemburg" <mal@lemburg.com> wrote:

> Just a few comments about the design (don't have any knowledge
> about Asian encodings):
> 
> 1. Keep large mapping tables in single automatically generated C
>    modules that export a lookup object (ones that define __getitem__).
>    These could also be generated using some perfect hash table
>    generator, BTW, to reduce memory consumption.
After researching perfect hash tables a little and thinking about it a
little more, a question:  I think this could work well for the decoding
maps, but for encoding (from Unicode to a legacy encoding), wouldn't I
have to be able to detect misses in my hash lookup?  For example, if I
had a string in Unicode that I was trying to convert to EUC-JP, and I
looked up a Unicode character that has no mapping to EUC-JP, with a
regular hash I my lookup will still succeed and I'll get back an EUC
character anyway, but the wrong one...  The only way I could think of to
avoid this would be to store the key as part of the value (or
alternately some kind of unique checksum), and then after lookup compare
the original key to the key that was looked up in the table; if they are
the same, then I've got a valid mapping, and if they are different than
my lookup failed, and I should return some kind of sentinel value
(0xFFFF or something?).  Since the Unicode keys are all two bytes
apiece, and for some of the largest CJK encoding standards the values
are a max of 4 bytes long (e.g. EUC-TW), I then need to define my
mapping table as containing values 8 bytes each in length, right? 
(assuming that I should keep the values of the array aligned along
machine words)  Is this complication worth the space savings, I wonder? 
I think a table built this way might be a little smaller than an
unhashed plain old table, since in a three- or four-byte encoding there
are generally always a lot fewer mapped values than there are spaces in
the available plane... maybe I should go with mappings as simple array
first and then figure out how to make them smaller if it seems to matter
a lot to people?  This is a pretty much a speed vs. space issue.

Opinions?  Does anyone have a cleverer way to detect the validity of a
hash lookup?

--Brian


From mal@lemburg.com  Thu Mar 16 10:21:29 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 16 Mar 2000 11:21:29 +0100
Subject: [I18n-sig] thinking of CJK codec, some questions
References: <38CF4F25F.B74EBRIAN@smtp.garage.co.jp> <38CF9FF2.5E558813@lemburg.com> <38D083F3254.189CBRIAN@smtp.garage.co.jp>
Message-ID: <38D0B5A9.1A4A66E9@lemburg.com>

Brian Takashi Hooper wrote:
> 
> On Wed, 15 Mar 2000 15:36:34 +0100
> "M.-A. Lemburg" <mal@lemburg.com> wrote:
> 
> > Just a few comments about the design (don't have any knowledge
> > about Asian encodings):
> >
> > 1. Keep large mapping tables in single automatically generated C
> >    modules that export a lookup object (ones that define __getitem__).
> >    These could also be generated using some perfect hash table
> >    generator, BTW, to reduce memory consumption.
> After researching perfect hash tables a little and thinking about it a
> little more, a question:  I think this could work well for the decoding
> maps, but for encoding (from Unicode to a legacy encoding), wouldn't I
> have to be able to detect misses in my hash lookup? 

I'd suggest using the same technique as Python: lookup the
hash(key) value and then compare the found entry (key,value) with
the looked up key. Since we are lucky, you can use the identity
function as hash function... keys still are Unicode ordinals,
but now you also store them in the mapping result (some redundance,
but better than putting them together with the keys).

> For example, if I
> had a string in Unicode that I was trying to convert to EUC-JP, and I
> looked up a Unicode character that has no mapping to EUC-JP, with a
> regular hash I my lookup will still succeed and I'll get back an EUC
> character anyway, but the wrong one...  The only way I could think of to
> avoid this would be to store the key as part of the value (or
> alternately some kind of unique checksum), and then after lookup compare
> the original key to the key that was looked up in the table; if they are
> the same, then I've got a valid mapping, and if they are different than
> my lookup failed, and I should return some kind of sentinel value
> (0xFFFF or something?).  Since the Unicode keys are all two bytes
> apiece, and for some of the largest CJK encoding standards the values
> are a max of 4 bytes long (e.g. EUC-TW), I then need to define my
> mapping table as containing values 8 bytes each in length, right?

See above.

> (assuming that I should keep the values of the array aligned along
> machine words)  Is this complication worth the space savings, I wonder?
> I think a table built this way might be a little smaller than an
> unhashed plain old table, since in a three- or four-byte encoding there
> are generally always a lot fewer mapped values than there are spaces in
> the available plane... maybe I should go with mappings as simple array
> first and then figure out how to make them smaller if it seems to matter
> a lot to people?  This is a pretty much a speed vs. space issue.

This is probably the way to go. You can always exchange the
mapping table against some other technique as long as the
interface stays the same.
 
I think what more important now, is focussing on the encoders
and decoders...

> Opinions?  Does anyone have a cleverer way to detect the validity of a
> hash lookup?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Thu Mar 16 10:35:04 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 16 Mar 2000 11:35:04 +0100
Subject: [I18n-sig] CJK codecs etc
References: <NDBBKJMNKBFFOBEFDJGEEEFACOAA.wittern@iis.sinica.edu.tw>
Message-ID: <38D0B8D8.5AC5C59C@lemburg.com>

Christian Wittern wrote:
> 
> Hi everybody,
> 
> I have some comments about CJK codecs, which are more from a user than a
> programmers perspective.
> 
> 1.) Please provide a (configurable?) fallback for failed conversions. This
> is of course especially needed for conversions out of Unicode. What I have
> in mind is, for example, provide the Unicode codepoint as entity (&U-4e00;)
> or Java escape or some such, depending on the users choice. Don't just give
> a '?', what M$'s braindead conversion routines do and thus regularily drive
> me nuts.

Please read the Misc/unicode.txt file. There are different error
handling techniques available... 'strict' (raise an error),
'ignore' (ignore the failed mapping), 'replace' (replace the
failed mapping by some codec specific replacement char, e.g. '?').

The error argument is codec specific -- the above values must
work though.
 
> 2.) On the same topic, there are some fairly frequently codepoints that map
> to different codepoints in Japanese and Taiwans encoding, although this is
> in most cases not expected. These codepoints should have been eliminated by
> Unicodes unification rules, but crept in via the source-encoding separation
> rule -- not a very good decision in my opinion. I have a list of some such
> characters at http://www.chibs.edu.tw/~chris/smart/cjkconv.htm, Ideally,
> there should be a way for the user to influence the conversion by providing
> a list of his choice (with his modifications) to the codec, to overlay the
> predefined values.

Everybody can write their own codecs... so no comment on this one ;-) 
 
> 3.) The nasty problem of user defined characters. I think there should be a
> default mapping of the user defined area in DBCS encodings to the Unicode
> code range for user characters. Microsoft uses fixed sequential tables and I
> think that is a good idea, since it is pretty straightforward. In big5 for
> example, the area of user defined characters starts at Fa40, Fa41 ..., which
> gets mapped to Unicode E000, E001, .. There should also be an option to use
> some kind of entity reference instead.

The core Python Unicode implementation doesn't touch these
private code areas at all. This issue is left to the codecs.

Since they are probably of some importance to the Asian world
due to the many corporate char sets, I guess the Asian codecs
should provide some kind of logic to handle these areas as
special cases... perhaps by passing an extra mapping table
to the codec.

> 4.) I developped years ago the habit of using entity references for any
> characters not representable in the given characterset used by the system. I
> have seen this becoming more widespread in the user communities I work with.
> It would be very useful for us, if the Unicode conversion routines in Python
> could be told to tread some arbitray entity references (we use things like
> &M24501; for the characters assigned by the Mojikyo Font Institute (see
> www.mojikyo.gr.jp) and &C4-4e21; for characters in the Taiwanese CNS
> encoding). I realize that this is a rather specialised usage, but it would
> be great and very helpful to have some hook in the system to treat this
> stuff just like any other character.

Hmm, sounds like some kind of SGML entity codec could solve this
aspect...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From chris@ccbs.ntu.edu.tw  Fri Mar 17 07:01:14 2000
From: chris@ccbs.ntu.edu.tw (Christian Wittern)
Date: Fri, 17 Mar 2000 15:01:14 +0800
Subject: [I18n-sig] CJK codecs etc
In-Reply-To: <38D0B8D8.5AC5C59C@lemburg.com>
Message-ID: <NDBBKJMNKBFFOBEFDJGEOEGNCOAA.chris@ccbs.ntu.edu.tw>

Marc-Andre Lemburg wrote:

> Christian Wittern wrote:
> >
> >
> > 1.) Please provide a (configurable?) fallback for failed
> conversions. This
> > is of course especially needed for conversions out of Unicode.
> What I have
> > in mind is, for example, provide the Unicode codepoint as
> entity (&U-4e00;)
> > or Java escape or some such, depending on the users choice.
> Don't just give
> > a '?', what M$'s braindead conversion routines do and thus
> regularily drive
> > me nuts.
>
> Please read the Misc/unicode.txt file. There are different error
> handling techniques available... 'strict' (raise an error),
> 'ignore' (ignore the failed mapping), 'replace' (replace the
> failed mapping by some codec specific replacement char, e.g. '?').

Err. If you read my comment above, this is exactly what I *don't* want to
see, since this is of no help at all. What I want to have is a fallback
mechanism, that preserves the information contained in the file (or maps it
to some other second best match). Simple raising an error or putting in some
default char is not helpful to the user at all!!!

Christian

>
> The error argument is codec specific -- the above values must
> work though.
>
> > 2.) On the same topic, there are some fairly frequently
> codepoints that map
> > to different codepoints in Japanese and Taiwans encoding,
> although this is
> > in most cases not expected. These codepoints should have been
> eliminated by
> > Unicodes unification rules, but crept in via the
> source-encoding separation
> > rule -- not a very good decision in my opinion. I have a list
> of some such
> > characters at http://www.chibs.edu.tw/~chris/smart/cjkconv.htm, Ideally,
> > there should be a way for the user to influence the conversion
> by providing
> > a list of his choice (with his modifications) to the codec, to
> overlay the
> > predefined values.
>
> Everybody can write their own codecs... so no comment on this one ;-)
>
> > 3.) The nasty problem of user defined characters. I think there
> should be a
> > default mapping of the user defined area in DBCS encodings to
> the Unicode
> > code range for user characters. Microsoft uses fixed sequential
> tables and I
> > think that is a good idea, since it is pretty straightforward.
> In big5 for
> > example, the area of user defined characters starts at Fa40,
> Fa41 ..., which
> > gets mapped to Unicode E000, E001, .. There should also be an
> option to use
> > some kind of entity reference instead.
>
> The core Python Unicode implementation doesn't touch these
> private code areas at all. This issue is left to the codecs.
>
> Since they are probably of some importance to the Asian world
> due to the many corporate char sets, I guess the Asian codecs
> should provide some kind of logic to handle these areas as
> special cases... perhaps by passing an extra mapping table
> to the codec.

That would solve the above point 2 as well and is all I have in mind here:
Leave some hook that the user can pass some overlayed extra mapping table,
without having to write a codec of his own. ALthough I realize the latter is
possible, I don't think it is practicle and maybe not even desirable. I
don't want to design a different car from scratch, just because I don't like
the color:-)
>
> > 4.) I developped years ago the habit of using entity references for any
> > characters not representable in the given characterset used by
> the system. I
> > have seen this becoming more widespread in the user communities
> I work with.
> > It would be very useful for us, if the Unicode conversion
> routines in Python
> > could be told to tread some arbitray entity references (we use
> things like
> > &M24501; for the characters assigned by the Mojikyo Font Institute (see
> > www.mojikyo.gr.jp) and &C4-4e21; for characters in the Taiwanese CNS
> > encoding). I realize that this is a rather specialised usage,
> but it would
> > be great and very helpful to have some hook in the system to treat this
> > stuff just like any other character.
>
> Hmm, sounds like some kind of SGML entity codec could solve this
> aspect...

Right, but how would that be integrated with the other codecs?

Christian Wittern, Taipei


From mal@lemburg.com  Fri Mar 17 08:40:49 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 17 Mar 2000 09:40:49 +0100
Subject: [I18n-sig] CJK codecs etc
References: <NDBBKJMNKBFFOBEFDJGEOEGNCOAA.chris@ccbs.ntu.edu.tw>
Message-ID: <38D1EF91.6B78B027@lemburg.com>

Christian Wittern wrote:
> 
> Marc-Andre Lemburg wrote:
> 
> > Christian Wittern wrote:
> > >
> > >
> > > 1.) Please provide a (configurable?) fallback for failed
> > conversions. This
> > > is of course especially needed for conversions out of Unicode.
> > What I have
> > > in mind is, for example, provide the Unicode codepoint as
> > entity (&U-4e00;)
> > > or Java escape or some such, depending on the users choice.
> > Don't just give
> > > a '?', what M$'s braindead conversion routines do and thus
> > regularily drive
> > > me nuts.
> >
> > Please read the Misc/unicode.txt file. There are different error
> > handling techniques available... 'strict' (raise an error),
> > 'ignore' (ignore the failed mapping), 'replace' (replace the
> > failed mapping by some codec specific replacement char, e.g. '?').
> 
> Err. If you read my comment above, this is exactly what I *don't* want to
> see, since this is of no help at all. What I want to have is a fallback
> mechanism, that preserves the information contained in the file (or maps it
> to some other second best match). Simple raising an error or putting in some
> default char is not helpful to the user at all!!!

Codecs may provide more than these three error handling
modes -- the only requirement is that at least these three
are defined.

Note that 'replace' and 'ignore' do have their value when
it comes to writing code that puts more priority on working
without errors than 100% percent correct output.

> > The error argument is codec specific -- the above values must
> > work though.
> >
> > > 2.) On the same topic, there are some fairly frequently
> > codepoints that map
> > > to different codepoints in Japanese and Taiwans encoding,
> > although this is
> > > in most cases not expected. These codepoints should have been
> > eliminated by
> > > Unicodes unification rules, but crept in via the
> > source-encoding separation
> > > rule -- not a very good decision in my opinion. I have a list
> > of some such
> > > characters at http://www.chibs.edu.tw/~chris/smart/cjkconv.htm, Ideally,
> > > there should be a way for the user to influence the conversion
> > by providing
> > > a list of his choice (with his modifications) to the codec, to
> > overlay the
> > > predefined values.
> >
> > Everybody can write their own codecs... so no comment on this one ;-)
> >
> > > 3.) The nasty problem of user defined characters. I think there
> > should be a
> > > default mapping of the user defined area in DBCS encodings to
> > the Unicode
> > > code range for user characters. Microsoft uses fixed sequential
> > tables and I
> > > think that is a good idea, since it is pretty straightforward.
> > In big5 for
> > > example, the area of user defined characters starts at Fa40,
> > Fa41 ..., which
> > > gets mapped to Unicode E000, E001, .. There should also be an
> > option to use
> > > some kind of entity reference instead.
> >
> > The core Python Unicode implementation doesn't touch these
> > private code areas at all. This issue is left to the codecs.
> >
> > Since they are probably of some importance to the Asian world
> > due to the many corporate char sets, I guess the Asian codecs
> > should provide some kind of logic to handle these areas as
> > special cases... perhaps by passing an extra mapping table
> > to the codec.
> 
> That would solve the above point 2 as well and is all I have in mind here:
> Leave some hook that the user can pass some overlayed extra mapping table,
> without having to write a codec of his own. ALthough I realize the latter is
> possible, I don't think it is practicle and maybe not even desirable. I
> don't want to design a different car from scratch, just because I don't like
> the color:-)

I think we are starting to pile up some good comments on
what the Asian codecs should look like... perhaps its time
for someone to jump in and write a proposal as basis for further
discussion.

(I don't have time for this and not even enough knowledge about
the complexity of the Asian encodings, so I'll leave this to
one of you...)

> >
> > > 4.) I developped years ago the habit of using entity references for any
> > > characters not representable in the given characterset used by
> > the system. I
> > > have seen this becoming more widespread in the user communities
> > I work with.
> > > It would be very useful for us, if the Unicode conversion
> > routines in Python
> > > could be told to tread some arbitray entity references (we use
> > things like
> > > &M24501; for the characters assigned by the Mojikyo Font Institute (see
> > > www.mojikyo.gr.jp) and &C4-4e21; for characters in the Taiwanese CNS
> > > encoding). I realize that this is a rather specialised usage,
> > but it would
> > > be great and very helpful to have some hook in the system to treat this
> > > stuff just like any other character.
> >
> > Hmm, sounds like some kind of SGML entity codec could solve this
> > aspect...
> 
> Right, but how would that be integrated with the other codecs?

Codecs are stackable :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From brian@garage.co.jp  Tue Mar 21 03:27:52 2000
From: brian@garage.co.jp (Brian Takashi Hooper)
Date: Tue, 21 Mar 2000 12:27:52 +0900
Subject: [I18n-sig] iconv encoding/decoding
Message-ID: <38D6EC3839E.18B2BRIAN@smtp.garage.co.jp>

Hi all -

Have others looked at the double-byte codec implementations for iconv,
in glibc 2?

This implementation uses customized lookup code for each encoding - I
don't think it's possible to make something that will run much faster
than this.  However, maybe we would be better off making a state-machine
based implementation that can be programmed and customized from Python. 
Looking at the iconv implementation should give us a good idea of what
atomic actions are possible. 

There are also scripts for automating the table creation from the
mapping tables at Unicode.org, and test data sets.  Does Python's
license allow us to borrow pieces from GPL'd software?

--Brian

(Thanks Ted for the reference)


From mal@lemburg.com  Tue Mar 21 09:25:23 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 21 Mar 2000 10:25:23 +0100
Subject: [I18n-sig] iconv encoding/decoding
References: <38D6EC3839E.18B2BRIAN@smtp.garage.co.jp>
Message-ID: <38D74003.C4D54838@lemburg.com>

Brian Takashi Hooper wrote:
> 
> Hi all -
> 
> Have others looked at the double-byte codec implementations for iconv,
> in glibc 2?
> 
> This implementation uses customized lookup code for each encoding - I
> don't think it's possible to make something that will run much faster
> than this.  However, maybe we would be better off making a state-machine
> based implementation that can be programmed and customized from Python.
> Looking at the iconv implementation should give us a good idea of what
> atomic actions are possible.
> 
> There are also scripts for automating the table creation from the
> mapping tables at Unicode.org, and test data sets.  Does Python's
> license allow us to borrow pieces from GPL'd software?

It does, but nothing GPLed can go into the core distribution
and have such an important piece of software under GPL would
harm the useability of these codecs in commercial apps.

Borrowing a few ideas is allowed though :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From andy@reportlab.com  Tue Mar 21 17:12:21 2000
From: andy@reportlab.com (Andy Robinson)
Date: Tue, 21 Mar 2000 17:12:21 -0000
Subject: [I18n-sig] Asian Encodings
Message-ID: <PGECLPOBGNBNKHNAGIJHOEHICAAA.andy@reportlab.com>

I've been on vacation since The Big Patch - sorry about the lousy timing.  I
hope to get up a friendly tutorial on using the Unicode features shortly.

In the meantime, some thoughts on the codecs and recent conversations:

>1. Should the CJK ideograms also be included in the unicodehelpers
>numeric converters?  From my perspective, I'd really like to see them go
>in, and think that it would make sense, too - any opinions?

>2. Same as above with double-width alphanumeric characters - I assume
>these should probably also be included in the lowercase / uppercase
>helpers?  Or will there be a way to add to these lists through the codec
>API (for those worried about data from unused codecs clogging up their
>character type helpers, maybe this would be a good option to have; I
>would by contrast like to be able to exclude all the extra Latin 1 stuff
>that I don't need, hmm.)

>3. Same thing for whitespace - I think there are a number of
>double-width whitespace characters around also.

We have to be really careful about what goes in the Python core, and what is
implemented as helper layers on top, with a preference for the latter where
possible.  If we have access to the character properties database, we could
write some helper libraries which give the full range of isKatakana,
isNumeric etc. in some dynamic way, without needing them hardcoded into the
core; what we are really asking is 'does a character have a property'.  I
haven't checked the API for this yet, but if it is not there then we need
it.

>Don't know what the standard installation method is... this
>hasn't been sorted out yet.

I'm keen to sort this out, so we can start playing with codecs.


Here's a bunch of ideas I'd like to float.  From now on, please assume I am
discussing some kind of CJK add-on package and not the Python core; it may
benefit from some helper functions in the core, but is not for everybody.

Character Sets and Encodings
----------------------------
Ken Lunde suggests that we should explicitly model Character Sets as
distinct from Encodings; for example, Shift-JIS is an encoding which
includes three character sets, (ASCII, JIS0208 Kanji and the Half width
katakana).   I tried to do this last year, but was not exactly sure of the
point; AFAIK it is only useful if you want to reason about whether certain
texts can survive certain round trips.  Can anyone see a need to do this
kind of thing?


Bypassing Unicode
-----------------
At some level, it should be possible to write and 'register' a codec which
goes straight from, say, EUC to Shift_JIS without Unicode in the middle,
using our codec machine.  We need to figure out how this will be accessed;
what is the clean way for a user to request the codec, without complicating
or affecting anything in the present implementation.  The present
conventions of StreamWriters, StreamRecoders etc. are really useful, with or
without Unicode.

Can we overload to do codecs.lookup(sourceEncoding, destEncoding)?  Or
should it be something totally separate?

Codecs State Machine
--------------------
As you know I suggested an mxTextTools-inspired mini-language for doing
stream transformations.  I've never written this kind of thing before, but
think it could be quite useful - I bet it could do data compression and
image manipulation too.  However, I have no experience designing languages.
It seems to me that we should be able to convert data faster than we can
rad/write to disk, but beyond that we need flexibility more than speed.  Now
what actions does it need?

Should we steam straight in, or prototype it in Python?
- what types?  it cannot be as flexible as Python, or it will be no faster.
Presumably most of the functions are statically typed, and we only need
bytes/character, integers and booleans
- what events when initialized ? construct mapping tables?
- read n bytes from input into a string buffer
- write n bytes from a string buffer to output
- look up 1/2/n bytes in a mapping
- full set of math and bit operators routines

One good suggestion I had from Aaron Watters was that by treating it as a
language, one could have a code-generation option as well as a runtime; we
might be able to create C code for specific encodings on demand.


Mapping tables:
---------------
For CJKV stuff I strongly favour mapping tables which are built at run time.
Mapping tables would be some of the possible inputs to our mini-language; we
would be able to write routines saying 'until byte pattern x encountered do
(read 2 bytes, look it up in a table, write the values found)', but with
user-supplied mapping tables.

These are currently implemented as dictionaries, but there are many
contiguous ranges and a compact representation is possible.  I did this last
year for a client and it worked  pretty well.  Even the big CJKV ones come
down to about 80 contiguous ranges.  Conceptually, let's imagine that bytes
1 to 5 in source encoding map to 100-105 in destination; 6-10 map to
200-205; and 11-15 map to 300-305. Then we can create a 'compact map'
structure like this...
  [(1, 5, 100),
  (6, 10, 200),
  (11, 15, 300)]
...and a routine which can expand it to a dictionary {1:100, 2:101 ....
15:305}.
One can also write routines to invert maps, check if they represent a round
trip and so on.  The attraction is that the definitions can be in literal
python modules, and look quite like the standards documents that create
them.  Furthermore, a lot of Japanese corporate encodings go like "Start
with strict JIS-0208, and add these extra 17 characters..." - so one module
could define all the variants for Japanese very cleanly and readably.  I
think this is a good way to tackle user-defined characters - tell them what
to hack to add theirt 50 new characters and create an encoding with a new
name.  If this sounds sensible, I'll try to start on it.


Test Harness
------------
A digression here, but perhaps we should build a web interface to convert
arbitrary files and output as HTML, so everyone can test the output of the
codecs as we write them.  Is this useful?

That's enough rambling for one day...

Thanks,

Andy


From brian@garage.co.jp  Wed Mar 22 01:53:48 2000
From: brian@garage.co.jp (Brian Takashi Hooper)
Date: Wed, 22 Mar 2000 10:53:48 +0900
Subject: [I18n-sig] Asian Encodings
In-Reply-To: <PGECLPOBGNBNKHNAGIJHOEHICAAA.andy@reportlab.com>
References: <PGECLPOBGNBNKHNAGIJHOEHICAAA.andy@reportlab.com>
Message-ID: <38D827AC273.18CBBRIAN@smtp.garage.co.jp>

Hi Andy, welcome back,

On Tue, 21 Mar 2000 17:12:21 -0000
"Andy Robinson" <andy@reportlab.com> wrote:

[snip]

> Character Sets and Encodings
> ----------------------------
> Ken Lunde suggests that we should explicitly model Character Sets as
> distinct from Encodings; for example, Shift-JIS is an encoding which
> includes three character sets, (ASCII, JIS0208 Kanji and the Half width
> katakana).   I tried to do this last year, but was not exactly sure of the
> point; AFAIK it is only useful if you want to reason about whether certain
> texts can survive certain round trips.  Can anyone see a need to do this
> kind of thing?
One complication that kind of arises from this is, if you've had a look
at the mappings which are available on Unicode.org, some of them are
encoding maps and some of them are character set maps.  Which of course
by itself is not such a huge chore but makes automatically generating
maps somewhat less trivial than if you ignore such considerations.

[snip]

> Mapping tables:
> ---------------
> For CJKV stuff I strongly favour mapping tables which are built at run time.
> Mapping tables would be some of the possible inputs to our mini-language; we
> would be able to write routines saying 'until byte pattern x encountered do
> (read 2 bytes, look it up in a table, write the values found)', but with
> user-supplied mapping tables.
> 
> These are currently implemented as dictionaries, but there are many
> contiguous ranges and a compact representation is possible.  I did this last
> year for a client and it worked  pretty well.  Even the big CJKV ones come
> down to about 80 contiguous ranges.  Conceptually, let's imagine that bytes
> 1 to 5 in source encoding map to 100-105 in destination; 6-10 map to
> 200-205; and 11-15 map to 300-305. Then we can create a 'compact map'
> structure like this...
>   [(1, 5, 100),
>   (6, 10, 200),
>   (11, 15, 300)]
> ...and a routine which can expand it to a dictionary {1:100, 2:101 ....
> 15:305}.
This is similar to the way a bunch of the codecs for glibc's iconv work
- there is an index mapping table which consists of start and end
ranges, and an index, which allows a lookup function to index properly
into a big static array.

iconv, as I posted earlier, is one place that it might be good to get
ideas, both for ideas on what kinds of operations the codec machine
should be able to do and data storage.

How about making the interface to mappings simply __getitem__, as
suggested earlier on this list by Marc-Andre?  I think that might be the
best way to ensure that we have lots of different options for what we
can use as mappings.

The Java i18n classes are also worth a look - they do everything as an
inheritance hierarchy, with the logic for doing the conversion kind of
bundled together with the maps themselves - everything inherits from
either ByteToCharConverter or CharToByteConverter, and then defines a
convert routine to do conversion.  The inheritance relationships are
kind of weird, I think - like, ByteToCharEUC_JP inherits from
ByteToCharJIS0208, and contains ByteToCharJIS0201 and ByteToCharJIS0212
instances as class members.  I like how the codecs return their
max character width - this can sometimes be more than two bytes for some
asian languages and helps to know for purposes of calculating memory
allocation when going from Unicode back to a legacy encoding, for
example.  (If anyone's interested, I have decompiled copies of i18n.jar
which I can put up someplace for people to look at).

> One can also write routines to invert maps, check if they represent a round
> trip and so on.  The attraction is that the definitions can be in literal
> python modules, and look quite like the standards documents that create
> them.  Furthermore, a lot of Japanese corporate encodings go like "Start
> with strict JIS-0208, and add these extra 17 characters..." - so one module
> could define all the variants for Japanese very cleanly and readably.  I
> think this is a good way to tackle user-defined characters - tell them what
> to hack to add theirt 50 new characters and create an encoding with a new
> name.  If this sounds sensible, I'll try to start on it.
> 
> 
> Test Harness
> ------------
> A digression here, but perhaps we should build a web interface to convert
> arbitrary files and output as HTML, so everyone can test the output of the
> codecs as we write them.  Is this useful?
> 
> That's enough rambling for one day...
> 
> Thanks,
> 
> Andy
> 
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://www.python.org/mailman/listinfo/i18n-sig
> 


From brian@garage.co.jp  Wed Mar 22 02:17:43 2000
From: brian@garage.co.jp (Brian Takashi Hooper)
Date: Wed, 22 Mar 2000 11:17:43 +0900
Subject: [I18n-sig] Asian Encodings
In-Reply-To: <PGECLPOBGNBNKHNAGIJHOEHICAAA.andy@reportlab.com>
References: <PGECLPOBGNBNKHNAGIJHOEHICAAA.andy@reportlab.com>
Message-ID: <38D82D47136.18CCBRIAN@smtp.garage.co.jp>

Hi again,

One other thing I forgot to mention, is that we'll have to start
thinking about (canonical) normalization, at least on a rudimentary
level, for Asian encodings - one specific example I can think of is in
Japanese with half-width katakana characters, there are a few
diacritical marks (dakuten) which are represented themselves as separate
characters - most encoding packages I've seen special case on these and
turn them into their corresponding canonical representations.  Without
normalization, searches and processing for these characters become a bit
of pain.

So, one other goal of creating the East Asian codecs should also be to
add some normalization support to the existing framework... other
Unicode packages / implementations mostly use normalization form C for
everything.

Those that aren't familiar with Unicode Normalization Forms, here's the
technical report, which is a good reference:

http://www.unicode.org/unicode/reports/tr15/tr15-18.html

--Brian


From andy@reportlab.com  Thu Mar 23 11:51:49 2000
From: andy@reportlab.com (Andy Robinson)
Date: Thu, 23 Mar 2000 11:51:49 -0000
Subject: [I18n-sig] Codec Language
In-Reply-To: <38D9DC5E103.DED4BRIAN@smtp.garage.co.jp>
Message-ID: <PGECLPOBGNBNKHNAGIJHKEIOCAAA.andy@reportlab.com>

On the subject of a mini-language for dealing with Asian codecs...I'm
fooling around with something in pure Python - a toy interpreter for a basic
FSM - I'll try to post something up after the weekend.  In the meantime, we
should certainly list the actions we need to be able to perform at a
conceptual level:


1. Data structures/types for bytes, strings, numbers and mapping tables
2. Read n bytes into designated buffers from input
3. Write contents of designated buffers to output
4. Look up contents of a buffer in a mapping table, and do somethign with
the output (how to deal with failed lookups?)
5. Do math, string concenatenation, bit operations
6. Wide range of pattern-matching tests on short strings and bytes - byte in
range, byte in set etc.  mxTextTools gives loads of examples.

Please pitch in with any suggested operations you think we need.

The real issue seems to be, can we do it with an FSM that is not hideously
complex to program?  Or do we need a non-finite language in which infinite
loops etc. are possible?  The latter is easier to write things in, but may
not be as safe or as fast.

- Andy


From mal@lemburg.com  Thu Mar 23 12:11:00 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 23 Mar 2000 13:11:00 +0100
Subject: [I18n-sig] Kanji codec sample
Message-ID: <38DA09D4.490C92B5@lemburg.com>

Just thought this might be of interest to you. There is a sample
implementation on the ftp.unicode.org site:

	ftp://ftp.unicode.org/Public/PROGRAMS/KANJIMAP/

Perhaps this could be used to get a quick start or at
least some ideas about how Asian codecs could work...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From brian@garage.co.jp  Thu Mar 23 13:32:12 2000
From: brian@garage.co.jp (Brian Takashi Hooper)
Date: Thu, 23 Mar 2000 22:32:12 +0900
Subject: [I18n-sig] Codec Language
In-Reply-To: <PGECLPOBGNBNKHNAGIJHKEIOCAAA.andy@reportlab.com>
References: <38D9DC5E103.DED4BRIAN@smtp.garage.co.jp> <PGECLPOBGNBNKHNAGIJHKEIOCAAA.andy@reportlab.com>
Message-ID: <38DA1CDC1FD.DED7BRIAN@smtp.garage.co.jp>

Hi Andy, 

On Thu, 23 Mar 2000 11:51:49 -0000
"Andy Robinson" <andy@reportlab.com> wrote:

> On the subject of a mini-language for dealing with Asian codecs...I'm
> fooling around with something in pure Python - a toy interpreter for a basic
> FSM - I'll try to post something up after the weekend.  In the meantime, we
> should certainly list the actions we need to be able to perform at a
> conceptual level:
> 
> 
> 1. Data structures/types for bytes, strings, numbers and mapping tables
> 2. Read n bytes into designated buffers from input
> 3. Write contents of designated buffers to output
> 4. Look up contents of a buffer in a mapping table, and do somethign with
> the output (how to deal with failed lookups?)
> 5. Do math, string concenatenation, bit operations
> 6. Wide range of pattern-matching tests on short strings and bytes - byte in
> range, byte in set etc.  mxTextTools gives loads of examples.
I'd been thinking along these lines too; from the encodings that I've
surveyed currently, which I think includes most of the major ones for
which there are unicode.org mappings available, the above should
probably be sufficient to do the job.

It also seems like with a scheme that allows a single codec to use
multiple maps, it should be possible to do any of the asian codecs with
only a two-byte key and four-byte value.  The four-byte value would
include the key that mapped to it, plus the value itself (which, as far
as I've gathered, could always be two bytes), so that misses could be
detected.  The reason two bytes is enough is that even though there are
extensions to many encodings which allow them to use more space outside
the BMP, those added spaces are always mapped as contiguous planes, and
never (at least in any of the encodings that I know of) larger than what
can be mapped on a 2-byte grid.

> 
> Please pitch in with any suggested operations you think we need.
> 
> The real issue seems to be, can we do it with an FSM that is not hideously
> complex to program?  Or do we need a non-finite language in which infinite
> loops etc. are possible?  The latter is easier to write things in, but may
> not be as safe or as fast.
Allowing for both algorithmic and mapping codecs within the same
implementation might confuse matters somewhat... what about separating
things into mapping codecs (which will handle all the Unicode stuff),
and a separate machine (or possibly extension to the mapping machine)
that can do algorithmic transformations?  This would whittle down the
immediate problem to developing the mapping machine, which as far as I
can tell should only have to support reading, writing, lookup, and
comparison, at least for doing Unicode conversions.  How does this
sound?

Also, I think another thing on our agenda should be to list up a
preliminary list of encodings/character sets we're going to support from
the beginning - this will also help to narrow the scope of the problem
somewhat.  There may eventually be other encodings which we'll want to
support by adding some extra functionality to the machine; but in
general, I don't think that there's any harm in making something that's
really simple to do what we want to do now...  If this sounds like a
good idea then I'll draw up a preliminary list from the Unicode site,
and then we can take a look at implementations (iconv, Java, and the
KANJIMAP link Marc-Andre just posted, for example) to help figure out
the FSM instruction set.

What do you all think?

--Brian


From andy@reportlab.com  Thu Mar 23 22:14:19 2000
From: andy@reportlab.com (Andy Robinson)
Date: Thu, 23 Mar 2000 22:14:19 -0000
Subject: [I18n-sig] More grief on Windows
Message-ID: <002201bf9515$75b4fab0$01ac2ac0@boulder>

I've built the Unicode-aware Python on Windows, with a proper encodings
library.
The moment I try to look up a codec, python crashes...

C:\users>python
Python 1.5.2+ (#0, Mar 23 2000, 15:31:41) [MSC 32 bit (Intel)] on win32
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> unicode('hello','ascii')
!!!! Application Error at this point

...try again...
C:\users>python
Python 1.5.2+ (#0, Mar 23 2000, 15:31:41) [MSC 32 bit (Intel)] on win32
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> unicode('hello')
u'hello'
>>> unicode('hello','utf-8')
u'hello'
>>> import codecs
>>> codecs.lookup('ascii')
!!!! Application Error at this point


This happens on two different machines, building using VC++ and the standard
workspace, both with a full CVS tree and no other Pythons lurking.  I
stepped through in Pythonwin, and found that __init__.py is called, and the
'ascii' module is loaded on demand correctly; immediately after this, it
crashes.  I don't have the skills to debug the C - yet.

Is anyone else able to run the above snippets on Windows, or is it me?

Thanks very much,

Andy Robinson


From mal@lemburg.com  Thu Mar 23 22:48:10 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 23 Mar 2000 23:48:10 +0100
Subject: [I18n-sig] More grief on Windows
References: <002201bf9515$75b4fab0$01ac2ac0@boulder>
Message-ID: <38DA9F2A.D174B26A@lemburg.com>

Andy Robinson wrote:
> 
> I've built the Unicode-aware Python on Windows, with a proper encodings
> library.
> The moment I try to look up a codec, python crashes...
> 
> C:\users>python
> Python 1.5.2+ (#0, Mar 23 2000, 15:31:41) [MSC 32 bit (Intel)] on win32
> Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> >>> unicode('hello','ascii')
> !!!! Application Error at this point

I can reproduce this on Linux too... I'll look into this and
send a patch.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From andy@reportlab.com  Thu Mar 23 23:16:55 2000
From: andy@reportlab.com (Andy Robinson)
Date: Thu, 23 Mar 2000 23:16:55 -0000
Subject: [I18n-sig] Codec Language
References: <38D9DC5E103.DED4BRIAN@smtp.garage.co.jp> <PGECLPOBGNBNKHNAGIJHKEIOCAAA.andy@reportlab.com> <38DA1CDC1FD.DED7BRIAN@smtp.garage.co.jp>
Message-ID: <000701bf951d$e1cc5c40$01ac2ac0@boulder>

> Allowing for both algorithmic and mapping codecs within the same
> implementation might confuse matters somewhat... what about separating
> things into mapping codecs (which will handle all the Unicode stuff),
> and a separate machine (or possibly extension to the mapping machine)
> that can do algorithmic transformations?  This would whittle down the
> immediate problem to developing the mapping machine, which as far as I
> can tell should only have to support reading, writing, lookup, and
> comparison, at least for doing Unicode conversions.  How does this
> sound?

I've been thinking hard what to do next, and actually I think the highest
priorities are

(a) build some kind if cgi test harness (maybe on Starship?), on which we
can stash all manner of input files, and a front end which lets you specify
input (file or a text field), say what encoding it is oiin, and say what
encoding you want to see it in.  Then, just using web browsers, we can
actually see the results of type conversions, and can accumulate test files
with subtle combinations of text.

(b) write some pure Python Asian codecs, no matter how slow, using simple
dictionaries for the mapping tables.  This gives us a benchmark, documents
the algorithms and features we are going to need, and lets people other than
you and I see what features are needed in a faster codec machine.

We should be able to move on that pretty fast.  What do you think?

BTW, I have often used uniconv.exe, a free utility from BasisTech - it is a
command line program to do encoding conversion and character normalization
transformations.  Another really good test target would be to write a
uniconv.py and a harness to run them both - when they give the same output
for all encodings, we know we've done a good job.

- Andy


From mal@lemburg.com  Thu Mar 23 23:21:31 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 24 Mar 2000 00:21:31 +0100
Subject: [I18n-sig] More grief on Windows
References: <002201bf9515$75b4fab0$01ac2ac0@boulder> <38DA9F2A.D174B26A@lemburg.com>
Message-ID: <38DAA6FB.63A036D5@lemburg.com>

"M.-A. Lemburg" wrote:
> 
> Andy Robinson wrote:
> >
> > I've built the Unicode-aware Python on Windows, with a proper encodings
> > library.
> > The moment I try to look up a codec, python crashes...
> >
> > C:\users>python
> > Python 1.5.2+ (#0, Mar 23 2000, 15:31:41) [MSC 32 bit (Intel)] on win32
> > Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> > >>> unicode('hello','ascii')
> > !!!! Application Error at this point
> 
> I can reproduce this on Linux too... I'll look into this and
> send a patch.

Here it is:

--- CVS-Python/Python/codecs.c  Fri Mar 24 00:02:04 2000
+++ Python+Unicode/Python/codecs.c      Fri Mar 24 00:01:49 2000
@@ -91,11 +91,11 @@ PyObject *lowercasestring(const char *st
 
    If no codec is found, a KeyError is set and NULL returned.  */
 
 PyObject *_PyCodec_Lookup(const char *encoding)
 {
-    PyObject *result, *args = NULL, *v = NULL;
+    PyObject *result, *args = NULL, *v;
     int i, len;
 
     if (_PyCodec_SearchCache == NULL || _PyCodec_SearchPath == NULL) {
        PyErr_SetString(PyExc_SystemError,
                        "codec module not properly initialized");
@@ -117,27 +117,26 @@ PyObject *_PyCodec_Lookup(const char *en
        Py_DECREF(v);
        return result;
     }
     
     /* Next, scan the search functions in order of registration */
-    len = PyList_Size(_PyCodec_SearchPath);
-    if (len < 0)
-       goto onError;
-
     args = PyTuple_New(1);
     if (args == NULL)
        goto onError;
     PyTuple_SET_ITEM(args,0,v);
-    v = NULL;
+
+    len = PyList_Size(_PyCodec_SearchPath);
+    if (len < 0)
+       goto onError;
 
     for (i = 0; i < len; i++) {
        PyObject *func;
 
        func = PyList_GetItem(_PyCodec_SearchPath, i);
        if (func == NULL)
            goto onError;
-       result = PyEval_CallObject(func,args);
+       result = PyEval_CallObject(func, args);
        if (result == NULL)
            goto onError;
        if (result == Py_None) {
            Py_DECREF(result);
            continue;
@@ -161,11 +160,10 @@ PyObject *_PyCodec_Lookup(const char *en
     PyDict_SetItem(_PyCodec_SearchCache, v, result);
     Py_DECREF(args);
     return result;
 
  onError:
-    Py_XDECREF(v);
     Py_XDECREF(args);
     return NULL;
 }
 
 static


-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From andy@reportlab.com  Fri Mar 24 09:53:21 2000
From: andy@reportlab.com (Andy Robinson)
Date: Fri, 24 Mar 2000 09:53:21 -0000
Subject: [I18n-sig] More grief on Windows
References: <002201bf9515$75b4fab0$01ac2ac0@boulder> <38DA9F2A.D174B26A@lemburg.com> <38DAA6FB.63A036D5@lemburg.com>
Message-ID: <000e01bf9576$ca342dc0$01ac2ac0@boulder>

> "M.-A. Lemburg" wrote:
> > I can reproduce this on Linux too... I'll look into this and
> > send a patch.
>
> Here it is:

Yup, that works for me.  Thanks for the fast response.

- Andy


From andy@reportlab.com  Fri Mar 24 09:53:25 2000
From: andy@reportlab.com (Andy Robinson)
Date: Fri, 24 Mar 2000 09:53:25 -0000
Subject: [I18n-sig] More grief on Windows
References: <002201bf9515$75b4fab0$01ac2ac0@boulder> <38DA9F2A.D174B26A@lemburg.com> <38DAA6FB.63A036D5@lemburg.com>
Message-ID: <000f01bf9576$cc89b680$01ac2ac0@boulder>

> "M.-A. Lemburg" wrote:
> > I can reproduce this on Linux too... I'll look into this and
> > send a patch.
>
> Here it is:

Yup, that works for me.  Thanks for the fast response.

- Andy


From andy@reportlab.com  Fri Mar 24 09:53:37 2000
From: andy@reportlab.com (Andy Robinson)
Date: Fri, 24 Mar 2000 09:53:37 -0000
Subject: [I18n-sig] More grief on Windows
References: <002201bf9515$75b4fab0$01ac2ac0@boulder> <38DA9F2A.D174B26A@lemburg.com> <38DAA6FB.63A036D5@lemburg.com>
Message-ID: <001001bf9576$d3b1d5f0$01ac2ac0@boulder>

> "M.-A. Lemburg" wrote:
> > I can reproduce this on Linux too... I'll look into this and
> > send a patch.
>
> Here it is:

Yup, that works for me.  Thanks for the fast response.

- Andy


From andy@reportlab.com  Fri Mar 24 09:53:44 2000
From: andy@reportlab.com (Andy Robinson)
Date: Fri, 24 Mar 2000 09:53:44 -0000
Subject: [I18n-sig] More grief on Windows
References: <002201bf9515$75b4fab0$01ac2ac0@boulder> <38DA9F2A.D174B26A@lemburg.com> <38DAA6FB.63A036D5@lemburg.com>
Message-ID: <001801bf9576$d82d7300$01ac2ac0@boulder>

> "M.-A. Lemburg" wrote:
> > I can reproduce this on Linux too... I'll look into this and
> > send a patch.
>
> Here it is:

Yup, that works for me.  Thanks for the fast response.

- Andy


From mal@lemburg.com  Fri Mar 31 22:15:53 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 01 Apr 2000 00:15:53 +0200
Subject: [I18n-sig] Test Suite for the Unicode codecs
Message-ID: <38E52399.19220D0@lemburg.com>

I would like to add some more testing to the mapping codecs
in the Python encodings package. Right now I can only test
for round-trips of lower character ordinal ranges and even
those tests fail for a couple of encodings.

Does anyone have access to some reference test suite for
these mappings ? The mapping codec is probably not the
cause for these errors. Perhaps the maps themselves
aren't of high enough quality or maybe some mappings
just cannot provide round-trip safety...

Here are my findings in form of a Python test script with comments.
The tests first translate an encoded into Unicode and then
translate it back. Some have undefined mappings even in the
lower ranges and others seem to be 1-n rather than 1-1.

print 'Testing standard mapping codecs...',

print '0-127...',
s = ''.join(map(chr, range(128)))
for encoding in (
    'cp037', 'cp1026',
    'cp437', 'cp500', 'cp737', 'cp775', 'cp850',
    'cp852', 'cp855', 'cp860', 'cp861', 'cp862',
    'cp863', 'cp865', 'cp866', 
    'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15',
    'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6',
    'iso8859_7', 'iso8859_9', 'koi8_r', 'latin_1',
    'mac_cyrillic', 'mac_latin2',

    'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255',
    'cp1256', 'cp1257', 'cp1258',
    'cp856', 'cp857', 'cp864', 'cp869', 'cp874',

    'mac_greek', 'mac_iceland','mac_roman', 'mac_turkish',
    'cp1006', 'cp875', 'iso8859_8',
    
    ### These have undefined mappings:
    #'cp424',
    
    ):
    try:
        assert unicode(s,encoding).encode(encoding) == s
    except AssertionError:
        print '*** codec "%s" failed round-trip' % encoding
    except ValueError,why:
        print '*** codec for "%s" failed: %s' % (encoding, why)

print '128-255...',
s = ''.join(map(chr, range(128,256)))
for encoding in (
    'cp037', 'cp1026',
    'cp437', 'cp500', 'cp737', 'cp775', 'cp850',
    'cp852', 'cp855', 'cp860', 'cp861', 'cp862',
    'cp863', 'cp865', 'cp866', 
    'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15',
    'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6',
    'iso8859_7', 'iso8859_9', 'koi8_r', 'latin_1',
    'mac_cyrillic', 'mac_latin2',
    
    ### These have undefined mappings:
    #'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255',
    #'cp1256', 'cp1257', 'cp1258',
    #'cp424', 'cp856', 'cp857', 'cp864', 'cp869', 'cp874',
    #'mac_greek', 'mac_iceland','mac_roman', 'mac_turkish',
    
    ### These fail the round-trip:
    #'cp1006', 'cp875', 'iso8859_8',
    
    ):
    try:
        assert unicode(s,encoding).encode(encoding) == s
    except AssertionError:
        print '*** codec "%s" failed round-trip' % encoding
    except ValueError,why:
        print '*** codec for "%s" failed: %s' % (encoding, why)

print 'done.'

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/