[I18n-sig] thinking of CJK codec, some questions

Brian Takashi Hooper brian@garage.co.jp
Tue, 14 Mar 2000 17:10:47 +0900


Hi!

On Mon, 13 Mar 2000 16:47:44 +0100
"M.-A. Lemburg" <mal@lemburg.com> wrote:

[snip]

> Unicode is built upon ASCII -- I don't think that other encodings
> were taken into account during the ordinal assignment (not 100%
> sure though).
> 
> You should be able to get at the numeric information of DBCS
> chars (this is what you're talking about, right ?) by first
> converting them to Unicode.
Yes - it looks like this is the case :-).

> 
> > >
> > > > 4. Are there any conventions for how non-standard codecs should be
> > > > installed?  Should they be added to Python's encodings directory, or
> > > > should they just be added to site-packages or site-python like other
> > > > third-party modules?
> > >
> > > You can drop them anyplace you want... and then have them
> > > register a search function. The standard encodings package
> > > uses modules as codec basis but you could just as well provide
> > > other means of looking up and even creating codecs on-the-fly.
> > >
> > > Don't know what the standard installation method is... this
> > > hasn't been sorted out yet.
> > >
> > > My current thinking is to include all standard and small
> > > codecs in the standard dist and include the bigger ones
> > > in a separate Python add-on distribution (e.g. a tar file
> > > that gets untarred on top of an existing installation).
> > > A smart installer should ideally take care of this...
> > Maybe one using Distutils?  I guess it would make the most sense if you
> > run the install script with /usr/local/bin/python, for example, then the
> > codecs would get installed in the proper place for that Python
> > installation to use them...
> 
> Right. distutils could be a solution on Unix -- the problem
> of using distutils is that you first have to have a working
> Python installation for it to work, so such an approach 
> would only work in two steps: first Python core, then extended
> codecs package.
I guess, then it would be nice to have something that could work in
either case...

Should encoding support be an option to ./configure, when you are first
building Python?  General question to everyone out there - should it be
possible to intentionally build Python without Unicode support?

>  
> > >
> > > > 5. Are there any existing tools for converting from Unicode mapping
> > > > files to a C source file that can be handily made into a dynamic
> > > > library, or am I on my own there?
> > >
> > > No, there is a tool to convert them to a Python source file
> > > though (Misc/gencodec.py). The created codecs will use the
> > > builtin generic mapping codec as basis for their work.
> > >
> > > If mappings get huge (like the CJK ones), I would create a
> > > new parser though, which then generates extension modules
> > > to have the mapping available as static C data rather
> > > than as Python dictionary on the heap... gencodec.py
> > > should provide a good template for such a tool.
> > You recommend in the unicode proposal that the mapping should probably
> > be a buildable as a shared library, to allow multiple interpreter
> > instances to share the table - for platforms which don't support this
> > option, then, would it make sense to make the codec such that the
> > mapping tables can be statically linked into the interpreter? Or, in
> > such a case, do you think would it be better to try to set things up so
> > that the mapping tables can be read from a file?
> 
> Since memory mapped files are not supported by Python per
> default I would suggest letting the system linker take care of
> sharing the constant C data from a shared (or statically linked)
> extension module. Reading the information directly from a file
> would probably be too slow.
> 
> Note that the module would only have to provide a simple
> __getitem__ interface compatible object which then fetches
> the data from the static C data. The rest can then be done
> in Python in the same way as the other mapping codecs do their
> job.
Am I right in thinking that 'static C data' means something like

static Py_UNICODE mapping[] = { ... };

?  Also, from a design standpoint do you (and anyone else on i18n) think
it would be better to emphasize speed and / or memory efficiency by
making specialized codecs for the different CJK encodings (for example,
if a table such as the above is used, then in the case of a particular
encoding, for example EUC, it may be possible to reduce the size of the
table by introducing some EUC-specific casing into the encoder/decoder),
or would it be better to try for a generalized implementation?  We need
something like codecs.charset_encode and codecs.charset_decode for CJK
char sets - I was thinking that this might be best handled by a few
separate C modules (for Japanese, one for SJIS, one for EUC, and one for
JIS) that would in turn use similarly defined mapping modules,
containing only one or more static conversion maps as arrays - in this
sense I am leaning towards making tuned codecs for each encoding set.

I want to try to make something that many people can use - does this
sound like a reasonable approach, or am I on the wrong track here?

--Brian