[I18n-sig] thinking of CJK codec, some questions

M.-A. Lemburg mal@lemburg.com
Mon, 13 Mar 2000 16:47:44 +0100


Brian Takashi Hooper wrote:
> 
> > I'm not sure I understand what you are intending here: the
> > unicodectype.c file contains a switch statements which were
> > deduced from the UnicodeData.txt file available at the
> > Unicode.org FTP site. It contains all mappings which were defined
> > in that files -- unless my parser omitted some.
> >
> > If you plan to add new mappings which are not part of the
> > Unicode standard, I would suggest adding them to a separate
> > module. E.g. you could extend the versions available through
> > the unicodedata module. But beware: the Unicode methods
> > only use the mappings defined in the unicodectype.c file.
> My mistake - I thought for some reason that double-width Latin
> characters, such as are used in Japanese, were part of the CJK ideogram
> code space that starts from \u3400, so I was expecting them to map to
> lower values in Unicode than they actually do (a double-width 'A', for
> example, is \uFF21.

Unicode is built upon ASCII -- I don't think that other encodings
were taken into account during the ordinal assignment (not 100%
sure though).

You should be able to get at the numeric information of DBCS
chars (this is what you're talking about, right ?) by first
converting them to Unicode.

> >
> > > 4. Are there any conventions for how non-standard codecs should be
> > > installed?  Should they be added to Python's encodings directory, or
> > > should they just be added to site-packages or site-python like other
> > > third-party modules?
> >
> > You can drop them anyplace you want... and then have them
> > register a search function. The standard encodings package
> > uses modules as codec basis but you could just as well provide
> > other means of looking up and even creating codecs on-the-fly.
> >
> > Don't know what the standard installation method is... this
> > hasn't been sorted out yet.
> >
> > My current thinking is to include all standard and small
> > codecs in the standard dist and include the bigger ones
> > in a separate Python add-on distribution (e.g. a tar file
> > that gets untarred on top of an existing installation).
> > A smart installer should ideally take care of this...
> Maybe one using Distutils?  I guess it would make the most sense if you
> run the install script with /usr/local/bin/python, for example, then the
> codecs would get installed in the proper place for that Python
> installation to use them...

Right. distutils could be a solution on Unix -- the problem
of using distutils is that you first have to have a working
Python installation for it to work, so such an approach 
would only work in two steps: first Python core, then extended
codecs package.
 
> >
> > > 5. Are there any existing tools for converting from Unicode mapping
> > > files to a C source file that can be handily made into a dynamic
> > > library, or am I on my own there?
> >
> > No, there is a tool to convert them to a Python source file
> > though (Misc/gencodec.py). The created codecs will use the
> > builtin generic mapping codec as basis for their work.
> >
> > If mappings get huge (like the CJK ones), I would create a
> > new parser though, which then generates extension modules
> > to have the mapping available as static C data rather
> > than as Python dictionary on the heap... gencodec.py
> > should provide a good template for such a tool.
> You recommend in the unicode proposal that the mapping should probably
> be a buildable as a shared library, to allow multiple interpreter
> instances to share the table - for platforms which don't support this
> option, then, would it make sense to make the codec such that the
> mapping tables can be statically linked into the interpreter? Or, in
> such a case, do you think would it be better to try to set things up so
> that the mapping tables can be read from a file?

Since memory mapped files are not supported by Python per
default I would suggest letting the system linker take care of
sharing the constant C data from a shared (or statically linked)
extension module. Reading the information directly from a file
would probably be too slow.

Note that the module would only have to provide a simple
__getitem__ interface compatible object which then fetches
the data from the static C data. The rest can then be done
in Python in the same way as the other mapping codecs do their
job.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/