[I18n-sig] thinking of CJK codec, some questions

Brian Takashi Hooper brian@garage.co.jp
Mon, 13 Mar 2000 23:42:41 +0900


Hi again,

On Mon, 13 Mar 2000 14:58:24 +0100
"M.-A. Lemburg" <mal@lemburg.com> wrote:

[snip]

> I'm not sure I understand what you are intending here: the
> unicodectype.c file contains a switch statements which were
> deduced from the UnicodeData.txt file available at the
> Unicode.org FTP site. It contains all mappings which were defined
> in that files -- unless my parser omitted some.
>  
> If you plan to add new mappings which are not part of the
> Unicode standard, I would suggest adding them to a separate
> module. E.g. you could extend the versions available through
> the unicodedata module. But beware: the Unicode methods
> only use the mappings defined in the unicodectype.c file.
My mistake - I thought for some reason that double-width Latin
characters, such as are used in Japanese, were part of the CJK ideogram
code space that starts from \u3400, so I was expecting them to map to
lower values in Unicode than they actually do (a double-width 'A', for
example, is \uFF21.

> 
> > 4. Are there any conventions for how non-standard codecs should be
> > installed?  Should they be added to Python's encodings directory, or
> > should they just be added to site-packages or site-python like other
> > third-party modules?
> 
> You can drop them anyplace you want... and then have them
> register a search function. The standard encodings package
> uses modules as codec basis but you could just as well provide
> other means of looking up and even creating codecs on-the-fly.
> 
> Don't know what the standard installation method is... this
> hasn't been sorted out yet.
> 
> My current thinking is to include all standard and small
> codecs in the standard dist and include the bigger ones
> in a separate Python add-on distribution (e.g. a tar file
> that gets untarred on top of an existing installation).
> A smart installer should ideally take care of this...
Maybe one using Distutils?  I guess it would make the most sense if you
run the install script with /usr/local/bin/python, for example, then the
codecs would get installed in the proper place for that Python
installation to use them...

> 
> > 5. Are there any existing tools for converting from Unicode mapping
> > files to a C source file that can be handily made into a dynamic
> > library, or am I on my own there?
> 
> No, there is a tool to convert them to a Python source file
> though (Misc/gencodec.py). The created codecs will use the
> builtin generic mapping codec as basis for their work.
>  
> If mappings get huge (like the CJK ones), I would create a
> new parser though, which then generates extension modules
> to have the mapping available as static C data rather
> than as Python dictionary on the heap... gencodec.py
> should provide a good template for such a tool.
You recommend in the unicode proposal that the mapping should probably
be a buildable as a shared library, to allow multiple interpreter
instances to share the table - for platforms which don't support this
option, then, would it make sense to make the codec such that the
mapping tables can be statically linked into the interpreter? Or, in
such a case, do you think would it be better to try to set things up so
that the mapping tables can be read from a file?

--Brian