[I18n-sig] thinking of CJK codec, some questions

M.-A. Lemburg mal@lemburg.com
Mon, 13 Mar 2000 14:58:24 +0100


Brian Takashi Hooper wrote:
> 
> Hi there i18n-siggers -
> 
> First of all, thank you very very much Marc-Andre (and Fredrik Lundh for
> the original implementation) for all your hard work, I checked out the
> CVS checkin yesterday and played with it a little, and took a print out
> of the source home with me.  It seems really well thought out and
> organized.
>
> I scrutinized the code base thinking about issues for a CJK codec, and
> came up with a few questions:
> 
> 1. Should the CJK ideograms also be included in the unicodehelpers
> numeric converters?  From my perspective, I'd really like to see them go
> in, and think that it would make sense, too - any opinions?
> 
> 2. Same as above with double-width alphanumeric characters - I assume
> these should probably also be included in the lowercase / uppercase
> helpers?  Or will there be a way to add to these lists through the codec
> API (for those worried about data from unused codecs clogging up their
> character type helpers, maybe this would be a good option to have; I
> would by contrast like to be able to exclude all the extra Latin 1 stuff
> that I don't need, hmm.)
> 
> 3. Same thing for whitespace - I think there are a number of
> double-width whitespace characters around also.

I'm not sure I understand what you are intending here: the
unicodectype.c file contains a switch statements which were
deduced from the UnicodeData.txt file available at the
Unicode.org FTP site. It contains all mappings which were defined
in that files -- unless my parser omitted some.
 
If you plan to add new mappings which are not part of the
Unicode standard, I would suggest adding them to a separate
module. E.g. you could extend the versions available through
the unicodedata module. But beware: the Unicode methods
only use the mappings defined in the unicodectype.c file.

> 4. Are there any conventions for how non-standard codecs should be
> installed?  Should they be added to Python's encodings directory, or
> should they just be added to site-packages or site-python like other
> third-party modules?

You can drop them anyplace you want... and then have them
register a search function. The standard encodings package
uses modules as codec basis but you could just as well provide
other means of looking up and even creating codecs on-the-fly.

Don't know what the standard installation method is... this
hasn't been sorted out yet.

My current thinking is to include all standard and small
codecs in the standard dist and include the bigger ones
in a separate Python add-on distribution (e.g. a tar file
that gets untarred on top of an existing installation).
A smart installer should ideally take care of this...

> 5. Are there any existing tools for converting from Unicode mapping
> files to a C source file that can be handily made into a dynamic
> library, or am I on my own there?

No, there is a tool to convert them to a Python source file
though (Misc/gencodec.py). The created codecs will use the
builtin generic mapping codec as basis for their work.
 
If mappings get huge (like the CJK ones), I would create a
new parser though, which then generates extension modules
to have the mapping available as static C data rather
than as Python dictionary on the heap... gencodec.py
should provide a good template for such a tool.

> Anyone who has any opinions on the above please chime in, I'm trying to
> start a discussion :-) !
> 
> Also, while I was reading the code, I found a few typos and spelling
> mistakes (for example the notoriously often misspelled 'occurrence').

Ahem ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/