[I18n-sig] thinking of CJK codec, some questions

Brian Takashi Hooper brian@garage.co.jp
Thu, 16 Mar 2000 15:49:23 +0900


On Wed, 15 Mar 2000 15:36:34 +0100
"M.-A. Lemburg" <mal@lemburg.com> wrote:

> Just a few comments about the design (don't have any knowledge
> about Asian encodings):
> 
> 1. Keep large mapping tables in single automatically generated C
>    modules that export a lookup object (ones that define __getitem__).
>    These could also be generated using some perfect hash table
>    generator, BTW, to reduce memory consumption.
After researching perfect hash tables a little and thinking about it a
little more, a question:  I think this could work well for the decoding
maps, but for encoding (from Unicode to a legacy encoding), wouldn't I
have to be able to detect misses in my hash lookup?  For example, if I
had a string in Unicode that I was trying to convert to EUC-JP, and I
looked up a Unicode character that has no mapping to EUC-JP, with a
regular hash I my lookup will still succeed and I'll get back an EUC
character anyway, but the wrong one...  The only way I could think of to
avoid this would be to store the key as part of the value (or
alternately some kind of unique checksum), and then after lookup compare
the original key to the key that was looked up in the table; if they are
the same, then I've got a valid mapping, and if they are different than
my lookup failed, and I should return some kind of sentinel value
(0xFFFF or something?).  Since the Unicode keys are all two bytes
apiece, and for some of the largest CJK encoding standards the values
are a max of 4 bytes long (e.g. EUC-TW), I then need to define my
mapping table as containing values 8 bytes each in length, right? 
(assuming that I should keep the values of the array aligned along
machine words)  Is this complication worth the space savings, I wonder? 
I think a table built this way might be a little smaller than an
unhashed plain old table, since in a three- or four-byte encoding there
are generally always a lot fewer mapped values than there are spaces in
the available plane... maybe I should go with mappings as simple array
first and then figure out how to make them smaller if it seems to matter
a lot to people?  This is a pretty much a speed vs. space issue.

Opinions?  Does anyone have a cleverer way to detect the validity of a
hash lookup?

--Brian