[I18n-sig] Codec Language

Brian Takashi Hooper brian@garage.co.jp
Thu, 23 Mar 2000 22:32:12 +0900


Hi Andy, 

On Thu, 23 Mar 2000 11:51:49 -0000
"Andy Robinson" <andy@reportlab.com> wrote:

> On the subject of a mini-language for dealing with Asian codecs...I'm
> fooling around with something in pure Python - a toy interpreter for a basic
> FSM - I'll try to post something up after the weekend.  In the meantime, we
> should certainly list the actions we need to be able to perform at a
> conceptual level:
> 
> 
> 1. Data structures/types for bytes, strings, numbers and mapping tables
> 2. Read n bytes into designated buffers from input
> 3. Write contents of designated buffers to output
> 4. Look up contents of a buffer in a mapping table, and do somethign with
> the output (how to deal with failed lookups?)
> 5. Do math, string concenatenation, bit operations
> 6. Wide range of pattern-matching tests on short strings and bytes - byte in
> range, byte in set etc.  mxTextTools gives loads of examples.
I'd been thinking along these lines too; from the encodings that I've
surveyed currently, which I think includes most of the major ones for
which there are unicode.org mappings available, the above should
probably be sufficient to do the job.

It also seems like with a scheme that allows a single codec to use
multiple maps, it should be possible to do any of the asian codecs with
only a two-byte key and four-byte value.  The four-byte value would
include the key that mapped to it, plus the value itself (which, as far
as I've gathered, could always be two bytes), so that misses could be
detected.  The reason two bytes is enough is that even though there are
extensions to many encodings which allow them to use more space outside
the BMP, those added spaces are always mapped as contiguous planes, and
never (at least in any of the encodings that I know of) larger than what
can be mapped on a 2-byte grid.

> 
> Please pitch in with any suggested operations you think we need.
> 
> The real issue seems to be, can we do it with an FSM that is not hideously
> complex to program?  Or do we need a non-finite language in which infinite
> loops etc. are possible?  The latter is easier to write things in, but may
> not be as safe or as fast.
Allowing for both algorithmic and mapping codecs within the same
implementation might confuse matters somewhat... what about separating
things into mapping codecs (which will handle all the Unicode stuff),
and a separate machine (or possibly extension to the mapping machine)
that can do algorithmic transformations?  This would whittle down the
immediate problem to developing the mapping machine, which as far as I
can tell should only have to support reading, writing, lookup, and
comparison, at least for doing Unicode conversions.  How does this
sound?

Also, I think another thing on our agenda should be to list up a
preliminary list of encodings/character sets we're going to support from
the beginning - this will also help to narrow the scope of the problem
somewhat.  There may eventually be other encodings which we'll want to
support by adding some extra functionality to the machine; but in
general, I don't think that there's any harm in making something that's
really simple to do what we want to do now...  If this sounds like a
good idea then I'll draw up a preliminary list from the Unicode site,
and then we can take a look at implementations (iconv, Java, and the
KANJIMAP link Marc-Andre just posted, for example) to help figure out
the FSM instruction set.

What do you all think?

--Brian