replacing Chinese chars with their spellings

Wed Apr 24 21:34:29 EDT 2002

Dan Jacobson <jidanni at deadspam.com> wrote in message news:<m2g01lfx7i.fsf at jidanni.org>...
> Before I start learning python, here's what I want to do: I have a
> table of Hakka Chinese words and their pronunciations.  I scan a file
> and replace any Hakka ["big5" 2, 4, 6 ... byte long strings] there
> with their pronunciations.  If it were just one character [two byte]
> words I would use the "c2t" program.  Is there a template that munches
> forth in a file and replaces the longest match in a database
> before moving on?

I doubt there is such a "template". What you would need to do is
something like
this:

1. Set up Python dictionaries, one for possible Hakka word-length. So
if the longest Hakka word in your table is say 5 Chinese characters,
then you would need 5 dictionaries.
2. Read your whole file into a string.
3. Isolate runs of Chinese "vocabulary" characters (runs being
separated by punctuation etc).
4.
For each run:
   Set pointer to start of run
   For each pointer position inside run:
      For n in (5,4,3,2,1):
         Look up the next n characters in the n-char dict;
         if found, substitute and move your input pointer on
         by n*2 bytes
      If all five lookups fail, it means you don't have
      a 1-char dictionary entry for a single Chinese character;
      put out whatever error message or substitution
      you consider appropriate, and move on to the next
      character in the run.

Warning: this is a data processing approach, not a computation
linguistics approach. It is quite possible that it might produce
howlers of the sort that early machine translation efforts were
alleged to do:

"The spirit is willing but the flesh is weak" -> Russian equivalent of
"The vodka is fine but the meat has gone rotten".

Presumably the point of having a multi-character pronunciation table
is that it is possible that pronounce("xy") can be != pronounce("x") +
pronounce("y"). With careful thought, you may be able to remove
redundant entries from your more-than-one-char dicts, so that they
contain only the necessary exception cases -- but do try the basic
approach first.

As well as pronunciation dictionaries, might you not also need to
incorporate rules for such things as tone sandhi?