replacing Chinese chars with their spellings

John Machin sjmachin at lexicon.net
Wed Apr 24 21:34:29 EDT 2002


Dan Jacobson <jidanni at deadspam.com> wrote in message news:<m2g01lfx7i.fsf at jidanni.org>...
> Before I start learning python, here's what I want to do: I have a
> table of Hakka Chinese words and their pronunciations.  I scan a file
> and replace any Hakka ["big5" 2, 4, 6 ... byte long strings] there
> with their pronunciations.  If it were just one character [two byte]
> words I would use the "c2t" program.  Is there a template that munches
> forth in a file and replaces the longest match in a database
> before moving on?

I doubt there is such a "template". What you would need to do is
something like
this:

1. Set up Python dictionaries, one for possible Hakka word-length. So
if the longest Hakka word in your table is say 5 Chinese characters,
then you would need 5 dictionaries.
2. Read your whole file into a string.
3. Isolate runs of Chinese "vocabulary" characters (runs being
separated by punctuation etc).
4.
For each run:
   Set pointer to start of run
   For each pointer position inside run:
      For n in (5,4,3,2,1):
         Look up the next n characters in the n-char dict;
         if found, substitute and move your input pointer on
         by n*2 bytes
      If all five lookups fail, it means you don't have
      a 1-char dictionary entry for a single Chinese character;
      put out whatever error message or substitution
      you consider appropriate, and move on to the next
      character in the run.

Warning: this is a data processing approach, not a computation
linguistics approach. It is quite possible that it might produce
howlers of the sort that early machine translation efforts were
alleged to do:

"The spirit is willing but the flesh is weak" -> Russian equivalent of
"The vodka is fine but the meat has gone rotten".

Presumably the point of having a multi-character pronunciation table
is that it is possible that pronounce("xy") can be != pronounce("x") +
pronounce("y"). With careful thought, you may be able to remove
redundant entries from your more-than-one-char dicts, so that they
contain only the necessary exception cases -- but do try the basic
approach first.

As well as pronunciation dictionaries, might you not also need to
incorporate rules for such things as tone sandhi?



More information about the Python-list mailing list