replacing Chinese chars with their spellings
John Machin
sjmachin at lexicon.net
Thu Apr 25 06:12:21 EDT 2002
Boudewijn Rempt <boud at valdyas.org> wrote in message news:<3cc7967a$0$37911$e4fe514c at dreader3.news.xs4all.nl>...
> John Machin wrote:
> >
> > Presumably the point of having a multi-character pronunciation table
> > is that it is possible that pronounce("xy") can be != pronounce("x") +
> > pronounce("y"). With careful thought, you may be able to remove
> > redundant entries from your more-than-one-char dicts, so that they
> > contain only the necessary exception cases -- but do try the basic
> > approach first.
> >
>
> Isn't big-5 a variable length encoding? I thought that was his
> problem, not translating two or more character words.
big5 is a 1-2 byte encoding. A byte 0-127 is more-or-less ASCII; a
byte 128-255 (or less) is the first byte of a two-byte Chinese
character. So it's variable-length only to that extent.
Contrary to popular mythology, Chinese words can have more than one
syllable. As the OP said:
> ["big5" 2, 4, 6 ... byte long strings] there
> with their pronunciations. If it were just one character [two byte]
> words I would use the "c2t" program.
More information about the Python-list
mailing list