[Spambayes] dealing with non-english data.
Guido van Rossum
guido@python.org
Sun, 22 Sep 2002 23:53:27 -0400
> Hm. So I've just about finished going through the new messages I've dumped
> into my corpus, and I'm trying to narrow down the fp's and fn's. There's a
> _lot_ of stuff in these mailboxes that are non-english and non-ascii. At the
> moment, the tokenizer doesn't do a fabulous job on this stuff. I'm wondering
> about doing conversion into the given character set, or else tagging the
> words with the character set (if it's non-english).
>
> Unfortunately my knowledge of character set issues is up there with my
> knowledge of high-altitude yak milking, but I'd love to know if we've got
> anyone on this list who knows more about this - for instance, tokenizing
> koi-8r, or euc-kr...
Me neither. But here's something any schmuck with a recent Python
version can try: use the regular expression \w+ compiled with the re.U
flag to find maximal strings of word characters according to the
Unicode locale. This should return to strings of characters for each
of which u.isalnum() or u == '_'. Then all we need to assume in
addition is that the Unicode standard defines letter-ness in a useful
way for Korean and Chinese...
--Guido van Rossum (home page: http://www.python.org/~guido/)