[Spambayes] dealing with non-english data.

Guido van Rossum guido@python.org
Sun, 22 Sep 2002 23:53:27 -0400


> Hm. So I've just about finished going through the new messages I've dumped
> into my corpus, and I'm trying to narrow down the fp's and fn's. There's a
> _lot_ of stuff in these mailboxes that are non-english and non-ascii. At the
> moment, the tokenizer doesn't do a fabulous job on this stuff. I'm wondering
> about doing conversion into the given character set, or else tagging the 
> words with the character set (if it's non-english). 
> 
> Unfortunately my knowledge of character set issues is up there with my
> knowledge of high-altitude yak milking, but I'd love to know if we've got
> anyone on this list who knows more about this - for instance, tokenizing
> koi-8r, or euc-kr...

Me neither.  But here's something any schmuck with a recent Python
version can try: use the regular expression \w+ compiled with the re.U
flag to find maximal strings of word characters according to the
Unicode locale.  This should return to strings of characters for each
of which u.isalnum() or u == '_'.  Then all we need to assume in
addition is that the Unicode standard defines letter-ness in a useful
way for Korean and Chinese...

--Guido van Rossum (home page: http://www.python.org/~guido/)