[spambayes-dev] effective tokenizer for wiki text

Matt Good matt at matt-good.net
Mon Oct 30 22:38:05 CET 2006


The Trac[1] project has resurrected work on a SpamBayes plugin for
filtering Wiki and ticket edits after finding the current Akismet system
to be unreliable.  Tony Meyer added some comments[2] to the Wiki
suggesting that we write a custom tokenizer instead of using the
built-in email-centric tokenizer.

Are there examples from other people that have written custom tokenizers
that may be helpful, or do you have any hints on what to take into
account for writing an effective tokenizer for Wiki text?

-- Matt Good

[1] http://trac.edgewall.org
[2] http://trac.edgewall.org/wiki/SpamFilter#Bayes



More information about the spambayes-dev mailing list