[Spambayes] using SpamBayes for Wiki filtering

Matthew Good trac at matt-good.net
Sun Nov 27 02:25:18 CET 2005


On Sun, 2005-11-27 at 12:16 +1300, Tony Meyer wrote:
> The tokenizer is designed to tokenize email, but you could certainly  
> write your own tokenizer (or subclass the existing one) designed to  
> tokenize wiki pages.  Once you've got tokens, you can use the  
> existing classifier and storage classes (classifier.py and
storage.py).
> 
> However, you might find that the email tokenizer does reasonably
well  
> on wiki pages; email and web text are not particularly different.
It  
> would be worth trying that first.

Yeah, a lot of the wiki-formatting constructs should be pretty close to
what people use in plain-text emails.  I may also see how it behaves if
I convert it to HTML first instead of using the raw Wiki text.

I started writing a subclass of the SQLClassifier in order to store the
statistics in the Trac db, which is pretty straightforward.

> > Are there any other projects using SpamBayes like this that I can  
> > use as
> > an example?
> 
> There's a plug-in for a web proxy to use SpamBayes for web filtering  
> in the contrib/ directory.  If you google through the archives of  
> this list (or maybe spambayes-dev?) there's an example of Skip using  
> SpamBayes for music classification, IIRC.  I've used the classifier
&  
> storage for classification of lines of dialogue in a scripted  
> performance.

Thanks, I'll check that out.  It looks like using SpamBayes should work
out ok.  The harder part will probably be adding the appropriate places
to the Trac UI for training the classifier (and presumably making this
extensible, since we'll want to allow various spam-prevention plugins).

-- 
Matthew Good <trac at matt-good.net>



More information about the SpamBayes mailing list