[Spambayes] How do you classify text?

Wed Apr 23 11:15:37 EDT 2003

4/23/2003 4:37:16 AM, Miguel Sevillano <msevilla at gts.tsc.uvigo.es> wrote:

>   Hello,
>
>   I'm working in a project that must classify a paragraph as one among 
>N subjects. I would like to know exactly how you take a paragraph and 
>classify it; how do you train the filter?.
>
>   I would like to apply bayesian rules to distinguish among N 
>differents subjects which a paragraph is talking about.

Spambayes will classify into three buckets at most: positive classification, 
negative classification, and unsure.  To apply this to n subjects, you'd need 
to apply the filter n-1 times.  For classifications c(1)...c(n), you would 
first apply the filter for c(1), removing all positive c(1) classifications 
from your input set.  Then filter for c(2), removing all positives, etc... to 
c(n).  You may indeed end up with negative and unsure classifications after 
the final c(n) filtering...  Each of these filters would require a bayesian 
classification database (PersistentClassifier in spambayes), and would have to 
be trained separately, by feeding known positives to each via the learn() 
method.  Filtering is initiated by using the spamprob method on a particular 
classifier, sending it the text that has been tokenized by our tokenizer.  You 
can see a clear example of this training and filtering activity in the 
imapfilter.

If you don't currently know python, you might want to get yourself a python 
primer and read it, as there is a bit of advanced python stuff in this code.  
By and large, the code is quite readable, though, so check it out and have a 
peek.  Again, start at the imapfilter, and don't get hung up on the imap-
ness...

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.