[spambayes-dev] The naive bayes classifier algorithm in spambayes
doesn't take in frequency?
Austine Jane
janeaustine50 at hotmail.com
Tue Aug 31 14:54:59 CEST 2004
Hello.
I have a question on the naive bayes classifier algorithm used in spambayes.
I suppose if word1 appeared in three ham mail, the probability of word1
being in ham mail would be greater than when it appeared in one ham mail:
(using spambayes-1.0rc2)
>>>c=storage.DBDictClassifier('test.db')
>>>def tok(s): return s.split()
>>>c.learn(tok('word1'),is_spam=False)
>>>c.spamprob(tok('word1'))
0.15517241379310343
>>>c.learn(tok('word1'),False)
>>>c.spamprob(tok('word1'))
0.091836734693877542
>>>c.learn(tok('word1'),False)
>>>c.spamprob(tok('word1'))
0.065217391304347783
As you see the spam probability declines. So far so good.
>>>c.learn(tok('word1'),True)
And word1 also appeared in one spam mail, but it appeared in three ham mail
before.
>>>c.spamprob(tok('word1'))
0.5
Hm... Sounds not very right.
>>>c.learn(tok('word1'),False)
>>>c.spamprob(tok('word1'))
0.5
Stays still.
This doesn't sound intuitive. For example, word1 occurred in 1000 spam email
and occured in 1 ham mail. What is the probability of one mail that contains
word1 being spam mail? Half and half? Doesn't it take in the number of
occurences(it does seem to take in the number of distinct tokens though)? It
seems like the concept of the number of occurences and the number of
distinct tokens are mixed in spambayes' classifier.
Machine Learning by Tom Mitchell(esp. page 183 and
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html see also
p.s.) suggests a formula that gives a quite different result from spambayes'
classifier. It always takes in the number of occurences, hence more
intuitive.
Am I missing something big? Thanks in advance,
Jane
-------------
p.s.
The formula in Tom Mitchell's book is:
Vocabulary is the set of all distinct words and other tokens occuring in any
text document from Examples
For each target value v_j in V do
* docs_j is the subset of documents from Examples for which the target
value is v_j
* P(v_j)= | docs_j | / | Examples |
* Text_j is a single document created by concatenating all members of
docs_j
* n is the total number of distinct word positions in Text_j
* for each word w_k in Vocabulary
* P(w_k|v_j) = ( n_k + 1 )/( n + |vocabulary| )
* n_k is the number of times word w_k occurs in Text_j
As you see it uses m-estimate.
_________________________________________________________________
Help STOP SPAM with the new MSN 8 and get 2 months FREE*
http://join.msn.com/?page=features/junkmail
More information about the spambayes-dev
mailing list