[Spambayes] Suggestion

Sat Nov 18 02:24:56 CET 2006

More years ago than I care to remember I did a Masters thesis on
incorporating time dependent query terms in search queries used for
searching "News" feeds. Part of the thesis involved implementing a test
system.

One of the many steps involved in the processing was the removal (or
ignoring of) punctuation before searching for search tokens. I draw your
attention to the following extract from a Spam Clues report

'beneficiary'                       0.844828            0      1
'beneficiary.'                      0.844828            0      1
I would argue that there is no difference between these two tokens and that
the inclusion of the punctuation adds nothing to the process but in this
instance is likely to give the token a lower score than may be appropriate.

I further draw your attention to the following extracts from the same Spam
Clues report:
'+31633775038'                      0.844828            0      1
'30%'                               0.844828            0      1
'65%to'                             0.844828            0      1
'7.5.430'                           0.867197            4      2
'17/11/2006'                        0.909938            1      2
'268.14.7/537'                      0.909938            1      2
'5:56'                              0.909938            1      2
While strings of numbers such as TCP/IP addresses may be useful in
differentiating spam from ham, generally numbers, digits and amounts for
currency are not good choices for tokens. In particular the above date
'17/11/2006' and time '5:56' tokens can normally be considered to be random
and are unlikely to be of any use in classifying spam/ham.

I also used a stop list of words which are so common that they are useless
to index or use in search engines or other search indexes. Below are a
number of instances of words which I believe are not appropriate tokens to
use to differentiate between spam and ham emails. 
'under'                             0.814607            3      1
'its'                               0.862812            1      1
'us.'                               0.862812            1      1
'our'                               0.611666           16      2
'when'                              0.637817            7      1
'that'                              0.664752           19      3
'all'                               0.674394           12      2
'around'                            0.739628            4      1
'it,'                               0.848794            1      1
'up,'                               0.848794            1      1
'p.m.'                              0.813589            7      2
'does'                              0.814607            3      1

Generally I find the current version of SpamBayes to be a very useful tool
but I would like the ability to permanently set the value of a token i.e.
I'd like to be able to set the token 'pharmacy' to value 1.0 to ensure that
all emails containing it are classified as spam; likewise I'd like to
classify certain terms as having value 0.0 so that they are always
classified as ham.

Keep up the good work and I hope that my suggestions are worthwhile.

Regards
A.J. O'Neill
M. App. Sc.
M.B. Computing
Grad. Dip. K.B.S. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes/attachments/20061118/cdcd82e2/attachment.htm