[spambayes-bugs] [ spambayes-Patches-917637 ] snagging some types of word salad

Tue Mar 16 16:55:18 EST 2004

Patches item #917637, was opened at 2004-03-16 15:55
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=917637&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Skip Montanaro (montanaro)
Assigned to: Nobody/Anonymous (nobody)
Summary: snagging some types of word salad

Initial Comment:
Based upon a comment in the procmail mailing list I 
implemented the attached patch to try and detect some 
types of word salad - that which contains random gibberish 
(not random words).  Based upon my current training 
database both tokens it generates are fairly spammy:

% spamcounts -d ~/tmp/tte.db -r &#039;long cons word&#039;
db: /Users/skip/tmp/tte.db
token,nspam,nham,spam prob
long cons word,31,7,0.801780167082
subject:long cons word,10,0,0.978468899522

I don&#039;t have much problem with word salad but some folks 
seem to.  I think it&#039;s more of a training problem than a 
tokenizing problem, but I thought I&#039;d save this patch for 
posterity (and delete it from my source) in case others want 
to investigate it.

The other kind of word salad (random words) might best be 
detected by the classifier by keeping track of runs of 
"natural" tokens (those which don&#039;t contain 
whitespace or prefixes like "subject:") generated by the 
tokenizer which aren&#039;t in the training database.  Spam with 
such word salad will probably have fairly long runs of such 
words while in ham such runs will probably be broken up 
frequently by common words.  Just a thought.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=917637&group_id=61702