[spambayes-bugs] [ spambayes-Bugs-800392 ] Filtered "known-spam" emails don't get added to database

SourceForge.net noreply at sourceforge.net
Sun May 9 14:39:46 EDT 2004


Bugs item #800392, was opened at 2003-09-04 11:36
Message generated for change (Settings changed) made by grab_rat
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=800392&group_id=61702

Category: Outlook
Group: None
Status: Deleted
>Resolution: Rejected
Priority: 5
Submitted By: Graham Bartlett (grab_rat)
Assigned to: Nobody/Anonymous (nobody)
Summary: Filtered "known-spam" emails don't get added to database

Initial Comment:
I don't know if this is a bug or a "feature" - it might 
belong in RFEs.  Anyway...

When an email gets recognised by the filter as spam, it 
gets moved to the "known-spam" folder.  However the 
filter does not seem to train on this email as spam.  I 
don't know why the filter doesn't train on emails it 
moves itself, when it *does* train on email that I move 
manually. This has two main effects.

Firstly, the filter will not "reinforce" itself against words 
which are almost certainly spam.  For instance, the 
word "girls" is only scored 0 ham, 2 spam, when in fact 
the word would be very unlikely to come up in my emails 
but makes a regular appearance in my spam.  This 
means that some words get scored abnormally low.  I 
don't use an "undecided" folder so I don't know how well 
the filter detects "known-ham" emails, but I would guess 
it would have a similar problem on scoring ham emails.

Secondly, the filter will not detect new words appearing 
in spams.  If an email is detected as spam, all words 
appearing in it should be trained on, otherwise when 
spams come in featuring the new words alone, they will 
not be recognised as such.  A classic example here 
would be the emails selling diet supplements - if I train 
my filter to see "glucosamine" and "vitamin" as spam, and 
then I receive a spam featuring those two words and 
also "echinacea", the filter should learn that "echinacea" 
is also likely to be connected to spams.  When I next 
get an email selling only echinacea, it'll then be correctly 
detected as spam.


----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2004-02-17 03:14

Message:
Logged In: YES 
user_id=552329

The main reason for this is that this (along with training all 
non-spam, which presumably is also wanted) would mean that 
the "train on everything" regime was being used.  Testing has 
shown that this is not a particularly effective technique to 
use.

Training only on mistakes (false positives, false negatives, 
unsures) appears to be a much more effective technique 
(despite what common sense might indicate).  As such, this is 
the technique that the interface makes most simple.

----------------------------------------------------------------------

Comment By: Graham Bartlett (grab_rat)
Date: 2003-09-04 11:40

Message:
Logged In: YES 
user_id=633868

One comment to follow up.  I know it's possible to collect a 
bunch of spam in the "known-spam" folder and then train on it 
all.  However (a) this is inconvenient; (b) if it's known that 
this needs doing then you may as well do it as soon as the 
spam is detected; and (c) this will count spams twice if 
they've made it through the filter and been moved manually 
by me, instead of being moved automatically by the filter.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=800392&group_id=61702



More information about the Spambayes-bugs mailing list