[Spambayes] Tony Meyer - Training question
Erik Brown
kirebrow at yahoo.com
Sun Sep 18 10:23:40 CEST 2005
I forgot to mention they must train on all false negatives and positives as
well.
Erik Brown
-----Original Message-----
From: spambayes-bounces at python.org [mailto:spambayes-bounces at python.org] On
Behalf Of Hely Holdings Pty Ltd (Sales Dept.)
Sent: Sunday, September 18, 2005 1:32 AM
To: spambayes at python.org
Subject: [Spambayes] Tony Meyer - Training question
Hi Tony.
Back in August 2004 you kindly critiqued a spam chapter for me
from my security book "The Hacker's Nightmare".
I am gearing up for a new edition of THN and will be expanding
the spam section a fair bit in the process. I deal only with the
Outlook plug-in.
At this time I would like to know if you have changed your
opinion on training since then. Here's what you said in a message
to me on August 10, 2004 after reading my draft chapter.
---------- BEGIN QUOTE ----------
Training is a difficult issue to write about. The problem is
that not enough is yet known about the best ways to train, and
that the Outlook plug-in really only facilitates a couple of
different methods. However, it is almost certain that 'train on
everything' is a bad idea, that smaller databases are generally
better than large ones, and that imbalances are bad.
These are not hard rules. Your training described has a huge
imbalance, and is a pretty large database, and is (at least
initially) train-on-everything, and yet I presume you have had
good results or you wouldn't be writing this. In general, though,
based on both testing and feedback from users, the above is true.
I believe that the best training method to recommend to people
using the plug-in is:
* Don't do *any* initial training. (Everything will now end up
in the 'unsure' folder.)
* Train on *everything* that ends up in the 'unsure' folder. At
first, this will be a lot of mail, but it will rapidly reduce.
* Train on *all* mistakes (at first, there may be some false
positives/false negatives, but these will even more rapidly
reduce).
Once 10-20 mails of each type have been trained, the system
should be very accurate.
---------- END QUOTE ----------
For my target audience I need to make all explanations and
instructions as simple as possible. If I started describing
techniques like Seth Goodman's "Recursive Training Set Selection
For Outlook" I'd have them throwing up out of fear and confusion.
I basically distilled your advice down to "do no pre-training at
all - train only on the UNSURE folder".
While that seems to work fine and has been well received, it was
after all a year and several releases ago.
Where do you stand on training these days, for people who simply
will not or cannot follow a complicated set of instructions.
Best regards,
- Bill H.
--
We take security very seriously. All outgoing mail is
certified Virus Free. To boost YOUR security visit
The Hacker's Nightmare: http://HackersNightmare.com.
Checked by AVG Anti-Virus.
Version: 7.0.344 / Virus Database: 267.11.1/104 - Release Date:
16/09/2005
_______________________________________________
Spambayes at python.org
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html
More information about the Spambayes
mailing list