[Spambayes] Tony Meyer - Training question

Erik Brown kirebrow at yahoo.com
Sun Sep 18 10:17:17 CEST 2005


Bill,

I would say that the training has not changed.  Since you last asked Tony
about training, no major modifications were done (or needed) and there have
been a lot of tweaks to catch more spam, and are realized my modifying the
default_bayes_customize.ini file.

IMHO, I would tell your users to replace the default_bayes_customize.ini
with my settings below *smile*, as I found that more tokens the better for
my mail stream.  

All you have to say is, "train only on unsures with no initial training".
This way they will only train on current messages and with a buffed .ini
file, that is all they would ever need.  With the below settings I am able
to distinguish PayPal phishing scams in 3 trains...

-----------------------------------------

[Classifier]

x-use_bigrams: True
max_discriminators: 150

[Tokenizer]

replace_nonascii_chars: True
record_header_absence: True
x-fancy_url_recognition: True
x-pick_apart_urls: True
x-reduce_habeas_headers: True
x-search_for_habeas_headers: True
basic_header_tokenize: True
basic_header_skip: date x-.* domainkey-signature
check_octets: True
octet_prefix_size: 5
mine_received_headers: True
address_headers: from sender reply-to errors-to
generate_long_skips: True
summarize_email_prefixes: True
summarize_email_suffixes: True
skip_max_word_size: 50


[URLRetriever]

x-cache_directory: url-cache
x-cache_expiry_days: 31
x-only_slurp_base: True
x-slurp_urls: True
x-web_prefix:web:

-----------------------------------------

Erik Brown

-----Original Message-----
From: spambayes-bounces at python.org [mailto:spambayes-bounces at python.org] On
Behalf Of Hely Holdings Pty Ltd (Sales Dept.)
Sent: Sunday, September 18, 2005 1:32 AM
To: spambayes at python.org
Subject: [Spambayes] Tony Meyer - Training question


Hi Tony.

Back in August 2004 you kindly critiqued a spam chapter for me
from my security book "The Hacker's Nightmare".

I am gearing up for a new edition of THN and will be expanding
the spam section a fair bit in the process. I deal only with the
Outlook plug-in.

At this time I would like to know if you have changed your
opinion on training since then. Here's what you said in a message
to me on August 10, 2004 after reading my draft chapter.

---------- BEGIN QUOTE ----------

Training is a difficult issue to write about.  The problem is
that not enough is yet known about the best ways to train, and
that the Outlook plug-in really only facilitates a couple of
different methods.  However, it is almost certain that 'train on
everything' is a bad idea, that smaller databases are generally
better than large ones, and that imbalances are bad.

These are not hard rules.  Your training described has a huge
imbalance, and is a pretty large database, and is (at least
initially) train-on-everything, and yet I presume you have had
good results or you wouldn't be writing this. In general, though,
based on both testing and feedback from users, the above is true.

I believe that the best training method to recommend to people
using the plug-in is:

 * Don't do *any* initial training. (Everything will now end up
in the 'unsure' folder.)
 * Train on *everything* that ends up in the 'unsure' folder.  At
first, this will be a lot of mail, but it will rapidly reduce.
 * Train on *all* mistakes (at first, there may be some false
positives/false negatives, but these will even more rapidly
reduce).

Once 10-20 mails of each type have been trained, the system
should be very accurate.

---------- END QUOTE ----------

For my target audience I need to make all explanations and
instructions as simple as possible. If I started describing
techniques like Seth Goodman's "Recursive Training Set Selection
For Outlook" I'd have them throwing up out of fear and confusion.

I basically distilled your advice down to "do no pre-training at
all - train only on the UNSURE folder".

While that seems to work fine and has been well received, it was
after all a year and several releases ago.

Where do you stand on training these days, for people who simply
will not or cannot follow a complicated set of instructions.

Best regards,
 - Bill H.

--
We take security very seriously. All outgoing mail is
certified Virus Free. To boost YOUR security visit
The Hacker's Nightmare: http://HackersNightmare.com.
Checked by AVG Anti-Virus.
Version: 7.0.344 / Virus Database: 267.11.1/104 - Release Date:
16/09/2005



_______________________________________________
Spambayes at python.org
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html



More information about the Spambayes mailing list