[Spambayes] Mass Distribution for Training Set
Anthony Baxter
anthony at interlink.com.au
Mon May 24 23:37:10 EDT 2004
Bahman Lashgari wrote:
> Hello!
>
> We are considering providing this plug-in to the entire office. However,
> it is an extra overhead of teaching people how to run training sets and
> they may not have enough emails for the spam category to build a good
> and updated set. Our question is this: can we configure one training
> file and load the same training file on all machines as default set? In
> this case, for example, the training file would be training.file and we
> could copy and paste to all workstations. How would this work? Your
> input is very much appreciated. Thank you.
Bear in mind that individual preferences may vary as to what's spam
and ham - having said that, if you've got a "work email is for work"
policy, that should be less of a problem. Selecting the correct
training set will be a bit tricky - you want something that's
typical of everyone's email.
You may find it appropriate to make a couple of different training
databases if you have distinct groups of users with distinct types
of email. For example, a finance department would probably deal with
messages containing terms like 'credit cards', 'cheapest' and 'payment',
while an engineering team would not.
I'd recommend a quite small initial training set - say about 30-40
of each (spam/ham). That way, if it _is_ sub-optimal for some users,
it won't be too hard for their training to overcome the default
training. As far as selecting the messages for the initial training
set - I'd start with an empty database, pick a couple of messages to
train on, then from your test set, train on the messages that are
furthest from being correctly scored - that is, pick the lowest
scoring spams and the highest scoring hams. Don't bother training on
messages that are already being scored perfectly (1.0/100% for a spam,
0.0/0% for a ham)
Hope this helps!
Anthony
--
Anthony Baxter <anthony at interlink.com.au>
It's never too late to have a happy childhood.
More information about the Spambayes
mailing list