[Spambayes] Re: There Can Be Only One

Harald Koch chk@pobox.com
Wed, 25 Sep 2002 00:51:59 -0400


> To avoid distractions, there's only one kind of test run I'll look at for
> this:  a 10-fold cross-validation run with exactly 200 ham and 200 spam in
> each set.  If you don't have at least 2,000 ham and 2,000 spam, you can't
> run this test, and reporting results won't help.

I don't know about you, but I receive about 250 spam a *month*. It would
take me 8 months to collect 2000 spams. However, the spam I was
receiving eight months ago is generally quite different from the spam
I'm receiving now, so even if I *did* have an archive going back that
far it wouldn't be terribly useful.

One of the purported strengths of Paul's original idea was that it was
*adaptive*.  Based on my recent reading of this list, I believe two
important facts about spam in the wild are being downplayed:

- mail headers (I understand *why* you're doing it, but the
  discriminators on *my* spam vs. ham *do* often come from the headers.
  Spammers seem to all find the same open relays at the same time ;-)

- *time*. my spam is self-similar over short periods of time, and (with
  some exceptions) changes and evolves over longer periods. Token
  statistics collected eight months ago wouldn't eliminate much of my
  current spam. Heck, if i train the classifier on only the old spam and
  then run it against all of the new spam, the f-n rate is abysmal.

I'm running a very simple perl version of the algorithm right now. The
thing that most aggressively lowers my f-n rate is my daily inbox cull;
I feed f-n spam into the classifier whenever I find it.

> Q. This is a ham/spam ratio of 1.  Is that realistic?
> A. We can't test everything at once.

It can be, especially at certain times of the day; I get most of my spam
at night and most of my regular email during the day (North America time).

> Q. 1800 each of ham & spam is a very large training set.
>    Wouldn't it be better to use less training data?
> A. We can't test everything at once.

<laughter>

-- 
Harald Koch     <chk@pobox.com>

"It takes a child to raze a village."
		-Michael T. Fry