[Spambayes] Anyone else seeing increasing error rates over time?

Bill Yerazunis wsy at merl.com
Wed Dec 4 15:36:26 2002


   Thread-Index: AcKbc6q3qg2tOWANRlqgdMLcb73raw==

   Over the past few days, I've been seeing an increase in FNs and
   Unsures. I initially trained on my inbox and spam folders (386 ham,
   999 spam), and since then I've trained on errors only. I'm now at
   391 ham and 1011 spam. Initially, I was getting no errors, and 1 or
   2 unsures per day. Now, I'm starting to get at least 1 FN per day,
   and a slight increase in the unsure rate.

   It's far too early to tell, but could this be related to Tim's code
   to handle unbalanced training sets? As time goes on, the spam:ham
   ratio will increase (as FNs happen more often than FPs) and so the
   impact of spam clues will be lessened (by Tim's code). I'll keep
   monitoring this, but my "real life" mail is definitely unbalanced
   (home is massively biased in favour of spam, work massively biased
   in favour of ham, but I pre-filter mailing lists which muddies the
   water badly).

   I dunno. Do the testing gurus round here have any idea whether this
   type of hypothesis could be tested in practice?

I'm seeing an increase in error rates as well.  I'm starting to think
of it as "evolution in action", that is, it's actually an indication
of how fast spam mutates.  The errors are new kinds of spam, or at
least new topics, or in a new style, and not simmple misclassifies in
the classic sense.

Looking at the statistics on CRM114, as of today (with the run 
starting Nov 1):

Week 1 - zero errors
Week 2 - zero errors
Week 3 - two errors
Week 4 - two errors
Week 5 - four errors, and it's only Wednesday!

As of the start of week 5, I'm back to Train On Errors on-the-fly, and
I'll let you know if that helps or not.  

It's too early to really have any assurance that this is the case, but
I'll hypothesize that this shows that spam has a measurable nonzero
mutation rate, and that mutation rate can be approximated by:

                                                 kT
    Total really new spams seen = Spams seen * (e   - 1)

where T is the elapsed time in days since training stopped, and k is
an empirical constant with value of roughly .0001

Paul Moore: see if this predicts your increase in errors.  If you get
100 spams a day, and it's been 5 days since you last trained, this
rule predicts 1/4 chance of a spam by the 5th day... but 4 spams by
the 20th day.

HUGE SCREAMING CAVEAT: This equation is pure smoke and mirrors, as I
have far too little data to get an error bar that isn't the entire
plotting area; a case of "torturing the data until it confesses"
sufficient to warrant investigation by the Hague Tribunal.

n.b.:

The spams I've seen come through since the start of the November run
are in general really new, and either

 1)written so well that they even fool me into reading for a page
 or two until I figure out that they're spams (or have me laughing
 so hard that I keep reading anyway)

or

 2)written so tersely that it takes some background research to 
 figure out that they're spams.

The exception was the first occurrence of "Barnyard Teen" spam (you
figure it out...)

And gee, just when I thought things had settled down enough that I
could sit back and make CRM114 truly 8-bit clean and wchar-safe...

      -Bill Yerazunis





More information about the Spambayes mailing list