[spambayes-dev] More obvious logarithmic expiration data

Mon Jun 9 13:11:29 EDT 2003

In message:  <1054952982.19.1541 at sake.mondoinfo.com>
             Matthew Dixon Cowles <matt at mondoinfo.com> writes:
>I mentioned a while ago that I'd do a little more work based on the
>statistics that I had collected that showed that tokens that figured
>in scoring were likely to be used for scoring again soon.
>
>I instrumented classifier.py and hammie.py to compute several scores
>and log them when computing a score. Each time SpamBayes computes a
>score, it also computes scores using only tokens that had been used
>in scoring in the previous 24 hours, the previous week, the previous
>two weeks, and the previous 30 days.
>
>Here are some results:
>
>2587 sets of scores processed
>Number of scores that differ from actual score
>by 0.00           6885
>by 0.01 or less    633
>by 0.10 or less    179
>by 0.20 or less     32
>by more than 0.20   32

Are these numbers from the within-24-hours number to the within-30-days
number, or the within-7-days number to rhe within-30-days number (given
that later you say you're comaparing against the 30-days number, not
actual), or some combination of both?  

>Also encouragingly, the score changes that happen don't seem to move
>the scores out of the standard 0.0-0.2 and 0.9-1.0 categories much:
>
>                         Moved out of spam   Moved out of ham
>Restricted to one day                   13                  8
>Restricted to one week                   3                  0
>Restricted to two weeks                  2                  0

This is much clearer data.

>If I were cleverer, I'd have guessed all this from the number of
>posts in which people have said that they've trained SpamBayes on
>only a couple of hundred emails and that it's already working well
>for them. But then I wouldn't have the fabulous collection of
>ambiguous and invalid data that came before looking at how often
>tokens are used for scoring <wink>.

Yep.  Empiricism beats clever theory, here.

>Judging from this data, I could relatively painlessly use a database
>that contains only those tokens that have figured in scoring in the
>last ten days or so. That's about 11% of the 273487 tokens in my
>database.

Nifty.  Do you have any provision for retaining (or desire to retain)
words that were used a lot, but suddenly go through an N+1 day dry
spell where they aren't used at all?

>You'd need to bootstrap the process, presumably by counting a token
>as used when it's first trained on. Waiting for a token to be used
>before making it eligible for use has a certain theoretical elegance
>but results might suffer <wink>.

Yeah, that's likely the best bootstrap.  Alternately, you could
base off words appearing rather than being used, which has some
value in not continually dropping noise words like 'the' and
relearning that they're worthless, then not using them because
they're within the .4-.6 exclusion range, then dropping them,
then relearning them, etc...

>And, of course, it's not really time that counts but rather the
>number of emails seen.

I'm not so convinced of this.  One of the things we're dealing
with is spam mutation rate, which I believe is independent of
how much mail any one person receives.

>Ironically, I started collecting these statistics when I was using a
>laptop with a tiny hard disk. Now, with 60G at my disposal, the 23M
>that my database takes up is pretty trifling.

Indeed.  I've got mine capped at about 21M, by only considering
mail within the last 4 months... but it wouldn't significantly
hurt my disk usage (out of if 30-some gig) if I didn't bother.
I have far more space than this consumed by keeping archival
copies of PennMUSH patch releases going back a decade...

- Alex