[spambayes-dev] RE: [Spambayes] How low can you go?

Thu Dec 18 18:41:04 EST 2003

Tim,

Thanks for taking the time to construct such a complete set of answers.  I
learned a lot from it and I assume other list readers did as well.

> > [Seth Goodman]
> > If we do, we could eventually have none of the tokens from a trained
> > message present but its message count will still be there.  Unless we
> > implement your token cross-reference as explained below, the message
> > counts will eventually not be correct if we expire enough tokens.
>
> [Tim Peters]
> I want to do expiration "correctly".  But even if all the tokens from a
> message expire when the total message count is N, it still doesn't change
> that counts on tokens that remain were in fact derived from N
> messages, and
> so N remains the best possible thing to feed into the spamprob guesses.

Not really.  If you decrement all the token counts from a trained message,
the database is in the exact same state as it was before you trained on that
message (ignoring subsequent messages trained).  At that point, the trained
message count was N-1, so that is the best thing to use for the probability
calculation rather than N.  The message count will keep increasing as you
train new messages but the token database will eventually level off.  That
suggests that the trained message counts will become too large as time goes
on.

If you only expire hapaxes, perhaps the incorrect message count is a
technicality and won't have a significant effect on the spam probabilities.
But unless you expire non-hapaxes as well, the token database can't track a
changing message stream very well.  Once you start expiring non-hapax tokens
(is there a name for these?), my guess is that you can no longer ignore the
incorrect message count issue.  So how _do_ you do expiration "correctly" if
not by whole messages?

> >> [Tim Peters]
> >> ...
> >> There's another bullet we haven't bitten yet, saving a map of
> >> message id to an explicit list of all tokens produced by that
> >> message (Skip wants the inverse of that mapping for diagnostic
> >> purposes too).  Given that, training and untraining of individual
> >> messages could proceed smoothly despite intervening changes in
> >> tokenization details; expiring entire messages would be
> >> straightforward; and when expiring an individual feature, it would
> >> be enough to remove that feature from each msg->[feature] list it's
> >> in (then untraining on a msg later wouldn't *try* to decrement the
> >> per-feature count of any feature that had previously been expired
> >> individually and appeared in the msg at the time).
>
> > [Seth Goodman]
> > This definitely works.  But why bother tracking, cross-referencing and
> > expiring individual tokens when we can just expire whole messages,
> > which is a lot simpler?
>
> [Tim Peters]
> I doubt that it's simpler at all, and you earlier today sketched quite an
> elaborate scheme for expiring different messages at different
> rates.  That's
> got its share of tuning parameters (aka wild-ass guesses <wink>)
> too, showed
> every sign of being just the beginning of its brand of
> complication, and has
> no testing or experience to support it.  We know a lot about the real-life
> effects of hapaxes now.

Offhand, adding a single timestamp per message at training time sounds
easier than tracking the last time seen for every token in the database.  As
far as the "elaborate" scheme I suggested for variable expiration times, all
that's involved is changing the message timestamp before storing it.  Since
you don't have anything like that now, you can just ignore that idea and the
extra parameter that goes with it.  BTW, that parameter value is not just a
wild-ass guess, it's a SWAG (sophisticated wild-ass guess), and I don't like
them any better than you do :)

Either way, rather than frequently searching for expired tokens (in a very
long list), you would only do token expiration when you have to train a new
message.  At that point, you find the oldest trained message (from a much
shorter list) and untrain it.  The extra complication is storing the token
list with each message ID plus its training timestamp.  That doesn't sound
big compared to cross referencing every token to every message it appeared
in.  They're certainly not mutually exclusive and you later made a good
argument for having this extra information anyway.

> [Tim Peters]
> BTW, the single worst thing you can do with a system of this type
> is train a
> message into the wrong category.  Everyone does it eventually, and some
> people can't seem to help but doing it often.  Maybe that's a UI
> problem at
> heart -- I don't know, because I seem to be unusually resistant
> to it.  It's

I agree completely.  This was an important motivation for expiring a whole
message at a time.  Training mistakes would eventually drop out of the
database without user intervention.  Not that a tool to help track down
training mistakes wouldn't be great, but a "casual" user could still make
occasional mistakes and the system would recover by itself.

> [Tim Peters]
> happened to me too, though, and it can be hard to recover.  One
> sterling use
> for a feature -> msg_ids map is, as Skip noted, a way to find out
> *why* your
> latest spam was a false negative:  look at the low-scoring features, then
> look at the messages with those features that were trained on as
> ham.  This
> has an excellent shot at pinpointing mis-trained messages.
> That's difficult
> at best now, and is a real problem for some people.  I've got gigabytes of
> unused disk space myself <wink>.

No argument there, it's a great feature for problem-solving.

> [Tim Peters]
> Evolution of this system would also be served by saving an
> explict msg_id ->
> features map.  When we change tokenization to get a small win,
> sometimes the
> tokens originally added to a database by training on message M
> can no longer
> be reconstructed by re-tokenizing M (the tokenizer has changed!  if it
> always returned exactly what it returned before the change, there wasn't
> much point to the change <wink>).  Blindly untraining anyway can violate
> database invariants then, eventually manifesting as assertion
> errors and the
> need to retrain from scratch.  The only clear and simple way to
> prevent this
> is to save a map from msg_id to the tokens it originally produced.  Then
> untraining simply walks that list, and nothing can go wrong as a result.

I agree completely and that's why I suggested saving the token list with
each message.  Your feature_ID scheme makes it practical.

> [Tim Peters]
> That's a bit subtle, so takes some long-term experience to appreciate at a
> gut level.  Of more immediate concern to most users is that only the
> obsessed *want* to save their spam.  Most people want to throw spam away
> ASAP.  But, if they do that, we currently have no way to expire any spam
> they ever trained on.  Moving toward saving msg_ids <-> features
> maps solves
> that too, and with suitable reuse of little integers for feature ids can
> store the relevant bits about trained messages in less space than it takes
> to save the original messages.  Note that hapaxes would waste the most
> resource in this context too.

Sounds like _you're_ arguing for expiration of whole messages :)  I know
you're not arguing that, but if there were bidirectional msg_id <->
feature_ID maps, it would be fairly easy to expire whole messages.  That
would obviate the need to track last time seen for every token.  In any
case, I hope you move in the direction of saving such maps as it adds so
much flexibility.

> [Tim Peters]
> We're not going to abandon plain strings, because they're far too
> useful and
> loved in various reports intended for human consumption.  Adding
> feature_id
> <-> feature_string maps would allow for effective compression of message
> storage.

All your arguments on this point make lots of sense.  I'm a little surprised
that you had significant collisions mapping perhaps 100K items (my guess)
into a 32-bit space.  I think that is rather dependent on the hash used, but
that's what you saw.  Since you need the cleartext anyway, your feature-ID
concept is far superior.  Thanks for educating me.

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above