[spambayes-dev] RE: [Spambayes] How low can you go?

Seth Goodman nobody at spamcop.net
Mon Dec 22 13:40:40 EST 2003


>     >> [Tim Peters]
>     >> I don't want to expire a hapax if it's been used recently in
>     >> *scoring*.  Message times can't distinguish used from unused
>     >> features.  If you're doing train-on-everything (with or without
>     >> whole-msg expiration), a hapax used in scoring becomes a non-hapax
>     >> the first time it's used in scoring.  For
>
>     Seth> But for really unusual messages of the type you were concerned
>     Seth> about, this may only happen once a year, or so, which
>           is too long
>     Seth> for a hapax-expiration scheme.
>
> [Skip Montanaro]
> Under the heading of "practicality beats purity"...
>
> If you know a given type of message is ham but is seen infrequently, train
> on it twice.  That makes sure none of its tokens are hapaxes, and are thus
> never candidates for deletion.

Great point.  That solves the problem for hapax expiration and unusual
messages.

> [Skip Montanaro]
> Hmmm...  That violates my "never train on a message twice" dictum.

Since you're thinking pragmatically, don't worry about the dictum.
Presumably, you would only do this rarely, i.e. on messages the likes of
which you only expect a couple times a year.  For the Outlook version, you
would have to make a copy of the message and train on that, but it would
still solve the problem.  Just out of curiosity, does the proxy version of
SpamBayes have the same protection as the Outlook version against training
on the same msg_id twice?

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above




More information about the spambayes-dev mailing list