[Spambayes] Re: Detecting hashbusters (was: Wired: Random Acts of Spamness)

Wed Jan 14 07:11:22 EST 2004

>>> Skip Montanaro wrote
> The question then becomes, how do you detect "the presence of white noise"?
> It seems obvious that a hash buster will contain a fairly long string of
> words not seen before.  The problem seems to be that in a newly minted
> database that might be inaccurate.  It also suggests a stronger coupling
> between the tokenizer and the classifier, since either the tokenizer needs
> access to the training database to consider what is and isn't known or the
> classifier needs to generate some synthetic features based upon the stream
> of tokens coming back from the tokenizer.

Surely this would be better in the classifier - if a message has more than
a certain number of unknown tokens, handle the unknown word probability
differently. Or else synthesise a token based on the percentage of known
vs unknown spamprobs and stuff it onto the end of the token stream.

I'm not sure if it's worthwhile - I'm finding that smalldb+nonedge isn't
having any problems at all. The rate of training has dropped to about 2
messages a day, as well.

Anthony