[spambayes-dev] "approximately" the same size

Mon Jan 24 16:29:02 CET 2005

    Kenny> Mathematically, the total number of tokens should have no effect
    Kenny> on the probabilities.  We only count a token once per message,
    Kenny> and we divide the number of messages that have contained the
    Kenny> token by the total number of messages.  The total number of
    Kenny> tokens never figures into the calculation at all.

Still, it seems to me the number of unique tokens seen (and the overlap
between those seen in ham and those in spam) must have some effect on the
effectiveness of the algorithm.  The more disjoint the set of tokens
appearing in hams and spams are the easier it should be to distinguish ham
from spam.  If there are 1000 tokens that appear in ham and 100 tokens that
appear in spam, is it more likely that the intersection of the two
approximates the set of spam tokens?

    Kenny> It would be interesting to know, though, if this type of
    Kenny> imbalance might skew the selection of the significant tokens that
    Kenny> figure into the calculation of the final score.  If there are
    Kenny> significantly more ham tokens in the training, is it more likely
    Kenny> that the 150 significant tokens chosen will also have a higher
    Kenny> percentage of ham tokens?

That's sort of what I was thinking (though my thought was not as
well-formed).

So, getting back to the original problem.  Assume I have tried hard to
maintain a nearly 1:1 ham:spam ratio.  Given that most hams are much larger
than most spams, there will be many more tokens found in hams than tokens
found in spams.  Most tokens seen in spams will have been seen in some hams,
thus lessening their effectiveness

A corollary thought: Given H and S, the sets of ham and spam tokens,
respectively, what would be effect of simply deleting their intersection
from the database?

Skip