[Spambayes] statistical comparison of enviroment?

T. Alexander Popiel popiel at wolfskeep.com
Wed Mar 5 20:22:00 EST 2003


In message:  <15974.46301.300535.819582 at montanaro.dyndns.org>
             Skip Montanaro <skip at pobox.com> writes:
>
>Accordingly, when considering potential improvements (improved tokenizing
>tricks, for example), perhaps what we should be doing is disabling much of
>the current capability and then testing a new change against such a
>"crippled" system.

This seems like a reasonable strategy.  There's already options to
control some of the header parsing; I suspect more options could be
put in to disable various other aspects of the tokenizer.  I'm not
sure how much the folks who are just trying to use the system will
like all the extra options, though...

>What I don't know is how to measure the independence of two different
>"improvements".

The simple solution for that seems to me to be doing four runs, with
each combination of the two options on and off.  If the two are
independent, then the run with both on should be better than the
run with either on, and the run with neither on should be worse
than both.  If it's really independent, then there should be a
nice mathematical relation between the improvements from none to
either and from either to both... but I'm forgetting what that
math is at the moment, and I doubt than anything is perfectly
independent anyway.
 
>Suppose for the sake of argument that this base system I talk about is 80%
>effective at properly distinguishing ham from spam.  Suppose improvement A
>takes that to 83% and applied independently to the base system, improvement
>B takes that to 85%.  How do you tell how independent A and B are from one
>another?

By doing a run with both A and B, and seeing if it was at about 87%.

>(The more independent two improvements are, the harder it seems it would
>be for a spammer to hit two birds with one stone when trying to defeat
>spambayes.)

Aye.  The problem, of course, is that we could start making spambayes
so tricked-out that it'd be as slow as SpamAssassin. ;-)

- Alex



More information about the Spambayes mailing list