[Spambayes] statistical comparison of enviroment?

Wed Mar 5 19:23:27 EST 2003

Skip Montanaro wrote:
> By reducing the effectiveness of the system for testing, I think we'd have a
> better idea how effective a new idea might be.  What I don't know is how to
> measure the independence of two different "improvements".  (The more
> independent two improvements are, the harder it seems it would be for a
> spammer to hit two birds with one stone when trying to defeat spambayes.)
> Suppose for the sake of argument that this base system I talk about is 80%
> effective at properly distinguishing ham from spam.  Suppose improvement A
> takes that to 83% and applied independently to the base system, improvement
> B takes that to 85%.  How do you tell how independent A and B are from one
> another?

how about you measure each of the methodologies individually (at least those that have relevance; it seems that time is not one such approach), then look for those that are most complimentary? for example, suppose you had a simple matrix with message_id along the vertical axis and methodology across the horizontal access (plus one entry for 'true nature' of message) and then checked to see which combination of methodologies was the most accurate? 

of course, there may be some level of combinatorial explosion in doing it this way, but it would speak to the independence issue wouldn't it? 

b