[spambayes-dev] How much tokenizer improvement is enough to justify a change?

Rob W.W. Hooft rob at hooft.net
Mon Aug 4 10:06:58 EDT 2003


Sean True wrote:
> Delayed-Standard Cost: $169.6000
> Delayed-Flex Cost: $318.2508
> Delayed-Flex**2 Cost: $232.0352
> 
> After change (breaking up compound words > maxwordlen into smaller words)
> Delayed-Standard Cost: $157.6000
> Delayed-Flex Cost: $365.5402
> Delayed-Flex**2 Cost: $237.8641

This merges nicely with the more-than-three-bins thread: The two /flex/ 
costs are cost functions that use a continuous function to describe the 
penalty given to a message. In its own zone (a spam in the spam zone, 
and a ham in the ham zone) these are 0.0, and they are going up smoothly 
(linear or quadratic) once messages get out into the unsure.

What you see here is that even though for your data set the cutoff-price 
is going down, the average amount by which messages are outside of their 
own zone is going up.... a vote against making more than three bins....

I've tried to use these /flex/es long ago to optimize the cutoffs and 
other parameters of spambayes, but that failed miserably. Therefore I'm 
not sure what this all means.

Rob

-- 
Rob W.W. Hooft  ||  rob at hooft.net  ||  http://www.hooft.net/people/rob/




More information about the spambayes-dev mailing list