[spambayes-dev] How much tokenizer improvement is enough to
justify a change?
Rob W.W. Hooft
rob at hooft.net
Mon Aug 4 10:06:58 EDT 2003
Sean True wrote:
> Delayed-Standard Cost: $169.6000
> Delayed-Flex Cost: $318.2508
> Delayed-Flex**2 Cost: $232.0352
>
> After change (breaking up compound words > maxwordlen into smaller words)
> Delayed-Standard Cost: $157.6000
> Delayed-Flex Cost: $365.5402
> Delayed-Flex**2 Cost: $237.8641
This merges nicely with the more-than-three-bins thread: The two /flex/
costs are cost functions that use a continuous function to describe the
penalty given to a message. In its own zone (a spam in the spam zone,
and a ham in the ham zone) these are 0.0, and they are going up smoothly
(linear or quadratic) once messages get out into the unsure.
What you see here is that even though for your data set the cutoff-price
is going down, the average amount by which messages are outside of their
own zone is going up.... a vote against making more than three bins....
I've tried to use these /flex/es long ago to optimize the cutoffs and
other parameters of spambayes, but that failed miserably. Therefore I'm
not sure what this all means.
Rob
--
Rob W.W. Hooft || rob at hooft.net || http://www.hooft.net/people/rob/
More information about the spambayes-dev
mailing list