[spambayes-dev] Re: [Spambayes] Database cleaning?

T. Alexander Popiel popiel at wolfskeep.com
Mon Jun 2 15:05:10 EDT 2003


In message:  <1054582419.53.613 at sake.mondoinfo.com>
             Matthew Dixon Cowles <matt at mondoinfo.com> writes:
>
>> Another thing that would be interesting to plot would be a
>> histogram of the average frequency each token gets used at... which
>> might give us some idea of how large a DB is actually useful.
>
>I'd be glad to poke at the data in a different way, but it's not
>clear to me how that's different from what I've done. Can you tell me
>a little more specifically what you mean?

If I'm reading your histogram right, then you're plotting for each
usage, how long ago it was since that token was last used.  Thus,
a single token that gets used frequently will contribute multiple
times to the histogram.

What I'm suggesting is having each token keep track of its usage
frequency, and then building a histogram of token vs. frequency,
with each token only contributing once to the chart.  This would
give an idea of what percentage of tokens are used a lot, as opposed
to what you've got now (which says that for tokens that are used,
most will be used again soon).

- Alex



More information about the spambayes-dev mailing list