[spambayes-dev] Re: [Spambayes] Database cleaning?

Matthew Dixon Cowles matt at mondoinfo.com
Mon Jun 2 21:36:02 EDT 2003


> What I'm suggesting is having each token keep track of its usage
> frequency, and then building a histogram of token vs. frequency,
> with each token only contributing once to the chart.  This would
> give an idea of what percentage of tokens are used a lot, as opposed
> to what you've got now (which says that for tokens that are used,
> most will be used again soon).

Here you go. Though this one doesn't seem to be worth a histogram:

Over 30.0 days, 63209 tokens were used in scoring a total of 1107800
times
Largest number of uses 11144, smallest 1

      0-500 uses 62929
   500-1000 uses 145
  1000-1500 uses 36
  1500-2000 uses 26
  2000-2500 uses 27
  2500-3000 uses 10
  3000-3500 uses 3
  3500-4000 uses 3
  4000-4500 uses 18
  4500-5000 uses 4
  5000-5500 uses 2
  5500-6000 uses 1
  6000-6500 uses 2
  6500-7000 uses 1
  7000-7500 uses 0
  7500-8000 uses 1
  8000-8500 uses 0
  8500-9000 uses 0
  9000-9500 uses 0
 9500-10000 uses 0
10000-10500 uses 0
10500-11000 uses 0
11000-11500 uses 1


That token that was used 11144 times was "content-type:text/plain"
and the next most commonly-used one was "subject:: ".

Regards,
Matt




More information about the spambayes-dev mailing list