[Spambayes] Frequency distribution for wordinfo counts?

Mon Feb 23 21:28:16 EST 2004

[Training to exhaustion]
> Seems to work pretty well.  Here's a run I did just now:
> 
>     % python ~/tmp/spambayes/contrib/tte.py -g 
> newham.clean.save -s newspam.clean.save -d tte.db 
>     round:  1, msgs:  770, ham misses: 196, spam misses: 244, 67.7s
>     round:  2, msgs:  770, ham misses:  33, spam misses:  55, 49.4s
>     round:  3, msgs:  770, ham misses:   8, spam misses:   5, 33.1s
>     round:  4, msgs:  770, ham misses:   0, spam misses:   0, 28.6s
>     1 untrained spams

How did these 770 messages get selected?  Is this a batch of recently
arrived mail, or some sort of pre-selected training collection?  Did tte.db
exist before this?

> Adding up the last column indicates a total run time of about 
> three minutes. I can live with that.

How often do you tend to run this?

[...]
> The database thus winds up smaller than it would be with a 
> more usual training approach.

Although slightly larger than mistake-based-training (541 instead of 440),
but presumably more accurate as well.

=Tony Meyer