[Spambayes] Mining the headers
Tim Peters
tim.one@comcast.net
Mon Oct 28 01:20:32 2002
> Tim> Skip, I think there's a bug in the extract_dow code.
[Skip]
> Thanks for catching it.
You're welcome <wink>.
> for: ... else: isn't a construct I use often, so it's not entirely
> surprising that I muffed it.
It was just one "break" away from perfection -- a loop needs an early exit
else "else:" is a bug (hmm -- I wonder whether PyChecker knows that rule!).
> How did you generate the table of tokens in your note?
>
> Tim> #ham #spam spamprob
> Tim> 'dow:0' 2 7 0.890542594688
> Tim> 'dow:1' 3 7 0.854937008074
> Tim> 'dow:2' 725 71 0.220827483069
> Tim> 'dow:3' 1038 261 0.420993872704
> Tim> 'dow:4' 845 234 0.444677806501
> Tim> 'dow:5' 126 196 0.81766035841
> Tim> 'dow:6' 0 137 0.998363041106
> Tim> 'dow:invalid' 2741 946 0.499472081328
>
> The only tokens I've ever seen are in the summaries.
I do that mostly by hand. Here's a little Python program I didn't bother to
check in:
"""
import cPickle as pickle
#f = file('outlook2000/default_bayes_database.pck', 'rb')
#f = file('fat.pik', 'rb')
f = file('class1.pik', 'rb')
c = pickle.load(f)
f.close()
w = c.wordinfo
def root(prefix):
for k, r in w.iteritems():
if k.startswith(prefix):
print `k`, r.hamcount, r.spamcount, r.spamprob
"""
Run that via, e.g.,
python -i pik.py
It then loads the trained classifier pickle of your choice into 'c', its
wordinfo dict into 'w', and leaves you in an interactive session where you
can play around. The utility root() function prints
token hamcount spamcount spamprob
for every token beginning with a given string. So, in this case, I did
root('dow:')
and pasted a screen scrape into the email.
Note that the option
[TestDriver]
save_trained_pickles: True
will leave behind classifier pickles for each classifier trained during a
test run. So there's not much too it! Spend a few minutes studying the
classes in Classifier: their instance data members are very simple (esp.
since we got rid of a ton of combining schemes), and the whole thing will
make a lot more sense to you then. The classifier's data structures are
very easy to rummage around in, and there are very few of them.
perfection-is-reached-when-there's-nothing-left-to-throw-away-ly y'rs - tim