[Spambayes] Mining the headers

Tim Peters tim.one@comcast.net
Mon Oct 28 01:20:32 2002


>     Tim> Skip, I think there's a bug in the extract_dow code.

[Skip]
> Thanks for catching it.

You're welcome <wink>.

>  for: ... else: isn't a construct I use often, so it's not entirely
> surprising that I muffed it.

It was just one "break" away from perfection -- a loop needs an early exit
else "else:" is a bug (hmm -- I wonder whether PyChecker knows that rule!).

> How did you generate the table of tokens in your note?
>
>     Tim>               #ham  #spam        spamprob
>     Tim> 'dow:0'          2      7  0.890542594688
>     Tim> 'dow:1'          3      7  0.854937008074
>     Tim> 'dow:2'        725     71  0.220827483069
>     Tim> 'dow:3'       1038    261  0.420993872704
>     Tim> 'dow:4'        845    234  0.444677806501
>     Tim> 'dow:5'        126    196  0.81766035841
>     Tim> 'dow:6'          0    137  0.998363041106
>     Tim> 'dow:invalid' 2741    946  0.499472081328
>
> The only tokens I've ever seen are in the summaries.

I do that mostly by hand.  Here's a little Python program I didn't bother to
check in:

"""
import cPickle as pickle
#f = file('outlook2000/default_bayes_database.pck', 'rb')
#f = file('fat.pik', 'rb')
f = file('class1.pik', 'rb')

c = pickle.load(f)
f.close()
w = c.wordinfo

def root(prefix):
    for k, r in w.iteritems():
        if k.startswith(prefix):
            print `k`, r.hamcount, r.spamcount, r.spamprob
"""

Run that via, e.g.,

    python -i pik.py

It then loads the trained classifier pickle of your choice into 'c', its
wordinfo dict into 'w', and leaves you in an interactive session where you
can play around.  The utility root() function prints

    token  hamcount  spamcount  spamprob

for every token beginning with a given string.  So, in this case, I did

    root('dow:')

and pasted a screen scrape into the email.

Note that the option

    [TestDriver]
    save_trained_pickles: True

will leave behind classifier pickles for each classifier trained during a
test run.  So there's not much too it!  Spend a few minutes studying the
classes in Classifier:  their instance data members are very simple (esp.
since we got rid of a ton of combining schemes), and the whole thing will
make a lot more sense to you then.  The classifier's data structures are
very easy to rummage around in, and there are very few of them.

perfection-is-reached-when-there's-nothing-left-to-throw-away-ly y'rs  - tim