[Spambayes] progress on POP+VM+ZODB deployment

Jeremy Hylton jeremy@alum.mit.edu
Fri Oct 25 23:27:42 2002


>>>>> "DS" == Derek Simkowiak <dereks@itsite.com> writes:

  DS> 	Where did you get your initial training corpses... carpals...
  DS> um, collections of email?  Just personal stuff lying around?

I started with a few messages from my existing VM folders.  I've also
got two training folders that I just created.  I'm adding any messages
that wasn't classified correctly to the training folder.  For example,
if a ham comes in and its score isn't < 0.10, I'm training on it.
Same for spam, but the min score is 0.95.  I've got some new key
bindings that automatically save messages in the appropriate folder.

  DS> 	I am still after a nice "real world" hammie.db.  (I'll buy a
  DS> 	pizza
  DS> for the first person to send me a good .db file, just include
  DS> your address, topping list, and the phone number of your
  DS> favorite local pizza joint in a private email to me.)

I don't think you want someone else's database.  Their ham might be
your spam, or vice versa.  Tim has mentioned a couple of times the
example of Guido's email about hotels.  Guido gets a non-trivial
amount of email about hotels for conferences.  He would have to train
his classifier to recognize messages about hotels as ham, but that
probably makes it more likely he'll get spams advertising discount
hotels.  The details of what exactly your ham looks like is pretty
personal.  The spam is easy to collect, unless you don't get much
spam.  And if you don't get much spam, it's hardly a problem.

  DS> Not having a nice .db to start out with seems like a pretty
  DS> heavy barrier for [potential] new users.  We need to go
  DS> searching through undocumented code just to figure out how to
  DS> play with it.

I agree that there are a lot of problems to be solved before potential
new users can try things out.  I think an initial training database is
a pretty minor problem.  I just spent an entire day getting the POP
proxies hooked up to a training database, and I still have a
bubble-gum-and-bailing-wire solution.

Jeremy