[Spambayes] progress on POP+VM+ZODB deployment

T. Alexander Popiel popiel@wolfskeep.com
Fri Oct 25 23:21:38 2002


In message:  <Pine.LNX.4.33L2.0210251448570.5611-100000@dev.itsite.com>
             Derek Simkowiak <dereks@itsite.com> writes:
>
>	Where did you get your initial training corpses... carpals...
>um, collections of email?  Just personal stuff lying around?

I personally get my corpora by adding a procmail entry to save
all my incoming email to a folder that I never touch, before
doing any other filing on it.  Then, as I process my mail, I
move any spam I get into a spam folder.  The spam folder acts
as my spam corpus, and the everything - spam stuff acts as
my ham corpus.  Do this for about a month, and you should
have some decent size corpora.  (It took me about a month and
a half to get above the 2000 ham and 2000 spam limit that
Tim set for doing algorithm shootouts. :-) )

>	I am still after a nice "real world" hammie.db.  (I'll buy a pizza
>for the first person to send me a good .db file, just include your
>address, topping list, and the phone number of your favorite local pizza
>joint in a private email to me.)

I think sharing dbs is actually a very _BAD_ idea.  Sure, it
saves some initial effort, but it encourages a tendency to just
take the stock db and never retrain.  One of the things I like
most about this system is how easily and automatically it
customizes itself to your personal mail patterns... which means
that spammers will have a harder time defeating it (since there's
no single widespread db to defeat).

>	Not having a nice .db to start out with seems like a pretty heavy
>barrier for [potential] new users.  We need to go searching through
>undocumented code just to figure out how to play with it.

I agree that the documentation needs to be improved, if this is
to be used by anyone other than researchers.  I don't think that
providing a starter db is the right way to make up for the lack
of documentation. :-)

- Alex