[Spambayes] progress on POP+VM+ZODB deployment
T. Alexander Popiel
popiel@wolfskeep.com
Fri Oct 25 23:21:38 2002
In message: <Pine.LNX.4.33L2.0210251448570.5611-100000@dev.itsite.com>
Derek Simkowiak <dereks@itsite.com> writes:
>
> Where did you get your initial training corpses... carpals...
>um, collections of email? Just personal stuff lying around?
I personally get my corpora by adding a procmail entry to save
all my incoming email to a folder that I never touch, before
doing any other filing on it. Then, as I process my mail, I
move any spam I get into a spam folder. The spam folder acts
as my spam corpus, and the everything - spam stuff acts as
my ham corpus. Do this for about a month, and you should
have some decent size corpora. (It took me about a month and
a half to get above the 2000 ham and 2000 spam limit that
Tim set for doing algorithm shootouts. :-) )
> I am still after a nice "real world" hammie.db. (I'll buy a pizza
>for the first person to send me a good .db file, just include your
>address, topping list, and the phone number of your favorite local pizza
>joint in a private email to me.)
I think sharing dbs is actually a very _BAD_ idea. Sure, it
saves some initial effort, but it encourages a tendency to just
take the stock db and never retrain. One of the things I like
most about this system is how easily and automatically it
customizes itself to your personal mail patterns... which means
that spammers will have a harder time defeating it (since there's
no single widespread db to defeat).
> Not having a nice .db to start out with seems like a pretty heavy
>barrier for [potential] new users. We need to go searching through
>undocumented code just to figure out how to play with it.
I agree that the documentation needs to be improved, if this is
to be used by anyone other than researchers. I don't think that
providing a starter db is the right way to make up for the lack
of documentation. :-)
- Alex