[Spambayes] progress on POP+VM+ZODB deployment

Derek Simkowiak dereks@itsite.com
Sat Oct 26 00:06:26 2002


> I don't think you want someone else's database.  Their ham might be
> your spam, or vice versa.

	A couple of people have mentioned this, and while I see the point,
I disagree.  Let me explain why.

	The differences between one person's ham and another individual's
spam (such as the hotel conference-info example) is far less significant
than the difference between one person's ham and everyone's spam.  That
is, the strongest indicators like "color=#FF0000" and porn-type swearwords
are not likely to appear in anyone's ham.  At least, not nearly as
frequently as it will be found in most of the spams that are out there.

	I take it for granted than a general starter.db file will not be
very accurate for my particular needs.  But I should be able to set a
fairly high cutoff value and get 80% to 90% of real-world spams correctly
flagged right out of the gate -- that's heads and tails above having
nothing at all, when trying to learn how this stuff works.

	But most importantly, training a starter.db for my specialized
needs is far easier as "step two" than creating a .db from scratch is as
"step one".  And that is why I'm asking for a .db file.


> I just spent an entire day getting the POP proxies hooked up to a
> training database, and I still have a bubble-gum-and-bailing-wire
> solution.

	I just used the Postfix-with-SpamAssassin instructions and
replaced SpamAssassin with hammie.py in filter mode.  For my needs,
finding a nice "real world" starter corpus is what's holding me back.
I'm not looking for a "documentation substitute".  I'm just looking for
something that will (a) tell me if I've installed the software correctly,
and (b) correctly identify more than 80% of the spams that I feed it.

	So again, with full recognition that whatever somebody else has
won't be tailored to my email lifestyle, I ask for the .db -- just to save
me a few hours of ramp-up time.  Once I've had a chance to dink around,
and try out the software, I will know if I want to take the time necessary
to collect, organize, and manually filter a highly-customized training
corpus for my personalized needs.

	The pizza offer still stands :)


Thanks,
Derek Simkowiak