[Spambayes] progress on POP+VM+ZODB deployment
Derek Simkowiak
dereks@itsite.com
Sat Oct 26 00:06:26 2002
> I don't think you want someone else's database. Their ham might be
> your spam, or vice versa.
A couple of people have mentioned this, and while I see the point,
I disagree. Let me explain why.
The differences between one person's ham and another individual's
spam (such as the hotel conference-info example) is far less significant
than the difference between one person's ham and everyone's spam. That
is, the strongest indicators like "color=#FF0000" and porn-type swearwords
are not likely to appear in anyone's ham. At least, not nearly as
frequently as it will be found in most of the spams that are out there.
I take it for granted than a general starter.db file will not be
very accurate for my particular needs. But I should be able to set a
fairly high cutoff value and get 80% to 90% of real-world spams correctly
flagged right out of the gate -- that's heads and tails above having
nothing at all, when trying to learn how this stuff works.
But most importantly, training a starter.db for my specialized
needs is far easier as "step two" than creating a .db from scratch is as
"step one". And that is why I'm asking for a .db file.
> I just spent an entire day getting the POP proxies hooked up to a
> training database, and I still have a bubble-gum-and-bailing-wire
> solution.
I just used the Postfix-with-SpamAssassin instructions and
replaced SpamAssassin with hammie.py in filter mode. For my needs,
finding a nice "real world" starter corpus is what's holding me back.
I'm not looking for a "documentation substitute". I'm just looking for
something that will (a) tell me if I've installed the software correctly,
and (b) correctly identify more than 80% of the spams that I feed it.
So again, with full recognition that whatever somebody else has
won't be tailored to my email lifestyle, I ask for the .db -- just to save
me a few hours of ramp-up time. Once I've had a chance to dink around,
and try out the software, I will know if I want to take the time necessary
to collect, organize, and manually filter a highly-customized training
corpus for my personalized needs.
The pizza offer still stands :)
Thanks,
Derek Simkowiak