[Spambayes] An alternate use

T. Alexander Popiel popiel@wolfskeep.com
Sat Nov 2 06:29:39 2002


In message:  <LNBBLJKPBEHFEDALKOLCIEEKCFAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>[T. Alexander Popiel]
>>
>> 1. Based on recent reports, spambayes works better when given full
>>    data about everything that comes through, not just the mistakes.
>>    This is predicted by the theory, too.
>
>I'd say "representative data" more than "full data".  A random slice of real
>life, consistently applied, should be enough.

Granted.

>> 4. We want a large penetration into the mail-reading populace,
>>    to better force the spammers to change tactics.
>
>Heh.  It's still an irony of this project that I've never particularly
>minded getting 100 spam per day <wink>.

Whereas my disgust with getting 70 spam per day (out of about 100
messages total) is one of the major things that prompted me to
actually try Graham's algorithm. ;-)

>> So, what I propose is that we specifically target mailing list
>> managers (mailman and ecartis being the two obvious first
>> targets) for spambayes integration.  I see two main modes for
>> this: just adding headers for the less intrusive, and actually
>> rejecting or forcing moderation for the heavily policed.
>
>That's actually what started this project:  Barry Warsaw is GNU Mailman's
>author, and he asked me to look into adapting Graham's scheme for
>incorporation into Mailman.  Barry has been pretty much missing in action
>here since then, but I expect him to take it up again.

Heh.  Glad to hear I'm not the only one thinking like this.
I don't claim to have new ideas... recycled ideas are easier. ;-)

>> Training is easily accomplished by taking the list archives
>> as a ham corpus and one of the spam collections floating
>> around as a spam corpus.
>
>That's exactly what I did, and it was anything but easy.  Mixed-source
>corpora create a world of problems, and Mailmain archives in particular save
>*all* the Mailman distortions introduced into the headers.

Blech.  You're right... I just forgot about the troubles you had.
Ecartis is similar with the tainting of the archives.

>> In the case of adding headers, we'll want to avoid collisions
>> with personal use of spambayes, too.  I suggest tagging the
>> X-Spambayes-Disposition header (or whatever we call it) with
>> some identifier for which classifier generated the rating,
>> so that multiple X-Spambayes-Disposition lines are distinguishable.
>> Something like:
>>
>>   X-Spambayes-Disposition: Spam by spambayes@python.org
>>   X-Spambayes-Disposition: Unsure by pennmush@pennmush.org
>>
>> Personal classifiers could leave off the 'by' section.
>>
>> Heck, make it so that X-Spambayes-Disposition lines are turned
>> into words similar to the mailer lines, and then personal
>> classifiers can use the judgements of list classifiers as clues.
>
>Easy to spoof, and I'm sure spammers would pick up on that quickly.

Yes, it would be easy to spoof, unless compared with routing
information... but doing that sort of comparison is beyond
the sorting rule capabilities of something like Outlook (and
Outlook is sadly one of the best GUI tools in that arena).
I'm not even sure procmail is up to the task without help
from a custom program.

On the other hand, we could build the smarts for it into
spambayes itself, for use in the personal classifier figuring
out when to trust the apparent list classifier... perhaps
I'll look into routing analysis for my next algorithmic
experiment.

>One idea we kicked around was to add a
>
>    If this looks like spam, click here:  http://yadda.yadda.yorg/abc?=etc
>
>line at the bottom of each mailing-list msg.  An automated system on the
>server would collect and organize votes.  There's no intention that users
>get to vote on what *is* spam, the real point is more devious:  a msg that
>*nobody* claims is spam almost certainly isn't spam, so it's really most
>valuable as a way to identify ham.  That is, if nobody claims msg X is spam
>within a few days, it's almost certainly the case that X is safe to add to
>the ham training.  That seems so certain that it could be automated.  Msgs
>that got "weveral" spam votes would be brought to the list admin's
>attention, for human judgment about whether to classify them as errors.
>Automating *that* part gets too close to censorship-by-vocal-minority for my
>tastes, so if Barry implemented that part I'd kill him <wink>.

Interesting, as a ham indicator.  Way too corruptible as a spam
indicator, I agree.

- Alex