[Tracker-discuss] spam auditor checked in

Wed Jul 25 18:40:18 CEST 2007

    Erik> skip at pobox.com skrev:
    Erik> *) An attribute, 'spambayes_score', is added to the file and msg
    Erik> classes (in schema.py). Guess what this attribute will
    Erik> hold.. :-). A boolean attribute 'spambayes_misclassified' should
    Erik> also be added.
    >> 
    >> When do you know it's been misclassified?  My thought would be that
    >> you have to save all submissionss which score as spam for some period
    >> of time, probably with some unique identifier (an incrementing
    >> counter would be sufficient).  That unique identifier has to
    >> propagate to the SpamBayes server.  Later on, if you determine that a
    >> submission was misclassifed, you use that unique id to retrieve the
    >> info you saved and pump it into the tracker.
    >> 
    Erik> My idea was to set it to False for all file/msg instances that
    Erik> have been successfully classified, and then add a button that
    Erik> allows ordinary users to tag the file/msg as misclassified, which
    Erik> would allow a coordinator to visit the message and press either a
    Erik> 'mark as spam' or a 'mark as ham' button. The former would set
    Erik> spambayes_score to 1.0 and submit the message for training as
    Erik> spam. The latter would set spambayes_score to 0.0 and submit the
    Erik> message for training as ham. Both would clear the
    Erik> spambayes__misclassified flag (set it to False).

    Erik> Does this sound reasonable to you?

It would work, but then you'd wind up exposing spam to the search engine
spiders for some period of time (maybe days if it's in a lightly visited
corner of the tracker).  That might be all the spammer needs (presuming he's
trying to leverage the tracker to boost search engine ranking).

    >> I would hide all submissions which score as spam, whether anonymous
    >> or known.  Only admins should be able to see spam submissions.
    >> 
    Erik> Yeah, that's probably the best way to do it.  This is quite a lot
    Erik> of work, of course, especially if you're new to roundup. Let me
    Erik> think about this to <zxsee if we can come up with something
    Erik> simpler.
    >> 
    >> Yeah, that's pretty much beyond my capability.  I simply don't have
    >> the time to become a Roundup expert.
    >> 
    Erik> Well, I'll see if I can find the time to do some of the
    Erik> work. Depends a bit on the weather.. :-).  I'll be very happy if
    Erik> you can contribute with some of your knowledge by inspecting my
    Erik> code and answer my questions.

That I can do.  Just let me know any time you have something you want me to
look at.

    Erik> It's been a while since I did anti-spam stuff. Fiddled a lot with
    Erik> SMTP filters and spamassassin some years ago. This feature wakes
    Erik> up some of the interest I had in the subject.

    Erik> On the matter of training - will spambayes work best if it gets
    Erik> trained on about the same amount of spam messages as ham messages?
    Erik> That is, if we're training it on 5 spam messages, should we make
    Erik> sure we also train it on 5 ham messages?

Generally, yes.  Relatively equal amounts are best, though a 3:1 ratio isn't
that big a deal.  In my experience with this type of usage (I implemented
this for the Mojam and Musi-Cal web servers a couple years ago) it's
extremely accurate.  I never needed more than 15-20 hams or spams total in
my training database.  The synthetic tokens you can generate will be
extremely helpful in discriminating ham from spam.  In the Mojam
application, spammers were hitting our concert submission form.  They were
obviously entering complete garbage for the city/state/country.
Consequently, whether or not I could find the city in my lat/long database
was an exceedingly good indicator of spamminess (or user typos - which was
a nice side benefit).

Skip