Spambayes + HTTP proxy server

Sun Feb 2 16:23:36 EST 2003

"Skip Montanaro" <skip at pobox.com> wrote in message
news:mailman.1044210485.12265.python-list at python.org...
>
> Sorry for the too quick post.  In rearranging things I lost the spam
return.
> Just to be sure it was actually filtering something, I searched for "sex"
at
> Google.  It let that page in, allowed the safersex and SEX.ETC pages
> through, but blocked HBO's Sex and the City and janesguide.  Note that
this
> is using my current hammmie.db file, which has only been trained on my ham
> and spam email collections.  I don't expect it to necessarily do a very
good
> job with web pages given no training.
>
> Skip
>
>     import os
>
>     from proxy3_filter import *
>     import proxy3_options
>
>     from spambayes import hammie, Options, mboxutils
>     dbf =
os.path.expanduser(Options.options.hammiefilter_persistent_storage_file)
>
>     class SpambayesFilter(BufferAllFilter):
>         hammie = hammie.open(dbf, 1, 'r')
>
>         def filter(self, s):
>             if self.reply.split()[1] == '200':
>                 prob = self.hammie.score("%s\r\n%s" % (self.serverheaders,
s))
>                 print "|  prob: %.5f" % prob
>                 if prob >= Options.options.spam_cutoff:
>                     print self.serverheaders
>                     print "text:", s[0:40], "...", s[-40:]
>                     return "not authorized"
>             return s
>
>     from proxy3_util import *
>
>     register_filter('*/*', 'text/html', SpambayesFilter)
>

This looks great - I'm giving this a go now.

I think that, as you say, the key now is to train on a corpus of web pages
rather than spam/ham. I notice that Spambayes has a proxy server which can
be used for easy training. I'll take a look at this and see if it can be
used to train on web pages too.