Spambayes + HTTP proxy server
Paul Paterson
hamonlypaulpaterson at houston.rr.com
Sun Feb 2 16:23:36 EST 2003
"Skip Montanaro" <skip at pobox.com> wrote in message
news:mailman.1044210485.12265.python-list at python.org...
>
> Sorry for the too quick post. In rearranging things I lost the spam
return.
> Just to be sure it was actually filtering something, I searched for "sex"
at
> Google. It let that page in, allowed the safersex and SEX.ETC pages
> through, but blocked HBO's Sex and the City and janesguide. Note that
this
> is using my current hammmie.db file, which has only been trained on my ham
> and spam email collections. I don't expect it to necessarily do a very
good
> job with web pages given no training.
>
> Skip
>
> import os
>
> from proxy3_filter import *
> import proxy3_options
>
> from spambayes import hammie, Options, mboxutils
> dbf =
os.path.expanduser(Options.options.hammiefilter_persistent_storage_file)
>
> class SpambayesFilter(BufferAllFilter):
> hammie = hammie.open(dbf, 1, 'r')
>
> def filter(self, s):
> if self.reply.split()[1] == '200':
> prob = self.hammie.score("%s\r\n%s" % (self.serverheaders,
s))
> print "| prob: %.5f" % prob
> if prob >= Options.options.spam_cutoff:
> print self.serverheaders
> print "text:", s[0:40], "...", s[-40:]
> return "not authorized"
> return s
>
> from proxy3_util import *
>
> register_filter('*/*', 'text/html', SpambayesFilter)
>
This looks great - I'm giving this a go now.
I think that, as you say, the key now is to train on a corpus of web pages
rather than spam/ham. I notice that Spambayes has a proxy server which can
be used for easy training. I'll take a look at this and see if it can be
used to train on web pages too.
More information about the Python-list
mailing list