Bayesian kids content filtering in Python?
Paul Paterson
paulpaterson at users.sourceforge.net
Fri Aug 29 00:47:28 EDT 2003
"Gregory (Grisha) Trubetskoy" <grisha at ispol.com> wrote in message
news:20030828161409.V40715 at onyx.ispol.com...
>
> I've been snooping around the web for open source kids filtering software.
> Something that runs as an http proxy on my home firewall and blocks
> certain pages based on content.
>
> It occured to me that this might be an interesting project to be done in
> Python, probably using the same training and scoring mechanism that
> spambayes uses.
>
> Anyway - I wonder if anyone has already tried something like this?
As Rene points out in his response,after some great advice and discussion
from Skip I gave this a try. It works very well. I added a module to a proxy
server (http://theory.stanford.edu/~amitp/proxy.html) and then 'trained'
Spambayes on top of it by going to sites that I wanted to allow (news sites)
and then ones I wanted to block (sports sites - just to test!). After a
relatively short training period (20-40 sites/pages) it started to pick up
the characteristics of positive and negative sites. It was then easy to get
it to block the negative sites. Although there were still quite a few false
positives I imagine that with a wider training suite it would have been very
accurate (based on the reported accuracy of Spambayes).
Unfortunately, I didn't carry the work through much beyond the initial proof
of concept but I have copied the code I ended up with below. It certainly
seems to work and has application both for parental filtering and other
kinds of content management.
Paul
----
[mod_spambayes.py - place in your proxy folder and ammend the proxy.conf
file to point to this module]
print "Importing Spambayes filter"
import os
from proxy3_filter import *
import proxy3_options
#
# Find ham/spam database and suitable folders for archiving
from spambayes import hammie, Options, mboxutils
dbf =
os.path.expanduser(Options.options.hammiefilter_persistent_storage_file)
#
hamfolder = os.path.join(os.path.split(dbf)[0], ".hampages")
spamfolder = os.path.join(os.path.split(dbf)[0], ".spampages")
import time
def getTempName(ham, spam, ok):
"""Return a temporary file name to archive this file"""
if ok:
direc = ham
else:
direc = spam
return os.path.join(direc, "arc%d" % time.time())
print "Using db file: %s\nham folder: %s\nspam folder: %s" % (dbf,
hamfolder, spamfolder)
class SpambayesFilter(BufferAllFilter):
hammie = hammie.open(dbf, 1, 'r')
am_learning = 0 # set to 1 when learning
is_ok = 0 # set to 1 if visited page is ok for viewing
prevent_access = 0 # set to 1 to block access to dubious pages
archive_files = 0 # set to 1 to archive files for later training
def filter(self, s):
if self.reply.split()[1] == '200':
msg = "%s\r\n%s" % (self.serverheaders, s)
if self.am_learning:
old = self.hammie.score(msg)
self.hammie.train(msg, not self.is_ok)
new = self.hammie.score(msg)
self.hammie.store()
print "Learned! was=%.5f, now=%.5f" % (old, new)
if self.archive_files:
try:
temp_name = getTempName(hamfolder, spamfolder,
self.is_ok)
print "Writing %s" % temp_name
f = open(temp_name, "w")
f.write(msg)
finally:
f.close()
else:
prob = self.hammie.score(msg)
print "| prob: %.5f" % prob
if prob >= Options.options.spam_cutoff and
self.prevent_access:
print self.serverheaders
print "text:", s[0:40], "...", s[-40:]
return "not authorized"
return s
from proxy3_util import *
register_filter('*/*', 'text/html', SpambayesFilter)
print "Spambayes filter installed!"
More information about the Python-list
mailing list