Spambayes + HTTP proxy server

Paul Paterson hamonlypaulpaterson at houston.rr.com
Mon Feb 3 01:35:52 EST 2003


"Skip Montanaro" <skip at pobox.com> wrote in message
news:mailman.1044238025.4542.python-list at python.org...
>
>     Paul> I think that, as you say, the key now is to train on a corpus of
>     Paul> web pages rather than spam/ham. I notice that Spambayes has a
>     Paul> proxy server which can be used for easy training. I'll take a
look
>     Paul> at this and see if it can be used to train on web pages too.
>
> Yes, you can use pop3proxy.  You might be able to fudge the proxytee.py
> script with something like:
>
>     httpget some-url | proxytee.py
>
> I doubt there's anything in proxytee which is email-specific.  See what
> happens.
>

As a quick hack I put the teaching code inside the proxy filter and then
surfed for a bit to give it some examples of good pages (news and such) and
"bad" pages (sports!). It was very quickly able to spot the sports pages,
even on new sites, and it was able to pick out sport sections from the news
sections on an individual site.

I'll try to build a more rigorous test with a larger corpus - it looks
promissing so far.







More information about the Python-list mailing list