[spambayes-dev] correlated clues
Toby Dickenson
tdickenson at geminidataloggers.com
Fri Jul 2 15:03:21 EDT 2004
On Thursday 01 July 2004 13:08, Toby Dickenson wrote:
> I have a small database of list-id (etc)
> headers. If that header is present, it inserts a list-id token, and inhibits
> all the tokens from a list-dependant set.
Attached is a proof-of-concept:
1. a patch to tokenizer.py, which uses this secondary database to detect list
post, suppress the relevant tokens for that list, and insert the list id
token. The secondary database is stored in a directory of small files; you
will need to hack the source to provide your directory name.
2. A tool to generate that secondary database. You will need to hack the
source to give it the same directory name as above (which should probably
start out empty before you run this tool). You will also need to give it a
file containing a list of paths to mailboxes, one path per line. It scans
every mail in each of those mailboxes looking for list posts, and calculates
the intersection of their tokens.
--
Toby Dickenson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: commontokens.py
Type: application/x-python
Size: 2732 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040702/09fa57db/commontokens.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tokenizer.diff
Type: text/x-diff
Size: 2054 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040702/09fa57db/tokenizer.bin
More information about the spambayes-dev
mailing list