[spambayes-dev] correlated clues

Toby Dickenson tdickenson at geminidataloggers.com
Fri Jul 2 15:03:21 EDT 2004


On Thursday 01 July 2004 13:08, Toby Dickenson wrote:
> I have a small database of list-id (etc) 
> headers. If that header is present, it inserts a list-id token, and inhibits 
> all the tokens from a list-dependant set.

Attached is a proof-of-concept:

1. a patch to tokenizer.py, which uses this secondary database to detect list 
post, suppress the relevant tokens for that list, and insert the list id 
token. The secondary database is stored in a directory of small files; you 
will need to hack the source to provide your directory name.

2. A tool to generate that secondary database. You will need to hack the 
source to give it the same directory name as above (which should probably 
start out empty before you run this tool). You will also need to give it a 
file containing a list of paths to mailboxes, one path per line. It scans 
every mail in each of those mailboxes looking for list posts, and calculates 
the intersection of their tokens. 



-- 
Toby Dickenson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: commontokens.py
Type: application/x-python
Size: 2732 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040702/09fa57db/commontokens.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tokenizer.diff
Type: text/x-diff
Size: 2054 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040702/09fa57db/tokenizer.bin


More information about the spambayes-dev mailing list