[spambayes-dev] 1070 spam, 1 false positive

Tue Jun 24 12:52:26 EDT 2003

On 22 June 2003, Tim Peters said:
> > basic_header_tokenize: True
> 
> That's a dangerous one -- although I think you've already figured out why
> the hard way.
> 
> > basic_header_skip: received envelope-to delivered-to delivery-date
> > x-spam-flag x-spam-status content-type list-*
> 
> The problem is that any random header line can yield a misleading clue by
> accident, and there may be no end of adding to this list.

The thing is, every header on that list is there for a very good reason.
But I can see your point: every addition *also* has a very good reason
for it.  Hmmm.  I guess I should try it without basic_header_tokenize at
all and see how it does.

> >         'date:2003': 0.663
> >         'date:Jun': 0.681
> 
> Any idea where those came from?  They have the form of synthesized tokens
> (keyword colon stuff), but I don't recall anything in the tokenizer that
> synthesizes tokens with keyword "date".

Beats me.  In my "default" corpus (right now: 418 ham, 583 spam, roughly
half of both from June 2003), these tokens are unsurprisingly quite
common:

>>> h = hammie.open("db/default.db", usedb=True)
>>> h.bayes.db["date:2003"]
(283, 192)
>>> h.bayes.db['date:Jun']
(317, 193)

So *some* bit of code in there is tokenizing the "Date:" header.  Seems
like a good idea to me, since junk mail often has non-RFC-conformant
date headers.

        Greg
-- 
Greg Ward <gward at python.net>                         http://www.gerg.ca/
All of science is either physics or stamp collecting.