[spambayes-dev] 1070 spam, 1 false positive
Greg Ward
gward at python.net
Tue Jun 24 12:52:26 EDT 2003
On 22 June 2003, Tim Peters said:
> > basic_header_tokenize: True
>
> That's a dangerous one -- although I think you've already figured out why
> the hard way.
>
> > basic_header_skip: received envelope-to delivered-to delivery-date
> > x-spam-flag x-spam-status content-type list-*
>
> The problem is that any random header line can yield a misleading clue by
> accident, and there may be no end of adding to this list.
The thing is, every header on that list is there for a very good reason.
But I can see your point: every addition *also* has a very good reason
for it. Hmmm. I guess I should try it without basic_header_tokenize at
all and see how it does.
> > 'date:2003': 0.663
> > 'date:Jun': 0.681
>
> Any idea where those came from? They have the form of synthesized tokens
> (keyword colon stuff), but I don't recall anything in the tokenizer that
> synthesizes tokens with keyword "date".
Beats me. In my "default" corpus (right now: 418 ham, 583 spam, roughly
half of both from June 2003), these tokens are unsurprisingly quite
common:
>>> h = hammie.open("db/default.db", usedb=True)
>>> h.bayes.db["date:2003"]
(283, 192)
>>> h.bayes.db['date:Jun']
(317, 193)
So *some* bit of code in there is tokenizing the "Date:" header. Seems
like a good idea to me, since junk mail often has non-RFC-conformant
date headers.
Greg
--
Greg Ward <gward at python.net> http://www.gerg.ca/
All of science is either physics or stamp collecting.
More information about the spambayes-dev
mailing list