[Spambayes] A couple of small tokenizer experiments.

Anthony Baxter anthony@interlink.com.au
Tue Nov 12 01:36:28 2002


>>> Tim Peters 
> Can you try this again replacing "break" with "continue"?  I can't believe
> you intended break here -- it means that the first time we see a Mailman URL
> in a msg, we stop looking for embedded URLs period.  Spam could easily
> exploit that.

Woopsie. I knew that :)


> >> ham:spam:  11192:1826
> >>                   11192:1826
> 
> You realize you've get a very high ratio of ham to spam, right?

*nod* It's my full personal test corpus. There's another 600 spam 
that haven't been dropped in. I'm re-running tests at the moment
with smaller amounts.

> We don't tokenize To: now because it gives good results for bad reasons on
> mixed-source corpora.  It would be good to have an option to tokenize it.
> It appears that your code also tokenized Cc:; also fine.  I would rather see
> the code added to the loop currently cracking "from" lines:

I've done this now, and am testing it before checking it in.

> Why is this tokenzing only "the first" piece of the Subject line?

Thinko.

> I changed this to loop over all the Subject parts, and saw some minor good
> effects on marginal msgs, so I'll check this one in without further ado.  It
> wasn't much of a win for you either, but it's cheap so why not.  In my
> personal email "subjectcharset:unknown" shows up a lot for some reason (but
> only in spam).

Hm. Dunno about that - Barry might know under what circumstances 
email package gives 'unknown' as a charset. I can't see how that 
could happen.


> > I plan to try something like tokenizing the oldest three received
> > lines (to hopefully avoid the previous issues with mail.python.org
> > blowing numbers to hell) to see if that will help this one.
> Did you try that yet?  I'm not replying in a timely fashion because I'm not
> interested, it's just because I'm 244 msgs behind on this mailing list alone
> now <wink/sigh>.

Not yet, no. It's on the stack.

> > A base64d MP3 spam sent via zope-dev
> > (*H* 0.993904, *S* 0.187868 = 0.0969820429397)
> > which got a bunch of hammy clues from "Subject: [Zope-dev] Re: ofpa" and
> > also the various mailman type clues (although that's better with the
> > first patch, above)

I'm going to try a patch to try and strip out mailing list [titles] at
some point, too.

Anthony



More information about the Spambayes mailing list