[Spambayes] how spambayes handles image-only spams

Mon Sep 8 12:52:25 EDT 2003

From: Bill Yerazunis [mailto:wsy at merl.com]

> Um.... you're arguing politics of desire 
> against actual measured statistics.  

Not really, I have no stake in the prevalence of HTML mail. I'm just
think that corpora with small amounts of HTML ham are not representative
of the general, Windows-using email population. And I also think the
trend of "more HTML ham" will continue, because of the default
configurations of popular mail clients.

Given the fact wonderful folks like you actually write these filters for
the Internet community, I am simply concerned that some harmful design
decisions were made because your ham corpora are so devoid of HTML.

> on the grounds that the SpamAssassin corpus 
> is a little less biased, I re-ran the tests
... 
> So, it seems that "font" is somewhat spammy, 
> and so is "br", but <a and <td aren't, and 
> <p> is totally equivocal.

This is what I was getting at, here are results from the most recent
1549 messages of each of my own corpora, which are probably biased
towards HTML ham:

	ham	ham %	spam	spam %

<P>	953	61.5%	1022	66.0%
<BR>	1223	79.0%	1009	65.1%
<TD	67	4.3%	425	27.4%
<font	1250	80.7%	1039	67.1%
<img	53	3.4%	817	52.7%

Total	1549		1549	

As you can see, because so many people who use Outlook, Outlook Express,
and Notes to send me ham, HTML tags are present in a great amount of
what I receive. (Except of course for <TD, which only seems to be ham
when someone is sending excerpts from a spreadsheet to me, and <img,
which is only used when people send me photos or joke images.)

My basic argument is that arbitrarily throwing out some HTML tokens in
the parser, while leaving others, might make the filter more effective
for only certain corpora. What test corpora was this decision based on?

I think keeping some form of <img as tokens as tokens would help my
detection of image-only spam, which seems to slip through SpamBayes more
often than other types of spam. I also think it would be even better to
have a multi-word token something like that produced by the CRM-114
token generator, which could find multi-tag strings like <img*src*http.
These suggestions are just based on my knowledge of the algorithms
involved and the contents of my corpora, I don't know enough python to
really give them a try in SpamBayes (although I'm working on that ;-).

Regards,
	-Ryan-