[Spambayes] Introducing myself

Sun Nov 10 00:32:36 2002

Hello everyone,

Just a quick note to introduce myself; I ran the session at that 
Hacker's conference that Guido mentioned, and passed on the 
suggestion of checking out Bill Y's combinatorial approach.

I've been playing with rules-based techniques for almost a year (see 
http://www.madoverlord.com/projects/told.t for details) and toying 
with bayesian  systems for only the last couple of months, on and 
off.  So no expert in that regard; I have mostly replicated the early 
work you guys have done (skimmed the archive today).

I'm particularly impressed with the chi-square work, it looks very 
interesting (but more stats for my poor stats-challenged mind to work 
on; not to mention that now I'm going to have to get around to 
cramming python in there with all the other languages that have 
accumulated over the years...).  Also, it's nice the way you're 
testing a lot of variants, I've been crossing things off my "try 
this" list all afternoon.

Couple of comments (bear in mind, I haven't grabbed the source yet, 
and only skimmed the archive, so if this repeats things you've 
already tried, sorry).  This is just stuff that's been in my mind 
recently, plus stuff stimulated by my skim.

* The great headers debate; suggest you put both machine and human 
readable opinions in the header, eg:

	X-SpamBayes-Rating: 9 (Very Spammy)
	X-SpamBayes-Rating: 7 (Somewhat Spammy)
	X-SpamBayes-Rating: 5 (Unsure)
	X-SpamBayes-Rating: 3 (Probably Innocent)
	X-SpamBayes-Rating: 0 (The Finest Ham)

The reason being, many mailreaders can use a finer discriminant than 
(yes,no,beats me) in ranking spam.  A common strategy (which I like 
myself) is to start an email at neutral priority and bump it up and 
down based on various triggers, whitelists, whatever, then sort the 
inbox by the final priority.

A cute hack I used in TOLD was to output the result like this:

	X-SpamBayes-Rating: 0123456789 (Very Spammy)
	X-SpamBayes-Rating: 012345 (Unsure)

This permits a mailreader with limited filtering tools (like Eudora) 
to classify multiple results with a single rule (such as "if an 
X-SpamBayes-Rating header contains the string 12345678, set priority 
to double-low", which catches both 8 and 9 rated emails).

BTW, being pedantic, "rating" is a better word to use, it is more 
precisely what the discriminator is doing, is the same in all flavors 
of english, and is shorter.  "Score" might be even better.  ;^)

* Hashing to a 32-bit token is very fast, saves a ton of memory, and 
the number of collisions using the python hash (I appealed for hash 
functions on the hackers-l and Guido was kind enough to send me the 
source) is low.  About 1100 collisions out of 3.3 million unique 
tokens on a training set I was using.  CRC32, of all things, is 
actually slightly better, but only by a hair.  So this kind of 
hashing probably won't have much effect on the statistical results.

* Bill Y's byte bucket system has a lot of problems, but a there are 
probably some data reduction techniques that would work well.  One 
that occurred to me on the way back from Hackers would be simply to 
keep a 1-byte count of ham/spam hits for each token, and when the ham 
or spam count is about to wrap, cut each count in half, rounding up 
the other value; ie:

	// increment ham count for bucket i
	// apologies, my pseudocode is a bizarre mixture of
	// now-unknown languages

	if (ham[i]=255)
	   {
	      ham[i]=128;
	      spam[i]=(spam[i]/2)+(spam[i]%2)
    	   }
	else
	   ham[i]++;

The nice thing about this is that it would bias in favor of more 
recent email; things would "age".  But note this means when building 
the original database you have to feed it ham and spam in small 
chunks, or use a greater resolution before cramming it into 
individual bytes.

* I was playing a week or two back with 1 and 2 token groups, and 
found that a useful technique was, for each new token, to only 
consider the most deviant result.  So if the individual word was .99 
spam, and the two word phrase was .95, it would only consider the .99 
result.  This would probably help with Bill Y's combinatorial scheme. 
Dunno if you've tried this; it prevents a particularly spammy or 
hammy sequence from dominating the results (I was only considering 
the 16 or so most deviant results in my bayesian calc, at least on my 
corpus, more than that didn't really help).

* My personal bias (as I think Guido mentioned) is for a multifaceted 
approach, using Bayesian, rules-based (attacking things that bayesian 
isn't good at, like looking for obfuscated url structures), DNSBL, 
and whitelisting heuristics to generate an overall ranking.  So a 
hammy mail from a guy in your address book would bubble up to highest 
priority, whereas something spammy from him would stay neutral. 
There's lots of room for cooperation between the various approaches 
and multiple agents means its less likely that a spam will get by. 
In particular, whitelisting heuristics can almost eliminate false 
positives.

* Finally, if anyone needs more spam, I get over 300 a day (I've been 
around a while!) and have a cleaned corpus of over 130MB of spam and 
foreign email.  Also, given all the legit web-marketing email I get 
because of the url registration work I've done, I've got tons of the 
spammiest ham you could imagine.

Best
R

-- 
-----------------------------------------------------------------------
http://madoverlord.com/    World Domination - a fun family activity!