[Spambayes] Introducing myself
Robert Woodhead
trebor@animeigo.com
Sun Nov 10 00:32:36 2002
Hello everyone,
Just a quick note to introduce myself; I ran the session at that
Hacker's conference that Guido mentioned, and passed on the
suggestion of checking out Bill Y's combinatorial approach.
I've been playing with rules-based techniques for almost a year (see
http://www.madoverlord.com/projects/told.t for details) and toying
with bayesian systems for only the last couple of months, on and
off. So no expert in that regard; I have mostly replicated the early
work you guys have done (skimmed the archive today).
I'm particularly impressed with the chi-square work, it looks very
interesting (but more stats for my poor stats-challenged mind to work
on; not to mention that now I'm going to have to get around to
cramming python in there with all the other languages that have
accumulated over the years...). Also, it's nice the way you're
testing a lot of variants, I've been crossing things off my "try
this" list all afternoon.
Couple of comments (bear in mind, I haven't grabbed the source yet,
and only skimmed the archive, so if this repeats things you've
already tried, sorry). This is just stuff that's been in my mind
recently, plus stuff stimulated by my skim.
* The great headers debate; suggest you put both machine and human
readable opinions in the header, eg:
X-SpamBayes-Rating: 9 (Very Spammy)
X-SpamBayes-Rating: 7 (Somewhat Spammy)
X-SpamBayes-Rating: 5 (Unsure)
X-SpamBayes-Rating: 3 (Probably Innocent)
X-SpamBayes-Rating: 0 (The Finest Ham)
The reason being, many mailreaders can use a finer discriminant than
(yes,no,beats me) in ranking spam. A common strategy (which I like
myself) is to start an email at neutral priority and bump it up and
down based on various triggers, whitelists, whatever, then sort the
inbox by the final priority.
A cute hack I used in TOLD was to output the result like this:
X-SpamBayes-Rating: 0123456789 (Very Spammy)
X-SpamBayes-Rating: 012345 (Unsure)
This permits a mailreader with limited filtering tools (like Eudora)
to classify multiple results with a single rule (such as "if an
X-SpamBayes-Rating header contains the string 12345678, set priority
to double-low", which catches both 8 and 9 rated emails).
BTW, being pedantic, "rating" is a better word to use, it is more
precisely what the discriminator is doing, is the same in all flavors
of english, and is shorter. "Score" might be even better. ;^)
* Hashing to a 32-bit token is very fast, saves a ton of memory, and
the number of collisions using the python hash (I appealed for hash
functions on the hackers-l and Guido was kind enough to send me the
source) is low. About 1100 collisions out of 3.3 million unique
tokens on a training set I was using. CRC32, of all things, is
actually slightly better, but only by a hair. So this kind of
hashing probably won't have much effect on the statistical results.
* Bill Y's byte bucket system has a lot of problems, but a there are
probably some data reduction techniques that would work well. One
that occurred to me on the way back from Hackers would be simply to
keep a 1-byte count of ham/spam hits for each token, and when the ham
or spam count is about to wrap, cut each count in half, rounding up
the other value; ie:
// increment ham count for bucket i
// apologies, my pseudocode is a bizarre mixture of
// now-unknown languages
if (ham[i]=255)
{
ham[i]=128;
spam[i]=(spam[i]/2)+(spam[i]%2)
}
else
ham[i]++;
The nice thing about this is that it would bias in favor of more
recent email; things would "age". But note this means when building
the original database you have to feed it ham and spam in small
chunks, or use a greater resolution before cramming it into
individual bytes.
* I was playing a week or two back with 1 and 2 token groups, and
found that a useful technique was, for each new token, to only
consider the most deviant result. So if the individual word was .99
spam, and the two word phrase was .95, it would only consider the .99
result. This would probably help with Bill Y's combinatorial scheme.
Dunno if you've tried this; it prevents a particularly spammy or
hammy sequence from dominating the results (I was only considering
the 16 or so most deviant results in my bayesian calc, at least on my
corpus, more than that didn't really help).
* My personal bias (as I think Guido mentioned) is for a multifaceted
approach, using Bayesian, rules-based (attacking things that bayesian
isn't good at, like looking for obfuscated url structures), DNSBL,
and whitelisting heuristics to generate an overall ranking. So a
hammy mail from a guy in your address book would bubble up to highest
priority, whereas something spammy from him would stay neutral.
There's lots of room for cooperation between the various approaches
and multiple agents means its less likely that a spam will get by.
In particular, whitelisting heuristics can almost eliminate false
positives.
* Finally, if anyone needs more spam, I get over 300 a day (I've been
around a while!) and have a cleaned corpus of over 130MB of spam and
foreign email. Also, given all the legit web-marketing email I get
because of the url registration work I've done, I've got tons of the
spammiest ham you could imagine.
Best
R
--
-----------------------------------------------------------------------
http://madoverlord.com/ World Domination - a fun family activity!
More information about the Spambayes
mailing list