[Spambayes] Introducing myself

Sun Nov 10 06:55:59 2002

[Robert Woodhead]
> Hello everyone,

Hi!

> Just a quick note to introduce myself; I ran the session at that
> Hacker's conference that Guido mentioned, and passed on the
> suggestion of checking out Bill Y's combinatorial approach.

You can find test results for that in the list archives.  Bottom line is
that it did worse than what we're doing now, to such an extent that I'm the
only one who appeared to try it (my reports weren't encouraging).  I may
have misimplemented the idea, but I don't think so.  The results were in
line with earlier experiments we tried on gimmicks that systematically
generate highly correlated "words".  Such things appear to learn a lot
faster than word unigrams, but we've always found (so far) that unigrams
soon enough overcome that, and then go on to win.  What we're missing is any
practical approach to a scheme that can suck out phrases without identifying
them by hand first, and without generating highly correlated phrases
(overlapping word n-grams are highly correlated, of course, and Bill carries
that to extremes).

Something I didn't report on is later experiments using chi-combining
instead of Bill's "add up the raw counts".  chi-combining worked better.  I
know Bill has gone on to do a more Bayesian-like combination method, but I
expect that to do worse than what we've got now for the same reasons we gave
up on Paul Graham's combining scheme, but more so:  the word independence
assumption is bogus, and feeding the Bayesian calcs highly correlated words
grossly over- or under- estimates the true probability as a result.  In the
end you get a scheme that claims certainty even when it's dead wrong, and
although it's not dead wrong often, it is dead wrong at a non-zero rate.

Revealing:  fiddle our chi2.py to use whatever combining scheme Bill is
using now, and feed it vectors of *random* probabilities.  Most of the code
needed for that, and to display a histogram of results, is already there.
Try it with Graham's combining scheme and you'll find that scores are almost
always very near 0 or very near 1 even when the inputs are random and
uniformly distributed.  I expect that can only get worse by doing "chain
rule" calcs on probs that are highly correlated to begin with.

The internal chi-combining S and H scores are uniformly distributed given
random inputs, so chi-combining doesn't infer certainty by chance any more
often than it "should" infer certainty by chance.  That appears to be what
makes it far more robust against embarrassing mistakes, and it reliably
pumps out a score near 0.5 given a highly ambiguous input msg (many strong
ham and many strong spam clues -- we call that "cancellation disease" here,
and chi-combining doesn't infer certainty when it happens; all other schemes
did, and didn't do better than chance when it happened).

> I've been playing with rules-based techniques for almost a year (see
> http://www.madoverlord.com/projects/told.t for details) and toying
> with bayesian  systems for only the last couple of months, on and
> off.  So no expert in that regard; I have mostly replicated the early
> work you guys have done (skimmed the archive today).
>
> I'm particularly impressed with the chi-square work, it looks very
> interesting (but more stats for my poor stats-challenged mind to work
> on;

So copy and paste <wink>.

> not to mention that now I'm going to have to get around to
> cramming python in there with all the other languages that have
> accumulated over the years...).

In return, you can throw twelve other languages out <0.7 wink>.

> Also, it's nice the way you're testing a lot of variants, I've been
> crossing things off my "try this" list all afternoon.

Testing has pretty much run out of steam here, though.  My error rates are
so low now I couldn't measure an improvement in a convincing way even if one
were to be made, and the same is true of a few others here too.  We appear
to be fresh out of big algorithmic wins, so are pushing on to wrestling with
deployment issues.

BTW, download the source code and read the comments in tokenizer.py:  the
results of many early experiments are given there in comment blocks.

> Couple of comments (bear in mind, I haven't grabbed the source yet,
> and only skimmed the archive, so if this repeats things you've
> already tried, sorry).  This is just stuff that's been in my mind
> recently, plus stuff stimulated by my skim.
>
> * The great headers debate; suggest you put both machine and human
> readable opinions in the header, eg:
>
> 	X-SpamBayes-Rating: 9 (Very Spammy)
> 	X-SpamBayes-Rating: 7 (Somewhat Spammy)
> 	X-SpamBayes-Rating: 5 (Unsure)
> 	X-SpamBayes-Rating: 3 (Probably Innocent)
> 	X-SpamBayes-Rating: 0 (The Finest Ham)
>
> The reason being, many mailreaders can use a finer discriminant than
> (yes,no,beats me) in ranking spam.  A common strategy (which I like
> myself) is to start an email at neutral priority and bump it up and
> down based on various triggers, whitelists, whatever, then sort the
> inbox by the final priority.

Spoken like someone who worked on a rule-based system <wink>.  We have three
categories:  Ham, Unsure, and Spam, and I haven't seen anything to make me
believe that a finer distinction than that can be quantitatively justified
(but my primary test data makes 2 mistakes out of 34,000 msgs now -- that's
what I mean by "can't measure an improvement anymore", and a finer-grained
scheme isn't going to touch those 2 mistakes; one of them is formally ham
because it was sent by a real person, but consists of a one-line comment
followed by a quote of an entire Nigerian scam spam -- nothing useful is
ever going to *call* that one ham, and it scores as spam *almost* as solidly
as an original Nigerian spam).

> A cute hack I used in TOLD was to output the result like this:
>
> 	X-SpamBayes-Rating: 0123456789 (Very Spammy)
> 	X-SpamBayes-Rating: 012345 (Unsure)
>
> This permits a mailreader with limited filtering tools (like Eudora)
> to classify multiple results with a single rule (such as "if an
> X-SpamBayes-Rating header contains the string 12345678, set priority
> to double-low", which catches both 8 and 9 rated emails).
>
> BTW, being pedantic, "rating" is a better word to use, it is more
> precisely what the discriminator is doing, is the same in all flavors
> of english, and is shorter.  "Score" might be even better.  ;^)

"Score" is my favorite, but isn't catching on.  I believe the word "ham" for
"not spam" was my invention, and since that one caught on big, I'm not
fighting to the death for any others <wink>.

> * Hashing to a 32-bit token is very fast, saves a ton of memory,
> and the number of collisions using the python hash (I appealed for hash
> functions on the hackers-l and Guido was kind enough to send me the
> source) is low.  About 1100 collisions out of 3.3 million unique
> tokens on a training set I was using.

That's significantly better than you could expect from a truly random hash
function, so is fishy.  Tossing 3.3M balls into 2**32 buckets at random
should leave 3298733 buckets occupied on average, with an sdev of 35.58
buckets.  Getting 1100 collisions is about 4.7 sdevs fewer than the random
mean.

> CRC32, of all things, is actually slightly better,

With sparse occupancy (3.3e6 out of 4.3e9 buckets is sparse) they may be
comparable.  PythonLabs ran large-scale statistical tests a few years ago on
this.  The Python string hash produced 32-bit numbers indistinguishable from
random (on the #-of-collision basis) as far as we pushed it; crc32 broken
down *very* badly as occupancy increased, with collision rates hundreds of
sdevs worse than random.  So I can't recommend crc32 for general string
hashing (and the Python docs indeed warn about this now), but can recommend
Python's string hash.  By coincidence, it turns out that Python's string
hash is very similar to what later became "the standard" Fowler-Noll-Vo
string hash, which may be the most widely tested "seems as good as random"
fast string hash now:

    http://www.isthe.com/chongo/tech/comp/fnv/

> but only by a hair.  So this kind of hashing probably won't have much
> effect on the statistical results.

Since we're sticking to unigrams, we don't have an insane database burden.
We also (by default) limit ourselves to looking at no more than 150 words
per msg.  So I'm not sure saving some bytes of string storage is "worth it"
for us, and it's very nice that we can get back the exact list of words that
went into computing a score later.  A pile of hash codes wouldn't give the
same loving effect <wink>.

> * Bill Y's byte bucket system has a lot of problems, but a there are
> probably some data reduction techniques that would work well.  One
> that occurred to me on the way back from Hackers would be simply to
> keep a 1-byte count of ham/spam hits for each token, and when the ham
> or spam count is about to wrap, cut each count in half, rounding up
> the other value; ie:
>
> 	// increment ham count for bucket i
> 	// apologies, my pseudocode is a bizarre mixture of
> 	// now-unknown languages
>
> 	if (ham[i]=255)
> 	   {
> 	      ham[i]=128;
> 	      spam[i]=(spam[i]/2)+(spam[i]%2)
>     	   }
> 	else
> 	   ham[i]++;
>
> The nice thing about this is that it would bias in favor of more
> recent email; things would "age".  But note this means when building
> the original database you have to feed it ham and spam in small
> chunks, or use a greater resolution before cramming it into
> individual bytes.

Except I didn't get good enough results from his approach to justify
pursuing it here, even leaving the hash codes at the full 32 bits.  When I
went on to squash them to fit in a million buckets, a few false positives
popped up that were just too bad to bear (two can be found in the list
archives):  ham that was so obviously ham that no system that called them
spam would be acceptable to most people.

> * I was playing a week or two back with 1 and 2 token groups, and
> found that a useful technique was, for each new token, to only
> consider the most deviant result.  So if the individual word was .99
> spam, and the two word phrase was .95, it would only consider the .99
> result.  This would probably help with Bill Y's combinatorial scheme.

It could be a viable approach to the problem mentioned above:  a scheme to
suck out more than one word that doesn't systematically generate mounds of
nearly redundant (highly correlated) clues.  We're clearly missing info by
never looking at bigrams (or beyond) now, and that continues to bother me
(even if it doesn't seem to be bothering the error rates <wink>).

> Dunno if you've tried this; it prevents a particularly spammy or
> hammy sequence from dominating the results (I was only considering
> the 16 or so most deviant results in my bayesian calc, at least on my
> corpus, more than that didn't really help).

There's too much I don't know about everything you're doing to say much
about that.  *All* the biases in Graham's original scheme eventually went
away in this project, and things like clamping the spamprobs into [.01,
0.99] turned out to make it systematically useless to try to use more than
16 words under Graham-combining (it just caused more "cancellation disease",
and so caused more wildly wrong mistakes).  We use 150 now, but IIRC we
generally stopped seeing strong benefits after hitting about 40.  That 40
was better than 16 very much relied on removing all the biases, though (no
"ham boosts", no prob clamping, no minimum word count, no giving unknown
words spamprobs above 0.5 to favor ham, no doubling the ham count when
computing a spam prob, etc).

> * My personal bias (as I think Guido mentioned) is for a multifaceted
> approach, using Bayesian, rules-based (attacking things that bayesian
> isn't good at, like looking for obfuscated url structures), DNSBL,
> and whitelisting heuristics to generate an overall ranking.  So a
> hammy mail from a guy in your address book would bubble up to highest
> priority, whereas something spammy from him would stay neutral.

I'm not sure we really need it.  For example, *lots* of spam has been
discussed on this mailing list, so much so that the python.org email admin
had to castrate SpamAssassin for msgs to this list address else it kept
blocking ordinary list traffic.  My personal email classifier never calls
anything here spam, though, nor does it call the originals of the spams
posted here ham.

I do worry a little about obsfuscated HTML.  We strip almost all HTML tags
by default for a reason I've harped on enough <wink>:  all HTML decorations
have very high spamprobs, and counting more than one of them as "a clue"
fools almost every combining scheme into believing the msg containing them
is spam (if you know a msg contains both <br> and <p>, it's not really more
likely to be spam than if you just know it contains <br>!).  So we blind the
classifier to HTML decorations now.

But a spam I forwarded here a week or so ago exploited that:  the spam was
interleaved with size=1 white-on-white news stories and tech mailing list
postings.  The classifier *did* see those, but didn't see the HTML
decorations hiding them.  This was a cancellation-disease-by-construction
kind of msg, and chi-combining scored it near 0.5 as a result (solidly
Unsure).  It's the only spam of that kind I've seen so far; if it becomes a
popular techinque, we'll have to take more HTML blinders off the classifier.

> There's lots of room for cooperation between the various approaches
> and multiple agents means its less likely that a spam will get by.
> In particular, whitelisting heuristics can almost eliminate false
> positives.

I'll let you know if I ever see one <wink>.  Seriously, one of the apps I've
especially got in mind is filtering the high-volume mailing lists on
python.org.  The only kind of FP I see there now in tests is adminstrative
requests to *-request addresses, which typically consist of a one word
"subscribe" or "unsubscribe" (themselves words with high spamprobs!),
followed by 6KB of employer-generated HTML disclaimers, and/or a forwarded
spam or conference announcement the sender didn't like.  There's still a
very low FP rate even on those, but text analysis simply can't be expected
to nail them every time.  Under SpamAssassin, those recipient addresses are
given strong ham boosts by the python.org email admin.

> * Finally, if anyone needs more spam, I get over 300 a day (I've been
> around a while!) and have a cleaned corpus of over 130MB of spam and
> foreign email.  Also, given all the legit web-marketing email I get
> because of the url registration work I've done, I've got tons of the
> spammiest ham you could imagine.

Wasn't Paul Graham collecting corpora?  Yup, still is:

    http://www.paulgraham.com/spamarchives.html

Getting vast quantities of spam isn't a problem anymore, but getting vast
quantities of ham is.  Since your spammy ham is presumably business-related,
I assume you can't share it.  Or can you?  Mixing spam and ham from
different sources also causes worlds of problems (indeed, we still (by
default) ignore most of the header lines partly for that reason, else the
system gets great results for bogus reasons).