[Spambayes] Matt Sergeant: Introduction

Tue, 01 Oct 2002 17:36:33 -0400

[Matt Sergeant]
> ...
> And to give back I'll tell you that one of my biggest wins was parsing
> HTML (with HTML::Parser - a C implementation so it's very fast) and
> tokenising all attributes, so I get:
>
>    colspan=2
>    face=Arial, Helvetica, sans-serif
>
> as tokens. Plus using a proper HTML parser I get to parse HTML comments
> too (which is a win).

Matt, what are you using as test data?  The experience here has been that
HTML is sooooo strongly correlated with spam that we've added gimmick after
gimmick to remove evidence that HTML ever existed; else the rare ham that
uses HTML-- or even *discusses* HTML with a few examples! --had an extremely
hard time avoiding getting classified as spam.

As a result, by default we're stripping all HTML tags unlooked-at (except
that we mine http etc thingies before stripping tags).  Even so, the mere
presence of text/html content-type, and "&nbsp;", still have very high
spamprob, and so still make it hard for content-free <0.1 wink> HTML hams to
get thru.

A problem seems to be that everyone here subscribes to a few HTML marketing
newsletters (whether they think of them that way or not), but that the only
other HTML they get in their email is 100x more HTML spam.  That gives every
indication of HTML spamprobs >= 0.99, and legitimately so.  A compounding
problem then is that the simplifying assumption of word-probability
independence is grossly violated by HTML markup -- the prob that a msg
contains colspan=2 and the prob that a msg contains face=Arial aren't
independent at all, and pretending that they are independent grossly
overestimates the rarity of seeing them both in a single msg.

Do you find, for example, that

    colspan=2

is common in HTML ham but rare in HTML spam, or vice versa?  I know of
specific cases where we're missing good clues by purging HMTL decorations,
but nobody here has yet found a strategy for leaving them in that isn't a
disaster for at least one of the error rates.  I'm wondering what's sparing
you from that fate.

> Using word tuples is also a small win,

Word bigrams were a loss for us (comments in tokenizer.py).  This should be
revisted under Gary's scheme, and/or we should stop thinking of unsolicited
conference announcements as being ham <snarl>.

> but increases the database size and number of tokens you have to
> pull from the database enormously.

That was also our experience with word bigrams, but less than "enormously";
about a factor of 2; character 5-grams were snuggling up to enormously.

> That's an issue for me because I'm not using an in-memory database (one
> implementation uses CDB, another uses SQL - the SQL one is really nice
> because you can so easily do data mining, and the code to extract the
> token probabilities is just a view).

I haven't hooked ours up to a database yet, but others have.  It's premature
for my present purposes <wink>.

> ...
> Well I very quickly found out that most of the academic research into
> this has been pretty bogus. For example everyone seems (seemed?) to
> think that stemming was a big win, but I found it to lose every time.

We haven't tried that.  OTOH, the academic research has been on Bayesian
classifiers, and this isn't one (despite that Paul called it one).

> ...
> The one thing that still bothers me still about Gary's method is that
> the threshold value varies depending on corpus. Though I expect there's
> some mileage in being able to say that the middle ground is "unknown".

It does allow for an easy, gradual, and effective way to favor f-n at the
expense of f-p, or vice versa.  There was no such possibility under Paul's
scheme, as the more training data we fed in, the rarer it was for *any*
score not to be extremely close to 0 or extremely close to 1, and regardless
of whether the classification was right or wrong.  Gary's method hasn't been
caught in such extreme embarrassment yet.

OTOH, it *is*, as you say, corpus dependent, and it seems hard to get that
across to people.  Gary has said he knows of ways to make the distinction
sharper, but we haven't yet been able to provoke him into revealing them
<wink>.  The central limit variations, and especially the logarithmic one,
are much more extreme this way.

> ...
> OK, I'll go over it again this week and next time I get stuck I'll mail
> out for some help ;-) The hardest part really is getting from how my
> code is structured (i.e. where I get my data from, how I store it, etc)
> to your version. Simple examples like where you use a priority queue for
> the probabilities so you can extract the top N indicators, I just use an
> array, and use a sort to get the top N.

The priority queue was potentially much more efficient when
max_discriminators was 15.  I expect that it costs more than it's worth now
that we've boosted it to 150, so if there's ever a hint that the scoring
time is non-trivial, I'll probably use an array too.

> So mostly it's just the details of storage that confuse me.

I'm not sure what that means, but expect it will get fleshed out in time.

> ...
> And where is compute_population_stats used?

It has a non-trivial implementation only under the central-limit variations,
of which there are two.  It's intended to be called after
update_probabilities is called at the end of training, to do a third
training pass of computing population ham & spam means & variances.  Most
people here aren't aware of that, as it happens "by magic" when a test
driver calls TestDriver.Driver.train():

    # CAUTION:  this just doesn't work for incrememental training when
    # options.use_central_limit is in effect.
    def train(self, ham, spam):
        print "-> Training on", ham, "&", spam, "...",
        c = self.classifier
        nham, nspam = c.nham, c.nspam
        self.tester.train(ham, spam)
        print c.nham - nham, "hams &", c.nspam- nspam, "spams"
        c.compute_population_stats(ham, False)
        c.compute_population_stats(spam, True)

>>>...
>>> such as how the probability stuff works so much better on individuals'
>>> corpora (or on a particular mailing list's corpus) than it does for
>>> hundreds of thousands of users.

> On my personal email I was seeing about 5 FP's in 4000, and about 20
> FN's in about the same number (can't find the exact figures right now).

So to match the units and order of the next sentence, about 0.5% FN rate and
0.13% FP rate.

> On a live feed of customer email we're seeing about 4% FN's and 2% FP's.

Is that across hundreds of thousands of users?  Do you know the
correpsonding statistics for SpamAssassin?  For python.org use, I've thought
that as long as we could keep this scheme fast, it may be a good way to
reduce the SpamAssassin load.