[Spambayes] Chi**2 results

Sun, 13 Oct 2002 01:57:23 -0400

[Rob Hooft]
> Here is my chi results.

Thanks for trying this, Rob!

> I am amazed by the high cutoff it is advising me to use!

Well, you told it you hate fp 10x more than you hate fn
(best_cutoff_fp_weight = 10), and that pushes the best cutoff up.  Note that
the cutoff is an after-the-fact thing, and moving it improves one error rate
at the unavoidable expense of injuring the other -- it doesn't change any
scores.  It looks like this scheme has an extremely usable middle ground for
you, so provided your deployment can *do* something with a middle ground,
you've got a very large range for absolute cutoffs that would leave you
staring at very few "unsure" msgs.

> This feels very good.

Looks good too <wink>.  One part is *too* good:

-> <stat> Ham scores for all runs: 16000 items; mean 0.57; sdev 5.03
-> <stat> min -2.22045e-13; median 9.99201e-14; max 100
          ^^^^^^^^^^^^^^^^

It's not logically possible for a score to go negative -- we can thank
rounding errors for that.

 On the FP side bad messages are:
>   * a yahoo account created to correct incorrect listings in their
>     database
>   * A problem with my Linux Journal subscription
>   * India student applying for a course
>   * Amazon.com membership update
>   * Red Cross blood drive announcement
>
> Which is 5 out of 16000; but I have to admit that even missing 4 out of
> these 5 would not have been too costly.

I don't think any scheme can afford to throw msgs away entirely.  What I
hope instead is that a middle ground can shuffle unclear msgs into a "please
help me" folder (or two, if it's still valuable to record the "ham or spam?"
guess for these) where most mistakes live, and that any scheme tossing a msg
entirely try to notify the sender.  I personally would never use a scheme
that tosses msgs entirely, but that's just me.

Unless you create a lot of Yahoo accts, and have a lot of problems with your
Linux Journal subscriptions, and etc, seems likely that the system just
won't get enough training examples to learn that they're OK for you.  A
whitelist might help, except it's hard to populate one without first
recognizing an FP from an unfortunate sender.

> The middle ground is amazingly empty! I'd almost want to set my cutoff
> at 0.99 or 0.995!

It's OK by me if you do <wink>.

> One thing that does bother me a bit is that some words have a very high
> correlation of co-existing in a message, and there is no way of finding
> this out. E.g. all the "bad jokes" I'm referring to in the attachment
> were sent by a friend of mine that uses a very strange way of
> forwarding by modifying the "From:" line:
>
>    From: callaway@indigo.picower.edu (David Callaway) (by way of Pieter
> Stouten)
>
>
> Which results in the highly correlated:
>
> prob('from:pieter') = 0.00151566
> prob('message-id:@[158.117.170.103]') = 0.00306331
> prob('x-mailer:eudora pro 3.1 for macintosh') = 0.00474183
> prob('from:stouten)') = 0.0115681
> prob('from:way') = 0.012894
> prob('from:(by') = 0.0167286

I don't know whether to call that a bug or a feature.  In this specific
example, I think I have to call it a feature:  the "bad joke" msgs appear to
confuse the system routinely, and this bundle of very low-spamprob words may
be all that's saving them from getting scores near 1.0.  There are a
significant number of my ham that are redeemed by this kind of thing too --
a well-known poster posting from a well-known address, but going on about
something that has nothing to do with the newsgroup.  Sucking out 8 distinct
clues about who they are and where they posted from helps them a *lot* in
these cases, even if all 8 come from the "From" line.

If you turn on mine_received_headers, you'll also find that Neil goes out of
his way to present IP addr and machine-name info in multiple ways,
triggering the same kind of effect for "bad machines" and "bad networks".

So, overall, "this kind of thing" has appeared valuable to me.  OTOH, we've
been reduced to stripping all HTML tags else we get a mountain of
high-spamprob decorations (in legit HTML mail) that are nearly 100%
correlated but each counts as if a killer-good clue all by itself.

So it's at best a mixed bag.  I don't know of a computationally cheap way to
take correlations into account, else I would have tried that before
resorting to stripping HTML tags (I hate throwing info away).