[Spambayes] Cunning use of quoted-printable

Wed, 02 Oct 2002 23:30:16 -0400

[Richie Hindle, continuing to unravel the mystery of the now-it-is,
 now-it-ain't false positive]

> You're right.  Where 'richie.pickle' is my full ~4000-message database:

Ah!  If that's really been trained on all your msgs, then in particular it's
been trained on the very message you're predicting against.  The test
drivers are careful never to do that (unless two msgs happen to have
identical content, in which case that's fine -- it that's what real life
looks like, it's not cheating to exploit it).

> >>> import cPickle, pprint, tokenizer, classifier
> >>> from Options import options
> >>> text = open( "Data/Ham/Set4/1641", "rt" ).read()
> >>> bayes = cPickle.load( open( "richie.pickle", "rb" ) )
> >>> score, clues = bayes.spamprob( tokenizer.tokenize( text ), True )
> >>> print options.spam_cutoff, score
> 0.56 0.402748505794
> >>> pprint.pprint( clues )
> [('header:Received:5', 0.13592289441927),
>  ('from:email addr:biglobe.ne.jp>', 0.15517241379310345),

Let's pause here and ponder.  Earlier you said you believed this was the
only msg with ISO encodings in the Subject/From lines.  Suppose that's true.
Then you've trained on exactly one message (this one) producing (among
others) "word"

    'from:email addr:biglobe.ne.jp>'

The estimated *from counting* probability that a message containing this
word is spam is then exactly 0.0 (you've seen it once, and only in ham).

Then Gary's Bayesian probability adjustment is applied, to account for how
much evidence you've got in favor of "the true" spamprob being 0.0:

    s*x + n*p
    ---------
       s+n

The default prior-belief strength (s) is 0.45, the default unknown-word prob
(x) is 0.5, the counting probability estimate (p) is 0 (as above), and the
total evidence (n -- the number of messages containing this word) is 1.  So
the adjusted spamprob is

     0.45*0.5 + 1*0     0.225
     -------------- =   ----- = 0.15517241379310345
          0.45+1        1.45

And that's exactly the prob shown on the line above, so we can be pretty
certain that your database was in fact trained on this msg.

>  ('from:email name:<rxmx7x5x1', 0.15517241379310345),

And ditto for all the other words in your corpus unique to this msg.  Since
there are several of them, and they all have low spamprob, the overall score
favors ham.  That's not too surprising considering the system was already
*told* it was ham.

>  ('from:skip:= 30', 0.15517241379310345),
>  ('message-id:@biglobe.ne.jp', 0.15517241379310345),
>  ('subject:2022', 0.15517241379310345),
>  ('subject:IBskQiMxGyhC', 0.15517241379310345),
>  ('charset:us-ascii', 0.26241865802854009),
>  ('content-type:text/plain', 0.34572203385342953),
>  ('subject:ISO', 0.35151428063116696),
>  ('header:Message-Id:1', 0.64496476638361089),
>  ('x-mailer:none', 0.67584084707587),
>  ('subject:=?', 0.69778644753001717),
>  ('subject:?=', 0.7215916912471283),
>  ('unsubscribe', 0.93148161126231199)]
> >>>
>
> But running in the test environment, which uses the same 4000 messages
> (subject to a couple of hundred extras being shuffled around by
> rebal.py), I get this:
>
> > python timcv.py -n10 --ham=200 --spam=200 -s1

As at the start, timcv never predicts against a message that the classifier
has been trained on.  It would be a very much weaker test if it ever did so,
and the example we're discussing here shows why.  In the test environment,
then, *all* the words unique to this message have never been seen in the
msgs the classifier was trained on, and so they all get the "unknown word"
spamprob, 0.5.  Then they're ignored completely, because the default
robinson_minimum_prob_strength is 0.1, which ignores all words with spamprob
in 0.4 thru 0.6.

> [snip]
> -> <stat> 1 new false positives
>     new fp: ['Data/Ham/Set4/1641']
> ******************************************************************
> ************
> Data/Ham/Set4/1641
> prob = 0.581295852793
> prob('header:Received:5') = 0.141997
> prob('charset:us-ascii') = 0.26578
> prob('content-type:text/plain') = 0.346687
> prob('header:Message-Id:1') = 0.648679
> prob('x-mailer:none') = 0.674625
> prob('subject:=?') = 0.775229
> prob('subject:?=') = 0.908163
> prob('unsubscribe') = 0.928485

Without those other clues, the best judgment it can make is that it's spam.
This is also why the system needs to be trained over time!  It can only know
what it's been taught.

Very brief subscribe/unsubscribe msgs have been a problem in my data too,
but I expect more so:  such msgs don't belong on c.l.py at all, and they're
really quite rare there.  That prevents subscribe/unsubscribe from getting
milder spamprobs no matter how much c.l.py data I train them on.  But if you
get a non-trivial number of these, the system will act differently for your
data, over time.

> From RxMx7x5x@biglobe.ne.jp Fri May 02 22:21:22 1997
> [snip]
>
> What's going on??  Far fewer clues in the test environment

Right -- but what that really shows is that the test environment isn't
cheating, so that's a Good Thing.

> (and my other false positive prints 67 of them, so it's not a
> display issue).
>
> I have a bayescustomize.ini like this:
>
> [TestDriver]
> best_cutoff_fp_weight = 10
> nbuckets = 100
>
> which I guess shouldn't have any effect on this at all.

Right again, none at all -- they merely affect the histogram display.  The
only [TestDriver] option that can affect results is spam_cutoff, and even
that has no effect on scores.