[Spambayes] First result from Gary Robinson's ideas

Wed, 18 Sep 2002 00:28:47 -0400

Sorry that I have to be telegraphic here -- no time to write up full
details, or even check in the code in a sane way for others to try it, and
I'll likely not get back to this at all until Thursday night.

This result comes from implementing just the first suggestion in

<http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.ht=
ml>

If you want to try it (and I sure hope someone does on a smaller test
corpus!), you need to change two places:

1. Tester.Test.predict:  change 0.90 to 0.50.  Somebody make that an
   .ini-file option?

2. classifier.GrahamBayes.spamprob:  Replace the final probability
   calculation with this:

        P = Q = 1.0
        num_clues = 0
        for distance, prob, word, record in nbest:
            if prob is None:  # it's one of the dummies nbest started with
                continue
            if record is not None:  # else wordinfo doesn't know about it
                record.killcount += 1
            if evidence:
                clues.append((word, prob))
            num_clues += 1
            P *= 1.0 - prob
            Q *= prob

        if num_clues:
            P = 1.0 - P**(1./num_clues)
            Q = 1.0 - Q**(1./num_clues)
            prob = (P-Q)/(P+Q)  # in -1 .. 1
            prob = 0.5 + prob/2 # shift to 0 .. 1
        else:
            prob = 0.5

A 10-fold cross-validation run against "my usual" monster corpus shows
almost no difference in results, but this isn't the interesting part of the
story <wink>:

"""
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.050  tied
    0.000  0.050  lost  +(was 0)
    0.000  0.000  tied
    0.050  0.050  tied
    0.000  0.000  tied
    0.100  0.100  tied

won   0 times
tied  9 times
lost  1 times

total unique fp went from 4 to 5 lost   +25.00%
mean fp % went from 0.02 to 0.025 lost   +25.00%

false negative percentages
    0.218  0.218  tied
    0.364  0.364  tied
    0.000  0.000  tied
    0.218  0.218  tied
    0.218  0.218  tied
    0.291  0.291  tied
    0.218  0.218  tied
    0.145  0.145  tied
    0.291  0.291  tied
    0.073  0.073  tied

won   0 times
tied 10 times
lost  0 times

total unique fn went from 28 to 28 tied
mean fn % went from 0.203636363636 to 0.203636363636 tied
"""

The *interesting* part of this story is the score histograms:

Ham distribution for all runs:
* = 311 items
  0.00 18608 ************************************************************
  2.50   301 *
  5.00   110 *
  7.50    54 *
 10.00    78 *
 12.50   177 *
 15.00   163 *
 17.50   140 *
 20.00    93 *
 22.50    53 *
 25.00    66 *
 27.50    44 *
 30.00    35 *
 32.50    22 *
 35.00    18 *
 37.50    12 *
 40.00    10 *
 42.50     6 *
 45.00     3 *
 47.50     2 *
 50.00     1 *
 52.50     0
 55.00     0
 57.50     0
 60.00     2 *
 62.50     0
 65.00     1 *
 67.50     0
 70.00     0
 72.50     0
 75.00     0
 77.50     0
 80.00     0
 82.50     0
 85.00     0
 87.50     0
 90.00     0
 92.50     0
 95.00     0
 97.50     1 *

Spam distribution for all runs:
* = 211 items
  0.00     2 *
  2.50     0
  5.00     0
  7.50     0
 10.00     0
 12.50     0
 15.00     0
 17.50     0
 20.00     0
 22.50     0
 25.00     1 *
 27.50     2 *
 30.00     1 *
 32.50     0
 35.00     1 *
 37.50     4 *
 40.00     3 *
 42.50     4 *
 45.00     6 *
 47.50     4 *
 50.00     4 *
 52.50     9 *
 55.00    11 *
 57.50    18 *
 60.00    42 *
 62.50    26 *
 65.00    41 *
 67.50    46 *
 70.00    55 *
 72.50    54 *
 75.00    59 *
 77.50    65 *
 80.00   160 *
 82.50   158 *
 85.00    61 *
 87.50    20 *
 90.00    11 *
 92.50    45 *
 95.00   233 **
 97.50 12604 ************************************************************

This is much more spread out than when using Graham's combining formula, and
has a useful "middle ground":  manual review of msgs with spamprob in [0.4,
0.6] would stop 1 of the false positives and 17 (of the 28 total) false
negatives.  There are also no cases where false positives or negatives have
insane "probabilities" like 1e-30 or 1.0000000000.  The sole very-high
scoring false positive has prob 0.989806241076, and is the fellow who added
one comment to a quote of an entire Nigerian scam.

There are two very low-scoring false negatives.  One is the "Hello, my Name
is BlackIntrepid" spam mentioned a few days ago, which simply has no words
that even score above 0.05 for spamness (its prob is 0.0173583933026 now).
The other is an extremely long base64-encoded spam that suffers
"cancellation disease" (Gary, because Graham clamps the word probs to lie
within [0.01, 0.99], sometimes we get messages with hundreds of each; we
changed his algorithm to do better than flipping a coin <0.5 wink> when this
happens, but it remains a challenge; your suggestion for normalizing
probabilities would stop this, although I won't know whether that works
better or worse than what we've got now until I can make time to code that
and test it).

Anyway, on my large corpus this looks very much worth pursuing, as the
results are as good but the numbers it produces "feel" much more like actual
probabilities.

The one new false positive that snuck into this is a revival of an old
friend, and it just barely scored over 0.50:

************************************************************************
Data/Ham/Set6/24252.txt
prob = 0.517172469699
prob('url:rpm') = 0.01
prob('header:Organization:1') = 0.01
prob('url:fi') = 0.01
prob('url:linux') = 0.0107383
prob('header:Errors-To:1') = 0.0200348
prob('header:Message-ID:1') = 0.364341
prob('x-mailer:none') = 0.388829
prob('header:Date:1') = 0.471899
prob('header:To:1') = 0.489291
prob('header:Subject:1') = 0.495598
prob('header:From:1') = 0.496521
prob('url:phtml') = 0.523918
prob('url:www') = 0.55991
prob('url:com') = 0.654736
prob('url:net') = 0.752554
prob('url:html') = 0.785737
prob('url:links') = 0.897196
prob('url:es') = 0.951256
prob('url:index') = 0.95655
prob('url:pid') = 0.99
prob('url:1110') = 0.99
prob('url:id') = 0.99

From: "agc" <agc@redestb.es>
Newsgroups: comp.lang.python
Subject: enlaces
Date: Thu, 17 Feb 2000 17:36:18 +0100
Organization: Iddeo - Retevisión
Lines: 5
Message-ID: <88h8gg$1vnl84@SGI3651ef0.iddeo.es>
NNTP-Posting-Host: 62.82.233.22
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 5.00.2014.211
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2014.211
Path:
news!ffx.uu.net!uunet!ams.uu.net!do.de.uu.net!newsfeed01.sul.t-online.de!new
sfeed00.sul.t-online.de!t-online.de!bignews.mediaways.net!newsfeed.nettuno.i
t!server-b.cs.interbusiness.it!news-1.retevision.es!news.iddeo.es!not-for-ma
il
Xref: news comp.lang.python:84746
To: python-list@python.org
Sender: python-list-admin@python.org
Errors-To: python-list-admin@python.org
X-BeenThere: python-list@python.org
X-Mailman-Version: 1.2 (experimental)
Precedence: bulk
List-Id: General discussion list for the Python programming language
        <python-list.python.org>

http://www.noguska.net/linux/rpm/rpm-HOWTO.html
http://www.ceu.fi.udc.es/GPUL/links/
http://www.demasiado.com/index.phtml?id=1110&pid=informática
************************************************************************

Note that there's so little to go on here that it's reduced to taking the
mere presence of a Subject header as "a clue".