[Spambayes] Spam at hackers conference

Tim Peters tim.one@comcast.net
Sun Nov 3 05:47:37 2002


This is a multi-part message in MIME format.

---------------------- multipart/mixed attachment
[Tim, sketches the CRM114 algorithm]
> ...
> Fiddling our codebase to [try] something like it wouldn't be hard.

Proof attached.  Like the docs say, nothing is sacred here, and if that
algorithm works better, great, we can go home early <wink>.

The attached patches classifier.py to do CRM114 HOH generation and scoring
by default.  The token hash is Python's string hash, which is a better hash
than CRM114 uses.  The 16 HOH hashes used here are, I believe, identical to
the ones CRM114 uses; I grit my teeth at this because they don't appear to
be good HOH functions, but let's let that pass.

This runs very much slower and requires a lot more memory than what we're
using now.  OTOH, the memory use is bounded no matter how much training data
there is, due to layers of many-to-one hash mappings.  If this scheme
becomes serious here, recoding the scoring in C would seem necessary for
both speed and memory efficiency (I'm already playing obscure speed and
memory reduction tricks here, but they don't help enough).  The patch
doesn't change tokenization at all, although I believe Bill preserves case
when tokenizing, doesn't skip either short words or meta-tokenize long
words, and doesn't do any of our "fancy" tokenization gimmicks (a note on
the project site suggests that he'll start decoding base64, because that's
been a problem with the scheme; we are decoding base64, of course).  The
patch also doesn't clamp counts to 1-byte values, although I doubt that
played a role here (later: unclear!).

If you try this, set ham_cutoff and spam_cutoff to 0.5 (later: also unclear
what to do here).  It's just comparing raw counts, and the bigger count
wins.  The score returned here is

    S/(S+H)

where

    S is the sum of the ~16*N HOH spamcounts
    H is the sum of the ~16*N HOH hamcounts

So < 0.5 means S was smaller, and > 0.5 means S was larger.

On a python.org email test I was running anyway, the results weren't
stellar:

filename:    base1    crm
ham:spam:  2741:948
                   2741:948
fp total:        5       2
fp %:         0.18    0.07
fn total:        2     271
fn %:         0.21   28.59
unsure t:       66       0
unsure %:     1.79    0.00
real cost:  $65.20 $291.00
best cost:  $21.40 $177.00
h mean:       0.84   18.28
h sdev:       6.21    6.17
s mean:      98.05   64.63
s sdev:       9.10   23.53
mean diff:   97.21   46.35
k:            6.35    1.56

This isn't a big test, but bloated to 100MB and took so long I killed it
once suspecting a hang (it wasn't hung, so I got to start over again
<wink>).

However,

1. If there's a usable middle ground here, setting both cutoffs to 0.5
   can't reflect that.  Still, the run was done with nbuckets=200, which
   gave the automated histogram analysis a lot of resolution to play
   with, and the best-cost crm value was $177.00, 8x worse than the best-
   cost "before" value (deduced from the same nbuckets value).

2. It occurs to me that *because* it's just scoring by comparing raw
   counts, it's probably crucial to train on an equal number of ham
   and spam.  That there was 3x as much ham in this test made it much
   easier to get high raw hamcounts than high raw spamcounts.  That may
   (or may not) explain the bulk of the huge FN rate.

OK, doing a 10-fold cross-validation run across 2000 random ham and 2000
random spam, but the same random sets for "before" and "after":

filename:    before     crm
ham:spam:  2000:2000
                   2000:2000
fp total:        1    1604
fp %:         0.05   80.20
fn total:        0       0
fn %:         0.00    0.00
unsure t:       20       0
unsure %:     0.50    0.00
real cost:  $14.00$16040.00
best cost:   $2.00 $228.00
h mean:       0.55   53.54
h sdev:       4.50    5.30
s mean:      99.91   71.40
s sdev:       1.64    6.84
mean diff:   99.36   17.86
k:           16.18    1.47

Well, that was a disaster.  My guess:  since virtually all ham contains
strong spam words, 0.5 is a lousy value for spam_cutoff.

For crm:

-> <stat> Ham scores for all runs: 2000 items; mean 53.54; sdev 5.30
-> <stat> min 24.6294; median 54.3869; max 74.4693
-> <stat> percentiles: 5% 43.7643; 25% 51.1123; 75% 56.8845; 95% 60.2207

-> <stat> Spam scores for all runs: 2000 items; mean 71.40; sdev 6.84
-> <stat> min 50; median 69.805; max 96.6838
-> <stat> percentiles: 5% 63.4695; 25% 66.6684; 75% 74.2597; 95% 86.1775

-> best cost for all runs: $228.00
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.61 & 0.695
->     fp 1; fn 17; unsure ham 63; unsure spam 942
->     fp rate 0.05%; fn rate 0.85%; unsure rate 25.1%

The highest-scoring ham is one our unigram scheme would never call spam;
since 0.75 is very near the spam 75th-percentile score, you'd have to call
about 25% of all spam "unsure" to avoid calling this spam (and, indeed, the
automated histogram analysis found its best-cost value at an unsure rate of
25.1%):

"""
Data/Ham/Set4/128466.txt
prob = 0.744693151307
prob('*H*') = 59619
prob('*S*') = 173900

Received: from [80.17.80.215] (helo=veronika.quadrante.com)
        by mail.python.org with smtp (Exim 3.21 #1)
        id 16dCZB-0000Op-00
        for python-list@python.org; Tue, 19 Feb 2002 10:52:29 -0500
Received: (qmail 29664 invoked by uid 64014); 19 Feb 2002 16:14:13 -0000
Received: from abottoni@quadrante.com by veronika
        by uid 64011 with qmail-scanner-1.10 (uvscan: v4.1.40/v4121. .
        Clear:0. Processed in 0.341367 secs); 19 Feb 2002 16:14:13 -0000
Received: from unknown (HELO backup.quadrante.com) (80.17.80.210)
  by 80.17.80.215 with SMTP; 19 Feb 2002 16:14:13 -0000
Message-Id: <5.1.0.14.0.20020219163858.00a901a8@veronika.quadrante.com>
X-Sender: abottoni@veronika.quadrante.com
X-Mailer: QUALCOMM Windows Eudora Version 5.1
Date: Tue, 19 Feb 2002 16:56:25 +0100
To: python-list@python.org
From: Alessandro Bottoni <abottoni@quadrante.com>
Subject: Python-based "Portal System"?
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format=flowed
Sender: python-list-admin@python.org
Errors-To: python-list-admin@python.org
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.0.8 (101270)
Precedence: bulk
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Post: <mailto:python-list@python.org>
List-Subscribe: <http://mail.python.org/mailman/listinfo/python-list>,
        <mailto:python-list-request@python.org?subject=subscribe>
List-Id: General discussion list for the Python programming language
        <python-list.python.org>
List-Unsubscribe: <http://mail.python.org/mailman/listinfo/python-list>,
        <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-list/>

Most likely, all of you know a number of open source, pre-built "portal
systems", like ezPublish (http://developer.ez.no), PHPNuke
(www.phpnuke.org), PostNuke (http://www.postnuke.com), Midgard
(http://www.midgard-project.org/) and so on.

Does anybody know if exists a Portal System like those, written in Python?

Thanks in advance

Alessandro Bottoni

PS: I know about Zope (http://www.zope.org) and WebWare
(http://webware.sourceforge.net), already...
"""

The 2nd-highest-scoring ham is due to our own Skip, and is at least as
mysterious:

"""
Data/Ham/Set2/146718.txt
prob = 0.693046527054
prob('*H*') = 80982
prob('*S*') = 182843

Received: from exim by mail.python.org with spamc (Exim 4.02)
        id 17DRZM-0005Xn-00
        for python-list@python.org; Thu, 30 May 2002 11:10:28 -0400
Received: from 12-248-41-177.client.attbi.com ([12.248.41.177])
        by mail.python.org with esmtp (Exim 4.02)
        id 17DRZM-0005Xg-00
        for python-list@python.org; Thu, 30 May 2002 11:10:28 -0400
Received: (from skip@localhost)
        by 12-248-41-177.client.attbi.com (8.11.6/8.11.6) id g4UFAPD25155;
        Thu, 30 May 2002 10:10:25 -0500
From: Skip Montanaro <skip@pobox.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <15606.16608.916236.657101@12-248-41-177.client.attbi.com>
Date: Thu, 30 May 2002 10:10:24 -0500
To: "David LeBlanc" <whisper@oz.net>
Cc: "Jeff Shannon" <jeff@ccvcorp.com>, <python-list@python.org>
Subject: RE: Crashing IDLE
In-Reply-To: <GCEDKONBLEFPPADDJCOEAEHODFAA.whisper@oz.net>
References: <MPG.175f041fd9efc1f09896e3@news.nwlink.com>
        <GCEDKONBLEFPPADDJCOEAEHODFAA.whisper@oz.net>
X-Mailer: VM 6.96 under 21.4 (patch 6) "Common Lisp" XEmacs Lucid
Reply-To: skip@pobox.com
X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20
X-Spam-Level:
Sender: python-list-admin@python.org
Errors-To: python-list-admin@python.org
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.0.11 (101270)
Precedence: bulk
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Post: <mailto:python-list@python.org>
List-Subscribe: <http://mail.python.org/mailman/listinfo/python-list>,
        <mailto:python-list-request@python.org?subject=subscribe>
List-Id: General discussion list for the Python programming language
        <python-list.python.org>
List-Unsubscribe: <http://mail.python.org/mailman/listinfo/python-list>,
        <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-list/>

    David> I would consider that a bug - "pass" should be checking for
    David> ctrl-c and other events imo. It sure strikes me as a point for
    David> relinquishing control.

It will relinquish control to another thread and sense KeyboardInterrupt.
If your app is not threaded though, Tk will never get control so it can
process its event queue.  That's what fills up.

--
Skip Montanaro (skip@pobox.com - http://www.mojam.com/)
Boycott Netflix - they spam - http://www.musi-cal.com/~skip/netflix.html

"""

Perhaps CRM114's one-byte count clamps are needed to prevent insane scores
(a form of bias acting against the extreme HOH correlation), or perhaps one
of the hash reductions mapped "control" to "big penis", or ... who knows?
If someone wants to pursue this (I've seen enough <wink>), it would be a lot
more interesting now to download CRM114 and run it the way the author
intended.

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: crm.patch
Type: application/octet-stream
Size: 11203 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021103/5cf1f41f/crm.exe

---------------------- multipart/mixed attachment--