From skip@pobox.com  Tue Oct  1 00:00:32 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 30 Sep 2002 18:00:32 -0500
Subject: [Spambayes] mining dates? 
In-Reply-To: <200209302241.g8UMfgJ08118@localhost.localdomain>
References: <15768.50203.838893.944644@12-248-11-90.client.attbi.com>
        <200209302241.g8UMfgJ08118@localhost.localdomain>
Message-ID: <15768.55184.227888.264009@12-248-11-90.client.attbi.com>


    >> It didn't prove my hypothesis, but may have exposed something as
    >> useful.  Spam seems to be sent at a fairly constant rate throughout
    >> the day, which stands to reason, since it's probably all sent
    >> automatically.  However, ham definitely seems to be sent
    >> predominantly during waking hours (doh!).  I'm going to give a little
    >> date mining a try.

    Anthony> Interesting. I'm not sure it actually buys that much, timezones
    Anthony> being what they are. Unless you have evidence that, say, all
    Anthony> spam is actually sent by a small team of Belgians, in which
    Anthony> case we can just knock out stuff sent during business hours in
    Anthony> belgian standard time.

That's why I simply ignored the timezone offset.  The points plotted were in
local time.  As I mentioned in my mail, spam seems to be sent at all hours
of the day and night.  If anything, a small hamminess would be attributed to
messages sent during waking hours.

Skip

From skip@pobox.com  Tue Oct  1 00:16:12 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 30 Sep 2002 18:16:12 -0500
Subject: [Spambayes] Here's why "generate_long_skips: False" worked...
Message-ID: <15768.56124.22371.659117@12-248-11-90.client.attbi.com>


I figured out why the false positive I saw was interpreted as text.  I had
been incorrectly forwarding mail from the itineraries@mojam.com command
processor alias (for probably five years or more).  This wasn't a big deal
in the past because I am the only person who receives such messages, but it
was incorrect nonethelss.  Instead of sending the original message out with
Resent-*: headers prepended, I sent a new message with the original message
as the body, e.g.:

    From itin@manatee.mojam.com  Tue Sep 24 15:34:42 2002
    Return-Path: <itin@manatee.mojam.com>
    Received: from manatee.mojam.com (localhost [127.0.0.1])
            by manatee.mojam.com (8.12.1/8.12.1) with ESMTP id g8OKYf0F013847
            for <skip@mojam.com>; Tue, 24 Sep 2002 15:34:41 -0500
    Received: (from itin@localhost)
            by manatee.mojam.com (8.12.1/8.12.1/Submit) id g8OKYfxH013839;
            Tue, 24 Sep 2002 15:34:41 -0500
    Message-Id: <200209242034.g8OKYfxH013839@manatee.mojam.com>
    From: itin@manatee.mojam.com
    To: skip@mojam.com
    Subject: New Itinerary: "nancy fly artist's tour dates" from mg@nflyagency.com
    Date: Tue, 24 Sep 2002 15:34:41 -0500

    Return-Path: <skip@manatee.mojam.com>
    Received: from txsmtp02.texas.rr.com (smtp2.texas.rr.com [24.93.36.230])
            by manatee.mojam.com (8.12.1/8.12.1) with ESMTP id g8OKYG0F013791
            for <itineraries@musi-cal.com>; Tue, 24 Sep 2002 15:34:17 -0500
    Received: from [192.168.0.4] (cs24342-228.austin.rr.com [24.243.42.228])
            by txsmtp02.texas.rr.com (8.12.5/8.12.2) with ESMTP id g8OKXano027834;
            Tue, 24 Sep 2002 16:33:36 -0400 (EDT)
    User-Agent: Microsoft-Outlook-Express-Macintosh-Edition/5.02.2022
    Date: Tue, 24 Sep 2002 15:33:56 -0500
    Subject: Nancy Fly Artist's Tour Dates
    From: Martha Guthrie <mg@nflyagency.com>
    To: Tour Date Recipients <mg@nflyagency.com>
    Message-ID: <B9B63664.6CA%mg@nflyagency.com>
    Mime-version: 1.0
    Content-type: multipart/mixed;
       boundary="MS_Mac_OE_3115726436_766524_MIME_Part"

    > This message is in MIME format. Since your mail reader does not understand
    this format, some or all of this message may not be legible.

    ...

I just fixed that piece of code over the weekend.  Since I won't be getting
any new mail like the above note in the future, I suppose I should purge
them from my collection or adjust those messages to have the correct format.

So, should I pull the generate_long_skips option back out?

Skip

From tim.one@comcast.net  Tue Oct  1 01:27:41 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 30 Sep 2002 20:27:41 -0400
Subject: [Spambayes] Here's why "generate_long_skips: False" worked...
In-Reply-To: <15768.56124.22371.659117@12-248-11-90.client.attbi.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEPCBHAB.tim.one@comcast.net>

[Skip Montanaro]
> I figured out why the false positive I saw was interpreted as
> text.  I had been incorrectly forwarding mail from the
> itineraries@mojam.com command processor alias (for probably five
> years or more).  This wasn't a big deal in the past because I am
> the only person who receives such messages, but it was incorrect
> nonethelss.  Instead of sending the original message out with
> Resent-*: headers prepended, I sent a new message with the
> original message as the body, e.g.:

[and the original headers "look like body text", ditto the MIME
 decorations]

> I just fixed that piece of code over the weekend.  Since I won't
> be getting any new mail like the above note in the future, I suppose
> I should purge them from my collection or adjust those messages to
> have the correct format.

Out of curiousity, what percentage of your corpus consisted of such msgs?
And were they all ham?

> So, should I pull the generate_long_skips option back out?

I'm neutral, but if you leave it in please change the comment (it's
misleading now).  I believe that whenever a skip token does some good, it's
indicating a weakness in the tokenizer (this is nearly tautological:  when
skip does some good, it says there's useful info in "very long words"!).
Over time, I hope people are inspired to find out just what good it is that
we're getting by crudely summarizing via "skip" tokens, and extract it
purposefully.  An easy example is Asian spam, where the lack of whitespace
ends up generating oodles of skip tokens (and '8bit%' tokens), but there
must be a more effective way to generate useful tokens for that without
bloating the database beyond reason.  So I hope that skip-generation will
eventually become worthless.


From JasonR.Mastaler  Tue Oct  1 01:38:38 2002
From: JasonR.Mastaler (JasonR.Mastaler)
Date: Mon, 30 Sep 2002 18:38:38 -0600
Subject: [Spambayes] Re: Matt Sergeant: Introduction
References: <3D98486A.1050208@startechgroup.co.uk>
Message-ID: <hhptuvat81.fsf@hrothgar.la.mastaler.com>

Matt Sergeant <msergeant@startechgroup.co.uk> writes:

> I've been following this list on gmane.org for a while now (it's a
> mail to nntp gateway for those interested in following multiple
> technical mailing lists in a read-only fashion)

Actually, Gmane is not read-only -- you can both read and post.

-- 
(http://tmda.net/)


From nas@python.ca  Tue Oct  1 01:42:57 2002
From: nas@python.ca (Neil Schemenauer)
Date: Mon, 30 Sep 2002 17:42:57 -0700
Subject: [Spambayes] Here's why "generate_long_skips: False" worked...
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEPCBHAB.tim.one@comcast.net>
References: <15768.56124.22371.659117@12-248-11-90.client.attbi.com>
	<LNBBLJKPBEHFEDALKOLCKEPCBHAB.tim.one@comcast.net>
Message-ID: <20021001004256.GA27420@glacier.arctrix.com>

Tim Peters wrote:
> An easy example is Asian spam, where the lack of whitespace ends up
> generating oodles of skip tokens (and '8bit%' tokens), but there must
> be a more effective way to generate useful tokens for that without
> bloating the database beyond reason.

I tried generating 2 character-grams when has_highbit_char was true.  I
seem to recall that it worked okay.  The bonus would be that there would
be a limit of 2**16 of these tokens in the DB.

  Neil

From skip@pobox.com  Tue Oct  1 01:54:20 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 30 Sep 2002 19:54:20 -0500
Subject: [Spambayes] Here's why "generate_long_skips: False" worked...
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEPCBHAB.tim.one@comcast.net>
References: <15768.56124.22371.659117@12-248-11-90.client.attbi.com>
        <LNBBLJKPBEHFEDALKOLCKEPCBHAB.tim.one@comcast.net>
Message-ID: <15768.62012.24430.856757@12-248-11-90.client.attbi.com>


    >> I just fixed that piece of code over the weekend.  Since I won't be
    >> getting any new mail like the above note in the future, I suppose I
    >> should purge them from my collection or adjust those messages to have
    >> the correct format.

    Tim> Out of curiousity, what percentage of your corpus consisted of such
    Tim> msgs?  And were they all ham?

Of the current Data/{Ham,Spam}/Set* collection (2000 per side), three hams
and 252 spams.  Three types of mail get sent to itineraries@mojam.com: spam,
legitimate (but unrecognized submissions), and recognized legitimate
submissions.  I never see the last category, because the command processor
dumps those to the correct files for later processing and forwards the rest
to me.  The vast majority of the other two classes of messages are spam.

    >> So, should I pull the generate_long_skips option back out?

    Tim> I'm neutral, but if you leave it in please change the comment (it's
    Tim> misleading now).

Will do.  Does this make sense?

    # If legitimate mail contains things that look like text to the
    # tokenizer and turning turning off this option helps (perhaps binary
    # attachments get 'defanged' by something upstream from this operation
    # and thus look like text), this may help, and should be an alert that
    # perhaps the tokenizer is broken.

Skip

From tim.one@comcast.net  Tue Oct  1 02:49:09 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 30 Sep 2002 21:49:09 -0400
Subject: [Spambayes] new option: generate_long_skips
In-Reply-To: <15768.54170.102815.684984@12-248-11-90.client.attbi.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEPIBHAB.tim.one@comcast.net>

[Skip Montanaro]
> ...
> I notice it's suggesting an even lower cutoff now (0.375).
>
> Before:
>
>     -> best cutoff for all runs: 0.4
>     ->     with weighted total 1*30 fp + 17 fn = 47
>     ->     fp rate 1.5%  fn rate 0.85%
>
> After:
>
>     -> best cutoff for all runs: 0.375
>     ->     with weighted total 1*35 fp + 7 fn = 42
>     ->     fp rate 1.75%  fn rate 0.35%

It's suggesting that cutoff *if* what you want to do is minimize the total
number of misclassified messages, without favoring errors of either kind.
Most people here hate false positives more, and in that case you should set
option best_cutoff_fp_weight (which defaults to 1) to how much more you hate
fp than fn.  See the comments for that option in Options.py.

You have such extreme overlap that you should also boost nbuckets up from
its default 40; the resolution of the automated histogram analysis is
limited by the number of buckets.


From tim.one@comcast.net  Tue Oct  1 03:16:59 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 30 Sep 2002 22:16:59 -0400
Subject: [Spambayes] mining dates?
In-Reply-To: <15768.50203.838893.944644@12-248-11-90.client.attbi.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEPKBHAB.tim.one@comcast.net>

[Skip Montanaro]
> ...
> It didn't prove my hypothesis, but may have exposed something as useful.
> Spam seems to be sent at a fairly constant rate throughout the day,
> which stands to reason, since it's probably all sent automatically.
> However, ham definitely seems to be sent predominantly during waking
> hours (doh!).  I'm going to give a little date mining a try.

You have my encouragement, but are you talking about date mining or time
mining?  Date mining has hurt lots of folks, by giving good results for
bogus reasons ("oops!  that whole ham archive came from 1998, and none of my
spam does").  So I suggest you *almost* stick to just time-of-day for now.
Two extensions:

1. Day of week may also be interesting.  I keep a hotmail account
   alive just to watch the spam pour in, and it definitely gets
   more spam on weekends.  I speculate that the last 500 people
   to buy a CD of email addresses can't make time until the
   weekend to become an instant internet millionaire <wink>.

2. Greg Ward suggested two Date things SpamAssassin looks for:

SPAM: *  1.6 -- Invalid Date: header (not RFC 2822)
SPAM: *  2.7 -- Date: is 24 to 48 hours before Received: date

If, OTOH, we were trying to distinguish email from Guido from the rest of
our email, a great clue would be whether it came from Guido, but an even
better one is whether his reply was sent before the original <wink>.


From tim.one@comcast.net  Tue Oct  1 03:22:03 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 30 Sep 2002 22:22:03 -0400
Subject: [Spambayes] Here's why "generate_long_skips: False" worked...
In-Reply-To: <20021001004256.GA27420@glacier.arctrix.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEPLBHAB.tim.one@comcast.net>

[Neil Schemenauer]
> I tried generating 2 character-grams when has_highbit_char was true.

In addition to, or in lieu of, generating skip tokens?

> I seem to recall that it worked okay.  The bonus would be that there
> would be a limit of 2**16 of these tokens in the DB.

Appreciated.  I used to do character 5-grams in this case, and the database
burden was significant.  Plus results didn't get worse when I stopped doing
n-grams altogether.

Somebody want to try this on their corpus?

1. Current vs doing character 2-grams when has_highbit_char is true
   instead of generating skip tokens.

2. Current vs doing character 2-grams when has_highbit_char is true
   in addition to generating skip tokens.


From nas@python.ca  Tue Oct  1 04:21:00 2002
From: nas@python.ca (Neil Schemenauer)
Date: Mon, 30 Sep 2002 20:21:00 -0700
Subject: [Spambayes] Here's why "generate_long_skips: False" worked...
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEPLBHAB.tim.one@comcast.net>
References: <20021001004256.GA27420@glacier.arctrix.com>
	<LNBBLJKPBEHFEDALKOLCOEPLBHAB.tim.one@comcast.net>
Message-ID: <20021001032100.GA27892@glacier.arctrix.com>

Tim Peters wrote:
> [Neil Schemenauer]
> > I tried generating 2 character-grams when has_highbit_char was true.
> 
> In addition to, or in lieu of, generating skip tokens?

In addition.

> 1. Current vs doing character 2-grams when has_highbit_char is true
>    instead of generating skip tokens.

Left is current:

    false positive percentages
        0.000  0.000  tied          
        1.000  1.000  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.500  0.500  tied          
        0.500  0.500  tied          
        0.000  0.000  tied          
        0.500  0.500  tied          
        0.000  0.000  tied          

    won   0 times
    tied 10 times
    lost  0 times

    total unique fp went from 5 to 5 tied          
    mean fp % went from 0.25 to 0.25 tied          

    false negative percentages
        0.000  0.000  tied          
        1.000  1.000  tied          
        1.000  1.000  tied          
        0.500  0.500  tied          
        1.500  1.500  tied          
        1.500  1.500  tied          
        0.500  0.500  tied          
        0.500  0.500  tied          
        1.000  1.000  tied          
        0.000  0.000  tied          

    won   0 times
    tied 10 times
    lost  0 times

    total unique fn went from 15 to 15 tied          
    mean fn % went from 0.75 to 0.75 tied          

    ham mean                     ham sdev
      27.66   27.62   -0.14%        8.52    8.51   -0.12%
      26.51   26.47   -0.15%        8.75    8.79   +0.46%
      25.82   25.76   -0.23%        7.92    7.91   -0.13%
      27.03   27.00   -0.11%        8.22    8.28   +0.73%
      26.95   26.88   -0.26%        8.21    8.26   +0.61%
      29.23   29.19   -0.14%        9.28    9.27   -0.11%
      27.25   27.20   -0.18%        8.15    8.16   +0.12%
      26.89   26.83   -0.22%        7.88    7.89   +0.13%
      27.02   26.93   -0.33%        9.02    8.99   -0.33%
      26.63   26.57   -0.23%        7.20    7.18   -0.28%

    ham mean and sdev for all runs
      27.10   27.05   -0.18%        8.38    8.39   +0.12%

    spam mean                    spam sdev
      81.73   82.38   +0.80%       10.24   10.96   +7.03%
      80.90   81.56   +0.82%       10.16   10.96   +7.87%
      80.03   81.11   +1.35%        9.99   11.02  +10.31%
      81.51   82.48   +1.19%       10.28   11.29   +9.82%
      81.44   82.31   +1.07%       10.43   11.13   +6.71%
      81.11   82.17   +1.31%        9.82   10.87  +10.69%
      80.64   81.69   +1.30%        9.52   10.47   +9.98%
      80.43   81.48   +1.31%        9.84   10.74   +9.15%
      81.18   82.02   +1.03%       10.25   10.91   +6.44%
      81.17   82.59   +1.75%        9.90   11.10  +12.12%

    spam mean and sdev for all runs
      81.01   81.98   +1.20%       10.06   10.96   +8.95%

    ham/spam mean difference: 53.91 54.93 +1.02

> 
> 2. Current vs doing character 2-grams when has_highbit_char is true
>    in addition to generating skip tokens.

Again, left is current:

    false positive percentages
        0.000  0.000  tied          
        1.000  1.000  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.500  0.500  tied          
        0.500  0.500  tied          
        0.000  0.000  tied          
        0.500  0.500  tied          
        0.000  0.000  tied          

    won   0 times
    tied 10 times
    lost  0 times

    total unique fp went from 5 to 5 tied          
    mean fp % went from 0.25 to 0.25 tied          

    false negative percentages
        0.000  0.000  tied          
        1.000  1.000  tied          
        1.000  1.000  tied          
        0.500  0.500  tied          
        1.500  1.500  tied          
        1.500  1.500  tied          
        0.500  0.500  tied          
        0.500  0.500  tied          
        1.000  1.000  tied          
        0.000  0.000  tied          

    won   0 times
    tied 10 times
    lost  0 times

    total unique fn went from 15 to 15 tied          
    mean fn % went from 0.75 to 0.75 tied          

    ham mean                     ham sdev
      27.66   27.66   +0.00%        8.52    8.52   +0.00%
      26.51   26.52   +0.04%        8.75    8.79   +0.46%
      25.82   25.82   +0.00%        7.92    7.92   +0.00%
      27.03   27.06   +0.11%        8.22    8.28   +0.73%
      26.95   26.96   +0.04%        8.21    8.25   +0.49%
      29.23   29.23   +0.00%        9.28    9.28   +0.00%
      27.25   27.26   +0.04%        8.15    8.16   +0.12%
      26.89   26.89   +0.00%        7.88    7.88   +0.00%
      27.02   27.02   +0.00%        9.02    9.02   +0.00%
      26.63   26.63   +0.00%        7.20    7.20   +0.00%

    ham mean and sdev for all runs
      27.10   27.10   +0.00%        8.38    8.39   +0.12%

    spam mean                    spam sdev
      81.73   82.51   +0.95%       10.24   11.00   +7.42%
      80.90   81.66   +0.94%       10.16   10.98   +8.07%
      80.03   81.24   +1.51%        9.99   11.18  +11.91%
      81.51   82.58   +1.31%       10.28   11.35  +10.41%
      81.44   82.38   +1.15%       10.43   11.17   +7.09%
      81.11   82.29   +1.45%        9.82   10.91  +11.10%
      80.64   81.78   +1.41%        9.52   10.48  +10.08%
      80.43   81.57   +1.42%        9.84   10.80   +9.76%
      81.18   82.13   +1.17%       10.25   10.96   +6.93%
      81.17   82.71   +1.90%        9.90   11.22  +13.33%

    spam mean and sdev for all runs
      81.01   82.09   +1.33%       10.06   11.02   +9.54%

    ham/spam mean difference: 53.91 54.99 +1.08

From nas@python.ca  Tue Oct  1 04:23:12 2002
From: nas@python.ca (Neil Schemenauer)
Date: Mon, 30 Sep 2002 20:23:12 -0700
Subject: [Spambayes] mining dates?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEPKBHAB.tim.one@comcast.net>
References: <15768.50203.838893.944644@12-248-11-90.client.attbi.com>
	<LNBBLJKPBEHFEDALKOLCEEPKBHAB.tim.one@comcast.net>
Message-ID: <20021001032312.GB27892@glacier.arctrix.com>

Tim Peters wrote:
> 2. Greg Ward suggested two Date things SpamAssassin looks for:
> 
> SPAM: *  1.6 -- Invalid Date: header (not RFC 2822)

Tried that.  It didn't help my error rate so I mercilessly killed it.

  Neil

From tim.one@comcast.net  Tue Oct  1 04:47:38 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 30 Sep 2002 23:47:38 -0400
Subject: [Spambayes] Central limit
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEMPBHAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEPOBHAB.tim.one@comcast.net>

[Tim]
> ...
> I made up a combination of "look at ratios" and "different cutoffs
> for different n" by iteratively staring at the errors and making
> stuff up.

It now appears that the "different cutoffs for different n" was just an
accident based on the specific errors I stared at.

Recall that the "certainty heuristic" was of the form:

    ratio = max(abs(zhsam / zspam), abs(zspam / zham))
    certain = ratio > cutoff

and then I went on to choose different cutoffs depending on n (n is the
number of "extreme words" found in the msg, with a maximum of 50).

Here's an exhaustive account of all the times the log-central-limit code was
wrong (meaning that abs(zham) < abs(zspam) but the msg was really spam, or
that abs(zspam) < abs(zham) but the msg was really ham).  This is segregated
by n (the number of extreme words).  For each n, a list of all ratios in the
"but I was wrong" cases is given.  The number in square brackets is the
number of predictions made with this specific value of n.  The number in
curly braces is the percentage of incorrect predictions.  So, for example,
35 times we did a prediction on a msg with 7 extreme words (that's a very
short msg!).  Twice the prediction was wrong (5.71% of 35), and in one of
those cases ratio was 1.31, and in the other ratio was 1.72.

 3: [36] {0.00%}
 4: [21] {0.00%}
 5: [14] {0.00%}
 6: [22] {0.00%}
 7: [35] {5.71%} 1.31 1.72
 8: [42] {4.76%} 1.01 1.33
 9: [72] {5.56%} 1.00 1.04 1.14 1.28
10: [123] {0.00%}
11: [129] {1.55%} 1.07 1.09
12: [123] {1.63%} 1.05 1.09
13: [131] {0.00%}
14: [169] {0.59%} 1.11
15: [180] {1.11%} 1.18 1.73
16: [232] {1.29%} 1.12 1.12 1.43
17: [315] {1.27%} 1.06 1.06 1.27 1.48
18: [344] {1.16%} 1.28 1.35 1.50 1.60
19: [333] {1.20%} 1.03 1.24 1.75 1.78
20: [375] {0.53%} 1.10 1.12
21: [448] {0.45%} 1.09 2.54
22: [492] {0.00%}
23: [535] {0.56%} 1.38 1.72 2.20
24: [604] {0.50%} 1.03 1.17 1.66
25: [638] {0.63%} 1.04 1.55 1.64 1.85
26: [594] {0.51%} 1.06 1.07 1.13
27: [676] {0.74%} 1.02 1.03 1.06 1.26 1.35
28: [789] {0.00%}
29: [811] {0.49%} 1.03 1.18 1.41 2.24
30: [763] {0.39%} 1.04 1.04 2.08
31: [805] {0.12%} 1.44
32: [787] {0.13%} 1.19
33: [763] {0.26%} 1.10 1.36
34: [764] {0.13%} 1.04
35: [822] {0.12%} 1.03
36: [796] {0.00%}
37: [819] {0.00%}
38: [947] {0.11%} 1.08
39: [907] {0.00%}
40: [873] {0.00%}
41: [877] {0.11%} 1.21
42: [1016] {0.00%}
43: [1005] {0.00%}
44: [1016] {0.00%}
45: [1003] {0.30%} 1.07 1.10 1.27
46: [1068] {0.09%} 1.24
47: [1019] {0.00%}
48: [1026] {0.10%} 1.15
49: [1056] {0.28%} 1.09 1.10 1.24
50: [63585] {0.07%} 1.02 1.02 1.02 1.03 1.03 1.04 1.04 1.04 1.05
                    1.05 1.05 1.06 1.06 1.08 1.09 1.09 1.09 1.10
                    1.10 1.11 1.11 1.12 1.13 1.14 1.17 1.17 1.18
                    1.18 1.18 1.19 1.19 1.19 1.20 1.21 1.25 1.27
                    1.27 1.29 1.30 1.30 1.40 1.44 1.48 1.52 1.56
                    1.63

Several things to note:

1. The error rate is generally lower the more words we've got to
   work with.

2. There are notable exceptions to that, but error rates are so low
   that a single message makes a large difference in error rate.

3. There doesn't appear to be any correlation between n and the
   maximum ratio "that works" for that n.

4. 5 predictions (of 90,000) were wrong with a ratio greater than 1.8.
   If we were willing to accept half of 1 percent of 1 percent as
   an acceptable error rate for "certainty", a fixed cutoff of 1.8
   would have caused 5 false negatives (sorry, you can't tell whether
   they're f-p or f-n from the above) in the region of certainty,
   and no false positives there:

[overall results with a fixed ratio cutoff of 1.8]

for all ham
    45000 total
    certain    44830 99.622% (|zham| smaller and ratio > 1.8)
        wrong      0  0.000%
    unsure       170  0.378% (|zham| smaller and ratio <= 1.8)
        wrong     37 21.765%

for all spam
    45000 total
    certain    44563 99.029% (|zspam| smaller and ratio > 1.8)
        wrong      5  0.011%
    unsure       437  0.971% (|zspam| smaller and ratio <= 1.8)
        wrong     79 18.078%


From tim.one@comcast.net  Tue Oct  1 05:03:17 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 01 Oct 2002 00:03:17 -0400
Subject: [Spambayes] mining dates?
In-Reply-To: <20021001032312.GB27892@glacier.arctrix.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEPPBHAB.tim.one@comcast.net>

[Tim]
>> 2. Greg Ward suggested two Date things SpamAssassin looks for:
>>
>> SPAM: *  1.6 -- Invalid Date: header (not RFC 2822)

[Neil Schemenauer]
> Tried that.  It didn't help my error rate so I mercilessly killed it.

Hmm.  You generally chop off the lines revealing how large a test you're
running, but from your total error rates in the last report:

    total unique fp went from 5 to 5 tied
    mean fp % went from 0.25 to 0.25 tied

    total unique fn went from 15 to 15 tied
    mean fn % went from 0.75 to 0.75 tied

it seems a safe bet that you're predicting against 200 messages per run.  In
that case, the smallest non-zero *change* in a one-run error rate you could
possibly see is 0.5% (1 of 200 msgs), which essentially *is* your overall
error rate.  In other words, like me, you've reached the point where your
corpus can no longer support measuring improvements reliably --  even if a
solid but modest improvement were to be made, it's quite likely you couldn't
measure it.

That leaves us staring at ham & spam means & sdevs, which are still good
indicators of whether a change moves "in a good direction", but isn't as
exciting as watching error rates plummet.  Moving to a larger corpus would
help make your life more interesting again:  sign up for more mailing lists
<wink>.


From anthony@interlink.com.au  Tue Oct  1 05:12:35 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Tue, 01 Oct 2002 14:12:35 +1000
Subject: [Spambayes] Central limit 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEPOBHAB.tim.one@comcast.net> 
Message-ID: <200210010412.g914CZJ10661@localhost.localdomain>


>>> Tim Peters wrote
> and then I went on to choose different cutoffs depending on n (n is the
> number of "extreme words" found in the msg, with a maximum of 50).

What happens past 50?


extracting just the ones where it was "dead wrong"...

> 21: [448] {0.45%} 1.09 2.54
> 23: [535] {0.56%} 1.38 1.72 2.20
> 25: [638] {0.63%} 1.04 1.55 1.64 1.85
> 29: [811] {0.49%} 1.03 1.18 1.41 2.24
> 30: [763] {0.39%} 1.04 1.04 2.08

What's the plot of cutoff -vs- uncertain messages like? How do 
these relate?


> 2. There are notable exceptions to that, but error rates are so low
>    that a single message makes a large difference in error rate.

Is there anything "magic" about those 5 fns? Were they the usual 
suspects? Does inspecting them by hand give any clues about other
tokenisation clues that might have helped them? (e.g. if your corpus
was sufficiently single-sourced that you could turn on all the 
disabled clue-extractors...)

> 4. 5 predictions (of 90,000) were wrong with a ratio greater than 1.8.

And all of those were fn, not fp. 

Anthony


From nas@python.ca  Tue Oct  1 05:25:32 2002
From: nas@python.ca (Neil Schemenauer)
Date: Mon, 30 Sep 2002 21:25:32 -0700
Subject: [Spambayes] mining dates?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEPPBHAB.tim.one@comcast.net>
References: <20021001032312.GB27892@glacier.arctrix.com>
	<LNBBLJKPBEHFEDALKOLCKEPPBHAB.tim.one@comcast.net>
Message-ID: <20021001042532.GA28075@glacier.arctrix.com>

Tim Peters wrote:
> it seems a safe bet that you're predicting against 200 messages per run

Good work detective Peters.

> Moving to a larger corpus would help make your life more interesting
> again:  sign up for more mailing lists <wink>.

I use a different email address for each email list I sign up on.  That
makes sorting easy.  My ham and spam collection is taken from addresses
that don't receive mailing list traffic.  So, signing up for more lists
wouldn't help.

  Neil

From skip@pobox.com  Tue Oct  1 05:36:44 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 30 Sep 2002 23:36:44 -0500
Subject: [Spambayes] mining dates?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEPKBHAB.tim.one@comcast.net>
References: <15768.50203.838893.944644@12-248-11-90.client.attbi.com>
        <LNBBLJKPBEHFEDALKOLCEEPKBHAB.tim.one@comcast.net>
Message-ID: <15769.9820.558475.996393@12-248-11-90.client.attbi.com>


    >> It didn't prove my hypothesis, but may have exposed something as
    >> useful.  Spam seems to be sent at a fairly constant rate throughout
    >> the day, which stands to reason, since it's probably all sent
    >> automatically.  However, ham definitely seems to be sent
    >> predominantly during waking hours (doh!).  I'm going to give a little
    >> date mining a try.

    Tim> You have my encouragement, but are you talking about date mining or
    Tim> time mining? 

Well, I'm mining the Date: field for time information.  The other mining
option examines Received: headers for host and IP information.  I was just
following suit.

    Tim> Date mining has hurt lots of folks, by giving good results for
    Tim> bogus reasons ("oops!  that whole ham archive came from 1998, and
    Tim> none of my spam does").  So I suggest you *almost* stick to just
    Tim> time-of-day for now.  Two extensions:

    Tim> 1. Day of week may also be interesting.  I keep a hotmail account
    Tim>    alive just to watch the spam pour in, and it definitely gets
    Tim>    more spam on weekends.  I speculate that the last 500 people
    Tim>    to buy a CD of email addresses can't make time until the
    Tim>    weekend to become an instant internet millionaire <wink>.

Yeah, I thought about dow.  I'll give it a look-see.  Of course, that
requires me to actually call time.strptime() and come up with a couple
plausible format strings.  Here's a small sample from one of my ham Set
directories:

    Date: Mon, 27 May 2002 00:02:09 EDT
    Date: 26 Sep 2002 20:21:59 -0700
    Date: Wed, 25 Sep 2002 23:02:40 -0400
    Date: Thu, 2 May 2002 11:12:41 -0700 (PDT)

Mining time info is simpler because it seems more uniformly formatted than
the rest of the Date: header (in my limited experience anway), so I can be
more stupid when I collect that information and just extract it with a
simple regular expression.

A quickie shell pipeline suggests that spam generally violates date formats
a lot more often than ham.  Given this pipeline for ham:

    find Data/Ham/Set* -type f \
    | xargs sed -n -e '/^From /,/^$/p' \
    | egrep '^Date: ' \
    | egrep '^Date: [A-Z][a-z][a-z],' \
    | awk '{print $2}' \
    | sort \
    | uniq -c

I get this nice clean output:

    418 Fri,
    217 Mon,
    135 Sat,
    118 Sun,
    396 Thu,
    247 Tue,
    347 Wed,

Changing the first element of the pipe to scan my Spam collection gives the
much messier output:

    228 Fri,
    317 Mon,
      1 Mon,16
      1 Mon,23
      2 Mon,27
    178 Sat,
      1 Sex,
    233 Sun,
    294 Thu,
      1 Thu,26
    339 Tue,
      2 Tue,17
    271 Wed,
      2 Wed,16

It is nice to know that every once in great while Sex is a day of the week.
Wish I could predict its occurence though.  Any experts on the list?  <wink>

Just eyeballing things, the frequency patterns look different between spam
and ham.

Skip

From tim.one@comcast.net  Tue Oct  1 06:25:28 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 01 Oct 2002 01:25:28 -0400
Subject: [Spambayes] Central limit
In-Reply-To: <200210010412.g914CZJ10661@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEADBIAB.tim.one@comcast.net>

[Tim]
>> and then I went on to choose different cutoffs depending on n (n is
>> the number of "extreme words" found in the msg, with a maximum of 50).

[Anthony Baxter]
> What happens past 50?

I don't know.  Gary originally suggested 30, and the only reason I tried 50
this time was due to a braino (I was editing the 150 max_discriminators
value we use now, and unthinkingly just deleted the "1").  I have no results
for any value other than 50.

> extracting just the ones where it was "dead wrong"...

By this I guess you mean the error cases where the ratio exceeded 1.8?

>> 21: [448] {0.45%} 1.09 2.54
>> 23: [535] {0.56%} 1.38 1.72 2.20
>> 25: [638] {0.63%} 1.04 1.55 1.64 1.85
>> 29: [811] {0.49%} 1.03 1.18 1.41 2.24
>> 30: [763] {0.39%} 1.04 1.04 2.08

> What's the plot of cutoff -vs- uncertain messages like?
> How do these relate?

Sorry, I don't know what you mean.  Here's a histogram showing the # of
predictions made at each ratio, where the "99.0" bucket includes all ratios
>= 99.0 (there are a lot of those!):

90000 items; mean 96.06; sdev 2174.03
* = 141 items
  1.0  794 ******
  2.0 1411 ***********
  3.0 2067 ***************
  4.0 2373 *****************
  5.0 2640 *******************
  6.0 2708 ********************
  7.0 2883 *********************
  8.0 2747 ********************
  9.0 2598 *******************
 10.0 2478 ******************
 11.0 2307 *****************
 12.0 2194 ****************
 13.0 2008 ***************
 14.0 1906 **************
 15.0 2403 ******************
 16.0 5814 ******************************************
 17.0 5650 *****************************************
 18.0 3635 **************************
 19.0 2133 ****************
 20.0 1762 *************
 21.0 1634 ************
 22.0 1351 **********
 23.0 1154 *********
 24.0 1001 ********
 25.0  937 *******
 26.0  871 *******
 27.0  898 *******
 28.0  861 *******
 29.0  949 *******
 30.0  952 *******
 31.0  937 *******
 32.0  869 *******
 33.0  812 ******
 34.0  715 ******
 35.0  745 ******
 36.0  736 ******
 37.0  576 *****
 38.0  573 *****
 39.0  551 ****
 40.0  520 ****
 41.0  463 ****
 42.0  486 ****
 43.0  445 ****
 44.0  451 ****
 45.0  374 ***
 46.0  349 ***
 47.0  365 ***
 48.0  365 ***
 49.0  288 ***
 50.0  319 ***
 51.0  299 ***
 52.0  276 **
 53.0  281 **
 54.0  273 **
 55.0  255 **
 56.0  246 **
 57.0  239 **
 58.0  213 **
 59.0  236 **
 60.0  211 **
 61.0  188 **
 62.0  205 **
 63.0  178 **
 64.0  164 **
 65.0  162 **
 66.0  190 **
 67.0  177 **
 68.0  174 **
 69.0  145 **
 70.0  175 **
 71.0  155 **
 72.0  168 **
 73.0  123 *
 74.0  140 *
 75.0  132 *
 76.0  130 *
 77.0  133 *
 78.0  121 *
 79.0  119 *
 80.0  122 *
 81.0  125 *
 82.0  124 *
 83.0   97 *
 84.0   96 *
 85.0  125 *
 86.0   99 *
 87.0   93 *
 88.0   94 *
 89.0  102 *
 90.0   99 *
 91.0  105 *
 92.0   88 *
 93.0   82 *
 94.0   95 *
 95.0   72 *
 96.0   72 *
 97.0   82 *
 98.0   82 *
 99.0 8580 *************************************************************

I suppose you can get a crude answer to whatever it is you're asking from
staring at that <wink>.  Here's restricted to ratios < 10.0:

20221 items; mean 61.62; sdev 23.19
* = 6 items
 1.00  93 ****************
 1.10  74 *************
 1.20  69 ************
 1.30  75 *************
 1.40  71 ************
 1.50  66 ***********
 1.60  69 ************
 1.70  90 ***************
 1.80  91 ****************
 1.90  96 ****************
 2.00  94 ****************
 2.10 119 ********************
 2.20 126 *********************
 2.30 146 *************************
 2.40 136 ***********************
 2.50 144 ************************
 2.60 134 ***********************
 2.70 168 ****************************
 2.80 167 ****************************
 2.90 177 ******************************
 3.00 192 ********************************
 3.10 176 ******************************
 3.20 222 *************************************
 3.30 203 **********************************
 3.40 198 *********************************
 3.50 230 ***************************************
 3.60 205 ***********************************
 3.70 183 *******************************
 3.80 209 ***********************************
 3.90 249 ******************************************
 4.00 207 ***********************************
 4.10 253 *******************************************
 4.20 204 **********************************
 4.30 212 ************************************
 4.40 253 *******************************************
 4.50 240 ****************************************
 4.60 249 ******************************************
 4.70 246 *****************************************
 4.80 270 *********************************************
 4.90 239 ****************************************
 5.00 258 *******************************************
 5.10 240 ****************************************
 5.20 242 *****************************************
 5.30 256 *******************************************
 5.40 248 ******************************************
 5.50 279 ***********************************************
 5.60 263 ********************************************
 5.70 294 *************************************************
 5.80 286 ************************************************
 5.90 274 **********************************************
 6.00 259 ********************************************
 6.10 261 ********************************************
 6.20 257 *******************************************
 6.30 278 ***********************************************
 6.40 278 ***********************************************
 6.50 241 *****************************************
 6.60 279 ***********************************************
 6.70 287 ************************************************
 6.80 287 ************************************************
 6.90 281 ***********************************************
 7.00 299 **************************************************
 7.10 291 *************************************************
 7.20 311 ****************************************************
 7.30 285 ************************************************
 7.40 281 ***********************************************
 7.50 259 ********************************************
 7.60 292 *************************************************
 7.70 288 ************************************************
 7.80 285 ************************************************
 7.90 292 *************************************************
 8.00 249 ******************************************
 8.10 271 **********************************************
 8.20 261 ********************************************
 8.30 289 *************************************************
 8.40 269 *********************************************
 8.50 275 **********************************************
 8.60 294 *************************************************
 8.70 290 *************************************************
 8.80 281 ***********************************************
 8.90 268 *********************************************
 9.00 258 *******************************************
 9.10 263 ********************************************
 9.20 268 *********************************************
 9.30 280 ***********************************************
 9.40 279 ***********************************************
 9.50 247 ******************************************
 9.60 253 *******************************************
 9.70 244 *****************************************
 9.80 265 *********************************************
 9.90 241 *****************************************

>> 2. There are notable exceptions to that, but error rates are so low
>>    that a single message makes a large difference in error rate.

> Is there anything "magic" about those 5 fns? Were they the usual
> suspects? Does inspecting them by hand give any clues about other
> tokenisation clues that might have helped them? (e.g. if your corpus
> was sufficiently single-sourced that you could turn on all the
> disabled clue-extractors...)

Sorry, I can't relate the errors to msgs.  All I have is a binary pickle
containing 90,000 of these:

class Node(object):
    __slots__ = 'is_spam', 'n', 'zham', 'zspam', 'delta', 'score'

That was generated when I was testing a different "certainty heuristic" that
performed much worse than the one I'm talking about now, and its text output
file doesn't contain any error cases with ratios larger than about 1.1 (so
it doesn't contain the errors in question now).  It never made a mistake,
but it considered huge numbers of msgs to be uncertain -- if 25% of msgs are
kicked out for manual review, I'd consider the scheme wholly impractical.

>> 4. 5 predictions (of 90,000) were wrong with a ratio greater than 1.8.

> And all of those were fn, not fp.

That's right.  In this particular test.  OTOH, this particular test ran 90
times each training on 500+500 then predicting against 4500+4500, so it was
giving itself a hard job.  I've got lots of reasons to believe that training
on 500 ham and 500 spam isn't enough to get reasonable coverage of the
diversity in my corpora.

Offline, Guido tried the use_central_limit2 code exactly as-is on a much
larger test, training on about 8K ham + 3K spam for each run.  I don't
recommend doing that because the "scores" produced by the code as-is make no
sense -- they basically produce 1 bit of information (which zscore was
smaller?) in a highly confusing way, and a way that's not symmetric around
0.5.  I believe he also used max_discriminators=150 (the default these
days), which may well be "too large" for the log-central-limit code (Gary
designed it to make extreme use of the extreme words, and there's no message
that has 150 distinct extreme words).

Even so, compared to our current default scheme, his bottom lines across 90
runs were:

total unique fp went from 904 to 324 won    -64.16%
mean fp % went from 0.662958214428 to 0.232509170721 won    -64.93%

total unique fn went from 97 to 275 lost  +183.51%
mean fn % went from 0.127271524421 to 0.328802849112 lost  +158.35%

and we've already seen that this scheme is less certain about spam than
about ham.  Alas, there's no way to know what the "certainty heuristic"
would have said in Guido's large run (there's no code checked in for that,
and I'm having an increasingly hard time making insane amounts of time for
this project).


From tim.one@comcast.net  Tue Oct  1 06:36:34 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 01 Oct 2002 01:36:34 -0400
Subject: [Spambayes] mining dates?
In-Reply-To: <15769.9820.558475.996393@12-248-11-90.client.attbi.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEAEBIAB.tim.one@comcast.net>

[Skip Montanaro, on day-of-week]
> Yeah, I thought about dow.  I'll give it a look-see.  Of course, that
> requires me to actually call time.strptime() and come up with a couple
> plausible format strings.

Stupid almost certainly beats smart here.  Match against

    r'(Mon|Tue|Wed|Thu|Fri|Sat|Sun),\s'

If that succeeds, generate a dow token with the day of the week, else
generate a dow token with a "no day" value.  All cases are then reduced to
8, and all goofy patterns you see in spam are reduced to one.  You could
refine that a little (e.g., to distinguish plain-missing from there-but-
not-followed-by-space), but I expect more than that would be
counterproductive.  Testing is the final judge, of course, but trust me on
this one <wink>:  start stupid, and work your way up until results stop
improving.


From mjm@michaelmeltzer.com  Tue Oct  1 07:01:38 2002
From: mjm@michaelmeltzer.com (Michael Meltzer)
Date: Tue, 1 Oct 2002 02:01:38 -0400
Subject: [Spambayes] just an idea
Message-ID: <010701c26910$017d5760$0b01a8c0@mjm2>

This is a multi-part message in MIME format.

---------------------- multipart/alternative attachment
For what it is wroth, in the same way time stamp might be useful, the =
current crop black holes list might be helpful, their problem has always =
be they are a little to touchy, slow to react and a little dangerous for =
a admin due to their draconian nature. I have had my commercial DSL line =
include just because they where DSL lines. but the filter is a little =
more forgiving then a simple binary decision. knowing a ip is a dial up =
line, cable modem, a know spammer address or a open relay could be =
useful in a close call. in fact a real cute application would be for the =
filters to report the spambayes blackhole list automatically, but it =
would not be a blackhole list just one element of the filter used in the =
evaluation. Works nicely with the properties of the filter, should help =
with hammy email that might look spammy, expressly if it a out of the =
norm for the user and the network effect with a little address ageing =
should be self maintaining. The down side those dns queries can be =
expense. Just a thought.

http://relays.osirusoft.com/cgi-bin/rbcheck.cgi

MJM
---------------------- multipart/alternative attachment--


From skip@pobox.com  Tue Oct  1 07:10:00 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 1 Oct 2002 01:10:00 -0500
Subject: [Spambayes] more date field mining
Message-ID: <15769.15416.639114.331796@12-248-11-90.client.attbi.com>

I have now modified the Tokenizer class thus:

    class Tokenizer:

        date_hms_re = re.compile(r' (?P<hour>[0-9][0-9]):'
                                 r'(?P<minute>[0-9][0-9]):'
                                 r'(?P<second>[0-9][0-9]) ')

        date_formats = ("%a, %d %b %Y %H:%M:%S (%Z)",
                        "%a, %d %b %Y %H:%M:%S %Z",
                        "%d %b %Y %H:%M:%S (%Z)",
                        "%d %b %Y %H:%M:%S %Z")

        ...

        def tokenize_headers(self, msg):
            # Special tagging of header lines and MIME metadata.

            ...

            if options.mine_date_headers:
                for header in msg.get_all("date", ()):
                    mat = self.date_hms_re.search(header)
                    # return the time in Date: headers arranged in
                    # six-minute buckets
                    if mat is not None:
                        h = int(mat.group('hour'))
                        bucket = int(mat.group('minute')) // 10
                        yield 'time:%02d:%d' % (h, bucket)

                    # extract the day of the week
                    for fmt in self.date_formats:
                        try:
                            timetuple = time.strptime(header, fmt)
                        except ValueError:
                            pass
                        else:
                            yield 'dow:%d' % timetuple[6]
                    else:
                        yield 'dow:invalid'

Times and days of the week seem like they should be pretty distinct.  I
should probably analyze them separately using two options.  Still, here are
my initial results using this coarser grained scheme:

    cutoffs -> times
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    ...

    false positive percentages
        1.000  1.000  tied          
        1.500  1.500  tied          
        1.000  1.000  tied          
        1.000  1.500  lost   +50.00%
        1.000  1.000  tied          
        1.500  1.500  tied          
        3.500  3.500  tied          
        1.500  1.500  tied          
        1.500  1.500  tied          
        1.500  2.000  lost   +33.33%

    won   0 times
    tied  8 times
    lost  2 times

    total unique fp went from 30 to 32 lost    +6.67%
    mean fp % went from 1.5 to 1.6 lost    +6.67%

    false negative percentages
        0.500  0.500  tied          
        1.500  1.500  tied          
        0.500  0.500  tied          
        0.500  0.500  tied          
        2.000  2.000  tied          
        0.000  0.000  tied          
        1.000  1.500  lost   +50.00%
        1.000  1.000  tied          
        0.000  0.000  tied          
        1.500  1.500  tied          

    won   0 times
    tied  9 times
    lost  1 times

    total unique fn went from 17 to 18 lost    +5.88%
    mean fn % went from 0.85 to 0.9 lost    +5.88%

    ham mean                     ham sdev
      20.82   21.05   +1.10%        6.43    6.47   +0.62%
      21.86   22.00   +0.64%        6.63    6.61   -0.30%
      21.38   21.56   +0.84%        6.49    6.57   +1.23%
      21.96   22.13   +0.77%        6.26    6.27   +0.16%
      21.51   21.73   +1.02%        6.72    6.73   +0.15%
      21.66   21.88   +1.02%        6.98    7.01   +0.43%
      21.45   21.62   +0.79%        7.66    7.59   -0.91%
      21.74   21.93   +0.87%        6.69    6.67   -0.30%
      21.71   21.88   +0.78%        7.44    7.43   -0.13%
      21.87   22.01   +0.64%        5.93    5.93   +0.00%

    ham mean and sdev for all runs
      21.60   21.78   +0.83%        6.75    6.75   +0.00%

    spam mean                    spam sdev
      74.10   73.79   -0.42%       12.99   12.71   -2.16%
      72.47   72.11   -0.50%       13.92   13.63   -2.08%
      74.05   73.75   -0.41%       13.00   12.80   -1.54%
      74.00   73.68   -0.43%       12.27   12.03   -1.96%
      72.43   72.06   -0.51%       13.73   13.33   -2.91%
      72.68   72.35   -0.45%       13.27   13.04   -1.73%
      72.57   72.29   -0.39%       13.03   12.84   -1.46%
      71.50   71.26   -0.34%       12.12   11.95   -1.40%
      73.25   72.92   -0.45%       12.67   12.39   -2.21%
      73.02   72.73   -0.40%       12.44   12.24   -1.61%

    spam mean and sdev for all runs
      73.01   72.69   -0.44%       12.98   12.73   -1.93%

    ham/spam mean difference: 51.41 50.91 -0.50

I'll try it with a more fine-grained set of options tomorrow after a little
snooze.

Skip

From msergeant@startechgroup.co.uk  Tue Oct  1 10:18:13 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Tue, 01 Oct 2002 10:18:13 +0100
Subject: [Spambayes] Matt Sergeant: Introduction
References: <LNBBLJKPBEHFEDALKOLCMENIBHAB.tim.one@comcast.net>
Message-ID: <3D996855.9030707@startechgroup.co.uk>

Tim Peters wrote:
> [Matt Sergeant]
> 
> Thanks for the introduction, Matt!  Welcome.
> 
> 
>>...
>>Like you all, I discovered very quickly that it's the tokenisation
>>techniques that are the biggest "win" when it comes down to it.
> 
> The first thing I tried after implementing Graham's scheme was special
> tokenization and tagging of embedded http/https/ftp thingies.

Consider that adopted ;-)

And to give back I'll tell you that one of my biggest wins was parsing 
HTML (with HTML::Parser - a C implementation so it's very fast) and 
tokenising all attributes, so I get:

   colspan=2
   face=Arial, Helvetica, sans-serif

as tokens. Plus using a proper HTML parser I get to parse HTML comments 
too (which is a win).

Using word tuples is also a small win, but increases the database size 
and number of tokens you have to pull from the database enormously. 
That's an issue for me because I'm not using an in-memory database (one 
implementation uses CDB, another uses SQL - the SQL one is really nice 
because you can so easily do data mining, and the code to extract the 
token probabilities is just a view).

> That
> instantly cut the false negative rate in half.  It remains the single
> biggest win we ever got.

Well I very quickly found out that most of the academic research into 
this has been pretty bogus. For example everyone seems (seemed?) to 
think that stemming was a big win, but I found it to lose every time.

> The rest has been an aggregation of many smaller
> wins, and the benefit gotten over time from finding and removing the biases
> in Paul's formulation has been highly significant.  That eventually hit a
> wall,where this set of 3 artificialities was stubborn:
> 
>     artificially clamping spamprobs into [0.01, 0.99]
>     artificially boosting ham counts
>     looking at only the 16 most-extreme words
> 
> Changing any one, or any two, of those, gave at best mixed results.  It took
> wholesale adoption of all of Gary Robinson's ideas at once (some of which
> aren't really explained (yet?) on his webpage) to nuke them all.  The fewer
> the number of "mystery knobs", the better results have gotten, but the
> original biases sometimes acted to cancel each other out in the areas they
> hurt most, so you can't get here from there removing just one at a time.

(I've followed this all so far in read-only mode, but thanks for 
rounding it up into 2 paragraphs <grin>).

The one thing that still bothers me still about Gary's method is that 
the threshold value varies depending on corpus. Though I expect there's 
some mileage in being able to say that the middle ground is "unknown".

>>so I'm hopefully going to get CLT done this week and see how it fares.
>>Unfortunately I find python incredibly difficult to read, so it takes
>>me a while!
> 
> 
> Hmm.  I could tell you to mentally translate
> 
>     a.b
> 
> to
> 
>     $a->{b}
> 
> but I doubt your problem is at that level <wink>.  Post a snippet of Python
> you find "incredibly difficult to read", and someone will be happy to walk
> you thru it.  I really can't guess, as this particular criticism of Python
> is one I've never heard before!

OK, I'll go over it again this week and next time I get stuck I'll mail 
out for some help ;-) The hardest part really is getting from how my 
code is structured (i.e. where I get my data from, how I store it, etc) 
to your version. Simple examples like where you use a priority queue for 
the probabilities so you can extract the top N indicators, I just use an 
array, and use a sort to get the top N. So mostly it's just the details 
of storage that confuse me.

Oh, and not being able to figure out where a block ends :-P

Off the top of my head, what does frexp() do?

And where is compute_population_stats used?

>>...
>>such as how the probability stuff works so much better on individuals'
>>corpora (or on a particular mailing list's corpus) than it does for
>>hundreds of thousands of users.
> 
> That's been my suspicion, but we haven't tested it here yet.  So save us the
> effort and tell us the bottom line from your tests <wink>.

On my personal email I was seeing about 5 FP's in 4000, and about 20 
FN's in about the same number (can't find the exact figures right now). 
On a live feed of customer email we're seeing about 4% FN's and 2% FP's.

I don't yet have your fancy histograms, mostly because the code works on 
one email in isolation right now, and knows nothing about what result it 
should have given - I need to write wrappers to do that stuff yet.


From anthony@interlink.com.au  Tue Oct  1 10:29:48 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Tue, 01 Oct 2002 19:29:48 +1000
Subject: [Spambayes] Matt Sergeant: Introduction 
In-Reply-To: <3D996855.9030707@startechgroup.co.uk> 
Message-ID: <200210010929.g919Tnn02347@localhost.localdomain>


>>> Matt Sergeant wrote
> And to give back I'll tell you that one of my biggest wins was parsing 
> HTML (with HTML::Parser - a C implementation so it's very fast) and 
> tokenising all attributes, so I get:
> 
>    colspan=2
>    face=Arial, Helvetica, sans-serif
> 
> as tokens. Plus using a proper HTML parser I get to parse HTML comments 
> too (which is a win).

With the Graham code, we found that the simple minded parsing of HTML
actually hurt more than it gained, but it was a _very_ simple split-on-
whitespace. In a case of syncronicity, at the moment I'm running a test 
over my newer larger monster corpus (35Kh/17Ks) to extract the avpairs 
from HTML tokens.

Anthony


From msergeant@startechgroup.co.uk  Tue Oct  1 10:37:29 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Tue, 01 Oct 2002 10:37:29 +0100
Subject: [Spambayes] Re: Matt Sergeant: Introduction
References: <3D98486A.1050208@startechgroup.co.uk>
	<hhptuvat81.fsf@hrothgar.la.mastaler.com>
Message-ID: <3D996CD9.2030300@startechgroup.co.uk>

Jason R. Mastaler wrote:
> Matt Sergeant <msergeant@startechgroup.co.uk> writes:
> 
> 
>>I've been following this list on gmane.org for a while now (it's a
>>mail to nntp gateway for those interested in following multiple
>>technical mailing lists in a read-only fashion)
> 
> 
> Actually, Gmane is not read-only -- you can both read and post.

Does it depend on the list? I tried to post once and my post never 
showed up.


From anthony@interlink.com.au  Tue Oct  1 10:50:01 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Tue, 01 Oct 2002 19:50:01 +1000
Subject: [Spambayes] memory consumption... 
In-Reply-To: <20020926192704.GA3931@schaduw.felnet> 
Message-ID: <200210010950.g919o3e02522@localhost.localdomain>


>>> Carel Fellinger wrote
> I take it that your new to linux? Otherwise ignore my rambling.
> Linux uses all its free memory for caching, but only trully free
> memory.  So before any swapping starts the cache will shrink to its
> bare minimum first.

That's what I'd expected. But it looked like this little laptop had
got confused, and wouldn't let go of the cached ram. a reboot later
and it's happy again, and tossing cached data away, rather than 
paging everything else out.

oh well. file under "one of those freaky things that computers sometimes
do". 

Anthony


From msergeant@startechgroup.co.uk  Tue Oct  1 10:54:56 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Tue, 01 Oct 2002 10:54:56 +0100
Subject: [Spambayes] memory consumption...
References: <200210010950.g919o3e02522@localhost.localdomain>
Message-ID: <3D9970F0.8090007@startechgroup.co.uk>

Anthony Baxter wrote:
>>>>Carel Fellinger wrote
>>>
>>I take it that your new to linux? Otherwise ignore my rambling.
>>Linux uses all its free memory for caching, but only trully free
>>memory.  So before any swapping starts the cache will shrink to its
>>bare minimum first.
> 
> 
> That's what I'd expected. But it looked like this little laptop had
> got confused, and wouldn't let go of the cached ram. a reboot later
> and it's happy again, and tossing cached data away, rather than 
> paging everything else out.

FWIW, this is very much dependant on Linux kernel version. Red Hat's 
stock kernels seem to perform much better than anyone elses doing this 
type of thing.

Matt.


From mwh@python.net  Tue Oct  1 12:04:38 2002
From: mwh@python.net (Michael Hudson)
Date: 01 Oct 2002 12:04:38 +0100
Subject: [Spambayes] Re: Matt Sergeant: Introduction
References: <LNBBLJKPBEHFEDALKOLCMENIBHAB.tim.one@comcast.net>
	<3D996855.9030707@startechgroup.co.uk>
Message-ID: <lkwup2to6x.fsf@pc150.maths.bris.ac.uk>

Matt Sergeant <msergeant@startechgroup.co.uk> writes:

> Off the top of my head, what does frexp() do?

>>> print math.frexp.__doc__ 
frexp(x)

Return the mantissa and exponent of x, as pair (m, e).
m is a float and e is an int, such that x = m * 2.**e.
If x is 0, m and e are both 0.  Else 0.5 <= abs(m) < 1.0.

Cheers,
M.

-- 
  The bottom tier is what a certain class of wanker would call
  "business objects" ...                      -- Greg Ward, 9 Dec 1999


From mwh@python.net  Tue Oct  1 12:03:28 2002
From: mwh@python.net (Michael Hudson)
Date: 01 Oct 2002 12:03:28 +0100
Subject: [Spambayes] Re: Matt Sergeant: Introduction
References: <3D98486A.1050208@startechgroup.co.uk>
	<hhptuvat81.fsf@hrothgar.la.mastaler.com>
	<3D996CD9.2030300@startechgroup.co.uk>
Message-ID: <lk1y7av2tb.fsf@pc150.maths.bris.ac.uk>

Matt Sergeant <msergeant@startechgroup.co.uk> writes:

> Jason R. Mastaler wrote:
> > Matt Sergeant <msergeant@startechgroup.co.uk> writes:
> >
> >>I've been following this list on gmane.org for a while now (it's a
> >>mail to nntp gateway for those interested in following multiple
> >>technical mailing lists in a read-only fashion)
> > Actually, Gmane is not read-only -- you can both read and post.
> 
> Does it depend on the list? I tried to post once and my post never
> showed up.

You should get a once-per-list confirmation email.  Reply to that, and
you should be able to post via gmane.

If you see this post, you know it works...

Cheers,
M.

-- 
  /* I'd just like to take this moment to point out that C has all
     the expressive power of two dixie cups and a string.
   */                       -- Jamie Zawinski from the xkeycaps source


From msergeant@startechgroup.co.uk  Tue Oct  1 13:05:51 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Tue, 01 Oct 2002 13:05:51 +0100
Subject: [Spambayes] Re: Matt Sergeant: Introduction
References: <LNBBLJKPBEHFEDALKOLCMENIBHAB.tim.one@comcast.net>
	<3D996855.9030707@startechgroup.co.uk> <lkwup2to6x.fsf@pc150.maths.bris.ac.uk>
Message-ID: <3D998F9F.1030301@startechgroup.co.uk>

Michael Hudson wrote:
> Matt Sergeant <msergeant@startechgroup.co.uk> writes:
> 
> 
>>Off the top of my head, what does frexp() do?
> 
> 
>>>>print math.frexp.__doc__ 
>>>
> frexp(x)
> 
> Return the mantissa and exponent of x, as pair (m, e).
> m is a float and e is an int, such that x = m * 2.**e.
> If x is 0, m and e are both 0.  Else 0.5 <= abs(m) < 1.0.

Ah cool. Same as Math::BigFloat's $x->parts().


From richie@entrian.com  Tue Oct  1 14:15:21 2002
From: richie@entrian.com (Richie Hindle)
Date: Tue, 01 Oct 2002 14:15:21 +0100
Subject: [Spambayes] Good evening/morning/afternoon everyone
In-Reply-To: <20020928153427.D68E.JCARLSON@uci.edu>
References: <20020928002231.CD68.JCARLSON@uci.edu>
	<ot6cpucv3lujj2dd6er9mf08j4ckgp5jes@4ax.com>
	<20020928153427.D68E.JCARLSON@uci.edu>
Message-ID: <aj6jpu80kc0s7faj8ti1lh7dkb1prqmclp@4ax.com>

Hi Josiah,

> I have (in the past) had email software that doesn't allow arbitrary
> header matching. By inserting the Subject, I guarantee that ANY email
> software can filter it.

A case for an option, maybe.  How old was this software? (please say "Very
old" 8-)

Thanks for the explanations of everything else.  I hope my comments were
useful.

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Tue Oct  1 14:15:25 2002
From: richie@entrian.com (Richie Hindle)
Date: Tue, 01 Oct 2002 14:15:25 +0100
Subject: [Spambayes] Cunning use of quoted-printable
Message-ID: <l46jpukftsehccrj65pbdchib4ujbp3tk4@4ax.com>

Afternoon all,

I've just found this message in my spam corpus:

-----------------------------------------------------------------------

[Some headers snipped]

Subject: Mail for Richie Hindle  
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
X-Mailer: Mail Express  

Dear=20Richie=20Hindle,=0D=0A=0D=0AInternet-Soft.Com=20is=20pleased=20t=
o=20announce=20the=20release=20of=20the=20following=20new=20software=20=
programs:=0D=0A=0D=0A1)=20FTP=20Navigator=206.58=0D=0Ahttp://www.intern=
et-soft.com/DEMO/ftpnavigator.exe=0D=0A=0D=0A2)=20Web=20Site=20eXtracto=
r=208.01=0D=0Ahttp://www.esalesbiz.com/extra/webextrasetup.exe=0D=0A=0D=

[more of the same snipped]

-----------------------------------------------------------------------

Looks like an attempt to fox system like spambayes.  It doesn't make much
difference, because the tokenizer decodes the quoted-printable, but it
could trigger a clue token.  I doubt there are enough spams out there for
that to make any difference, and how to quantify whether a message looks
like its using this trick is not obvious.  I only really mention it as a
curiosity.  It did some out as a false positive in my testing, but I don't
think that was because of the quoting.

Less interesting are the results of running Tim's 4000-message tests on my
corpora:

-> best cutoff for all runs: 0.56
->     with weighted total 10*2 fp + 37 fn = 57
->     fp rate 0.1%  fn rate 1.85%
total unique false pos 2
total unique false neg 37
average fp % 0.1
average fn % 1.85

This tells me two things: I am Mr. Average, and the results are
astonishingly impressive!

-- 
Richie Hindle
richie@entrian.com


From msergeant@startechgroup.co.uk  Tue Oct  1 14:32:22 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Tue, 01 Oct 2002 14:32:22 +0100
Subject: [Spambayes] Good evening/morning/afternoon everyone
References: <20020928002231.CD68.JCARLSON@uci.edu>
	<ot6cpucv3lujj2dd6er9mf08j4ckgp5jes@4ax.com>
	<20020928153427.D68E.JCARLSON@uci.edu>
	<aj6jpu80kc0s7faj8ti1lh7dkb1prqmclp@4ax.com>
Message-ID: <3D99A3E6.4050403@startechgroup.co.uk>

Richie Hindle wrote:
> Hi Josiah,
> 
> 
>>I have (in the past) had email software that doesn't allow arbitrary
>>header matching. By inserting the Subject, I guarantee that ANY email
>>software can filter it.
> 
> 
> A case for an option, maybe.  How old was this software? (please say "Very
> old" 8-)

Lotus Notes still can't filter on arbitrary headers.

Matt.


From msergeant@startechgroup.co.uk  Tue Oct  1 14:36:35 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Tue, 01 Oct 2002 14:36:35 +0100
Subject: [Spambayes] Tokenising clues
Message-ID: <3D99A4E3.9000108@startechgroup.co.uk>

It seems everyone is slowly stumbling on "tokenising clues" here. A 
"date" header issue here, a "message-id" issue there, and a particular 
way to format body text as another possible clue.

This seems like a vast waste of your time to me. There's a couple of 
projects out there that have already spent vast amounts of time and 
programming effort into figuring out these other clues that spambayes 
misses out on. Rather than repeating that work, why not just rip all the 
rules out of SpamAssassin or some other spam checking project wholesale, 
and stuff those into your database?

Sorry, I don't want to demean any of your work, but we need to work 
together to fight spam, and I'd rather not see so much time wasted on 
individual clues when SpamAssassin already extracts about 800 of them!

Matt.


From skip@pobox.com  Tue Oct  1 14:52:55 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 1 Oct 2002 08:52:55 -0500
Subject: [Spambayes] results of mining post time - slight loss
Message-ID: <15769.43191.229664.140708@12-248-11-90.client.attbi.com>

(forgot to press the send key yesterday evening...)

Using six-minute time buckets gleaned from Date: headers, here are the
results (executive summary: slight loss).  Buckets were computed as I
suggested in my previous email:

    (h*60+m)//10

that is, six-minute intervals (maybe I should name this option the
lawyer-fee-increment (*)?)

Before:

    [TestDriver]
    spam_cutoff: 0.4

After:

    [Tokenizer]
    mine_date_headers: True

    [TestDriver]
    spam_cutoff: 0.4

Results:

    cutoffs -> times
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    ... yadda yadda yadda

    false positive percentages
        1.000  1.000  tied          
        1.500  1.500  tied          
        1.000  1.000  tied          
        1.000  1.500  lost   +50.00%
        1.000  1.000  tied          
        1.500  1.500  tied          
        3.500  3.500  tied          
        1.500  1.500  tied          
        1.500  1.500  tied          
        1.500  2.000  lost   +33.33%

    won   0 times
    tied  8 times
    lost  2 times

    total unique fp went from 30 to 32 lost    +6.67%
    mean fp % went from 1.5 to 1.6 lost    +6.67%

    false negative percentages
        0.500  0.500  tied          
        1.500  1.500  tied          
        0.500  0.500  tied          
        0.500  0.500  tied          
        2.000  2.000  tied          
        0.000  0.000  tied          
        1.000  1.000  tied          
        1.000  1.000  tied          
        0.000  0.000  tied          
        1.500  1.500  tied          

    won   0 times
    tied 10 times
    lost  0 times

    total unique fn went from 17 to 17 tied          
    mean fn % went from 0.85 to 0.85 tied          

    ham mean                     ham sdev
      20.82   20.98   +0.77%        6.43    6.47   +0.62%
      21.86   21.96   +0.46%        6.63    6.62   -0.15%
      21.38   21.52   +0.65%        6.49    6.56   +1.08%
      21.96   22.09   +0.59%        6.26    6.29   +0.48%
      21.51   21.67   +0.74%        6.72    6.75   +0.45%
      21.66   21.78   +0.55%        6.98    7.00   +0.29%
      21.45   21.59   +0.65%        7.66    7.62   -0.52%
      21.74   21.88   +0.64%        6.69    6.68   -0.15%
      21.71   21.84   +0.60%        7.44    7.43   -0.13%
      21.87   21.96   +0.41%        5.93    5.93   +0.00%

    ham mean and sdev for all runs
      21.60   21.73   +0.60%        6.75    6.76   +0.15%

    spam mean                    spam sdev
      74.10   73.87   -0.31%       12.99   12.80   -1.46%
      72.47   72.28   -0.26%       13.92   13.79   -0.93%
      74.05   73.83   -0.30%       13.00   12.85   -1.15%
      74.00   73.83   -0.23%       12.27   12.11   -1.30%
      72.43   72.18   -0.35%       13.73   13.45   -2.04%
      72.68   72.44   -0.33%       13.27   13.11   -1.21%
      72.57   72.44   -0.18%       13.03   12.94   -0.69%
      71.50   71.34   -0.22%       12.12   12.01   -0.91%
      73.25   73.05   -0.27%       12.67   12.50   -1.34%
      73.02   72.81   -0.29%       12.44   12.29   -1.21%

    spam mean and sdev for all runs
      73.01   72.81   -0.27%       12.98   12.82   -1.23%

    ham/spam mean difference: 51.41 51.08 -0.33

Skip

(*) It's a sad commentary on the litigiousness of Americans if someone like
me who's basically never been to a lawyer recognizes the stereotypical
six-minute increment lawyers are supposed to use to bill their clients.  (Or
maybe I watched too much "LA Law" at a crucial period of my life...)

From anthony@interlink.com.au  Tue Oct  1 15:22:16 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Wed, 02 Oct 2002 00:22:16 +1000
Subject: [Spambayes] Tokenising clues 
In-Reply-To: <3D99A4E3.9000108@startechgroup.co.uk> 
Message-ID: <200210011422.g91EMHT04893@localhost.localdomain>


>>> Matt Sergeant wrote
> This seems like a vast waste of your time to me. There's a couple of 
> projects out there that have already spent vast amounts of time and 
> programming effort into figuring out these other clues that spambayes 
> misses out on. Rather than repeating that work, why not just rip all the 
> rules out of SpamAssassin or some other spam checking project wholesale, 
> and stuff those into your database?

The problems are that

  - many of the existing tools are of the "if this header says _this_,
    it indicates spamminess of -this- much". The stuff here is more
    trying to work out answers that work without having to try and 
    produce magic numbers for what a particular header value means.

  - a lot of the problems are from the testing corpuses (yes, I know
    the word is corpora, corpuses looks cooler :) and the mixed nature
    of them. This rules out a bunch of "obvious" tricks.

  - spamassassin, in particular, is written in perl. I tried looking
    through it to grok clues and started having twitches and convulsions.
    Been through the perl horror, not going back :) 
    I couldn't find a simple doco of "here's what SA looks at" in the docs.

> Sorry, I don't want to demean any of your work, but we need to work 
> together to fight spam, and I'd rather not see so much time wasted on 
> individual clues when SpamAssassin already extracts about 800 of them!

The problem with SA for at least one of the applications I have is that
it's way, way too aggressive. My monster corpus is the main contact email
for the company I work for. SA kicks out far too many legitimate 
commercial email messages. But that mailbox gets (in the last week) 
something like 200 spams a day - probably more. Sifting through the 
hits looking for the real posts is too much work.

If there is a list of existing tokenisation clues we can work from,
excellent! I know I won't mind re-using someone else's hard-won experience
in this area. :)

Anthony

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From skip@pobox.com  Tue Oct  1 15:39:20 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 1 Oct 2002 09:39:20 -0500
Subject: [Spambayes] new virus...
Message-ID: <15769.45976.592234.222829@12-248-11-90.client.attbi.com>


Not quite on-topic for this group, but I know some people are interested in
getting this project to identify viruses.  FYI...

    Virus Could Prove Real Bugbear for Networks

    A new mass-mailing virus, which hit the Internet on Monday,
    could cause quite a bit of damage to vulnerable networks. 
    The virus, known as Bugbear, installs a Trojan on infected
    machines that is capable of logging users' keystrokes, which
    could include passwords and other sensitive information. 
    http://eletters1.ziffdavis.com/cgi-bin10/flo?y=eSHe0EWaTF0E4J0q1G0Ac 

Skip

From msergeant@startechgroup.co.uk  Tue Oct  1 15:41:56 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Tue, 01 Oct 2002 15:41:56 +0100
Subject: [Spambayes] Tokenising clues
References: <200210011422.g91EMHT04893@localhost.localdomain>
Message-ID: <3D99B434.1000006@startechgroup.co.uk>

Anthony Baxter wrote:
>>>>Matt Sergeant wrote
>>>
>>This seems like a vast waste of your time to me. There's a couple of 
>>projects out there that have already spent vast amounts of time and 
>>programming effort into figuring out these other clues that spambayes 
>>misses out on. Rather than repeating that work, why not just rip all the 
>>rules out of SpamAssassin or some other spam checking project wholesale, 
>>and stuff those into your database?
> 
> 
> The problems are that
> 
>   - many of the existing tools are of the "if this header says _this_,
>     it indicates spamminess of -this- much". The stuff here is more
>     trying to work out answers that work without having to try and 
>     produce magic numbers for what a particular header value means.

The scoring is independant from the matching. The scoring is merely a 
by-product of running the matches through the genetic algorithm - in 
order to feed that genetic algorithm we have to not care what the score 
is (as that's prior knowledge, thus bad).

>   - a lot of the problems are from the testing corpuses (yes, I know
>     the word is corpora, corpuses looks cooler :) and the mixed nature
>     of them. This rules out a bunch of "obvious" tricks.

This is suggested as an extension of what you do, not a replacement 
though. You've already got accurate code, but it seems that spamassassin 
was able to get clues from your FN's that word tokenisation has missed. 
The very nature of what you're doing will mean that if the SA rules 
aren't as accurate as the tokens you do find in an email then it won't 
matter. But it's just that little bit more information.

>   - spamassassin, in particular, is written in perl. I tried looking
>     through it to grok clues and started having twitches and convulsions.
>     Been through the perl horror, not going back :) 
>     I couldn't find a simple doco of "here's what SA looks at" in the docs.

Check the rules/ directory. You can read regexps I assume. That's all 
SpamAssassin is - a big regexp engine. There are rules that run code (we 
call them eval tests), but most of them aren't that complex, for example 
something that looks at eval:subject_all_caps() will run:

sub subject_is_all_caps {
    my ($self) = @_;
    my $subject = $self->get('Subject');

    $subject =~ s/^\s+//;
    $subject =~ s/\s+$//;
    return 0 if $subject !~ /\s/;	# don't match one word subjects
    return 0 if (length $subject < 10);  # don't match short subjects
    $subject =~ s/[^a-zA-Z]//g;		# only look at letters
    return length($subject) && ($subject eq uc($subject));
}

if you change all the arrows to dots, and remove all the dollars, 
semi-colons and curly brackets, you get:

sub subject_is_all_caps
    subject = self.get('Subject')

    subject =~ s/^\s+//
    subject =~ s/\s+$//
    return 0 if subject !~ /\s/	# don't match one word subjects
    return 0 if (length subject < 10)  # don't match short subjects
    subject =~ s/[^a-zA-Z]//g		# only look at letters
    return length(subject) && (subject eq uc(subject))

It's almost like python! ;-)

>>Sorry, I don't want to demean any of your work, but we need to work 
>>together to fight spam, and I'd rather not see so much time wasted on 
>>individual clues when SpamAssassin already extracts about 800 of them!
> 
> The problem with SA for at least one of the applications I have is that
> it's way, way too aggressive.

So up your threshold, or train it yourself. Isn't that what you're doing 
with spambayes?

> My monster corpus is the main contact email
> for the company I work for. SA kicks out far too many legitimate 
> commercial email messages. But that mailbox gets (in the last week) 
> something like 200 spams a day - probably more. Sifting through the 
> hits looking for the real posts is too much work.
> 
> If there is a list of existing tokenisation clues we can work from,
> excellent! I know I won't mind re-using someone else's hard-won experience
> in this area. :)

Yep, check the rules/ directory. Particularly the 20_* files, which are 
the header, body and rawbody rules (don't worry about the distinction 
between body and rawbody for now - it's really rather bogus ;-)

Matt.


From tim.one@comcast.net  Tue Oct  1 16:10:12 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 01 Oct 2002 11:10:12 -0400
Subject: [Spambayes] Re: Matt Sergeant: Introduction
Message-ID: <39e6338f5b.38f5b39e63@icomcast.net>

>>>Off the top of my head, what does frexp() do?

>> frexp(x)
>> 
>> Return the mantissa and exponent of x, as pair (m, e).
>> m is a float and e is an int, such that x = m * 2.**e.
>> If x is 0, m and e are both 0.  Else 0.5 <= abs(m) < 1.0.

> Ah cool. Same as Math::BigFloat's $x->parts().

Maybe -- I like to think of it as being the same as the frexp() defined 
by the C standard <wink>.


From noreply@sourceforge.net  Tue Oct  1 10:31:38 2002
From: noreply@sourceforge.net (noreply@sourceforge.net)
Date: Tue, 01 Oct 2002 02:31:38 -0700
Subject: [Spambayes] [ spambayes-Feature Requests-616944 ] Mozilla Mail
	integration
Message-ID: <E17wJNS-0006i6-00@usw-sf-web5.sourceforge.net>

Feature Requests item #616944, was opened at 2002-10-01 13:31
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=616944&group_id=61702

Category: None
Group: None
Status: Open
Priority: 5
Submitted By: Sinchi Pacharuraq (sinchi)
Assigned to: Nobody/Anonymous (nobody)
Summary: Mozilla Mail integration

Initial Comment:
Integration with Mozilla Mail client

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=616944&group_id=61702

From msergeant@startechgroup.co.uk  Tue Oct  1 16:22:59 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Tue, 01 Oct 2002 16:22:59 +0100
Subject: [Spambayes] Re: Matt Sergeant: Introduction
References: <39e6338f5b.38f5b39e63@icomcast.net>
Message-ID: <3D99BDD3.5020802@startechgroup.co.uk>

Tim Peters wrote:
>>>>Off the top of my head, what does frexp() do?
>>>
> 
>>>frexp(x)
>>>
>>>Return the mantissa and exponent of x, as pair (m, e).
>>>m is a float and e is an int, such that x = m * 2.**e.
>>>If x is 0, m and e are both 0.  Else 0.5 <= abs(m) < 1.0.
>>
> 
>>Ah cool. Same as Math::BigFloat's $x->parts().
> 
> 
> Maybe -- I like to think of it as being the same as the frexp() defined 
> by the C standard <wink>.

Duh, yeah. That was just the first search.cpan.org result for mantissa ;-)

So it's the same as POSIX::frexp() ;-)


From gward@python.net  Tue Oct  1 16:41:24 2002
From: gward@python.net (Greg Ward)
Date: Tue, 1 Oct 2002 11:41:24 -0400
Subject: [Spambayes] Cunning use of quoted-printable
In-Reply-To: <l46jpukftsehccrj65pbdchib4ujbp3tk4@4ax.com>
References: <l46jpukftsehccrj65pbdchib4ujbp3tk4@4ax.com>
Message-ID: <20021001154123.GA1581@cthulhu.gerg.ca>

On 01 October 2002, Richie Hindle said:
[... message with lots of quoted-printable in it ...]
> Looks like an attempt to fox system like spambayes.  It doesn't make much
> difference, because the tokenizer decodes the quoted-printable, but it
> could trigger a clue token.

SpamAssassin has a test for this -- MIME_EXCESSIVE_QP:

rawbody  MIME_EXCESSIVE_QP      eval:check_for_mime_excessive_qp()
describe MIME_EXCESSIVE_QP      Excessive quoted-printable encoding in body
score MIME_EXCESSIVE_QP              2.070

The implementation is pretty simple:

  sub check_for_mime_excessive_qp {
    my ($self) = @_;

    # Note: We don't use rawbody because it removes MIME parts.  Instead,
    # we get the raw unfiltered body.  We must not change any lines.
    my $body = join('', @{$self->{msg}->get_body()});

    my $length = length($body);
    my $qp = $body =~ s/\=([0-9A-Fa-f]{2,2})/$1/g;

    # this seems like a decent cutoff
    return ($length != 0 && ($qp > ($length / 20)));
  }

(Hey, now that Matt Sergeant is on the list, I can stop being the local
SpamAssassin expert!  *phew*!)

I guess there are a couple of ways to translate this to a
stream-of-tokens approach:
  * do a tokenizing pass over the raw message body, and spit out
    a whole lot of "=20" tokens
  * examine the raw body in a non-tokenizing way, and just emit
    a "lots of quoted-printable" token
  * ...?

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
Did YOU find a DIGITAL WATCH in YOUR box of VELVEETA?

From gward@python.net  Tue Oct  1 16:50:13 2002
From: gward@python.net (Greg Ward)
Date: Tue, 1 Oct 2002 11:50:13 -0400
Subject: [Spambayes] Tokenising clues
In-Reply-To: <3D99A4E3.9000108@startechgroup.co.uk>
References: <3D99A4E3.9000108@startechgroup.co.uk>
Message-ID: <20021001155013.GB1581@cthulhu.gerg.ca>

On 01 October 2002, Matt Sergeant said:
> This seems like a vast waste of your time to me. There's a couple of 
> projects out there that have already spent vast amounts of time and 
> programming effort into figuring out these other clues that spambayes 
> misses out on. Rather than repeating that work, why not just rip all the 
> rules out of SpamAssassin or some other spam checking project wholesale, 
> and stuff those into your database?

The tricky part is not stealing relevant code from SpamAssassin -- I
just posted SA's "excessive quoted printable" hack, and I'm sure I could
translate it into Python in 10 minutes.  Not all Python hackers are
afraid of Perl.  ;-)  (Tim could probably do it in 10 seconds, but never
mind.)

The trick is how to integrate it into spambayes' overall approach, where
a message is simply distilled into a stream of tokens for training or
prediction.  It's a very different model from SpamAssassin -- it's one
thing to write the code that says, "this message has a lot of
quoted-printable characters in it", and it's another thing entirely to
decide how to use that knowledge in an appropriate way.  It's like the
difference between writing the rule and coming up with a score for it.

This, IMHO, is one respect in which SA is much more mature than
spambayes: I see a lot of people here groping through a
multi-dimensional space made up of various options and algorithm tweaks,
trying to optimize something (the FP rate, the FN rate, the distance
between the two histograms, whatever).  In contrast, SpamAssassin
drastically simplifies the space to explore -- it's the space of all SA
rules and scores -- and automates the optimization by using a genetic
algorithm.  There's a middle ground waiting to be found somewhere...

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
I used to be a FUNDAMENTALIST, but then I heard about the HIGH
RADIATION LEVELS and bought an ENCYCLOPEDIA!!

From skip@pobox.com  Tue Oct  1 17:00:09 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 1 Oct 2002 11:00:09 -0500
Subject: [Spambayes] Tokenising clues
In-Reply-To: <3D99B434.1000006@startechgroup.co.uk>
References: <200210011422.g91EMHT04893@localhost.localdomain>
        <3D99B434.1000006@startechgroup.co.uk>
Message-ID: <15769.50825.784716.147473@12-248-11-90.client.attbi.com>


    >>> Sorry, I don't want to demean any of your work, but we need to work
    >>> together to fight spam, and I'd rather not see so much time wasted
    >>> on individual clues when SpamAssassin already extracts about 800 of
    >>> them!

    >> The problem with SA for at least one of the applications I have is
    >> that it's way, way too aggressive.

    Matt> So up your threshold, or train it yourself. Isn't that what you're
    Matt> doing with spambayes?

If I understand things correctly, the SA genetic algorithm trains using a
huge body of mail (how many ham & spam test inputs are fed to the GA?).  If
a huge collection is necessary, that would pretty much rule out individuals
doing their own training.  Have the SA gang done any tests to see how
accurate the GA is with small ham/spam collections?  Are the inputs fed to
the GA pruned periodically to eliminate old messages?  I assume that
training using an individual's ham/spam collection would make it more
accurate for that person's future mail.

On the other hand, spambayes training (ignoring all the experimenting we're
doing at the moment) pretty much just consists of separating known ham and
spam, training on that periodically, then feeding incoming messages to the
classifier.  It looks like at this point, the spambayes stuff works pretty
well for individuals with relatively small collections (200-400 of each).
It remains to be seen if a "default" set of indicators would work for a
large population.  Even within this small community, we have pretty variable
results across individuals, partly because we make mistakes establishing our
training sets and partly because our email interests vary.

I view the two projects as complementary and don't find any of the potential
duplication of effort a problem.  Having multiple ways to look at ham and
spam makes it much harder for the bad guys to sneak something through and
also creates new opportunities for each other.  Last night I noticed that
one of my strongest ham indicators is

    skip:_ 40

Turns out that many mailing lists - at least those managed by Mailman - by
default add a trailer to the end of each message, like so:

    _______________________________________________
    Spambayes mailing list
    Spambayes@python.org
    http://mail.python.org/mailman-21/listinfo/spambayes

I subscribe to and administer a number of Mailman-managed mailing lists, so
it's a good ham indicator for me.  For others who tend not to subscribe to
any such lists it would obviously be less valuable.

There's a hammy rule for the SA gang which I doubt is currently in the SA
rule set.  ("url:mailman" is not quite as good a ham indicator as the forty
underscore token.)

Skip

From nas@python.ca  Tue Oct  1 17:19:31 2002
From: nas@python.ca (Neil Schemenauer)
Date: Tue, 1 Oct 2002 09:19:31 -0700
Subject: [Spambayes] Tokenising clues
In-Reply-To: <20021001155013.GB1581@cthulhu.gerg.ca>
References: <3D99A4E3.9000108@startechgroup.co.uk>
	<20021001155013.GB1581@cthulhu.gerg.ca>
Message-ID: <20021001161931.GA29333@glacier.arctrix.com>

Greg Ward wrote:
> This, IMHO, is one respect in which SA is much more mature than
> spambayes: I see a lot of people here groping through a
> multi-dimensional space made up of various options and algorithm tweaks,
> trying to optimize something (the FP rate, the FN rate, the distance
> between the two histograms, whatever).  In contrast, SpamAssassin
> drastically simplifies the space to explore -- it's the space of all SA
> rules and scores -- and automates the optimization by using a genetic
> algorithm.  There's a middle ground waiting to be found somewhere...

SpamAssassin's smaller search space comes a price.  People have to
continously come up with new rules.  I don't like the way the
tokenizer is heading right now either.  I want to try generating n-grams
from the headers.  If that can be made if work reasonably well I think
it will be much better approach long term.

   Neil 

From msergeant@startechgroup.co.uk  Tue Oct  1 17:14:57 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Tue, 01 Oct 2002 17:14:57 +0100
Subject: [Spambayes] Tokenising clues
References: <200210011422.g91EMHT04893@localhost.localdomain>
	<3D99B434.1000006@startechgroup.co.uk>
	<15769.50825.784716.147473@12-248-11-90.client.attbi.com>
Message-ID: <3D99CA01.1090100@startechgroup.co.uk>

Skip Montanaro wrote:
>     >>> Sorry, I don't want to demean any of your work, but we need to work
>     >>> together to fight spam, and I'd rather not see so much time wasted
>     >>> on individual clues when SpamAssassin already extracts about 800 of
>     >>> them!
> 
>     >> The problem with SA for at least one of the applications I have is
>     >> that it's way, way too aggressive.
> 
>     Matt> So up your threshold, or train it yourself. Isn't that what you're
>     Matt> doing with spambayes?
> 
> If I understand things correctly, the SA genetic algorithm trains using a
> huge body of mail (how many ham & spam test inputs are fed to the GA?).  If
> a huge collection is necessary, that would pretty much rule out individuals
> doing their own training.  Have the SA gang done any tests to see how
> accurate the GA is with small ham/spam collections?  Are the inputs fed to
> the GA pruned periodically to eliminate old messages?  I assume that
> training using an individual's ham/spam collection would make it more
> accurate for that person's future mail.

We haven't done that much testing on small data sets, but that's because 
the project aims are very different - I see spambayes as an experiment 
right now, whereas SpamAssassin has to genericise to large numbers of 
users out of the box. Feel free to try your own training though and let 
us know how it goes!

> On the other hand, spambayes training (ignoring all the experimenting we're
> doing at the moment) pretty much just consists of separating known ham and
> spam, training on that periodically, then feeding incoming messages to the
> classifier.

Same as SpamAssassin. You run mass-check on a bunch of spam and 
non-spam, then feed that into the GA. It takes a *lot* longer than a 
statistical classifier, but that's the only difference I can see.

> I view the two projects as complementary and don't find any of the potential
> duplication of effort a problem.  Having multiple ways to look at ham and
> spam makes it much harder for the bad guys to sneak something through and
> also creates new opportunities for each other.  Last night I noticed that
> one of my strongest ham indicators is
> 
>     skip:_ 40
> 
> Turns out that many mailing lists - at least those managed by Mailman - by
> default add a trailer to the end of each message, like so:
> 
>     _______________________________________________
>     Spambayes mailing list
>     Spambayes@python.org
>     http://mail.python.org/mailman-21/listinfo/spambayes
> 
> I subscribe to and administer a number of Mailman-managed mailing lists, so
> it's a good ham indicator for me.  For others who tend not to subscribe to
> any such lists it would obviously be less valuable.
> 
> There's a hammy rule for the SA gang which I doubt is currently in the SA
> rule set.  ("url:mailman" is not quite as good a ham indicator as the forty
> underscore token.)

We have a much more robust mailman detector already. And that's my point 
- a spammer can get around your naive "mailman detector" with a bunch of 
underscores anywhere in his message, but he has to work a lot harder to 
get around a more robust detection system (it's not invincible, but it 
would probably require him modifying his software). So give the dog 
(spambayes) a bone. Let it eat all the information you can give it. None 
of it is going to hurt, or if it does you can chuck that out like you 
have been doing for a few weeks already with other tokenising ideas!

Matt.


From noreply@sourceforge.net  Tue Oct  1 17:04:12 2002
From: noreply@sourceforge.net (noreply@sourceforge.net)
Date: Tue, 01 Oct 2002 09:04:12 -0700
Subject: [Spambayes] [ spambayes-Feature Requests-616944 ] Mozilla Mail
	integration
Message-ID: <E17wPVM-0002su-00@usw-sf-web5.sourceforge.net>

Feature Requests item #616944, was opened at 2002-10-01 04:31
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=616944&group_id=61702

Category: None
Group: None
Status: Open
Priority: 5
Submitted By: Sinchi Pacharuraq (sinchi)
Assigned to: Nobody/Anonymous (nobody)
Summary: Mozilla Mail integration

Initial Comment:
Integration with Mozilla Mail client

----------------------------------------------------------------------

>Comment By: Skip Montanaro (montanaro)
Date: 2002-10-01 11:04

Message:
Logged In: YES 
user_id=44345

ummm.... a bit short on detail/description.  What precisely do you mean by 
"Mozilla Mail integration"?  Can you describe what you would like to see 
feature-wise?  Note that no other mail system integration has been 
attempted at this point with the exception that I believe the hammie script 
works with procmail.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=616944&group_id=61702

From carel.fellinger@chello.nl  Tue Oct  1 18:11:26 2002
From: carel.fellinger@chello.nl (Carel Fellinger)
Date: Tue, 1 Oct 2002 19:11:26 +0200
Subject: [Spambayes] memory consumption...
In-Reply-To: <200210010950.g919o3e02522@localhost.localdomain>
References: <20020926192704.GA3931@schaduw.felnet>
	<200210010950.g919o3e02522@localhost.localdomain>
Message-ID: <20021001171126.GA4184@mail.felnet>

On Tue, Oct 01, 2002 at 07:50:01PM +1000, Anthony Baxter wrote:
...
> That's what I'd expected. But it looked like this little laptop had
> got confused, and wouldn't let go of the cached ram. a reboot later
> and it's happy again, and tossing cached data away, rather than 
> paging everything else out.

Any chance you're running an early 2.4 kernel?  There where lot of
problems that sounded just like what you're saying here.  Version 18
should work, probably earlier version too.

-- 
groetjes, carel


From neale@woozle.org  Tue Oct  1 17:57:19 2002
From: neale@woozle.org (Neale Pickett)
Date: 01 Oct 2002 09:57:19 -0700
Subject: [Spambayes] Some ideas I have....
In-Reply-To: <v03110703b9bbc5d85869@[192.168.0.102]>
References: <v03110703b9bbc5d85869@[192.168.0.102]>
Message-ID: <w53ptuuyu4w.fsf@woozle.org>

So then, John Draper <crunch@shopip.com> is all like:

> I want to start up another discussion about what the direction of the
> group is heading, as far as addressing the issues of where spam filter
> should take place.  IE: Client side, Vs Server side.

Currently we have two applications of the classifier: hammie and
pop3proxy.  Both of these can run on either the client or the server.

Your "bureaucrat" model sounds a lot like an observer pattern
<http://c2.com/cgi-bin/wiki?ObserverPattern>.  This is what procmail
does with incoming mail, dispatching events to various processing
functions (like hammie or spamassassin) who can each take a swing at the
message.  

What might be really useful would be a hook into an existing SMTP
server.  But before that happens, we need to answer some questions like
whether or not it's feasible to run one classifier database against an
entire organization or ISP's email.

Still trying to port this to my Palm Pilot,

Neale

From neale@woozle.org  Tue Oct  1 18:03:15 2002
From: neale@woozle.org (Neale Pickett)
Date: 01 Oct 2002 10:03:15 -0700
Subject: [Spambayes] to From_ or not to From_?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEFKBHAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCCEFKBHAB.tim.one@comcast.net>
Message-ID: <w53n0pyytv0.fsf@woozle.org>

So then, Tim Peters <tim.one@comcast.net> is all like:

> Actually, none of mine do, because BruceG's spam didn't.  I removed
> all the "From " lines from the c.l.py archive to match that (easier
> than inventing such lines for Bruce's msgs).  I don't know that it
> makes any difference for the way I run the tests, but it certainly
> could make a difference if "From " lines were getting mined for clues.
> I forced all my msgs alike in this respect just to cut off that
> possibility.

Sorry to enter this discussion a little late--I've been pretty busy with
a release at work.

I understand some people may not have them, but the "From " lines seem
to be very useful, as they report who the sender identified themselves
as in the MAIL command of the SMTP envelope.  I've had a great deal of
success stopping spam at the gate by denying access to people who
identify themselves with addresses from certain domains.  I would expect
that looking at "From " lines would be a clear win for anyone.

Here, I'll put my money where my mouth is.  My mail program writes the
>From header as an X-From: line.  I add this to my bayescustomize.ini:

[Tokenizer]
basic_header_tokenize: True
basic_header_skip: received
    date
    x-[^f][^r].*

And I get this on my tiny corpus (2x5x200 messages):

"""
false positive percentages
    1.500  1.500  tied
    1.000  1.000  tied
    2.000  1.000  won    -50.00%
    1.500  1.000  won    -33.33%
    1.500  1.000  won    -33.33%

won   3 times
tied  2 times
lost  0 times

total unique fp went from 15 to 11 won    -26.67%
mean fp % went from 1.5 to 1.1 won    -26.67%

false negative percentages
    1.500  1.000  won    -33.33%
    0.000  0.500  lost  +(was 0)
    1.000  1.000  tied
    0.500  0.000  won   -100.00%
    1.000  1.000  tied

won   2 times
tied  2 times
lost  1 times
"""

In all but one case where something changed, it was just a single
message.  That's not a huge improvement, but maybe enough of one to
convince someone with a larger test set to try it out?

Neale

From tim.one@comcast.net  Tue Oct  1 17:35:46 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 01 Oct 2002 12:35:46 -0400
Subject: [Spambayes] RE: spambayes: CLT
In-Reply-To: <3D9973EC.5020303@startechgroup.co.uk>
Message-ID: <BIEJKCLHCIOIHAGOKOLHKEEFDLAA.tim.one@comcast.net>

This is a popular offline question, so I'll take the liberty of answering it
once and for for all <heh> here:

> Is there one place (or email in the archives) that describes clearly
> what the Central Limit Theorem does, and how it works? I can't seem to
> find one, and the code isn't helping all that much ;-)

CLT is a fundamental tool in statistics, and a google search will turn up
dozens of good intros.

Skipping the weasle words, if you've got a large population P, with mean M
and variance V, suppose you want to estimate M.  You can take N values from
P at random, and compute *their* mean, M'.  That's called "the sample mean",
for obvious reasons.  The CLT says that the sample mean follows a normal
distribution, with mean M and variance V/N.  The remarkable thing is that no
assumption need be made about the distribution of P:  it may be normal
itself, or uniform, or darned-near crazy, or anything in-between -- it just
doesn't matter.  M' is normally distributed regardless.

In practical terms this says two things:  M' is an unbiased estimate of M,
and that you can cut the chance that M' differs significantly from M as low
as you want by increasing N.  Heck, by observing the variance V' of M' over
many trials, you can even estimate the variance V of P, via multiplying V'
by N.

All that said, it's pretty much irrelevant to the "central limit" code:  the
calculations act as if the CLT applied here, but CLT doesn't actually apply
here.  The problem is that picking the N "most extreme" words from a single
email isn't a *random* sampling from P by a long shot, and that violates the
key precondition for applying the CLT.  What we've got instead are two
distinct populations, and a seat-of-the-pants rule for deciding whether
we're certain a given email belongs to one of the populations (although note
that no "certainty code" has been checked in yet, and "the scores" returned
by the central-limit code that is checked in are pretty much nonsense right
now).


From python-spambayes@discworld.dyndns.org  Tue Oct  1 18:12:04 2002
From: python-spambayes@discworld.dyndns.org (Charles Cazabon)
Date: Tue, 1 Oct 2002 11:12:04 -0600
Subject: [Spambayes] to From_ or not to From_?
In-Reply-To: <w53n0pyytv0.fsf@woozle.org>;
	from neale@woozle.org on Tue, Oct 01, 2002 at 10:03:15AM -0700
References: <LNBBLJKPBEHFEDALKOLCCEFKBHAB.tim.one@comcast.net>
	<w53n0pyytv0.fsf@woozle.org>
Message-ID: <20021001111204.A4413@discworld.dyndns.org>

Neale Pickett <neale@woozle.org> wrote:
> 
> I understand some people may not have them, but the "From " lines seem
> to be very useful, as they report who the sender identified themselves
> as in the MAIL command of the SMTP envelope.

This information is also normally recorded in a Return-Path: header, which is
not dependent on the mail storage format, unlike the mbox-only "From " lines.

Charles
-- 
-----------------------------------------------------------------------
Charles Cazabon                 <python-spambayes@discworld.dyndns.org>
GPL'ed software available at:     http://www.qcc.ca/~charlesc/software/
-----------------------------------------------------------------------

From neale@woozle.org  Tue Oct  1 19:00:06 2002
From: neale@woozle.org (Neale Pickett)
Date: 01 Oct 2002 11:00:06 -0700
Subject: [Spambayes] Patch and info on how to run a test
In-Reply-To: <3D98B463.3040006@videotron.ca>
References: <3D98B463.3040006@videotron.ca>
Message-ID: <w53it0myr89.fsf@woozle.org>

So then, papaDoc <papaDoc@videotron.ca> is all like:

> Hi,
> 
>     This is a small patch to help people which don't have python in
>     their path.

Hmm, what you probably ought to do is run runtest.sh like so:

  PATH=$PATH:/python/dir ./runtest.sh testname

But maybe you should figure out how to get python in your path instead
:)  I have a /home/neale/bin directory in my path, where I can make
symbolic links to or write wrappers around stuff that wouldn't otherwise
be in my path.

Failing that, you can always add something like this at the beginning of
the file:

  python() {
    /path/to/python "$@"
  }

which makes a "python" function that runs python.


> runtest.sh is talking about
>     # This test requires you have an appropriately-modified
>     # Tester.py.new and classifier.py.new as detailed in
>     # <LNBBLJKPBEHFEDALKOLCKENMBEAB.tim.one@comcast.net>
> Where can I find those two files ?

That test is obsolete now; Tim's already pronounced on Gary's ideas (he
liked them).  I've taken it out of runtest.sh in CVS.  Now it only has a
"set1" and "set2" target, which you can use to run two timcv tests.
You'll want to run

  ./runtest.sh -r set1

at least once, and from then on you can diddle with the code and run

  ./runtest.sh set2

to see what the diddling has done.

> I don't have 2000 spams yet only 1546 now but going up every days.

I never thought I'd live to see the day when people were hoping for more
spam :)

Neale

From neale@woozle.org  Tue Oct  1 19:48:15 2002
From: neale@woozle.org (Neale Pickett)
Date: 01 Oct 2002 11:48:15 -0700
Subject: [Spambayes] to From_ or not to From_?
In-Reply-To: <20021001111204.A4413@discworld.dyndns.org>
References: <LNBBLJKPBEHFEDALKOLCCEFKBHAB.tim.one@comcast.net>
	<w53n0pyytv0.fsf@woozle.org>
	<20021001111204.A4413@discworld.dyndns.org>
Message-ID: <w53elbayp00.fsf@woozle.org>

So then, Charles Cazabon <python-spambayes@discworld.dyndns.org> is all like:

> Neale Pickett <neale@woozle.org> wrote:
> > 
> > I understand some people may not have them, but the "From " lines seem
> > to be very useful, as they report who the sender identified themselves
> > as in the MAIL command of the SMTP envelope.
> 
> This information is also normally recorded in a Return-Path: header, which is
> not dependent on the mail storage format, unlike the mbox-only "From " lines.

Touch�--so it is.  I've been working with pipermail archives too long!

Thanks for the correction,

Neale

From tim.one@comcast.net  Tue Oct  1 20:17:46 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 01 Oct 2002 15:17:46 -0400
Subject: [Spambayes] Cunning use of quoted-printable
In-Reply-To: <l46jpukftsehccrj65pbdchib4ujbp3tk4@4ax.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHGEELDLAA.tim.one@comcast.net>

[Richie Hindle]
> I've just found this message in my spam corpus:
>
> -----------------------------------------------------------------------
>
> [Some headers snipped]
>
> Subject: Mail for Richie Hindle
> Content-Type: text/plain; charset="us-ascii"
> Content-Transfer-Encoding: quoted-printable
> X-Mailer: Mail Express
>
> Dear=20Richie=20Hindle,=0D=0A=0D=0AInternet-Soft.Com=20is=20pleased=20t=
> o=20announce=20the=20release=20of=20the=20following=20new=20software=20=
> programs:=0D=0A=0D=0A1)=20FTP=20Navigator=206.58=0D=0Ahttp://www.intern=
> et-soft.com/DEMO/ftpnavigator.exe=0D=0A=0D=0A2)=20Web=20Site=20eXtracto=
> r=208.01=0D=0Ahttp://www.esalesbiz.com/extra/webextrasetup.exe=0D=0A=0D=
>
> [more of the same snipped]
>
> -----------------------------------------------------------------------
>
> Looks like an attempt to fox system like spambayes.  It doesn't make much
> difference, because the tokenizer decodes the quoted-printable, but it
> could trigger a clue token.

The other trick of this nature is to encode the whole msg in base64.  We
decode that too.

tokenizer.py contains a comment block with before-and-after tests run with
and without generating tokens for Content-Transfer-Encoding.  Results were
random (some runs got better, others got worse), so I left it out.

That didn't aim at catching intentional obfuscation, though.

> I doubt there are enough spams out there for that to make any difference,
> and how to quantify whether a message looks like its using this trick is
> not obvious.  I only really mention it as a curiosity.  It did some out
> as a false positive in my testing,

I *think* you meant it was a false negative, since you said it was in your
spam collection, and haven't argued that it's actually ham.

> but I don't think that was because of the quoting.

It's currently as if the quoted-printable business didn't exist.  It likely
got mild ham boosts for the text/plain and us-ascii parts of the
Content-Type line.

> Less interesting are the results of running Tim's 4000-message tests on my
> corpora:
>
> -> best cutoff for all runs: 0.56
> ->     with weighted total 10*2 fp + 37 fn = 57
> ->     fp rate 0.1%  fn rate 1.85%
> total unique false pos 2
> total unique false neg 37
> average fp % 0.1
> average fn % 1.85
>
> This tells me two things: I am Mr. Average, and the results are
> astonishingly impressive!

If you can without revealing a confidence, it would be good if you could
share the fp.  Short of that, are these fp that bother you?  Would you be
upset if you lost them in real life?  There are about 10 msgs in my ham I
couldn't care less about, but I keep them in the ham just because they're
not truly spam.  8 of them are correctly classified at the moment, but if I
found a change that slashed the f-n rate at the cost of putting those 8 back
in to the f-p class, I wouldn't count the latter against the change much.

BTW, non-Python conference announcements appear to be hated by the
central-limit versions of the classifier too -- but at least those versions
"know" they're confused about what to call them!


From tim.one@comcast.net  Tue Oct  1 20:54:32 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 01 Oct 2002 15:54:32 -0400
Subject: [Spambayes] Tokenising clues
In-Reply-To: <20021001161931.GA29333@glacier.arctrix.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHMEEMDLAA.tim.one@comcast.net>

[Neil Schemenauer]
> ...
> I don't like the way the tokenizer is heading right now either.

I only care which way the results are heading <wink>.

> I want to try generating n-grams from the headers.  If that can be
> made if work reasonably well I think it will be much better
> approach long term.

Be sure to read the comments in tokenizer.py about previous experiments with
character n-grams.  A string of length N produces N-n+1 character n-grams,
and that's a ton of clues for a single string.  For example,

Organization: Massachussetts Institute of Technology

is going to generate a big pile of ham clues, and if a spammer happens to
include that header too, it's going to be hard to overcome them.  There are
some specific examples in the aforementioned comments.  This should be less
severe now, though, since max_discriminators is about 10x larger than it
used to be.  Certainly worth trying!


From rob@hooft.net  Tue Oct  1 20:57:57 2002
From: rob@hooft.net (Rob Hooft)
Date: Tue, 01 Oct 2002 21:57:57 +0200
Subject: [Spambayes] Central limit
References: <LNBBLJKPBEHFEDALKOLCCEMPBHAB.tim.one@comcast.net>
Message-ID: <3D99FE45.3010905@hooft.net>

Tim Peters wrote:
> [Rob Hooft]
> 
>>  - The standard deviations seem "underestimated". Gary already said
>>    this can be caused by correlations between scores. Alternatively
>>    this can indicate that the data is not 1D: in more than one
>>    dimension, a higher percentage of normally distributed data lies
>>    outside of the "core regions". Anyway, something can be done about
>>    this: just calculate the RMS Z-score, and scale it to 1.0.
> 
> 
> Sorry, I don't know what that means or how to compute it; neither does
> google <wink>.  Let's say this is my population:  {2, 5, 10, 64}.  Then what
> are the "RMS Z-score scaled to 1.0" thingies of 1, 2, 32, 64, and 1000?

You can calculate the Root Mean Square (RMS) of all Z-scores. That is 
the same as the "standard deviation" of the population. This appears to 
be around 3-4. If we calculate the value for one test run, it can be 
used as a parameter on the next run, to make sure the Z-scores really 
form a distribution of 0+/-1. These parameters might even be relatively 
corpus-insensitive.

>>  - The "certainty" rule of Tim should be formalized.

> Sure, but how?   I made up a combination of "look at ratios" and "different
> cutoffs for different n" by iteratively staring at the errors and making
> stuff up.  Even then all I get is a binary "certain or uncertain?" decision
> out of it, and without a clear connection to quantifiable probabilities I
> don't have strong reason to believe it's a sensible approach in general.

I'd say something like erf(Zspam) is the chance that the message belongs 
to the spam corpus (assuming the renormalized Z scores) and erf(Zham) is 
the chance that the message belongs to the ham corpus. If 
1.0-erf(Zham)-erf(Zspam) is sizeable, that could express a "chance" that 
the message belongs to neither; erf(Zham)+erf(Zspam) is then a way to 
express the "classifyability". Normally not both of Zham and Zspam are 
small, but the math might need to handle the case that the sum of these 
two is larger than one for the weird case... Somebody with a proper 
statistical background can probably improve on this. Unfortunately 
import math does not come with the error function....

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tim.one@comcast.net  Tue Oct  1 21:07:53 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 01 Oct 2002 16:07:53 -0400
Subject: [Spambayes] Tokenising clues
In-Reply-To: <3D99CA01.1090100@startechgroup.co.uk>
Message-ID: <BIEJKCLHCIOIHAGOKOLHIEENDLAA.tim.one@comcast.net>

[Matt Sergeant]
> ...
> We have a much more robust mailman detector already. And that's my point
> - a spammer can get around your naive "mailman detector" with a bunch of
> underscores anywhere in his message, but he has to work a lot harder to
> get around a more robust detection system (it's not invincible, but it
> would probably require him modifying his software).

Matt, we don't have *any* "mailman detector", and that's a key point.  We
generate "skip" tokens for every string longer than 12 chars, and that it
happened to catch a Mailman clue is pure luck.  It's not trying to *do*
anything specific.  We catch so many "Mailman clues", in fact, that I dare
not look at most of the header lines in my mixed-source data -- the Mailman
clues it picks up purely by luck then are too strong.

As to a spammer trying to exploit it, not a problem.  No single word can
determine the outcome, and if spammers take to putting '-'*40 in their spam,
the system will learn to disregard it.

I've done this experiment:  I ran my fat test, looked at the list of the top
50 discriminators, and purged them all from the database.  Then I ran my fat
test again.  The performance wasn't significantly worse.  If one set of
clues becomes worthless, it finds another set.  So long as spam is trying to
sell you something, "it's different".

> So give the dog (spambayes) a bone. Let it eat all the information
> you can give it.

This is fine, provided it doesn't bloat the database size, or increase
classification time, without a compensating measurable improvement in
results.  Part of the tokenizer is as finicky as it is because I'm aiming to
keep size and time requirements in bounds too (so, e.g., I deliberately
don't tokenize Content-Transfer-Encoding, and note the presence or absence
of an Organization line but without tokenizing its value:  experiments
showed that what I *do* do in these cases helped, but that the parts I left
out did not help).

> None of it is going to hurt, or if it does you can chuck that out like
> you have been doing for a few weeks already with other tokenising ideas!

As a general rule, I add things that help, not add in lots of ideas at once
and then throw things out that don't help.  Our results steadily progress in
the right direction, so I'm going to stick with what works.


From tim.one@comcast.net  Tue Oct  1 22:36:33 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 01 Oct 2002 17:36:33 -0400
Subject: [Spambayes] Matt Sergeant: Introduction
In-Reply-To: <3D996855.9030707@startechgroup.co.uk>
Message-ID: <BIEJKCLHCIOIHAGOKOLHCEFBDLAA.tim.one@comcast.net>

[Matt Sergeant]
> ...
> And to give back I'll tell you that one of my biggest wins was parsing
> HTML (with HTML::Parser - a C implementation so it's very fast) and
> tokenising all attributes, so I get:
>
>    colspan=2
>    face=Arial, Helvetica, sans-serif
>
> as tokens. Plus using a proper HTML parser I get to parse HTML comments
> too (which is a win).

Matt, what are you using as test data?  The experience here has been that
HTML is sooooo strongly correlated with spam that we've added gimmick after
gimmick to remove evidence that HTML ever existed; else the rare ham that
uses HTML-- or even *discusses* HTML with a few examples! --had an extremely
hard time avoiding getting classified as spam.

As a result, by default we're stripping all HTML tags unlooked-at (except
that we mine http etc thingies before stripping tags).  Even so, the mere
presence of text/html content-type, and "&nbsp;", still have very high
spamprob, and so still make it hard for content-free <0.1 wink> HTML hams to
get thru.

A problem seems to be that everyone here subscribes to a few HTML marketing
newsletters (whether they think of them that way or not), but that the only
other HTML they get in their email is 100x more HTML spam.  That gives every
indication of HTML spamprobs >= 0.99, and legitimately so.  A compounding
problem then is that the simplifying assumption of word-probability
independence is grossly violated by HTML markup -- the prob that a msg
contains colspan=2 and the prob that a msg contains face=Arial aren't
independent at all, and pretending that they are independent grossly
overestimates the rarity of seeing them both in a single msg.

Do you find, for example, that

    colspan=2

is common in HTML ham but rare in HTML spam, or vice versa?  I know of
specific cases where we're missing good clues by purging HMTL decorations,
but nobody here has yet found a strategy for leaving them in that isn't a
disaster for at least one of the error rates.  I'm wondering what's sparing
you from that fate.

> Using word tuples is also a small win,

Word bigrams were a loss for us (comments in tokenizer.py).  This should be
revisted under Gary's scheme, and/or we should stop thinking of unsolicited
conference announcements as being ham <snarl>.

> but increases the database size and number of tokens you have to
> pull from the database enormously.

That was also our experience with word bigrams, but less than "enormously";
about a factor of 2; character 5-grams were snuggling up to enormously.

> That's an issue for me because I'm not using an in-memory database (one
> implementation uses CDB, another uses SQL - the SQL one is really nice
> because you can so easily do data mining, and the code to extract the
> token probabilities is just a view).

I haven't hooked ours up to a database yet, but others have.  It's premature
for my present purposes <wink>.

> ...
> Well I very quickly found out that most of the academic research into
> this has been pretty bogus. For example everyone seems (seemed?) to
> think that stemming was a big win, but I found it to lose every time.

We haven't tried that.  OTOH, the academic research has been on Bayesian
classifiers, and this isn't one (despite that Paul called it one).

> ...
> The one thing that still bothers me still about Gary's method is that
> the threshold value varies depending on corpus. Though I expect there's
> some mileage in being able to say that the middle ground is "unknown".

It does allow for an easy, gradual, and effective way to favor f-n at the
expense of f-p, or vice versa.  There was no such possibility under Paul's
scheme, as the more training data we fed in, the rarer it was for *any*
score not to be extremely close to 0 or extremely close to 1, and regardless
of whether the classification was right or wrong.  Gary's method hasn't been
caught in such extreme embarrassment yet.

OTOH, it *is*, as you say, corpus dependent, and it seems hard to get that
across to people.  Gary has said he knows of ways to make the distinction
sharper, but we haven't yet been able to provoke him into revealing them
<wink>.  The central limit variations, and especially the logarithmic one,
are much more extreme this way.

> ...
> OK, I'll go over it again this week and next time I get stuck I'll mail
> out for some help ;-) The hardest part really is getting from how my
> code is structured (i.e. where I get my data from, how I store it, etc)
> to your version. Simple examples like where you use a priority queue for
> the probabilities so you can extract the top N indicators, I just use an
> array, and use a sort to get the top N.

The priority queue was potentially much more efficient when
max_discriminators was 15.  I expect that it costs more than it's worth now
that we've boosted it to 150, so if there's ever a hint that the scoring
time is non-trivial, I'll probably use an array too.

> So mostly it's just the details of storage that confuse me.

I'm not sure what that means, but expect it will get fleshed out in time.

> ...
> And where is compute_population_stats used?

It has a non-trivial implementation only under the central-limit variations,
of which there are two.  It's intended to be called after
update_probabilities is called at the end of training, to do a third
training pass of computing population ham & spam means & variances.  Most
people here aren't aware of that, as it happens "by magic" when a test
driver calls TestDriver.Driver.train():

    # CAUTION:  this just doesn't work for incrememental training when
    # options.use_central_limit is in effect.
    def train(self, ham, spam):
        print "-> Training on", ham, "&", spam, "...",
        c = self.classifier
        nham, nspam = c.nham, c.nspam
        self.tester.train(ham, spam)
        print c.nham - nham, "hams &", c.nspam- nspam, "spams"
        c.compute_population_stats(ham, False)
        c.compute_population_stats(spam, True)

>>>...
>>> such as how the probability stuff works so much better on individuals'
>>> corpora (or on a particular mailing list's corpus) than it does for
>>> hundreds of thousands of users.

> On my personal email I was seeing about 5 FP's in 4000, and about 20
> FN's in about the same number (can't find the exact figures right now).

So to match the units and order of the next sentence, about 0.5% FN rate and
0.13% FP rate.

> On a live feed of customer email we're seeing about 4% FN's and 2% FP's.

Is that across hundreds of thousands of users?  Do you know the
correpsonding statistics for SpamAssassin?  For python.org use, I've thought
that as long as we could keep this scheme fast, it may be a good way to
reduce the SpamAssassin load.


From tim.one@comcast.net  Wed Oct  2 00:35:14 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 01 Oct 2002 19:35:14 -0400
Subject: [Spambayes] to From_ or not to From_?
In-Reply-To: <w53n0pyytv0.fsf@woozle.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCGECFBIAB.tim.one@comcast.net>

[Neale Pickett, "From " lines]
> ...
> Here, I'll put my money where my mouth is.  My mail program writes the
> From header as an X-From: line.  I add this to my bayescustomize.ini:
>
> [Tokenizer]
> basic_header_tokenize: True
> basic_header_skip: received
>     date
>     x-[^f][^r].*

Note that this tokenizes a great many headers lines beyond just x-from.
Something like

    basic_header_skip: (?!x-from)

would have been sharper (that's a negative lookahead assertion:  it matches
iff the header name doesn't match x-from, so it skips a header line iff it's
not x-from, so it looks only at x-from -- all obvious to the most casual
observer <wink>).

> And I get this on my tiny corpus (2x5x200 messages):
>
> """
> false positive percentages
>     1.500  1.500  tied
>     1.000  1.000  tied
>     2.000  1.000  won    -50.00%
>     1.500  1.000  won    -33.33%
>     1.500  1.000  won    -33.33%
>
> won   3 times
> tied  2 times
> lost  0 times
>
> total unique fp went from 15 to 11 won    -26.67%
> mean fp % went from 1.5 to 1.1 won    -26.67%
>
> false negative percentages
>     1.500  1.000  won    -33.33%
>     0.000  0.500  lost  +(was 0)
>     1.000  1.000  tied
>     0.500  0.000  won   -100.00%
>     1.000  1.000  tied
>
> won   2 times
> tied  2 times
> lost  1 times
> """
>
> In all but one case where something changed, it was just a single
> message.  That's not a huge improvement,

*Relative to* your error rates, it was a huge improvement, but it's hard to
be confident about it because the absolute # of msgs involved is so small.
Still, that it won 3 times on f-p, and never lost, adds to the confidence
you should have that it truly helped.

> but maybe enough of one to convince someone with a larger test
> set to try it out?

I can't get away with tokenizing so many header lines; there are too many
"good clues for bad reasons" in my mixed-source data.


From tim.one@comcast.net  Wed Oct  2 00:52:41 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 01 Oct 2002 19:52:41 -0400
Subject: [Spambayes] Tokenising clues
In-Reply-To: <20021001155013.GB1581@cthulhu.gerg.ca>
Message-ID: <LNBBLJKPBEHFEDALKOLCAECHBIAB.tim.one@comcast.net>

[Greg Ward]
> ...
> This, IMHO, is one respect in which SA is much more mature than
> spambayes: I see a lot of people here groping through a
> multi-dimensional space made up of various options and algorithm tweaks,
> trying to optimize something (the FP rate, the FN rate, the distance
> between the two histograms, whatever).  In contrast, SpamAssassin
> drastically simplifies the space to explore -- it's the space of all SA
> rules and scores -- and automates the optimization by using a genetic
> algorithm.  There's a middle ground waiting to be found somewhere...

There are many ways to automatically improve learning algorithms, and at
least boosting has been mentioned here several times.  It would likely also
be suitable for improving SpamAssassin.  Robert Schapire's papers are the
ones to read:

    http://www.research.att.com/~schapire/boost.html

I can't make time to pursure it, and, at least on my corpus, I doubt any
algorithm is going to do significantly better than what we've got right now
(given my current error rates, that's not really open to rational debate
<wink>).


From tim.one@comcast.net  Wed Oct  2 01:03:08 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 01 Oct 2002 20:03:08 -0400
Subject: [Spambayes] new virus...
In-Reply-To: <15769.45976.592234.222829@12-248-11-90.client.attbi.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKECHBIAB.tim.one@comcast.net>

[Skip Montanaro]
> Not quite on-topic for this group, but I know some people are
> interested in getting this project to identify viruses.

Greg Ward explained the very simple scheme he uses to catch viruses at
python.org, and by all signs it's very effective.  Provided users are
willing to block executables of all kinds, piece o' cake!  I'm working on a
much more insidious virus, which exploits the confusion caused by mixing
tabs and spaces <wink>.


From jcarlson@uci.edu  Wed Oct  2 02:42:51 2002
From: jcarlson@uci.edu (Josiah Carlson)
Date: Tue, 01 Oct 2002 18:42:51 -0700
Subject: [Spambayes] Good evening/morning/afternoon everyone
In-Reply-To: <aj6jpu80kc0s7faj8ti1lh7dkb1prqmclp@4ax.com>
References: <20020928153427.D68E.JCARLSON@uci.edu>
	<aj6jpu80kc0s7faj8ti1lh7dkb1prqmclp@4ax.com>
Message-ID: <20021001183417.D9A6.JCARLSON@uci.edu>

Richie,

> > I have (in the past) had email software that doesn't allow arbitrary
> > header matching. By inserting the Subject, I guarantee that ANY email
> > software can filter it.
> 
> A case for an option, maybe.  How old was this software? (please say "Very
> old" 8-)
> 
> Thanks for the explanations of everything else.  I hope my comments were
> useful.

It is in fact fairly old...like '98 vintage or so.  I recently upgraded
to their 2.0 release and thought it was the same.  Turns out it
automatically parses new headers and allows one to search for those
specifically.  Pretty neat. (FYI I'm using Becky! internet email and
just wrote a console python app to parse and read email from their
proprietary files, works pretty smooth).

Your comments were very useful.  I've fixed the size stuff and went with
the X-Hammie-Disposition header.  Also fixed some other code and added a
few utilities (like one to convert mbox format files to pasp, and
vise-versa).

No problem about the explanations, thanks for the comments *smile*
 - Josiah

From tim_one@email.msn.com  Wed Oct  2 05:30:50 2002
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 2 Oct 2002 00:30:50 -0400
Subject: [Spambayes] Central limit
In-Reply-To: <3D99FE45.3010905@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMECOBIAB.tim_one@email.msn.com>

[Rob Hooft]
>>>  - The standard deviations seem "underestimated". Gary already said
>>>    this can be caused by correlations between scores. Alternatively
>>>    this can indicate that the data is not 1D: in more than one
>>>    dimension, a higher percentage of normally distributed data lies
>>>    outside of the "core regions". Anyway, something can be done about
>>>    this: just calculate the RMS Z-score, and scale it to 1.0.

Just noting that since the central limit theorem doesn't really apply here,
the justification for dividing the population variance by n to estimate the
sample variance seems approximately non-existent.  That division has a big
effect on what we're taking to be "the sdev" when n=50.

[Tim]
>> Sorry, I don't know what that means or how to compute it; neither does
>> google <wink>.  Let's say this is my population:  {2, 5, 10, 64}.  Then
>> what are the "RMS Z-score scaled to 1.0" thingies of 1, 2, 32,
>> 64, and 1000?

[Rob]
> You can calculate the Root Mean Square (RMS) of all Z-scores.

By RMS I understand you to mean the square root of the mean of the Z-score
squares.  Is that what you mean?

> That is the same as the "standard deviation" of the population.

But now I'm lost again.  The RMS of a population isn't the same as the sdev
of a population (as I understand sdev), unless the mean of the population
happens to be 0.  The mean Z-score is definitely not 0 in the results I'm
seeing; the zscores for ham are highly skewed to one side of 0.

> This appears to be around 3-4.

For what?  The RMS of {1, 2, 32, 64, 1000} is about 450:

>>> math.sqrt((1**2 + 2**2 + 32**2 + 64**2 + 1000**2)/5.)
448.35811579584458
>>>

The RMS of {2, 5, 10, 64) is about 32:

>>> math.sqrt((2**2 + 5**2 + 10**2 + 64**2)/4.)
32.5
>>>

OTOH, what I understand to be the sdevs of those are about 32 and 25,
respectively.  So I'm out of ideas for where 3-4 might come from.

I note that most google hits on "RMS Z-score" land on the WHAT IF program
you worked on as a postdoc -- so I suspect this is a case where something is
so obvious to you it may be impossible for you to explain it <0.9 wink>.

> If we calculate the value for one test run, it can be used as a parameter
> on the next run, to make sure the Z-scores really form a distribution of
> 0+/-1. These parameters might even be relatively
> corpus-insensitive.

> ...
> I'd say something like erf(Zspam) is the chance that the message belongs
> to the spam corpus (assuming the renormalized Z scores) and erf(Zham) is
> the chance that the message belongs to the ham corpus.
>
> 1.0-erf(Zham)-erf(Zspam) is sizeable, that could express a "chance" that
> the message belongs to neither; erf(Zham)+erf(Zspam) is then a way to
> express the "classifyability". Normally not both of Zham and Zspam are
> small, but the math might need to handle the case that the sum of these
> two is larger than one for the weird case... Somebody with a proper
> statistical background can probably improve on this. Unfortunately
> import math does not come with the error function....

That won't be a problem if there's something useful here.  I'm willing to
pursue it, but am getting hints that it will work worse than the
seat-of-the-pants ratio gimmick.  The "region of certainty" where the ratio
gimmick has never been *observed* to make a mistake includes cases where
both zscores are so large that, no matter how we fiddle them, the system
would believe there's no earthly chance the msg came from either population.
But so long as

    |larger zscore| / |smaller score|

"is big enough", it doesn't seem to matter how large the smaller zscore is.
The most extreme ham in this "region of (seeming) certainty" had a ham
zscore of -24.3, and a spam zscore of -42.9.  I only tallied percentages up
to ham |zscores| of 10 before, but it's clear from this that the percentage
at 24.3 would be insignificant:

This % of hams   had abs(zham) <= this
--------------   ---------------------
18.377%           1.0
36.525%           2.0
53.650%           3.0
67.919%           4.0
78.301%           5.0
85.831%           6.0
90.788%           7.0
93.762%           8.0
95.696%           9.0
97.044%          10.0

The similar table for spam zcores showed lower variance.


From tim_one@email.msn.com  Wed Oct  2 06:01:18 2002
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 2 Oct 2002 01:01:18 -0400
Subject: [Spambayes] mining dates?
In-Reply-To: <20021001042532.GA28075@glacier.arctrix.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEDABIAB.tim_one@email.msn.com>

[Neil Schemenauer]
> ...
> I use a different email address for each email list I sign up on.  That
> makes sorting easy.  My ham and spam collection is taken from addresses
> that don't receive mailing list traffic.  So, signing up for more lists
> wouldn't help.

Who said anything about signing up for lists you want to read?  We're just
trying to get you more ham <wink>.


From crunch@shopip.com  Tue Oct  1 20:51:11 2002
From: crunch@shopip.com (John Draper)
Date: Tue, 1 Oct 2002 12:51:11 -0700
Subject: [Spambayes] Some ideas I have....
In-Reply-To: <w53ptuuyu4w.fsf@woozle.org>
References: <v03110703b9bbc5d85869@[192.168.0.102]>
 <v03110703b9bbc5d85869@[192.168.0.102]>
Message-ID: <v03110720b9bfac42c842@[192.168.0.2]>

Neale writes:

>So then, John Draper <crunch@shopip.com> is all like:
>
>> I want to start up another discussion about what the direction of the
>> group is heading, as far as addressing the issues of where spam filter
>> should take place.  IE: Client side, Vs Server side.
>
>Currently we have two applications of the classifier: hammie and
>pop3proxy.  Both of these can run on either the client or the server.
>
>Your "bureaucrat" model sounds a lot like an observer pattern
><http://c2.com/cgi-bin/wiki?ObserverPattern>.  This is what procmail
>does with incoming mail, dispatching events to various processing
>functions (like hammie or spamassassin) who can each take a swing at the
>message.  
>
>What might be really useful would be a hook into an existing SMTP
>server.  But before that happens, we need to answer some questions like
>whether or not it's feasible to run one classifier database against an
>entire organization or ISP's email.

We wrote a cheap and dirty MTA (SMTP server) in Python.   Definately not RFC821 complient,  but I use it as a test or proof of performance.

These issues I think are important (System wide filtering,  vs local user level filtering).

John


From crunch@shopip.com  Wed Oct  2 09:03:27 2002
From: crunch@shopip.com (John Draper)
Date: Wed, 2 Oct 2002 01:03:27 -0700
Subject: [Spambayes] Another proposal from one of us.
Message-ID: <v03110727b9c0584d3415@[192.168.0.2]>

I propose a preprocessor which would convert message "meta-information" into=
 tokens which=20
would be appended to the email message prior to digestion by SpamBayes. Call=
 this SBMIP.

Meta-information is content information which is outside that of the text in=
 the message header or body.

Example meta-information rules might be:
----------------------------------------
Was the message body entirely in upper case?
Have I sent mail to this address before?
Have I received mail from this address before?
Is this sender in my address book?
Are all the recipients in my address book?
Was this sent to more than one person?
Was this sent to more than two people?
Was this sent between the hours of 8AM and 5PM?
Was this received between the hours of midnight and 6AM?
Did this message have attachments?
Was this message plain or HTML?
Was this sent from a Pacific Rim IP address block (see IANA list)?
Does the subject include non-ASCII (>127) characters?
Does the body include a large number of non-ASCII characters?
Is the character set other than Latin, ISO,...?
Is this from a time zone other than mine?
Is there a greater than three hour difference in time zones?
Does the source IP address have a DNS entry?
Is the source IP ping-able (requires sending ping and waiting for response)?
Was the domain name forged?
Is this a valid source email (would require connecting to source POP acct)?

There is some overlap between these and what the tokens themselves reveal,=
 that is not a concern. The more information, the better. The last four=
 rules are relatively slow, so should be made optional. This rule checker=
 would probably be implemented as a Python module like a plug-in. There's=
 too much customization to have simply a rules text file, though a text file=
 and matching engine would work nicely. This way many people could easily=
 contribute their own preprocessor rules.

Each of these rules would generate a unique tag, one value for true and=
 another for false.=20
=46or example, "Is this message mostly in upper case" might output=
 "SB54321-1234T" or "SB54321-1234F".  "SB" (SpamBayes) is just an=
 identifier, "54321" is a user-unique random number so that other people's=
 posted or forwarded messages don't coincidentally match your own=
 preprocessor, "1234" is the rule ID number, and the "T" or "F" is whether=
 the rule was matched or not.

The preprocessor output/SpamBayes input might look something like:
------------------------------------------------------------------
Date: xxxx
=46rom: xxxx
Subject: xxxx
To: xxx
Bla: xxxx
Header junk: xxx

This is the text of any old random message. 24973 is unique to that user. If=
 that user imports someone else's knowledge base, they would change the #'s=
 to their own.

SB24973-0001T
SB24973-0002F
SB24973-0004T
SB24973-0034F
=2E..
SB24973-0234T
SB24973-0235T
SB24973-0236F

-BBC, 2002-10-01.


From msergeant@startechgroup.co.uk  Wed Oct  2 09:26:41 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Wed, 02 Oct 2002 09:26:41 +0100
Subject: [Spambayes] Matt Sergeant: Introduction
References: <BIEJKCLHCIOIHAGOKOLHCEFBDLAA.tim.one@comcast.net>
Message-ID: <3D9AADC1.9030207@startechgroup.co.uk>

Tim Peters wrote:
> [Matt Sergeant]
> 
>>...
>>And to give back I'll tell you that one of my biggest wins was parsing
>>HTML (with HTML::Parser - a C implementation so it's very fast) and
>>tokenising all attributes, so I get:
>>
>>   colspan=2
>>   face=Arial, Helvetica, sans-serif
>>
>>as tokens. Plus using a proper HTML parser I get to parse HTML comments
>>too (which is a win).
> 
> 
> Matt, what are you using as test data?  The experience here has been that
> HTML is sooooo strongly correlated with spam that we've added gimmick after
> gimmick to remove evidence that HTML ever existed; else the rare ham that
> uses HTML-- or even *discusses* HTML with a few examples! --had an extremely
> hard time avoiding getting classified as spam.

We have a live feed from one of our towers. You have to be careful to 
classify only HTML that is actually going to be rendered as HTML by the 
client (i.e. content-type: text/html, or the whole thing is HTML which 
is a heuristic Outlook seems to use, which is infuriating).

Due to it being a live feed, we get all sorts of HTML newsletters in 
there, so only real spammy indicators get noticed, rather than HTML 
being a generic catch-all. I guess the point being that we see more HTML 
newsletters than we see HTML spam ;-)

> Do you find, for example, that
> 
>     colspan=2
> 
> is common in HTML ham but rare in HTML spam, or vice versa?

select * from words where word = 'colspan=2';
    word    | goodcount | badcount
-----------+-----------+----------
  colspan=2 |      3950 |     4197

Hmm, I guess colspan=2 wasn't a good example <grin>.

 > I'm wondering what's sparing
> you from that fate.

I suspect it's just the corpus.

>>but increases the database size and number of tokens you have to
>>pull from the database enormously.
> 
> 
> That was also our experience with word bigrams, but less than "enormously";
> about a factor of 2; character 5-grams were snuggling up to enormously.

I think for me it was more me hitting the limits of the performance I 
could expect from postgresql. Expecting 10,000 selects to come back in 
anything like a reasonable timeframe was a bit much to ask ;-)

>>Well I very quickly found out that most of the academic research into
>>this has been pretty bogus. For example everyone seems (seemed?) to
>>think that stemming was a big win, but I found it to lose every time.
> 
> We haven't tried that.  OTOH, the academic research has been on Bayesian
> classifiers, and this isn't one (despite that Paul called it one).

True, but my original classifier was bayesian (naive).

>>The one thing that still bothers me still about Gary's method is that
>>the threshold value varies depending on corpus. Though I expect there's
>>some mileage in being able to say that the middle ground is "unknown".
> 
> 
> It does allow for an easy, gradual, and effective way to favor f-n at the
> expense of f-p, or vice versa.  There was no such possibility under Paul's
> scheme, as the more training data we fed in, the rarer it was for *any*
> score not to be extremely close to 0 or extremely close to 1, and regardless
> of whether the classification was right or wrong.  Gary's method hasn't been
> caught in such extreme embarrassment yet.
> 
> OTOH, it *is*, as you say, corpus dependent, and it seems hard to get that
> across to people.  Gary has said he knows of ways to make the distinction
> sharper, but we haven't yet been able to provoke him into revealing them
> <wink>.  The central limit variations, and especially the logarithmic one,
> are much more extreme this way.

Is that central_limit_2 as you call it?

>>On my personal email I was seeing about 5 FP's in 4000, and about 20
>>FN's in about the same number (can't find the exact figures right now).
> 
> So to match the units and order of the next sentence, about 0.5% FN rate and
> 0.13% FP rate.
> 
>>On a live feed of customer email we're seeing about 4% FN's and 2% FP's.
> 
> Is that across hundreds of thousands of users?

It's just on one particular email tower, so around a few thousand I think.

> Do you know the
> correpsonding statistics for SpamAssassin?  For python.org use, I've thought
> that as long as we could keep this scheme fast, it may be a good way to
> reduce the SpamAssassin load.

I don't keep stats for SpamAssassin - we don't use it "pure" so it 
wouldn't be worth it. FWIW, I'm working on making SpamAssassin 3 
significantly faster (like about 50x) by using a decision tree rather 
than a linear scan of all rules. I think for your purposes (python.org 
mailing lists) there's probably a lot of mileage in doing spambayes 
first, then if spambayes is unsure (say between .40 and .60) run the 
email through spamassassin (but set the threshold to 7).

Matt.


From rob@hooft.net  Wed Oct  2 11:22:16 2002
From: rob@hooft.net (Rob W.W. Hooft)
Date: Wed, 02 Oct 2002 12:22:16 +0200
Subject: [Spambayes] Central limit
References: <LNBBLJKPBEHFEDALKOLCMECOBIAB.tim_one@email.msn.com>
Message-ID: <3D9AC8D8.5020308@hooft.net>

Tim Peters wrote:
  [Tim]

>>>Sorry, I don't know what that means or how to compute it; neither does
>>>google <wink>.  Let's say this is my population:  {2, 5, 10, 64}.  Then
>>>what are the "RMS Z-score scaled to 1.0" thingies of 1, 2, 32,
>>>64, and 1000?

[Rob]

>>You can calculate the Root Mean Square (RMS) of all Z-scores.

[Tim]

> By RMS I understand you to mean the square root of the mean of the Z-score
> squares.  Is that what you mean?

Yep.

>>That is the same as the "standard deviation" of the population. 

> But now I'm lost again.  The RMS of a population isn't the same as the sdev
> of a population (as I understand sdev), unless the mean of the population
> happens to be 0.  The mean Z-score is definitely not 0 in the results I'm
> seeing; the zscores for ham are highly skewed to one side of 0.

OK. I'd like to see a "histogram" to see what causes this. Is 0 still 
the most frequently observed value, and is the distribution asymmetric, 
or does it look like a bell-curve that is offset?

What we are trying to do is to "describe" the histogram in as little 
parameters as possible. Thereby it is not very important to get the 
"bulk-form" right, as long as the tails of the distribution (or at least 
the relevant one of the two tails) is reasonably described. This is 
because we are not interested in Z scores lower than 1-2, they indicate 
by themselves that the tested message is part of the population. Only 
for abs(Z)>2 we should have a reasonable description to be able to 
calculate a "chance".

It may be necessary to describe the thing in more than average and 
standard deviation (skew, kurtosis), but chances are that we can do with 
a "fake average" and a "fake standard deviation" to describe the one 
interesting tail.

> 
>>This appears to be around 3-4.
> 
> 
> For what?  The RMS of {1, 2, 32, 64, 1000} is about 450:

Sorry, I didn't look at your example set, but at the distribution you 
gave earlier and again a bit lower in this message.

> I note that most google hits on "RMS Z-score" land on the WHAT IF program
> you worked on as a postdoc -- so I suspect this is a case where something is
> so obvious to you it may be impossible for you to explain it <0.9 wink>.

:-) and/or I didn't see the enough of the clt results yet to communicate 
in a reasonable way. In fact, my machine is crunching on a clt run now 
for the first time, but it is taking several hours for my corpora.

>>1.0-erf(Zham)-erf(Zspam) is sizeable, that could express a "chance" that
>>the message belongs to neither; erf(Zham)+erf(Zspam) is then a way to
>>express the "classifyability". Normally not both of Zham and Zspam are
>>small, but the math might need to handle the case that the sum of these
>>two is larger than one for the weird case... Somebody with a proper
>>statistical background can probably improve on this. Unfortunately
>>import math does not come with the error function....
> 
> 
> That won't be a problem if there's something useful here.  I'm willing to
> pursue it, but am getting hints that it will work worse than the
> seat-of-the-pants ratio gimmick.  The "region of certainty" where the ratio
> gimmick has never been *observed* to make a mistake includes cases where
> both zscores are so large that, no matter how we fiddle them, the system
> would believe there's no earthly chance the msg came from either population.
> But so long as
> 
>     |larger zscore| / |smaller score|
> 
> "is big enough", it doesn't seem to matter how large the smaller zscore is.
> The most extreme ham in this "region of (seeming) certainty" had a ham
> zscore of -24.3, and a spam zscore of -42.9.  I only tallied percentages up
> to ham |zscores| of 10 before, but it's clear from this that the percentage
> at 24.3 would be insignificant:

Does that -24.3/-42.9 look like anything else you have ever seen? For me 
this would clearly indicate that "it is unlike anything in the two 
corpora, but if you really want to express chances, it it VERY much more 
likely (18 sigma!) to be ham than spam".

> This % of hams   had abs(zham) <= this
> --------------   ---------------------
> 18.377%           1.0
> 36.525%           2.0
> 53.650%           3.0
> 67.919%           4.0
> 78.301%           5.0
> 85.831%           6.0
> 90.788%           7.0
> 93.762%           8.0
> 95.696%           9.0
> 97.044%          10.0

This is where I got my scaling factor of ~4: This population has its 65% 
cutoff at Z=~4 and 95% cutoff at Z=~9. In a normal distribution these 
two occur at 1 sigma and 2 sigma, respectively. This indicates that Zham 
is about 4 times too large. Your -24.3 above thus reduces to ~6 standard 
deviations: still very unlikely (Hm, I have some problems with the 
definition and use of the error function. Do I need to do erfc(sqrt(p))? 
That gives p=5e-4 for Z=6), but less so than the original 24....

For spam the reduction factor might be ~3 (you say it is sharper), 
reducing the 42.9 to 14 standard deviations (p=1e-7), which is indeed 
much, much less likely (pham/pspam>4000) than the ~6 for the distance to 
the ham population. But if it would be my incoming mail, I think I'd 
want to see it "unclassified" because it is so different from both.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From richie@entrian.com  Wed Oct  2 12:52:56 2002
From: richie@entrian.com (Richie Hindle)
Date: Wed, 02 Oct 2002 12:52:56 +0100
Subject: [Spambayes] Good evening/morning/afternoon everyone
In-Reply-To: <3D99A3E6.4050403@startechgroup.co.uk>
References: <20020928002231.CD68.JCARLSON@uci.edu>
	<ot6cpucv3lujj2dd6er9mf08j4ckgp5jes@4ax.com>
	<20020928153427.D68E.JCARLSON@uci.edu>
	<aj6jpu80kc0s7faj8ti1lh7dkb1prqmclp@4ax.com>
	<3D99A3E6.4050403@startechgroup.co.uk>
Message-ID: <k1llpusnbop0t74pcsknpae8jsn03hqfg2@4ax.com>


> Lotus Notes still can't filter on arbitrary headers.

Grr.  Do you know what it *can* filter on?  Is there a sensible
behaviour for pop3proxy that would work for Notes?  Preferably
something less intrusive than Josiah's idea of modifying the
Subject line.

-- 
Richie Hindle
richie@entrian.com


From msergeant@startechgroup.co.uk  Wed Oct  2 13:34:05 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Wed, 02 Oct 2002 13:34:05 +0100
Subject: [Spambayes] Good evening/morning/afternoon everyone
References: <20020928002231.CD68.JCARLSON@uci.edu>
	<ot6cpucv3lujj2dd6er9mf08j4ckgp5jes@4ax.com>
	<20020928153427.D68E.JCARLSON@uci.edu>
	<aj6jpu80kc0s7faj8ti1lh7dkb1prqmclp@4ax.com>
	<3D99A3E6.4050403@startechgroup.co.uk>
	<k1llpusnbop0t74pcsknpae8jsn03hqfg2@4ax.com>
Message-ID: <3D9AE7BD.5090607@startechgroup.co.uk>

Richie Hindle wrote:
>>Lotus Notes still can't filter on arbitrary headers.
> 
> 
> Grr.  Do you know what it *can* filter on?  Is there a sensible
> behaviour for pop3proxy that would work for Notes?  Preferably
> something less intrusive than Josiah's idea of modifying the
> Subject line.

We've been thinking about this at work. We *think* it might be able to 
look at the Precedence headers, so you could potentially set them to 
"junk" and have it work. Alternatively you could modify the From header 
(and set Reply-To if it's not set) to something like "spammer". Or 
finally yes, you can modify the subject.

Definitely the worst piece of junk email client I've ever had to deal 
with. Wait until you have to ask for an original email from them with 
all headers in-tact. Bwahahahahahaha ;-)


From noreply@sourceforge.net  Wed Oct  2 10:53:51 2002
From: noreply@sourceforge.net (noreply@sourceforge.net)
Date: Wed, 02 Oct 2002 02:53:51 -0700
Subject: [Spambayes] [ spambayes-Feature Requests-616944 ] Mozilla Mail
	integration
Message-ID: <E17wgCU-0000aE-00@usw-sf-web3.sourceforge.net>

Feature Requests item #616944, was opened at 2002-10-01 13:31
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=616944&group_id=61702

Category: None
Group: None
Status: Open
Priority: 5
Submitted By: Sinchi Pacharuraq (sinchi)
Assigned to: Nobody/Anonymous (nobody)
Summary: Mozilla Mail integration

Initial Comment:
Integration with Mozilla Mail client

----------------------------------------------------------------------

>Comment By: Sinchi Pacharuraq (sinchi)
Date: 2002-10-02 13:53

Message:
Logged In: YES 
user_id=621182

I just want to have this anti-spam filter built in Mozilla
message filters. For example, user might activate this
filter to delete spam messages from inbox or to move it to
special folder.

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2002-10-01 20:04

Message:
Logged In: YES 
user_id=44345

ummm.... a bit short on detail/description.  What precisely do you mean by 
"Mozilla Mail integration"?  Can you describe what you would like to see 
feature-wise?  Note that no other mail system integration has been 
attempted at this point with the exception that I believe the hammie script 
works with procmail.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=616944&group_id=61702

From skip@pobox.com  Wed Oct  2 14:20:17 2002
From: skip@pobox.com (Skip Montanaro)
Date: Wed, 2 Oct 2002 08:20:17 -0500
Subject: [Spambayes] Matt Sergeant: Introduction
In-Reply-To: <3D9AADC1.9030207@startechgroup.co.uk>
References: <BIEJKCLHCIOIHAGOKOLHCEFBDLAA.tim.one@comcast.net>
        <3D9AADC1.9030207@startechgroup.co.uk>
Message-ID: <15770.62097.97078.342522@12-248-11-90.client.attbi.com>

    Matt> We have a live feed from one of our towers....

then later:

    Matt> It's just on one particular email tower, ...

What's an "email tower"?

Skip

From neale@woozle.org  Wed Oct  2 16:02:35 2002
From: neale@woozle.org (Neale Pickett)
Date: 02 Oct 2002 08:02:35 -0700
Subject: [Spambayes] Another proposal from one of us.
In-Reply-To: <v03110727b9c0584d3415@[192.168.0.2]>
References: <v03110727b9c0584d3415@[192.168.0.2]>
Message-ID: <w53k7l0yjck.fsf@woozle.org>

So then, John Draper <crunch@shopip.com> is all like:

> I propose a preprocessor which would convert message
> "meta-information" into tokens which would be appended to the email
> message prior to digestion by SpamBayes. Call this SBMIP.

Why not just run it through SpamAssassin first, then have your tokenizer
pay attention to the tests reported in the X-Spam-Status header?  SA is
probably a better place to be doing this sort of testing anyhow--that's
what SA is all about.  :)

Neale

From jcarlson@uci.edu  Wed Oct  2 16:21:50 2002
From: jcarlson@uci.edu (Josiah Carlson)
Date: Wed, 02 Oct 2002 08:21:50 -0700
Subject: [Spambayes] Good evening/morning/afternoon everyone
In-Reply-To: <k1llpusnbop0t74pcsknpae8jsn03hqfg2@4ax.com>
References: <3D99A3E6.4050403@startechgroup.co.uk>
	<k1llpusnbop0t74pcsknpae8jsn03hqfg2@4ax.com>
Message-ID: <20021002081804.E1B7.JCARLSON@uci.edu>

> > Lotus Notes still can't filter on arbitrary headers.
> 
> Grr.  Do you know what it *can* filter on?  Is there a sensible
> behaviour for pop3proxy that would work for Notes?  Preferably
> something less intrusive than Josiah's idea of modifying the
> Subject line.

I only modified it when it was a suspected spam.  In those cases, it was
nice to know.  Of course my software knows to not use that portion of
the subject (or really any word longer than 10 characters).  But it now
does the X-Hammie-Disposition thing.  *grin* And subject line
modification is not that intrusive when you consider how intrusive spam
itself is.

 - Josiah

From tim.one@comcast.net  Wed Oct  2 16:34:12 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 02 Oct 2002 11:34:12 -0400
Subject: [Spambayes] Central limit
In-Reply-To: <3D9AC8D8.5020308@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEENBIAB.tim.one@comcast.net>

A quickie:

[Rob W.W. Hooft]
> ...
> (Hm, I have some problems with the definition and use of the error
function.
> Do I need to do erfc(sqrt(p))?

I think I can clear this one up:  erf() is often documented incorrectly.
For hysterical raisins, erf(x) computes the area under the unit Gaussian
from -x*sqrt(2) to x*sqrt(2).  So if you want the area under the unit
Gaussian from -x to x, you need to do erf(x/sqrt(2)).  (erf integrates over
exp(-t**2) for historical simplicity, while the unit Gaussian integrates
over exp(-t**2/2); the difference is where the sqrt(2) comes from)


From noreply@sourceforge.net  Wed Oct  2 14:33:57 2002
From: noreply@sourceforge.net (noreply@sourceforge.net)
Date: Wed, 02 Oct 2002 06:33:57 -0700
Subject: [Spambayes] [ spambayes-Feature Requests-616944 ] Mozilla Mail
	integration
Message-ID: <E17wjdV-0007VI-00@usw-sf-web5.sourceforge.net>

Feature Requests item #616944, was opened at 2002-10-01 09:31
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=616944&group_id=61702

Category: None
Group: None
Status: Open
Priority: 5
Submitted By: Sinchi Pacharuraq (sinchi)
Assigned to: Nobody/Anonymous (nobody)
Summary: Mozilla Mail integration

Initial Comment:
Integration with Mozilla Mail client

----------------------------------------------------------------------

>Comment By: Richie Hindle (richiehindle)
Date: 2002-10-02 13:33

Message:
Logged In: YES 
user_id=85414

I'm no expert on how Mozilla filters work... can you add
a filter that says "If a message contains an
X-Hammie-Disposition header whose value starts with Yes
then <do whatever>"?  If so, you can use either hammie.py
(as part of your unix mail delivery system) or pop3proxy.py
(on either a server machine or your own client machine).
Both of these add an X-Hammie-Disposition header, with
which you can filter your messages.


----------------------------------------------------------------------

Comment By: Sinchi Pacharuraq (sinchi)
Date: 2002-10-02 09:53

Message:
Logged In: YES 
user_id=621182

I just want to have this anti-spam filter built in Mozilla
message filters. For example, user might activate this
filter to delete spam messages from inbox or to move it to
special folder.

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2002-10-01 16:04

Message:
Logged In: YES 
user_id=44345

ummm.... a bit short on detail/description.  What precisely do you mean by 
"Mozilla Mail integration"?  Can you describe what you would like to see 
feature-wise?  Note that no other mail system integration has been 
attempted at this point with the exception that I believe the hammie script 
works with procmail.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=616944&group_id=61702

From chk@pobox.com  Wed Oct  2 16:49:21 2002
From: chk@pobox.com (Harald Koch)
Date: Wed, 02 Oct 2002 11:49:21 -0400
Subject: [Spambayes] Re: Tokenising clues 
In-Reply-To: Your message of "Tue, 01 Oct 2002 15:54:32 -0400".
	 <BIEJKCLHCIOIHAGOKOLHMEEMDLAA.tim.one@comcast.net> 
References: <BIEJKCLHCIOIHAGOKOLHMEEMDLAA.tim.one@comcast.net>
Message-ID: <17064.1033573761@elisabeth.cfrq.net>

> I only care which way the results are heading <wink>.

As you've mentioned before, at this point you're tuning the tokenizer to
*your* sample, which doesn't necessarily represent the global population
of spam. I still strongly suspect that you're entering chaotic space at
this point.

> Organization: Massachussetts Institute of Technology
> 
> is going to generate a big pile of ham clues, and if a spammer happens to
> include that header too, it's going to be hard to overcome them.

The first time, yes. Then that message gets moved into the spam corpus,
the probabilities are recaculated, and those words are no longer
discriminators.

What's the problem with that?

-- 
Harald Koch     <chk@pobox.com>

From richie@entrian.com  Wed Oct  2 17:19:35 2002
From: richie@entrian.com (Richie Hindle)
Date: Wed, 02 Oct 2002 17:19:35 +0100
Subject: [Spambayes] Cunning use of quoted-printable
Message-ID: <8r6mpu4tq03lb0i0j4ncoftnsgdd2394up@4ax.com>


[Send privately to Tim by accident; now forwarding to the list]

[Tim]
> I *think* you meant it was a false negative, since you said it was in your
> spam collection, and haven't argued that it's actually ham.

Correct, sorry.

[Tim]
> If you can without revealing a confidence, it would be good if you could
> share the fp.  Short of that, are these fp that bother you?  Would you be
> upset if you lost them in real life?

Here they are.  The first is a request to unsubscribe from a mailing
list - this one I certainly *would* be bothered about.  I've censored
the email address slightly in deference to its author - I've replaced
every other character with 'x'.

 'header:Received:5': 0.14;
 'from:email addr:biglobe.ne.jp>': 0.16; 'from:email name:<rxmx7x5x': 0.16;
 'from:skip:= 30': 0.16; 'message-id:@biglobe.ne.jp': 0.16;
 'subject:2022': 0.16; 'subject:IBskQiMxGyhC': 0.16; 'charset:us-ascii': 0.26;
 'content-type:text/plain': 0.35; 'subject:ISO': 0.35;
 'header:Message-Id:1': 0.64; 'x-mailer:none': 0.68; 'subject:=?': 0.70;
 'subject:?=': 0.72; 'unsubscribe': 0.93

It's probably the only legitimate email with 'ISO' headers that I've
ever received, and its author made the mistake of using the word
'unsubscribe'.  8-)

The bit I understand least here is this:

 'header:Message-Id:1': 0.64

Why is the tokenizer reading '1' for the Message-Id?  I'd look further
into this (the message_id_re code looks fine to me at a brief glance)
but I need to get back to my day job.  8-)

------------------------------------------------------------------------

>From RxMx7x5x@biglobe.ne.jp Fri May 02 22:21:22 1997
Received: from punt-2.mail.demon.net by mailstore for
	sr-list@sundog.demon.co.uk
	id 862608130:10:24450:1; Fri, 02 May 97 22:22:10 BST
Received: from mailsv1.pcvan.or.jp ([192.47.117.193]) by punt-2.mail.demon.net
           id aa1024075; 2 May 97 22:21 BST
Received: from mail-gw.biglobe.ne.jp (mailsv5.pcvan.or.jp [192.47.117.85]) by
	mailsv1.pcvan.or.jp (8.7.5+2.6Wbeta6/3.5W9-PCVAN01) with ESMTP id GAA11518 for
	<sr-list@sundog.demon.co.uk>; Sat, 3 May 1997 06:21:40 +0900 (JST)
Received: by mail-gw.biglobe.ne.jp (8.7.5+2.6Wbeta6/6.4J.6-BIGLOBE_GW)
	id GAA02729; Sat, 3 May 1997 06:21:15 +0900 (JST)
Received: by biglobe.ne.jp
	id 1023702; Sat, 03 May 1997 06:21:22 +0900
Message-Id: <970503062118.23085B03.1023702@biglobe.ne.jp>
Date: Sat, 03 May 1997 06:21:22 +0900
From: =?ISO-2022-JP?B?GyRCJV8layUtITwbKEI=?= <RxMx7x5x@biglobe.ne.jp>
To: sr-list@sundog.demon.co.uk
Subject: =?ISO-2022-JP?B?IBskQiMxGyhC?=
Content-Type: Text/Plain; charset=us-ascii
MIME-Version: 1.0

unsubscribe <list> [<RxMx7x5x@biglobe.ne.jp>]
end

------------------------------------------------------------------------

The second is a spam-looking mail from one of my ISPs, telling me that
their web address has changed.  I wouldn't care if I'd missed that.

------------------------------------------------------------------------

>From Orange#18.3250.d5-BLEXlg11G9rR.1@socket.cyberdialogue.com Thu Sep 26 10:04:37 2002
Return-Path: <Orange#18.3250.d5-BLEXlg11G9rR.1.b@socket.cyberdialogue.com>
Received: from punt-2.mail.demon.net by mailstore for
	entrian@sundog.demon.co.uk
	id 1033031768:20:16776:120; Thu, 26 Sep 2002 09:16:08 GMT
Received: from westhost19.westhost.net ([216.71.84.92]) by
	punt-2.mail.demon.net           id aa2017667; 26 Sep 2002 9:15 GMT
Received: from accumx-2.cyberdialogue.com (accumx-2.cyberdialogue.com
	[209.123.95.101])
	by westhost19.westhost.net (8.11.6/8.11.6) with SMTP id g8Q9Dxu31643
	for <richie@entrian.com>; Thu, 26 Sep 2002 04:13:59 -0500
Received: (qmail 31858 invoked from network); 26 Sep 2002 08:26:33 -0000
Received: from socket.fulcrumanalytics.com (HELO socket.cyberdialogue.com)
	(209.123.95.99)  by 0 with SMTP; 26 Sep 2002 08:26:33 -0000
Message-ID: <7160165.1033031077537.JavaMail.root@socket.cyberdialogue.com>
Date: Thu, 26 Sep 2002 05:04:37 -0400 (EDT)
From: "orange@orange.co.uk"
	<Orange#18.3250.d5-BLEXlg11G9rR.1@socket.cyberdialogue.com>
To: richie@entrian.com
Subject: Orange Internet has moved
Mime-Version: 1.0
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Mailer: Accucast (http://www.accucast.com)
X-Mailer-Version: 2.7.2-1
X-Hammie-Disposition: Yes

<HTML>
<HEAD>
<TITLE>Orange Internet moving to orange.co.uk</TITLE>
<STYLE TYPE="text/css">
.StandardLink {
    COLOR: #ff6600;
    TEXT-DECORATION: underline
}
.StandardLink:hover {
    COLOR: #ff6600;
    TEXT-DECORATION: underline
}
</STYLE>
</HEAD>
<BODY MARGINWIDTH="0" MARGINHEIGHT="0" TOPMARGIN="0" LEFTMARGIN="0" BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#FF6600" ALINK="#FF6600" VLINK="#FF6600">
<TABLE WIDTH="612" CELLPADDING="0" CELLSPACING="0">
<TR BGCOLOR="#FFFFFF"><TD BGCOLOR="#FFFFFF" COLSPAN="3" WIDTH="612" ALIGN="left" VALIGN="top"><IMG SRC="http://www.orange.co.uk/html_emails/orangenet/images/splash_top.jpg" WIDTH="612" HEIGHT="140" BORDER="0" ALT="Orange Internet moving to orange.co.uk"></TD></TR>
<TR BGCOLOR="#FFFFFF">
	<TD BGCOLOR="#FFFFFF" WIDTH="50" ALIGN="left" VALIGN="top"><IMG SRC="http://www.orange.co.uk/html_emails/orangenet/images/clear.gif" WIDTH="50" HEIGHT="1" BORDER="0"></TD>
	<TD BGCOLOR="#FFFFFF" WIDTH="530" ALIGN="left" VALIGN="top">
	<TABLE WIDTH="530" CELLPADDING="0" CELLSPACING="0" BORDER="0">
	<TR BGCOLOR="#FFFFFF"><TD BGCOLOR="#FFFFFF" WIDTH="530" HEIGHT="50"></TD></TR>
	<TR BGCOLOR="#FFFFFF">
		<TD BGCOLOR="#FFFFFF" WIDTH="530" ALIGN="left" VALIGN="top"><FONT FACE="Helvetica,Arial" COLOR="#666666" SIZE="-1">
		<B>Hello Richard</B><BR>
		</FONT></TD>
	</TR>
	<TR BGCOLOR="#FFFFFF"><TD BGCOLOR="#FFFFFF" WIDTH="530" HEIGHT="20"></TD></TR>
	<TR BGCOLOR="#FFFFFF">
		<TD BGCOLOR="#FFFFFF" WIDTH="530" ALIGN="left" VALIGN="top"><FONT FACE="Helvetica,Arial" COLOR="#000000" SIZE="-1">
		Orange Internet has moved from its old home at orange.net to its brand new address at orange.co.uk. You can still organise your life exactly the way you have been, with the same Orange email address and log in, your diary, and free text messages - all available to you on <A STYLE="color: #ff6600" CLASS="StandardLink" HREF="http://socket.cyberdialogue.com/Click?q=ea-fjzyQhVvN0N5jNLts_rATXuR"><FONT COLOR="#FF6600">Orange today</FONT></A>. Orange today is our new look site, you'll find the link at the top right of <A STYLE="color: #ff6600" CLASS="StandardLink" HREF="http://socket.cyberdialogue.com/Click?q=00-4FClI9Qsp2CcPHsrO4r8uPcR"><FONT COLOR="#FF6600">Orange.co.uk</FONT></A>.<BR>
		</FONT></TD>
	</TR>
	<TR BGCOLOR="#FFFFFF"><TD BGCOLOR="#FFFFFF" WIDTH="530" HEIGHT="35"></TD></TR>
	<TR BGCOLOR="#FFFFFF">
		<TD BGCOLOR="#FFFFFF" WIDTH="530" ALIGN="left" VALIGN="top"><IMG SRC="http://www.orange.co.uk/html_emails/orangenet/images/subhead_get_just_the_news_you_want.gif" WIDTH="173" HEIGHT="15" BORDER="0" ALT="get just the news you want"></TD>
	</TR>
	<TR BGCOLOR="#FFFFFF"><TD BGCOLOR="#FFFFFF" WIDTH="530" HEIGHT="20"></TD></TR>
	<TR BGCOLOR="#FFFFFF">
		<TD BGCOLOR="#FFFFFF" WIDTH="530" ALIGN="left" VALIGN="top"><FONT FACE="Helvetica,Arial" COLOR="#000000" SIZE="-1">
		Your news service can now be personalised, so you can receive updates on the news that matters to you. Go to <A STYLE="color: #ff6600" CLASS="StandardLink" HREF="http://socket.cyberdialogue.com/Click?q=15-mYd_IGcs_2mqKwwtyAMwPZ4R"><FONT COLOR="#FF6600">Orange today</FONT></A> for more details.<BR>
		</FONT></TD>
	</TR>
	<TR BGCOLOR="#FFFFFF"><TD BGCOLOR="#FFFFFF" WIDTH="530" HEIGHT="30"></TD></TR>
	<TR BGCOLOR="#FFFFFF">
		<TD BGCOLOR="#FFFFFF" WIDTH="530" ALIGN="left" VALIGN="top"><IMG SRC="http://www.orange.co.uk/html_emails/orangenet/images/subhead_tell_me_more.gif" WIDTH="80" HEIGHT="15" BORDER="0" ALT="tell me more"></TD>
	</TR>
	<TR BGCOLOR="#FFFFFF"><TD BGCOLOR="#FFFFFF" WIDTH="530" HEIGHT="20"></TD></TR>
	<TR BGCOLOR="#FFFFFF">
		<TD BGCOLOR="#FFFFFF" WIDTH="530" ALIGN="left" VALIGN="top"><FONT FACE="Helvetica,Arial" COLOR="#000000" SIZE="-1">
		Do you want to keep up with all the latest news on Orange products and services? Simply <A STYLE="color: #ff6600" CLASS="StandardLink" HREF="http://socket.cyberdialogue.com/Click?q=2a-CNyWIrIpIgk5OfI1YLQzQQRR"><FONT COLOR="#FF6600">click here</FONT></A> to provide your contact details.<BR>
		</FONT></TD>
	</TR>
	<TR BGCOLOR="#FFFFFF"><TD BGCOLOR="#FFFFFF" WIDTH="530" HEIGHT="80"></TD></TR>
	<TR BGCOLOR="#FFFFFF">
		<TD BGCOLOR="#FFFFFF" WIDTH="530" ALIGN="left" VALIGN="top">
		<TABLE WIDTH="530" CELLPADDING="0" CELLSPACING="0" BORDER="0">
		<TR BGCOLOR="#FFFFFF">
			<TD BGCOLOR="#FFFFFF" WIDTH="470" ALIGN="left" VALIGN="bottom">
			<TABLE WIDTH="470" CELLPADDING="0" CELLSPACING="0" BORDER="0">
			<TR BGCOLOR="#FFFFFF"><TD BGCOLOR="#FFFFFF" WIDTH="470" HEIGHT="15"></TD></TR>
			<TR BGCOLOR="#FFFFFF">
				<TD BGCOLOR="#FFFFFF" WIDTH="470" ALIGN="left" VALIGN="bottom"><FONT FACE="Helvetica,Arial" SIZE="-2" COLOR="#000000">
				Click here to see the Orange <A STYLE="color: #ff6600" CLASS="StandardLink" HREF="http://socket.cyberdialogue.com/Click?q=3f-4sSEIqMB4OMyB4zsXEMaCNeR"><FONT COLOR="#FF6600">privacy statement</FONT></A><BR>
				</FONT></TD>
			</TR>
			<TR BGCOLOR="#FFFFFF">
				<TD BGCOLOR="#FFFFFF" WIDTH="470" ALIGN="left" VALIGN="bottom"><FONT FACE="Helvetica,Arial" SIZE="-2" COLOR="#000000">
				If you don't want to receive marketing information from us by email, please <A STYLE="color: #ff6600" CLASS="StandardLink" HREF="<A HREF="http://socket.cyberdialogue.com/Click?q=54-l6HvzcK3RoVUapsEsRh5lR01zsRR"><FONT COLOR="#FF6600">click here</FONT></A> to unsubscribe<BR>
				</FONT></TD>
			</TR>
			</TABLE>
			</TD>
			<TD BGCOLOR="#FFFFFF" WIDTH="20" ALIGN="left" VALIGN="bottom"><IMG SRC="http://www.orange.co.uk/html_emails/orangenet/images/clear.gif" WIDTH="20" HEIGHT="1" BORDER="0"></TD>
			<TD BGCOLOR="#FFFFFF" WIDTH="40" ALIGN="right" VALIGN="bottom"><IMG SRC="http://www.orange.co.uk/html_emails/orangenet/images/logo.gif" WIDTH="40" HEIGHT="40" BORDER="0" ALT="orange&#153;"></TD>
		</TR>
		</TABLE>
		</TD>
	</TR>
	<TR><TD BGCOLOR="#FFFFFF" WIDTH="530" HEIGHT="30"></TD></TR>
	</TABLE>
	</TD>
	<TD BGCOLOR="#FFFFFF" WIDTH="32" ALIGN="left" VALIGN="top"><IMG SRC="http://www.orange.co.uk/html_emails/orangenet/images/clear.gif" WIDTH="32" HEIGHT="1" BORDER="0"></TD>
</TR>
</TABLE>
</BODY>
</HTML>


<IMG HEIGHT=1 WIDTH=1 SRC="http://socket.cyberdialogue.com/Click?q=69-CYAinjWDS5is2_PWo-D_gRRR">

------------------------------------------------------------------------

-- 
Richie Hindle
richie@entrian.com


From nas@python.ca  Wed Oct  2 18:45:10 2002
From: nas@python.ca (Neil Schemenauer)
Date: Wed, 2 Oct 2002 10:45:10 -0700
Subject: [Spambayes] Cunning use of quoted-printable
In-Reply-To: <8r6mpu4tq03lb0i0j4ncoftnsgdd2394up@4ax.com>
References: <8r6mpu4tq03lb0i0j4ncoftnsgdd2394up@4ax.com>
Message-ID: <20021002174510.GA32247@glacier.arctrix.com>

Richie Hindle wrote:
> The bit I understand least here is this:
> 
>  'header:Message-Id:1': 0.64

It means there is one Message-Id header.

  Neil

From tim.one@comcast.net  Wed Oct  2 18:45:12 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 02 Oct 2002 13:45:12 -0400
Subject: [Spambayes] Re: Tokenising clues
In-Reply-To: <17064.1033573761@elisabeth.cfrq.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEFDBIAB.tim.one@comcast.net>

[Harald Koch]
> As you've mentioned before, at this point you're tuning the tokenizer to
> *your* sample, which doesn't necessarily represent the global population
> of spam. I still strongly suspect that you're entering chaotic space at
> this point.

I do very little tokenizer tuning anymore, for this very reason.  Nearly all
changes I've made recently were supported by tests conducted publicly on
this list, across multiple corpora.  When 10 of 10 runs across each of 3
distinct testers all say "yup, it worked the same way here too", chaos isn't
a likely explanation <wink>.

>> Organization: Massachussetts Institute of Technology
>>
>> is going to generate a big pile of ham clues, and if a spammer
>> happens to include that header too, it's going to be hard to overcome
them.

> The first time, yes. Then that message gets moved into the spam corpus,
> the probabilities are recaculated, and those words are no longer
> discriminators.

Sorry, it takes time for the algorithm to learn, and if there are many ham
containing this header line now then it will take almost that many training
samples of spam containing the same thing before the spamprobs decrease to
neutrality.

> What's the problem with that?

I advised Neil to read the comments in tokenizer.py about previous
experiments with character n-grams, and can only repeat that advice to you.
When a single phrase can generate a large number of clues, a single unlucky
(or lucky) phrase can determine the entire outcome.  *Mixing* character
n-grams in one part of the tokenizer with word-based tokenization elsewhere
effectively gives more weight to the parts tokenized via characters, as the
latter generate nore clues per character of input text.  This suggests it
introduces biases, and we've never seen a case yet where a bias was actually
helpful.  Testing is the final judge, but I have reasonable cause to suspect
it will backfire, and have seen it backfire in previous attempts.


From tim.one@comcast.net  Wed Oct  2 19:03:02 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 02 Oct 2002 14:03:02 -0400
Subject: [Spambayes] Cunning use of quoted-printable
In-Reply-To: <20021002174510.GA32247@glacier.arctrix.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEFEBIAB.tim.one@comcast.net>

[Richie Hindle]
>> The bit I understand least here is this:
>>
>>  'header:Message-Id:1': 0.64

[Neil Schemenauer]
> It means there is one Message-Id header.

More, that "Message-Id" was exactly how it was spelled:  this count is
case-sensitive.  It's interesting that this Camel-Case spelling is a mild
spam indicator for Richie.  In Neil's reply, the message id was spelled
"Message-id".  Staring at your database entries will turn up some
interesting things!  For example, in my corpus

MiME-Version

has killer-strong spamprob.


From anthony@interlink.com.au  Wed Oct  2 19:19:10 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Thu, 03 Oct 2002 04:19:10 +1000
Subject: [Spambayes] Cunning use of quoted-printable 
In-Reply-To: <8r6mpu4tq03lb0i0j4ncoftnsgdd2394up@4ax.com> 
Message-ID: <200210021819.g92IJA711989@localhost.localdomain>


>>> Richie Hindle wrote

>  'header:Received:5': 0.14;
>  'from:email addr:biglobe.ne.jp>': 0.16; 'from:email name:<rxmx7x5x': 0.16;
>  'from:skip:= 30': 0.16; 'message-id:@biglobe.ne.jp': 0.16;
>  'subject:2022': 0.16; 'subject:IBskQiMxGyhC': 0.16; 'charset:us-ascii': 0.26
;
>  'content-type:text/plain': 0.35; 'subject:ISO': 0.35;
>  'header:Message-Id:1': 0.64; 'x-mailer:none': 0.68; 'subject:=?': 0.70;
>  'subject:?=': 0.72; 'unsubscribe': 0.93

It looks like it's tokenizing the encoded version of the subject here... ?


-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From tim.one@comcast.net  Wed Oct  2 20:23:07 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 02 Oct 2002 15:23:07 -0400
Subject: [Spambayes] Good evening/morning/afternoon everyone
In-Reply-To: <20021002081804.E1B7.JCARLSON@uci.edu>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEFLBIAB.tim.one@comcast.net>

[Josiah Carlson]
> ...
> And subject line modification is not that intrusive when you consider how
> intrusive spam itself is.

My employer fiddled our system to prepend a tilde (~) to the Subject of
suspected spam.  I never even noticed this until it was pointed out to me!
Which was months after we started doing it.  Then again, there's not much
that doesn't escape me <wink>.


From tim.one@comcast.net  Wed Oct  2 20:51:26 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 02 Oct 2002 15:51:26 -0400
Subject: [Spambayes] Cunning use of quoted-printable
In-Reply-To: <200210021819.g92IJA711989@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEFNBIAB.tim.one@comcast.net>

[Richie Hindle, after untangling the compressed clue listing so it's
 readable]

>>  'header:Received:5': 0.14;
>>  'from:email addr:biglobe.ne.jp>': 0.16;
>>  'from:email name:<rxmx7x5x': 0.16;
>>  'from:skip:= 30': 0.16;
>>  'message-id:@biglobe.ne.jp': 0.16;
>>  'subject:2022': 0.16;
>>  'subject:IBskQiMxGyhC': 0.16;
>>  'charset:us-ascii': 0.26;
>>  'content-type:text/plain': 0.35;
>>  'subject:ISO': 0.35;
>>  'header:Message-Id:1': 0.64;
>>  'x-mailer:none': 0.68;
>>  'subject:=?': 0.70;
>>  'subject:?=': 0.72;
>>  'unsubscribe': 0.93


[Anthony Baxter]
> It looks like it's tokenizing the encoded version of the subject here... ?

That's right, both the Subject and From headers:

From: =?ISO-2022-JP?B?GyRCJV8layUtITwbKEI=?= <RxMx7x5x@biglobe.ne.jp>
Subject: =?ISO-2022-JP?B?IBskQiMxGyhC?=

The only things we ever decode are text/* quoted-printable and text/* base64
MIME sections.  This isn't by design either way -- AFAIK, it's simply that
nobody has *tried* to decode anything else, and so it's unknown how doing so
would affect results.  (There'ss one false negative in my corpus that would
go away if we decoded uuencoded sections, btw.)  OTOH, we don't have a clue
about how to tokenize Asian languages anyway, so I'm not sure that anyone
here knows *how* to "decode" this in a way that might help.

Richie, what do you have spam_cutoff set to?  I thought your first message
implied it was set to 0.56.  The thing that strikes me hardest about this
false positive is that it's got a lot more ham clues than spam clues listed.
You chopped off the overall score (or the test driver you're using doesn't
display it), but looks to me like it should be about 0.4.  That's an awfully
low value for spam_cutoff!  If spam_cutoff wasn't that low, this should not
have *been* a false positive (.4 is too low for the system to consider it
spam unless spam_cutoff is less than .4).  Or didn't you send the full list
of clues?  That this was a false positive simply doesn't make sense based on
what you've told us.


From richie@entrian.com  Wed Oct  2 21:26:39 2002
From: richie@entrian.com (Richie Hindle)
Date: Wed, 02 Oct 2002 21:26:39 +0100
Subject: [Spambayes] Good evening/morning/afternoon everyone
In-Reply-To: <3D9AE7BD.5090607@startechgroup.co.uk>
References: <20020928002231.CD68.JCARLSON@uci.edu>
	<ot6cpucv3lujj2dd6er9mf08j4ckgp5jes@4ax.com>
	<20020928153427.D68E.JCARLSON@uci.edu>
	<aj6jpu80kc0s7faj8ti1lh7dkb1prqmclp@4ax.com>
	<3D99A3E6.4050403@startechgroup.co.uk>
	<k1llpusnbop0t74pcsknpae8jsn03hqfg2@4ax.com>
	<3D9AE7BD.5090607@startechgroup.co.uk>
Message-ID: <58kmpuknmmknkhpep044pc1nak4i7u7s96@4ax.com>


[Matt]
> Lotus Notes still can't filter on arbitrary headers.

[Richie]
> Grr.  Do you know what it *can* filter on?

[Matt]
> We've been thinking about this at work. We *think* it might be able to 
> look at the Precedence headers, so you could potentially set them to 
> "junk" and have it work.

That would be good - a more portable version of the X-Hammie-Disposition
header.  If you confirm whether Notes can do this, please let us know!

[Josiah]
> subject line modification is not that intrusive when you consider how
> intrusive spam itself is.

An excellent point!  8-)

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Wed Oct  2 22:34:48 2002
From: richie@entrian.com (Richie Hindle)
Date: Wed, 02 Oct 2002 22:34:48 +0100
Subject: [Spambayes] Cunning use of quoted-printable
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEFNBIAB.tim.one@comcast.net>
References: <200210021819.g92IJA711989@localhost.localdomain>
	<LNBBLJKPBEHFEDALKOLCIEFNBIAB.tim.one@comcast.net>
Message-ID: <gjmmpuod64chscfrrk96mpptrirnj9c5as@4ax.com>


[Tim]
> Richie, what do you have spam_cutoff set to?  I thought your first message
> implied it was set to 0.56.

It is, yes.

[Tim]
> this should not have *been* a false positive

You're right.  Where 'richie.pickle' is my full ~4000-message database:

>>> import cPickle, pprint, tokenizer, classifier
>>> from Options import options
>>> text = open( "Data/Ham/Set4/1641", "rt" ).read()
>>> bayes = cPickle.load( open( "richie.pickle", "rb" ) )
>>> score, clues = bayes.spamprob( tokenizer.tokenize( text ), True )
>>> print options.spam_cutoff, score
0.56 0.402748505794
>>> pprint.pprint( clues )
[('header:Received:5', 0.13592289441927),
 ('from:email addr:biglobe.ne.jp>', 0.15517241379310345),
 ('from:email name:<rxmx7x5x1', 0.15517241379310345),
 ('from:skip:= 30', 0.15517241379310345),
 ('message-id:@biglobe.ne.jp', 0.15517241379310345),
 ('subject:2022', 0.15517241379310345),
 ('subject:IBskQiMxGyhC', 0.15517241379310345),
 ('charset:us-ascii', 0.26241865802854009),
 ('content-type:text/plain', 0.34572203385342953),
 ('subject:ISO', 0.35151428063116696),
 ('header:Message-Id:1', 0.64496476638361089),
 ('x-mailer:none', 0.67584084707587),
 ('subject:=?', 0.69778644753001717),
 ('subject:?=', 0.7215916912471283),
 ('unsubscribe', 0.93148161126231199)]
>>>

But running in the test environment, which uses the same 4000 messages
(subject to a couple of hundred extras being shuffled around by rebal.py),
I get this:

> python timcv.py -n10 --ham=200 --spam=200 -s1

[snip]
-> <stat> 1 new false positives
    new fp: ['Data/Ham/Set4/1641']
******************************************************************************
Data/Ham/Set4/1641
prob = 0.581295852793
prob('header:Received:5') = 0.141997
prob('charset:us-ascii') = 0.26578
prob('content-type:text/plain') = 0.346687
prob('header:Message-Id:1') = 0.648679
prob('x-mailer:none') = 0.674625
prob('subject:=?') = 0.775229
prob('subject:?=') = 0.908163
prob('unsubscribe') = 0.928485

>From RxMx7x5x@biglobe.ne.jp Fri May 02 22:21:22 1997
[snip]

What's going on??  Far fewer clues in the test environment (and my other
false positive prints 67 of them, so it's not a display issue).

I have a bayescustomize.ini like this:

[TestDriver]
best_cutoff_fp_weight = 10
nbuckets = 100

which I guess shouldn't have any effect on this at all.

-- 
Richie Hindle
richie@entrian.com


From skip@pobox.com  Thu Oct  3 01:13:58 2002
From: skip@pobox.com (Skip Montanaro)
Date: Wed, 2 Oct 2002 19:13:58 -0500
Subject: [Spambayes] Integration w/ mail clients
Message-ID: <15771.35782.414547.651734@localhost.localdomain>


There's a tracker item asking for "Mozilla Mail integration":

  https://sourceforge.net/tracker/?func=detail&atid=498106&aid=616944&group_id=61702

In my own fiddling around trying to as correctly create ham & spam
collections (*) using Emacs' VM mail reader I've noticed that the simple act
of dumping a message into either a ham or spam file after deciding its
category is a bit tedious.  I think this is where most of my mistaken
hams-as-spam or spams-as-ham come from.  Currently, all messages so disposed
also wind up deleted, so I have to undelete them if I don't want them lost
from the regular mail stream.  Occasionally, I forget to do this.  I should
break down and write a little ELisp to do a bit better job of the task.  For
any other VM users out there, it seems to me that "l h" and "l s" would be
decent keybindings for whatever commands I develop to save ham and spam.

I think it would be worthwhile understanding what tasks can and/or should be
integrated into various mail clients.  Here's what I see as required:

    * training mode - the user makes the ham/spam distinction (my "lh"/"ls")
      example for VM

    * run mode - the mail client calls out to SB to get a reading on
      the message - this may not be necessary in many unixoid environments
      since other tools upstream from the MUA may run the classifier

    * override - the user corrects a mistake by the classifier - presumably
      it should be able to incrementally subtract the incorrect
      classification info from the score database and add the correct infon

What MUA functionality do other people think is necessary?

Skip

(*) Can we settle on "collection" instead of "corpus" to avoid the weird
plural?

From jcarlson@uci.edu  Thu Oct  3 01:28:28 2002
From: jcarlson@uci.edu (Josiah Carlson)
Date: Wed, 02 Oct 2002 17:28:28 -0700
Subject: [Spambayes] Integration w/ mail clients
In-Reply-To: <15771.35782.414547.651734@localhost.localdomain>
References: <15771.35782.414547.651734@localhost.localdomain>
Message-ID: <20021002171806.E1BD.JCARLSON@uci.edu>

> What MUA functionality do other people think is necessary?

If the mail client stores email in any decent format (mbox, '\n.\n'
delimited, etc.), I can't imagine it would be a big deal to just have
the classifier check at regular intervals whether or not the file has
changed, and if so, re-index.  It wouldn't be difficult to add support
in for multiple recursive folders for people who don't have 15 folders
in their email root, but have subdirectories (I've done it in other
projects), and multiple databases (to make re-indexing easier).

It also wouldn't be a big deal to require that the user keep a 'spam'
folder, named in undercase, or even '__spam__', so that the software can
easily determine which is the bad stuff.  Of course assuming that the
good stuff is every other email anywhere else in your email archives.

In terms of actual integration, I don't believe more than the above is
required, especially if the proxy knows how to do multiple mail servers
(check out pasp or popfile on how we do it).

If the above were implemented for a few major email clients, a virtually
drop-in spam filter is possible.

Mozilla uses mbox, I don't know what Eudora or pegasus use, most linux
mail clients use mbox.  Outlook is a whore, but you can import your
outlook mail into mozilla, then it becomes mbox.  Of course there is the
borrowing of the outlook->mbox code from the mozilla project that could
happen, if only for outlook people.

 - Josiah

From tim.one@comcast.net  Thu Oct  3 04:30:16 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 02 Oct 2002 23:30:16 -0400
Subject: [Spambayes] Cunning use of quoted-printable
In-Reply-To: <gjmmpuod64chscfrrk96mpptrirnj9c5as@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEHJBIAB.tim.one@comcast.net>

[Richie Hindle, continuing to unravel the mystery of the now-it-is,
 now-it-ain't false positive]

> You're right.  Where 'richie.pickle' is my full ~4000-message database:

Ah!  If that's really been trained on all your msgs, then in particular it's
been trained on the very message you're predicting against.  The test
drivers are careful never to do that (unless two msgs happen to have
identical content, in which case that's fine -- it that's what real life
looks like, it's not cheating to exploit it).

> >>> import cPickle, pprint, tokenizer, classifier
> >>> from Options import options
> >>> text = open( "Data/Ham/Set4/1641", "rt" ).read()
> >>> bayes = cPickle.load( open( "richie.pickle", "rb" ) )
> >>> score, clues = bayes.spamprob( tokenizer.tokenize( text ), True )
> >>> print options.spam_cutoff, score
> 0.56 0.402748505794
> >>> pprint.pprint( clues )
> [('header:Received:5', 0.13592289441927),
>  ('from:email addr:biglobe.ne.jp>', 0.15517241379310345),

Let's pause here and ponder.  Earlier you said you believed this was the
only msg with ISO encodings in the Subject/From lines.  Suppose that's true.
Then you've trained on exactly one message (this one) producing (among
others) "word"

    'from:email addr:biglobe.ne.jp>'

The estimated *from counting* probability that a message containing this
word is spam is then exactly 0.0 (you've seen it once, and only in ham).

Then Gary's Bayesian probability adjustment is applied, to account for how
much evidence you've got in favor of "the true" spamprob being 0.0:

    s*x + n*p
    ---------
       s+n

The default prior-belief strength (s) is 0.45, the default unknown-word prob
(x) is 0.5, the counting probability estimate (p) is 0 (as above), and the
total evidence (n -- the number of messages containing this word) is 1.  So
the adjusted spamprob is

     0.45*0.5 + 1*0     0.225
     -------------- =   ----- = 0.15517241379310345
          0.45+1        1.45

And that's exactly the prob shown on the line above, so we can be pretty
certain that your database was in fact trained on this msg.

>  ('from:email name:<rxmx7x5x1', 0.15517241379310345),

And ditto for all the other words in your corpus unique to this msg.  Since
there are several of them, and they all have low spamprob, the overall score
favors ham.  That's not too surprising considering the system was already
*told* it was ham.

>  ('from:skip:= 30', 0.15517241379310345),
>  ('message-id:@biglobe.ne.jp', 0.15517241379310345),
>  ('subject:2022', 0.15517241379310345),
>  ('subject:IBskQiMxGyhC', 0.15517241379310345),
>  ('charset:us-ascii', 0.26241865802854009),
>  ('content-type:text/plain', 0.34572203385342953),
>  ('subject:ISO', 0.35151428063116696),
>  ('header:Message-Id:1', 0.64496476638361089),
>  ('x-mailer:none', 0.67584084707587),
>  ('subject:=?', 0.69778644753001717),
>  ('subject:?=', 0.7215916912471283),
>  ('unsubscribe', 0.93148161126231199)]
> >>>
>
> But running in the test environment, which uses the same 4000 messages
> (subject to a couple of hundred extras being shuffled around by
> rebal.py), I get this:
>
> > python timcv.py -n10 --ham=200 --spam=200 -s1

As at the start, timcv never predicts against a message that the classifier
has been trained on.  It would be a very much weaker test if it ever did so,
and the example we're discussing here shows why.  In the test environment,
then, *all* the words unique to this message have never been seen in the
msgs the classifier was trained on, and so they all get the "unknown word"
spamprob, 0.5.  Then they're ignored completely, because the default
robinson_minimum_prob_strength is 0.1, which ignores all words with spamprob
in 0.4 thru 0.6.

> [snip]
> -> <stat> 1 new false positives
>     new fp: ['Data/Ham/Set4/1641']
> ******************************************************************
> ************
> Data/Ham/Set4/1641
> prob = 0.581295852793
> prob('header:Received:5') = 0.141997
> prob('charset:us-ascii') = 0.26578
> prob('content-type:text/plain') = 0.346687
> prob('header:Message-Id:1') = 0.648679
> prob('x-mailer:none') = 0.674625
> prob('subject:=?') = 0.775229
> prob('subject:?=') = 0.908163
> prob('unsubscribe') = 0.928485

Without those other clues, the best judgment it can make is that it's spam.
This is also why the system needs to be trained over time!  It can only know
what it's been taught.

Very brief subscribe/unsubscribe msgs have been a problem in my data too,
but I expect more so:  such msgs don't belong on c.l.py at all, and they're
really quite rare there.  That prevents subscribe/unsubscribe from getting
milder spamprobs no matter how much c.l.py data I train them on.  But if you
get a non-trivial number of these, the system will act differently for your
data, over time.

> From RxMx7x5x@biglobe.ne.jp Fri May 02 22:21:22 1997
> [snip]
>
> What's going on??  Far fewer clues in the test environment

Right -- but what that really shows is that the test environment isn't
cheating, so that's a Good Thing.

> (and my other false positive prints 67 of them, so it's not a
> display issue).
>
> I have a bayescustomize.ini like this:
>
> [TestDriver]
> best_cutoff_fp_weight = 10
> nbuckets = 100
>
> which I guess shouldn't have any effect on this at all.

Right again, none at all -- they merely affect the histogram display.  The
only [TestDriver] option that can affect results is spam_cutoff, and even
that has no effect on scores.


From tim.one@comcast.net  Thu Oct  3 04:35:01 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 02 Oct 2002 23:35:01 -0400
Subject: [Spambayes] Integration w/ mail clients
In-Reply-To: <15771.35782.414547.651734@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEHKBIAB.tim.one@comcast.net>

[Skip Montanaro, raises important issues which I'm ignoring because
 I'm spread too thin -- but if anyone wants s challenge, try getting
 msgs out of Outlook 2000 natively without 8 weeks of Mark Hammond's
 help <wink>]

> ...
> (*) Can we settle on "collection" instead of "corpus" to avoid the
>     weird plural?

If we agree to call the plural of collection collecora, sure!

From jbublitz@nwinternet.com  Thu Oct  3 06:28:30 2002
From: jbublitz@nwinternet.com (Jim Bublitz)
Date: Wed, 02 Oct 2002 22:28:30 -0700 (PDT)
Subject: [Spambayes] Here's why "generate_long_skips: False" worked... 
Message-ID: <XFMail.021002222830.jbublitz@nwinternet.com>

Tim Peters wrote:

> An easy example is Asian spam, where the lack of whitespace
> ends up generating oodles of skip tokens (and '8bit%' tokens),
> but there must be a more effective way to generate useful tokens
> for that without bloating the database beyond reason.  So I hope
> that skip-generation will eventually become worthless.

I'm not sure this'll help much, but:

I'm playing around with Graham and have just started looking at
Spambayes. I have something more than 6500 Asian language spams
(about 4 mos. worth, over half my spam), and what I use to tokenize
is:

re.compile(r"[\w'$_-]+", re.U)

which gives tokens (from Asian languages) in the 1 to 10 character
length range (mostly to the low end of that, similar to a
distribution of English words). I imagine you could apply something
like this when 8 bit data is detected.

OTOH, in running some very preliminary tests with hammie.py out of
the box, Spambayes catches all of the Asian language spam but gets
right the ham msgs which contain a small portion of Asian chars
(same with Graham), so your handling of 8 bit data seems to work
pretty well. 

Tokenizing as above certainly adds to the database size, but nowhere
near as much as the equivalent number of English language messages
probably would. I haven't really quantified it, but I'd guess it
adds less than 10% - perhaps a lot less. I haven't seen any strings
of unbounded length, and at the moment I'm not trimming any tokens
from the above regex.
 
lower() also works with Asian characters (doesn't raise an exception
anyway), but I get better results staying case sensitive.


Jim


From msergeant@startechgroup.co.uk  Thu Oct  3 11:20:49 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Thu, 03 Oct 2002 11:20:49 +0100
Subject: [Spambayes] Matt Sergeant: Introduction
References: <BIEJKCLHCIOIHAGOKOLHCEFBDLAA.tim.one@comcast.net>
	<3D9AADC1.9030207@startechgroup.co.uk>
	<15770.62097.97078.342522@12-248-11-90.client.attbi.com>
Message-ID: <3D9C1A01.1010208@startechgroup.co.uk>

Skip Montanaro wrote:
>     Matt> We have a live feed from one of our towers....
> 
> then later:
> 
>     Matt> It's just on one particular email tower, ...
> 
> What's an "email tower"?

Sorry - internal lingo for a rack. Well, not quite a rack, sometimes 
multiple racks. Basically a group of email servers. We use multiple 
towers throughout the world for redundancy and proximity reasons.

Matt.


From papaDoc@videotron.ca  Thu Oct  3 14:48:46 2002
From: papaDoc@videotron.ca (papaDoc)
Date: Thu, 03 Oct 2002 09:48:46 -0400
Subject: [Spambayes] Result of a test
Message-ID: <3D9C4ABE.8070407@videotron.ca>

This is a multi-part message in MIME format.

---------------------- multipart/mixed attachment
Hi,

The attachment is the result of a run on my ham and spam.

They are comming from 3 different email addresses.
The email can be in english or french.

Most of the fp are email from company (palm and APC) that I subscribed
to their mailing list. (Even if I don't see those email I won't miss them
because usually I don't read them).

Others are subscription verification and some spam (what I consider spam)
are email forwarded to me by my boss.

I am using all the default values.


Most of the false negative are spam in french !
Since my ration of french/english is really low and
the ration of french spam/french ham is very low


I did not play with the python code yet since I'm new to python

Looking at the prob of each word I saw something


prob('battery"') = 0.844828
prob('battery,') = 0.844828

prob('powernews,') = 0.77651
prob('powernews.') = 0.77651

prob('outlet,') = 0.844828
prob('outlet.') = 0.844828

prob('luncheon') = 0.844828
prob('luncheon:') = 0.844828
prob('luncheons') = 0.844828


I think it can be interesting to try to remove the ponctuation (the . , 
? !) at the end of a word
and then count it as the same word and do the same thing with the 
plurial (luncheon and luncheons) based
on a dictionary like the one in ispell.

papaDoc

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: run1.zip
Type: application/x-zip-compressed
Size: 41935 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021003/f71c7995/run1.bin

---------------------- multipart/mixed attachment--

From richie@entrian.com  Thu Oct  3 17:44:00 2002
From: richie@entrian.com (Richie Hindle)
Date: Thu, 03 Oct 2002 17:44:00 +0100
Subject: [Spambayes] Cunning use of quoted-printable
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEHJBIAB.tim.one@comcast.net>
References: <gjmmpuod64chscfrrk96mpptrirnj9c5as@4ax.com>
	<LNBBLJKPBEHFEDALKOLCEEHJBIAB.tim.one@comcast.net>
Message-ID: <qloopu41jekcs14ljfceov07ffqi1805hv@4ax.com>


[Tim]
> you've trained on exactly one message (this one) producing (among
> others) "word"
> 
>     'from:email addr:biglobe.ne.jp>'
> 
> The estimated *from counting* probability that a message containing this
> word is spam is then exactly 0.0 (you've seen it once, and only in ham).
> 
> Then Gary's Bayesian probability adjustment is applied [...]

<Grovels apologetically>  I did briefly think that this might be due to
this message having unique words, but I thought that the non-zero scores
for those words meant they must have appeared in a mix of ham and spam.  I
confess I've let the mathematical discussions slip past me, so I wasn't
expecting words unique to the ham corpus to have non-zero probabilities.
I should have looked more carefully at the words ('from:email
name:<rxmx7x5x1' is a big giveaway) and paid more attention to the maths.

I will pay more attention in class.  I will pay more attention in class.
I will pay more attention in class.  I will pay more attention in class.
I will pay more attention in class.  I will pay more attention in class.
I will pay more attention in class.  I will pay more attention in class.
I will pay more attention in class.  I will pay more attention in class.
I will pay more attention in class.  I will pay more attention in class.
I will pay more attention in class.  I will pay more attention in class.
I will pay more attention in class.  I will pay more attention in class.
I will pay more attention in class.  I will pay more attention in class.
I will pay more attention in class.  I will pay more attention in class.

Many thanks for the explanation, and sorry to have wasted your time.

-- 
Richie Hindle
richie@entrian.com

From tim.one@comcast.net  Thu Oct  3 19:21:19 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 03 Oct 2002 14:21:19 -0400
Subject: [Spambayes] Cunning use of quoted-printable
In-Reply-To: <qloopu41jekcs14ljfceov07ffqi1805hv@4ax.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHMEGGDLAA.tim.one@comcast.net>

[Richie Hindle]
> <Grovels apologetically>

No need!  It's amusing, though <wink>.

> ...
> Many thanks for the explanation, and sorry to have wasted your time.

Let's be clear about that:  it wasn't a waste of time at all.  Tracking down
the details to the bloody end verified that some key parts of the system are
working as intended, which raises confidence; and made an opportunity to
write a little tutorial on what's going on behind the scenes, which should
be helpful to those who followed it.  If I had to do the same thing every
day, then it would become a drag, but I thought this one a very good use of
time.


From gward@python.net  Thu Oct  3 20:07:48 2002
From: gward@python.net (Greg Ward)
Date: Thu, 3 Oct 2002 15:07:48 -0400
Subject: [Spambayes] Good evening/morning/afternoon everyone
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEFLBIAB.tim.one@comcast.net>
References: <20021002081804.E1B7.JCARLSON@uci.edu>
	<LNBBLJKPBEHFEDALKOLCIEFLBIAB.tim.one@comcast.net>
Message-ID: <20021003190748.GA29525@cthulhu.gerg.ca>

On 02 October 2002, Tim Peters said:
> My employer fiddled our system to prepend a tilde (~) to the Subject of
> suspected spam.  I never even noticed this until it was pointed out to me!
> Which was months after we started doing it.  Then again, there's not much
> that doesn't escape me <wink>.

They're not the only ones.  I noticed this was so prevalent in spam to
mail.python.org that I 1) wondered why spammers would make life so easy
for filters by making their spam obvious and 2) added this SA rule:

header SUBJECT_TILDE      Subject =~ /^\~/
describe SUBJECT_TILDE    Subject starts with a tilde (~)
score SUBJECT_TILDE        2.5

I guess it's not the spammers adding tildes.  Whatever, it helps!

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
Money is truthful.  If a man speaks of his honor, make him pay cash.

From seant@webreply.com  Thu Oct  3 21:59:54 2002
From: seant@webreply.com (Sean True)
Date: Thu, 3 Oct 2002 16:59:54 -0400
Subject: [Spambayes] Microsoft Outlook 'support'
Message-ID: <MJEHLHJKGINLONDMMKNEGEILGKAA.seant@webreply.com>

I've written a couple of scripts which use Mark H's win32com package to do
the following:

1) Dump arbitrary mail folders in Outlook 2000 to Data/Spam/reservoir and/or
Data/Ham/Reservoir
2) Train a classifier directly from Outlook 2000 folders
3) Move messages from folder to folder based on a thresholded classifier
score

These scripts are quite raw, and do require Outlook (as opposed to Outlook
Express). If there is general interest,
I'd be glad to share. The question is, where?

-- Sean
-------
Sean True
WebReply.Com, Inc.


From seant@webreply.com  Thu Oct  3 22:04:40 2002
From: seant@webreply.com (Sean True)
Date: Thu, 3 Oct 2002 17:04:40 -0400
Subject: [Spambayes] Bad at math.
Message-ID: <MJEHLHJKGINLONDMMKNEAEIMGKAA.seant@webreply.com>

Is it plausible to use the classifier as a multi-class classifier by using
multiple independent classifiers and 'somehow'
taking the best score? Anybody want to comment on the 'somehow'?

I strongly suspect that there is a better way to do this, but the results of
the multiple
independent classifiers appear to match the hand crafted regexp recognizer
that I wrote some time ago. And are easier
to maintain, if someone will keep the training set up to date.

I should have paid more attention in my last job <grin>.

-- Sean
-------
Sean True
WebReply.Com, Inc.


From gward@python.net  Thu Oct  3 22:13:51 2002
From: gward@python.net (Greg Ward)
Date: Thu, 3 Oct 2002 17:13:51 -0400
Subject: [Spambayes] Result of a test
In-Reply-To: <3D9C4ABE.8070407@videotron.ca>
References: <3D9C4ABE.8070407@videotron.ca>
Message-ID: <20021003211351.GC29525@cthulhu.gerg.ca>

On 03 October 2002, papaDoc said:
> Looking at the prob of each word I saw something
> 
> 
> prob('battery"') = 0.844828
> prob('battery,') = 0.844828
> 
> prob('powernews,') = 0.77651
> prob('powernews.') = 0.77651
> 
> prob('outlet,') = 0.844828
> prob('outlet.') = 0.844828
> 
> prob('luncheon') = 0.844828
> prob('luncheon:') = 0.844828
> prob('luncheons') = 0.844828
> 
> 
> I think it can be interesting to try to remove the ponctuation (the . , 
> ? !) at the end of a word
> and then count it as the same word and do the same thing with the 
> plurial (luncheon and luncheons) based
> on a dictionary like the one in ispell.

Tim played with this very early in the project.  Turned out that keeping
punctuation, preserving case, and not stemming, were all wins.  A bit
counter-intuitive, but there you go.  Experiment beats intuition every
time in this project.

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
All right, you degenerates!  I want this place evacuated in 20 seconds!

From whisper@oz.net  Thu Oct  3 22:19:32 2002
From: whisper@oz.net (David LeBlanc)
Date: Thu, 3 Oct 2002 14:19:32 -0700
Subject: [Spambayes] Microsoft Outlook 'support'
In-Reply-To: <MJEHLHJKGINLONDMMKNEGEILGKAA.seant@webreply.com>
Message-ID: <GCEDKONBLEFPPADDJCOEIEJEFFAA.whisper@oz.net>

I'd dearly love to have your scripts for munging Outlook 2000 mail folders!!

This seems like a good thing to add to a sub-project of spambayes!

David LeBlanc
Seattle, WA USA

> -----Original Message-----
> From: spambayes-bounces@python.org
> [mailto:spambayes-bounces@python.org]On Behalf Of Sean True
> Sent: Thursday, October 03, 2002 14:00
> To: spambayes@python.org
> Subject: [Spambayes] Microsoft Outlook 'support'
>
>
> I've written a couple of scripts which use Mark H's win32com package to do
> the following:
>
> 1) Dump arbitrary mail folders in Outlook 2000 to
> Data/Spam/reservoir and/or
> Data/Ham/Reservoir
> 2) Train a classifier directly from Outlook 2000 folders
> 3) Move messages from folder to folder based on a thresholded classifier
> score
>
> These scripts are quite raw, and do require Outlook (as opposed to Outlook
> Express). If there is general interest,
> I'd be glad to share. The question is, where?
>
> -- Sean
> -------
> Sean True
> WebReply.Com, Inc.
>
>
> _______________________________________________
> Spambayes mailing list
> Spambayes@python.org
> http://mail.python.org/mailman-21/listinfo/spambayes


From tim.one@comcast.net  Thu Oct  3 22:28:25 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 03 Oct 2002 17:28:25 -0400
Subject: [Spambayes] Microsoft Outlook 'support'
In-Reply-To: <MJEHLHJKGINLONDMMKNEGEILGKAA.seant@webreply.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHOEHBDLAA.tim.one@comcast.net>

[Sean True]
> I've written a couple of scripts which use Mark H's win32com package
> to do the following:
>
> 1) Dump arbitrary mail folders in Outlook 2000 to
> Data/Spam/reservoir and/or
> Data/Ham/Reservoir
> 2) Train a classifier directly from Outlook 2000 folders
> 3) Move messages from folder to folder based on a thresholded classifier
> score
>
> These scripts are quite raw, and do require Outlook (as opposed
> to Outlook Express). If there is general interest, I'd be glad to share.
> The question is, where?

If you're willing to let the PSF (Python Software Foundation) hold
copyright, I'd be delighted to add these to the project, probably in a new
Outlook2000 subdirectory.

And if you've got a SourceForge account, I'd be delighted to add you as a
developer to the project.

general-interest-be-damned-*i*-use-outlook-2k-ly y'rs  - tim


From tim.one@comcast.net  Thu Oct  3 22:40:18 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 03 Oct 2002 17:40:18 -0400
Subject: [Spambayes] Bad at math.
In-Reply-To: <MJEHLHJKGINLONDMMKNEAEIMGKAA.seant@webreply.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHKEHCDLAA.tim.one@comcast.net>

[Sean True]
> Is it plausible to use the classifier as a multi-class classifier by
> using multiple independent classifiers and 'somehow' taking the best
> score?

It hasn't been tried here.  The math in the schemes here use p(w) and 1-p(w)
in various ways, where p(w) is a guess about the probability that a msg is
spam given that it contains word w, and so doesn't generalize in a
screamingly obvious way to N-way decisions.  There's also the demonstrated
error rates on traditional N-way classifiers (like ifile), which are decent
but much worse than we're getting on this binary decision problem.

> Anybody want to comment on the 'somehow'?

Undoubtedly <wink>.

> I strongly suspect that there is a better way to do this, but the
> results of the multiple independent classifiers appear to match the hand
> crafted regexp recognizer that I wrote some time ago.

Is that good or bad?  Can you quantify?  If you have N classifiers, in which
order do you run them?  How are error rates affected if you vary the order
in which you run them?  Why does the porridge bird lay its eggs in the air?

> And are easier to maintain, if someone will keep the training set
> up to date.
>
> I should have paid more attention in my last job <grin>.

Indeed, I'd especially love to have Larry Gillick's help here.  I learned a
lot about how to test from your last job, although don't tell my old boss as
he would have thought that a waste of time <wink>.


From whisper@oz.net  Thu Oct  3 23:09:06 2002
From: whisper@oz.net (David LeBlanc)
Date: Thu, 3 Oct 2002 15:09:06 -0700
Subject: [Spambayes] Bad at math.
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHKEHCDLAA.tim.one@comcast.net>
Message-ID: <GCEDKONBLEFPPADDJCOECEJJFFAA.whisper@oz.net>

> [Sean True]
> > Is it plausible to use the classifier as a multi-class classifier by
> > using multiple independent classifiers and 'somehow' taking the best
> > score?
>
> It hasn't been tried here.  The math in the schemes here use p(w)
> and 1-p(w)
> in various ways, where p(w) is a guess about the probability that a msg is
> spam given that it contains word w, and so doesn't generalize in a
> screamingly obvious way to N-way decisions.  There's also the demonstrated
> error rates on traditional N-way classifiers (like ifile), which
> are decent
> but much worse than we're getting on this binary decision problem.
>
> > Anybody want to comment on the 'somehow'?
>
> Undoubtedly <wink>.
>
> > I strongly suspect that there is a better way to do this, but the
> > results of the multiple independent classifiers appear to match the hand
> > crafted regexp recognizer that I wrote some time ago.
>
> Is that good or bad?  Can you quantify?  If you have N
> classifiers, in which
> order do you run them?  How are error rates affected if you vary the order
> in which you run them?  Why does the porridge bird lay its eggs
> in the air?
>
> > And are easier to maintain, if someone will keep the training set
> > up to date.
> >
> > I should have paid more attention in my last job <grin>.
>
> Indeed, I'd especially love to have Larry Gillick's help here.  I
> learned a
> lot about how to test from your last job, although don't tell my
> old boss as
> he would have thought that a waste of time <wink>.
>

>From the literature search I've done, the best n-way classifier is based on
Support Vector Machines. It's significantly better then naive Bayes. (As Tim
points out, the Graham-Peters binary classifier isn't Bayesian at all.)

Dave LeBlanc
Seattle, WA USA


From papaDoc@videotron.ca  Fri Oct  4 04:16:09 2002
From: papaDoc@videotron.ca (Remi Ricard)
Date: Thu, 03 Oct 2002 23:16:09 -0400
Subject: [Spambayes] Result of a test
In-Reply-To: <20021003211351.GC29525@cthulhu.gerg.ca>
References: <3D9C4ABE.8070407@videotron.ca>
 <20021003211351.GC29525@cthulhu.gerg.ca>
Message-ID: <1033701369.4141.17.camel@localhost.localdomain>

Hi,

> > 
> > I think it can be interesting to try to remove the punctuation (the . , 
> > ? !) at the end of a word
> > and then count it as the same word and do the same thing with the 
> > plural (luncheon and luncheons) based
> > on a dictionary like the one in ispell.
> 
> Tim played with this very early in the project.  Turned out that keeping
> punctuation, preserving case, and not stemming, were all wins.  A bit
> counter-intuitive, but there you go.  Experiment beats intuition every
> time in this project.

I read the comments in the file tokenizer.py and saw that It was already
tried. Sorry...

So I tried something else ;-) 
Since spam want to catch your attention they use ? ! very often. So 
I remove only the ',' and '.' and ':' 

This is the patch:
        # Tokenize everything in the body.
            for w in text.split():
                n = len(w)
                # Make sure this range matches in tokenize_word().
                if 3 <= n <= 12:
                    if w[-1] == ',' or w[-1] == '.' or w[-1] == ':':
                        w = w[:-1];
                    yield w

                elif n >= 3:
                    for t in tokenize_word(w):
                        yield t

Please don't flame me this is my first modification of python code
I'm more a C and C++ guy....

This is the result:
run1s -> run2s
-> <stat> tested 225 hams & 279 spams against 941 hams & 1113 spams
-> <stat> tested 242 hams & 275 spams against 924 hams & 1117 spams
-> <stat> tested 251 hams & 298 spams against 915 hams & 1094 spams
-> <stat> tested 230 hams & 272 spams against 936 hams & 1120 spams
-> <stat> tested 218 hams & 268 spams against 948 hams & 1124 spams
-> <stat> tested 225 hams & 279 spams against 941 hams & 1113 spams
-> <stat> tested 242 hams & 275 spams against 924 hams & 1117 spams
-> <stat> tested 251 hams & 298 spams against 915 hams & 1094 spams
-> <stat> tested 230 hams & 272 spams against 936 hams & 1120 spams
-> <stat> tested 218 hams & 268 spams against 948 hams & 1124 spams

false positive percentages
    0.889  0.444  won    -50.06%
    0.826  1.240  lost   +50.12%
    1.594  1.594  tied          
    1.304  1.304  tied          
    0.000  0.000  tied          

won   1 times
tied  3 times
lost  1 times

total unique fp went from 11 to 11 tied          
mean fp % went from 0.922661698796 to 0.916417438007 won     -0.68%

false negative percentages
    0.717  0.717  tied          
    0.727  0.364  won    -49.93%
    1.342  1.678  lost   +25.04%
    0.000  0.368  lost  +(was 0)
    0.746  0.373  won    -50.00%

won   2 times
tied  1 times
lost  2 times

total unique fn went from 10 to 10 tied          
mean fn % went from 0.706533828263 to 0.699823195589 won     -0.95%

ham mean                     ham sdev
  24.18   24.58   +1.65%        9.24    8.93   -3.35%
  25.70   26.23   +2.06%        8.47    8.21   -3.07%
  25.51   25.87   +1.41%        9.12    8.90   -2.41%
  25.01   25.34   +1.32%        8.08    8.07   -0.12%
  24.93   25.36   +1.72%        8.27    8.17   -1.21%

ham mean and sdev for all runs
  25.08   25.49   +1.63%        8.67    8.49   -2.08%

spam mean                    spam sdev
  80.43   79.91   -0.65%        8.79    8.78   -0.11%
  79.72   79.38   -0.43%        8.30    8.12   -2.17%
  79.67   79.25   -0.53%        8.83    8.69   -1.59%
  80.09   79.73   -0.45%        8.15    8.17   +0.25%
  79.84   79.48   -0.45%        9.35    9.07   -2.99%

spam mean and sdev for all runs
  79.95   79.55   -0.50%        8.70    8.58   -1.38%

ham/spam mean difference: 54.87 54.06 -0.81


papaDoc


From tim.one@comcast.net  Fri Oct  4 06:02:31 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 04 Oct 2002 01:02:31 -0400
Subject: [Spambayes] Result of a test
In-Reply-To: <1033701369.4141.17.camel@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEMBBIAB.tim.one@comcast.net>

[Remi Ricard]
>>> I think it can be interesting to try to remove the
>>> punctuation (the . , > > > ? !) at the end of a word
>>> and then count it as the same word and do the same thing with the
>>> plural (luncheon and luncheons) based
>>> on a dictionary like the one in ispell.

[Greg Ward]
>> Tim played with this very early in the project.  Turned out that
>> keeping punctuation, preserving case, and not stemming, were all
>> wins.  A bit counter-intuitive, but there you go.  Experiment beats
>> intuition every time in this project.

Rather than saying that keeping punctuation won, I'd say instead that simple
split-on-whitespace beat searching for alphanumeric runs (where
"alphanumeric runs" sometimes included things like '$' and '_' and '-' too).
I'd say that because that's what I tested <wink>.  This should be revisited,
since boosting max_discriminators in particular may change the conclusion.

About preserving case, it was my *intuition* (and, indeed, a strong
intuition) that preserving case would be a clear win.  Testing didn't back
me on that, though (see comments in tokenizer.py), and so we stopped
preserving case -- the data said it bloated the database without significant
benefit.  This too should be revisited.  However, experiments also showed
that preserving case in Subject lines was a winner, so we do preserve case
there.

See tokenizer.py comments too for another scheme that, at the time,
significantly cut the f-n rate at the cost of major database bloat.

About stemming/lemmatization, I didn't even try it, mostly because the code
to do so is expensive and highly language-dependent.  We've since heard Matt
Sergeant's testimony that it hurt when he tried it, so I'm even less
motivated to bother.

Everyone should feel encouraged to try these things on their own!  If you
find something that wins for you, share it, and we can set up a cross-corpus
test here.

[Remi Ricard]
> I read the comments in the file tokenizer.py and saw that It was
> already tried. Sorry...

No problem.  Read TESTING.txt too -- heresy can pay <wink>.

> So I tried something else ;-)
> Since spam want to catch your attention they use ? ! very often. So
> I remove only the ',' and '.' and ':'
>
> This is the patch:
>         # Tokenize everything in the body.
>             for w in text.split():
>                 n = len(w)
>                 # Make sure this range matches in tokenize_word().
>                 if 3 <= n <= 12:
>                     if w[-1] == ',' or w[-1] == '.' or w[-1] == ':':
>                         w = w[:-1];
>                     yield w

I'd write that part:

                      while w and w[-1] in ',.:':
                          w = w[:-1]
                          n -= 1
                      if n >= 3:
                          yield w

For whatever reason, putting "words" with fewer than 3 chars in the database
has hurt results whenever I've tried it.

> ...
> Please don't flame me this is my first modification of python code
> I'm more a C and C++ guy....

That won't last <wink>.

> This is the result:
> run1s -> run2s
> -> <stat> tested 225 hams & 279 spams against 941 hams & 1113 spams
> -> <stat> tested 242 hams & 275 spams against 924 hams & 1117 spams
> -> <stat> tested 251 hams & 298 spams against 915 hams & 1094 spams
> -> <stat> tested 230 hams & 272 spams against 936 hams & 1120 spams
> -> <stat> tested 218 hams & 268 spams against 948 hams & 1124 spams
> -> <stat> tested 225 hams & 279 spams against 941 hams & 1113 spams
> -> <stat> tested 242 hams & 275 spams against 924 hams & 1117 spams
> -> <stat> tested 251 hams & 298 spams against 915 hams & 1094 spams
> -> <stat> tested 230 hams & 272 spams against 936 hams & 1120 spams
> -> <stat> tested 218 hams & 268 spams against 948 hams & 1124 spams

Do "rebal -h" for instructions on how to use an automagical "rebalancing"
script -- rebal will even out the # of ham and spam across your directories.
It's *OK* if they're unbalanced, it just complicates life a little.

> false positive percentages
>     0.889  0.444  won    -50.06%
>     0.826  1.240  lost   +50.12%
>     1.594  1.594  tied
>     1.304  1.304  tied
>     0.000  0.000  tied
>
> won   1 times
> tied  3 times
> lost  1 times

Seems to have had small random effects in both directions, with no overall
tendency.

> total unique fp went from 11 to 11 tied
> mean fp % went from 0.922661698796 to 0.916417438007 won     -0.68%
>
> false negative percentages
>     0.717  0.717  tied
>     0.727  0.364  won    -49.93%
>     1.342  1.678  lost   +25.04%
>     0.000  0.368  lost  +(was 0)
>     0.746  0.373  won    -50.00%
>
> won   2 times
> tied  1 times
> lost  2 times

Ditto.  BTW, it looks like the default value of spam_cutoff is probably too
high for your data.

> total unique fn went from 10 to 10 tied
> mean fn % went from 0.706533828263 to 0.699823195589 won     -0.95%
>
> ham mean                     ham sdev
>   24.18   24.58   +1.65%        9.24    8.93   -3.35%
>   25.70   26.23   +2.06%        8.47    8.21   -3.07%
>   25.51   25.87   +1.41%        9.12    8.90   -2.41%
>   25.01   25.34   +1.32%        8.08    8.07   -0.12%
>   24.93   25.36   +1.72%        8.27    8.17   -1.21%
>
> ham mean and sdev for all runs
>   25.08   25.49   +1.63%        8.67    8.49   -2.08%
>
> spam mean                    spam sdev
>   80.43   79.91   -0.65%        8.79    8.78   -0.11%
>   79.72   79.38   -0.43%        8.30    8.12   -2.17%
>   79.67   79.25   -0.53%        8.83    8.69   -1.59%
>   80.09   79.73   -0.45%        8.15    8.17   +0.25%
>   79.84   79.48   -0.45%        9.35    9.07   -2.99%
>
> spam mean and sdev for all runs
>   79.95   79.55   -0.50%        8.70    8.58   -1.38%
>
> ham/spam mean difference: 54.87 54.06 -0.81

Those are mixed signs, but overall on the bad side:  the average score of
ham went up on every run, and the average score of spam went down on every
run.  That means they're closer together <wink>.  The variance of both
decreased, though, so while your populations grew closer together, they're
tighter than they were; alas, the decrease in variances weren't enough
relative to the decrease in spread (difference between means) to reduce the
likely overlap any.

You're off to a great start, Remi!  Keep at it.


From tim.one@comcast.net  Fri Oct  4 06:11:58 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 04 Oct 2002 01:11:58 -0400
Subject: [Spambayes] Result of a test
In-Reply-To: <20021003211351.GC29525@cthulhu.gerg.ca>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEMCBIAB.tim.one@comcast.net>

>> prob('powernews,') = 0.77651
>> prob('powernews.') = 0.77651

BTW, it's impossible under Gary's probability adjustment (provided you stick
to the default "unknown word prob" of 0.5) for a spamprob to move "to the
other side" of 0.5 than the probability-by-counting estimate was (this
wasn't true when we were using Paul's prob calculations:  there it was
possible for a word to be a ham indicator even if it appeared more often in
spam(!)).

So that tells me that these variants of "powernews" *did* appear more often
in spam than in ham in the training data.  But that's a very unlikely word,
and it shows up routinely in all the "APC PowerNews" false positives papadoc
reported.  This very strongly suggests that the spam in that collection is
polluted with ham, and specifically that some APC PowerNews newsletters were
incorrectly classified as spam in the training data.  This would go a long
way toware explaining why the "APC PowerNews" false positives got such
extremely high scores (if the system was fed some and *told* they were spam,
it believes you <wink>).


From stephena@hiwaay.net  Fri Oct  4 07:31:42 2002
From: stephena@hiwaay.net (Stephen Anderson)
Date: Fri, 4 Oct 2002 01:31:42 -0500 (CDT)
Subject: [Spambayes] splitndirs bug [need help]
Message-ID: <Pine.OSF.4.44.0210040121230.94726-100000@fly.hiwaay.net>

Hi,

I'm trying to use splitndirs to split collected spam mbox's into maildir 
format for testing.  I've got exactly 3 hours of Python experience and I'm 
running into a wall.

Splitndirs is incorreclt capturing and munging the "From " line of the 
"next" message at the end of the preceding message.  This oddity seems to 
be happening because of a malfunction of a .read(length) call.

In mailbox.py on line 53, part of the _Subfile.read(length) function, a 
call exists to "self.fp.read(length).  Now length is defined from 
self.stop - self.pos.  Before the call self.pos is 7058L.  Also self.stop 
is 10987L.  Consequently, length is 2788L.

Now if I understand this right, we should read 2788 bytes.  But, after the 
call, self.pos is 11101L.  This represents an overread of 114 bytes.  This 
also happens to be the length of the "From " line that we were supposed to 
stop in front of.  And, said "From " line is part of the read data.

I traced it down to this point, but I can't seem to find the definition of 
the self.fp.read(lenght) function.  I suspects it's OS specific, but I 
can't find it in the Python libraries.

FYI, I am running WinXP Pro.  Can somebody please help me out; I've hit my 
Python newbie limit.  Thanks!


                                       Stephen Anderson
                                     <stephena@HiWAAY.net>


From richie@entrian.com  Fri Oct  4 09:06:25 2002
From: richie@entrian.com (Richie Hindle)
Date: Fri, 04 Oct 2002 09:06:25 +0100
Subject: [Spambayes] Cunning use of quoted-printable
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHMEGGDLAA.tim.one@comcast.net>
References: <qloopu41jekcs14ljfceov07ffqi1805hv@4ax.com>
	<BIEJKCLHCIOIHAGOKOLHMEGGDLAA.tim.one@comcast.net>
Message-ID: <hsiqpukiamtomvn3c2r7ntkl5emco6mg0o@4ax.com>

Tim,

> [...] made an opportunity to
> write a little tutorial on what's going on behind the scenes, which should
> be helpful to those who followed it.

Glad to help - any time you want to write such a tutorial, and need a slow
and dimwitted pupil to explain things to, I'm your man.  8-)

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Fri Oct  4 09:06:59 2002
From: richie@entrian.com (Richie Hindle)
Date: Fri, 04 Oct 2002 09:06:59 +0100
Subject: [Spambayes] splitndirs bug [need help]
In-Reply-To: <Pine.OSF.4.44.0210040121230.94726-100000@fly.hiwaay.net>
References: <Pine.OSF.4.44.0210040121230.94726-100000@fly.hiwaay.net>
Message-ID: <ihiqpug426vg9iuoh30t3qvm1mo7a5nlc5@4ax.com>

Hi Stephen,

> Splitndirs is incorreclt capturing and munging the "From " line of the 
> "next" message at the end of the preceding message. [...] This represents
> an overread of 114 bytes.

This is because mboxutils.py is opening the mailbox file in text mode, but
the Python mailbox library uses tell() and seek() to navigate around the
file, which is no good with text-mode files on Windows.

I've patched my mboxutils.py by changing the third-to-last line of
mboxutils.py from:

        fp = open(name)

to

        fp = open(name, "rb")

and that seemed to fix it.  I've been meaning to commit this, but I need
to work out whether reading the '\r\n' line endings will break anything
(Tim?)  In the meantime, that should fix your problem.

-- 
Richie Hindle
richie@entrian.com


From Alexander@Leidinger.net  Fri Oct  4 09:17:19 2002
From: Alexander@Leidinger.net (Alexander Leidinger)
Date: Fri, 4 Oct 2002 10:17:19 +0200
Subject: [Spambayes] Microsoft Outlook 'support'
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHOEHBDLAA.tim.one@comcast.net>
References: <MJEHLHJKGINLONDMMKNEGEILGKAA.seant@webreply.com>
	<BIEJKCLHCIOIHAGOKOLHOEHBDLAA.tim.one@comcast.net>
Message-ID: <20021004101719.2730fdf5.Alexander@Leidinger.net>

On Thu, 03 Oct 2002 17:28:25 -0400
Tim Peters <tim.one@comcast.net> wrote:

> If you're willing to let the PSF (Python Software Foundation) hold
> copyright, I'd be delighted to add these to the project, probably in a
> new Outlook2000 subdirectory.

I suggest to add a level of indirection here, please don't put the
Outlook directory directly into the spambayes root. If someone else
provides some sort of code for other MUAs we would end up with a lot
of directories in the base.

Bye,
Alexander.

-- 
      ...and that is how we know the Earth to be banana-shaped.

http://www.Leidinger.net                       Alexander @ Leidinger.net
  GPG fingerprint = C518 BC70 E67F 143F BE91  3365 79E2 9C60 B006 3FE7

From guido@python.org  Fri Oct  4 13:31:03 2002
From: guido@python.org (Guido van Rossum)
Date: Fri, 04 Oct 2002 08:31:03 -0400
Subject: [Spambayes] splitndirs bug [need help]
In-Reply-To: Your message of "Fri, 04 Oct 2002 09:06:59 BST."
             <ihiqpug426vg9iuoh30t3qvm1mo7a5nlc5@4ax.com> 
References: <Pine.OSF.4.44.0210040121230.94726-100000@fly.hiwaay.net>  
            <ihiqpug426vg9iuoh30t3qvm1mo7a5nlc5@4ax.com> 
Message-ID: <200210041231.g94CV3020176@pcp02138704pcs.reston01.va.comcast.net>

> This is because mboxutils.py is opening the mailbox file in text mode, but
> the Python mailbox library uses tell() and seek() to navigate around the
> file, which is no good with text-mode files on Windows.
> 
> I've patched my mboxutils.py by changing the third-to-last line of
> mboxutils.py from:
> 
>         fp = open(name)
> 
> to
> 
>         fp = open(name, "rb")

Good catch!

> and that seemed to fix it.  I've been meaning to commit this, but I need
> to work out whether reading the '\r\n' line endings will break anything
> (Tim?)  In the meantime, that should fix your problem.

I think Tim is already opening his message files with 'rb' -- his code
doesn't use mboxutils.py.  So please go ahead!

--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido@python.org  Fri Oct  4 13:32:06 2002
From: guido@python.org (Guido van Rossum)
Date: Fri, 04 Oct 2002 08:32:06 -0400
Subject: [Spambayes] Microsoft Outlook 'support'
In-Reply-To: Your message of "Fri, 04 Oct 2002 10:17:19 +0200."
             <20021004101719.2730fdf5.Alexander@Leidinger.net> 
References: <MJEHLHJKGINLONDMMKNEGEILGKAA.seant@webreply.com>
	<BIEJKCLHCIOIHAGOKOLHOEHBDLAA.tim.one@comcast.net>  
	<20021004101719.2730fdf5.Alexander@Leidinger.net> 
Message-ID: <200210041232.g94CW6U20188@pcp02138704pcs.reston01.va.comcast.net>

> I suggest to add a level of indirection here, please don't put the
> Outlook directory directly into the spambayes root. If someone else
> provides some sort of code for other MUAs we would end up with a lot
> of directories in the base.

I disagree.  There aren't going to be that many, and (as you may have
noticed :-) I'm not fond of deep hierarchies -- they tend to obscure
more than they help.

--Guido van Rossum (home page: http://www.python.org/~guido/)

From Alexander@Leidinger.net  Fri Oct  4 13:56:14 2002
From: Alexander@Leidinger.net (Alexander Leidinger)
Date: Fri, 4 Oct 2002 14:56:14 +0200
Subject: [Spambayes] Microsoft Outlook 'support'
In-Reply-To: <200210041232.g94CW6U20188@pcp02138704pcs.reston01.va.comcast.net>
References: <MJEHLHJKGINLONDMMKNEGEILGKAA.seant@webreply.com>
	<BIEJKCLHCIOIHAGOKOLHOEHBDLAA.tim.one@comcast.net>
	<20021004101719.2730fdf5.Alexander@Leidinger.net>
	<200210041232.g94CW6U20188@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <20021004145614.452ae254.Alexander@Leidinger.net>

On Fri, 04 Oct 2002 08:32:06 -0400
Guido van Rossum <guido@python.org> wrote:

> > I suggest to add a level of indirection here, please don't put the
> > Outlook directory directly into the spambayes root. If someone else
> > provides some sort of code for other MUAs we would end up with a lot
> > of directories in the base.
> 
> I disagree.  There aren't going to be that many, and (as you may have
> noticed :-) I'm not fond of deep hierarchies -- they tend to obscure
> more than they help.

I don't want something like MUA_Interfaces/Windows/Outlook2000,
just one level of indirection, like ProgramInterfaces/Outlook2000 or
something like this. Having several pages of ls output isn't very
userfriendly. Grouping relevant pieces together and hiding things
which aren't relevant in the actual context is userfriendly.

Bye,
Alexander.

-- 
                  Weird enough for government work.

http://www.Leidinger.net                       Alexander @ Leidinger.net
  GPG fingerprint = C518 BC70 E67F 143F BE91  3365 79E2 9C60 B006 3FE7

From guido@python.org  Fri Oct  4 14:07:34 2002
From: guido@python.org (Guido van Rossum)
Date: Fri, 04 Oct 2002 09:07:34 -0400
Subject: [Spambayes] Microsoft Outlook 'support'
In-Reply-To: Your message of "Fri, 04 Oct 2002 14:56:14 +0200."
             <20021004145614.452ae254.Alexander@Leidinger.net> 
References: <MJEHLHJKGINLONDMMKNEGEILGKAA.seant@webreply.com>
	<BIEJKCLHCIOIHAGOKOLHOEHBDLAA.tim.one@comcast.net>
	<20021004101719.2730fdf5.Alexander@Leidinger.net>
	<200210041232.g94CW6U20188@pcp02138704pcs.reston01.va.comcast.net>  
	<20021004145614.452ae254.Alexander@Leidinger.net> 
Message-ID: <200210041307.g94D7YN21286@pcp02138704pcs.reston01.va.comcast.net>

> I don't want something like MUA_Interfaces/Windows/Outlook2000,
> just one level of indirection, like ProgramInterfaces/Outlook2000 or
> something like this.

Then at least use "mua/Outlook2000".  Let's please start a convention
that directory names should be short lowercase words, and keep the
ugly camelcase for class names.

> Having several pages of ls output isn't very userfriendly.

You already got that now.  It has to become a *lot* worse before it's
going to bother me.  One extra subdir per supported MUA won't add that
much (especially since most MUAs are irrelevant in practice :-).

> Grouping relevant pieces together and hiding things
> which aren't relevant in the actual context is userfriendly.

One directory per MUA seems plenty of hiding to me.

--Guido van Rossum (home page: http://www.python.org/~guido/)

From Alexander@Leidinger.net  Fri Oct  4 14:22:39 2002
From: Alexander@Leidinger.net (Alexander Leidinger)
Date: Fri, 4 Oct 2002 15:22:39 +0200
Subject: [Spambayes] Microsoft Outlook 'support'
In-Reply-To: <200210041307.g94D7YN21286@pcp02138704pcs.reston01.va.comcast.net>
References: <MJEHLHJKGINLONDMMKNEGEILGKAA.seant@webreply.com>
	<BIEJKCLHCIOIHAGOKOLHOEHBDLAA.tim.one@comcast.net>
	<20021004101719.2730fdf5.Alexander@Leidinger.net>
	<200210041232.g94CW6U20188@pcp02138704pcs.reston01.va.comcast.net>
	<20021004145614.452ae254.Alexander@Leidinger.net>
	<200210041307.g94D7YN21286@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <20021004152239.4bd25185.Alexander@Leidinger.net>

On Fri, 04 Oct 2002 09:07:34 -0400
Guido van Rossum <guido@python.org> wrote:

> > I don't want something like MUA_Interfaces/Windows/Outlook2000,
> > just one level of indirection, like ProgramInterfaces/Outlook2000 or
> > something like this.
> 
> Then at least use "mua/Outlook2000".  Let's please start a convention
> that directory names should be short lowercase words, and keep the
> ugly camelcase for class names.

I don't care how the directory is spelled. I'm fine with everything you
come up with (someone may want to add files for MTAs, I like to have
them in the same directory, but I don't care if you want to add a mta
directory for them ;-) ).

> > Having several pages of ls output isn't very userfriendly.
> 
> You already got that now.  It has to become a *lot* worse before it's

Yes. That's the reason I wrote my initial mail on this topic.

> going to bother me.  One extra subdir per supported MUA won't add that
> much (especially since most MUAs are irrelevant in practice :-).
> 
> > Grouping relevant pieces together and hiding things
> > which aren't relevant in the actual context is userfriendly.
> 
> One directory per MUA seems plenty of hiding to me.

Normaly you are only interested in files for your MUA, aren't you?

Bye,
Alexander.

-- 
            Give a man a fish and you feed him for a day;
     teach him to use the Net and he won't bother you for weeks.

http://www.Leidinger.net                       Alexander @ Leidinger.net
  GPG fingerprint = C518 BC70 E67F 143F BE91  3365 79E2 9C60 B006 3FE7

From tim@zope.com  Fri Oct  4 17:56:02 2002
From: tim@zope.com (Tim Peters)
Date: Fri, 4 Oct 2002 12:56:02 -0400
Subject: [Spambayes] splitndirs bug [need help]
In-Reply-To: <ihiqpug426vg9iuoh30t3qvm1mo7a5nlc5@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEOFBIAB.tim@zope.com>

[Richie Hindle]
> ...
> This is because mboxutils.py is opening the mailbox file in text mode,
> but the Python mailbox library uses tell() and seek() to navigate
> around the file, which is no good with text-mode files on Windows.

Not quite.  seek and tell are fine with text-mode Windows files, provided
you stick to what C guarantees about them with text mode files:  you can
seek to a position previously returned by tell(), but that's essentially
*all* that's defined.  In particular, trying to do arithmetic on text-mode
tell() results has no meaning, and Stephen found code doing

> a call exists to "self.fp.read(length).  Now length is defined from
> self.stop - self.pos.

*That* makes no sense for text-mode files on Windows.  (BTW, good detective
work, Stephen!)

> I've patched my mboxutils.py by changing the third-to-last line of
> mboxutils.py from:
>
>         fp = open(name)
>
> to
>
>         fp = open(name, "rb")
>
> and that seemed to fix it.

Yes!  Please check that in.  Besides the seek/tell business, opening a mail
archive in text mode under Windows is likely to truncate the data
prematurely, if the archive contains any 8-bit chars (the first instance of
chr(26) is taken to mean EOF in Windows text mode).

> I've been meaning to commit this, but I need to work out whether
> reading the '\r\n' line endings will break anything (Tim?)

I can't say for sure, but if it does I'll fix it.  Offhand, the only pieces
that *might* be vulnerable are regular expressions assuming plain \n line
endings, but it's unlikely they would fall into a trap here.  I normalized
all line endings to plain \n in my data, BTW:  before that, all my spam had
\r\n, and all my ham plain \n, and when experimenting with character n-grams
the mere fact of different line endings proved to be a killer strong clue!

Bottom lines:  all mail files should always be opened in binary mode, and
spambayes code should never be sensitive to line endings.


From tim@zope.com  Fri Oct  4 18:01:17 2002
From: tim@zope.com (Tim Peters)
Date: Fri, 4 Oct 2002 13:01:17 -0400
Subject: [Spambayes] splitndirs bug [need help]
In-Reply-To: <200210041231.g94CV3020176@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEOHBIAB.tim@zope.com>

[Guido]
> I think Tim is already opening his message files with 'rb'

Yes, all code I've written for the project already opens files in binary
mode.

> ...
> So please go ahead!

Ditto!


From tim@zope.com  Fri Oct  4 18:05:51 2002
From: tim@zope.com (Tim Peters)
Date: Fri, 4 Oct 2002 13:05:51 -0400
Subject: [Spambayes] Microsoft Outlook 'support'
In-Reply-To: <20021004101719.2730fdf5.Alexander@Leidinger.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEOHBIAB.tim@zope.com>

[Alexander Leidinger]
> ...
> I suggest to add a level of indirection here, please don't put the
> Outlook directory directly into the spambayes root. If someone else
> provides some sort of code for other MUAs we would end up with a lot
> of directories in the base.

[and Guido and Alexander go back & forth on this]

Sorry, I'm unpersuaded.  Most users have no idea what "MUA" or "MTA" mean,
and I'm not going to hide what they're looking for under layers of
geek-speak.  Neither you nor they will be confused by a directory named
Outlook2000; if 30 other such directories appear, then I'll think about
rearranging stuff; I predict it ain't gonna happen.


From tim.one@comcast.net  Fri Oct  4 20:11:11 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 04 Oct 2002 15:11:11 -0400
Subject: [Spambayes] For the bold
Message-ID: <LNBBLJKPBEHFEDALKOLCCEPMBIAB.tim.one@comcast.net>

I checked in enough stuff so that bold experimenters can play with the
central-limit schemes, but not yet enough so that I (and Rob, bless his
heart) can get a detailed picture of what's going on under the covers
(patience, please).

You CANNOT use a cross-validation test with these schemes.  So don't use
timcv or mboxtest.  timtest is fine, or any other grid driver (are there
any?).  I believe I'll need to whip up a custom driver for deeper analysis
to make progress.

You CANNOT meaningfully compare error rates between a cross-validation
driver and a grid driver.  Don't even think about it.  If you want to do
comparisons with a central-limit scheme, use a grid driver for both.

A sample .ini file:

"""
[Classifier]
use_central_limit2: True
max_discriminators: 50
zscore_ratio_cutoff: 1.9

[TestDriver]
spam_cutoff: 0.50
nbuckets: 4
"""

Note that, for now, every message gets one of just 4 distinct scores when a
central-limit scheme is in use:

0.00  -- certain it's ham
0.49  -- guesses ham but is unsure
0.51  -- guesses spam but is unsure
1.00  -- certain it's spam

That's the reason for setting nbuckets to 4:  more than that won't do you a
lick of good, as there are only 4 possible scores.  spam_cutoff must also be
exactly 0.50, and for the same reason; the "best cutoff" histogram analysis
is still displayed, but is meaningless.

Nothing is known about how max_discrimators affects this.  Play!

Nothing is known about how use_central_limit (as opposed to
use_central_limit2) works with this.  Play!

When one of the central-limit schemes is in use, the list of (word, prob)
clues returned by spamprob() now has two made-up entries at the start, in
this order:

('*zham*', zham), ('*zspam*', zspam)

These are the ham and spam zscores.  So, for example, a listing of a false
positive now begins like so:

Data/Ham/Set2/143733.txt
prob = 0.51
prob('*zham*') = -65.9011
prob('*zspam*') = -53.3419
prob('header:Errors-To:1') = 0.0266272
prob('subject:: ') = 0.0266272
prob('python') = 0.0412844
...

Here's something remarkable.  I just tried this, with the .ini file given
above, like so:

    timtest.py -n5 --s=10 --h=10 -s123

In other words, this does 5**2-5 = 20 runs, training the classifier each
time on *just* 10 random ham and 10 random spam, and then predicting against
10 disjoint random ham and 10 disjoint random spam.

Here's the bottom line from this run (the "all runs" histograms at the end):

-> <stat> Ham scores for all runs: 200 items; mean 5.42; sdev 15.42
-> <stat> min 0; median 0; max 51
* = 3 items
  0.0 178 ************************************************************
 25.0  19 *******
 50.0   3 *
 75.0   0

-> <stat> Spam scores for all runs: 200 items; mean 93.13; sdev 17.03
-> <stat> min 49; median 100; max 100
* = 3 items
  0.0   0
 25.0   1 *
 50.0  27 *********
 75.0 172 **********************************************************

The 0.00 score ends up in the  0.0 bucket.
The 0.49 score ends up in the 25.0 bucket.
The 0.51 score ends up in the 50.0 bucket.
The 1.00 score ends up in the 75.0 bucket.

Even with such little data, this was never wrong when it was certain.  For
ham, it was wrong 3 of the 19+3=22 times it was unsure.
For spam, it was wrong 1 of the 27+1=28 times it was unsure.

What surprised me most there is-- given how little training was done --just
how often it *was* "certain".

This continues to suggest that these schemes have enormous potential, but we
still don't know how to exploit it (although with my pragmatic hat on, I'd
say we're already doing a not-too-shabby job of exploiting it <wink>).


From tim.one@comcast.net  Fri Oct  4 20:43:26 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 04 Oct 2002 15:43:26 -0400
Subject: [Spambayes] For the bold
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEPMBIAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEPPBIAB.tim.one@comcast.net>

BTW, that teensy test run I reported on uncovered a ham hiding in Bru=
ceG's
spam -- it was one the "false negatives" the central-limit scheme sai=
d it
was unsure about, but *guessed* it was ham (note that both zscores ar=
e very
large):

"""
Data/Spam/Set5/6510.txt
prob =3D 0.49
prob('*zham*') =3D -31.6082
prob('*zspam*') =3D -44.4025
prob('header:Organization:1') =3D 0.00738916
prob('wrote:') =3D 0.0110024
prob('header:User-Agent:1') =3D 0.0167286
prob('class') =3D 0.0412844
prob('files') =3D 0.0412844
prob('comes') =3D 0.0505618
prob('hi,') =3D 0.0652174
prob('might') =3D 0.12963
prob('subject:: ') =3D 0.135891
prob('contains:') =3D 0.155172
prob('files.') =3D 0.155172
prob('there.') =3D 0.155172
prob('inc.') =3D 0.155172
prob('subject:?') =3D 0.194323
prob('charset:us-ascii') =3D 0.244597
prob('line') =3D 0.263314
prob('content-type:text/plain') =3D 0.306763
prob('proto:http') =3D 0.681245
prob('skip:p 10') =3D 0.691388
prob('will') =3D 0.700267
prob('url:org') =3D 0.701342
prob('url:www') =3D 0.702475
prob('easily') =3D 0.724719
prob('been') =3D 0.740964
prob('your') =3D 0.752572
prob('addresses') =3D 0.775229
prob('subject:-') =3D 0.775229
prob('people') =3D 0.776817
prob('url:html') =3D 0.776817
prob('world') =3D 0.810078
prob('subject:000') =3D 0.844828
prob('subject:. ') =3D 0.844828
prob('ease') =3D 0.844828
prob('sent') =3D 0.85503
prob('bulk') =3D 0.908163
prob('subject:,') =3D 0.908163
prob('emails') =3D 0.908163
prob('low') =3D 0.908163
prob('our') =3D 0.918944
prob('regardless') =3D 0.934783
prob('received.') =3D 0.934783
prob('info') =3D 0.958716
prob('million') =3D 0.965116
prob('send') =3D 0.969799
prob('unsubscribe') =3D 0.969799
prob('header:Return-Path:1') =3D 0.971807
prob('header:Received:7') =3D 0.973373
prob('money') =3D 0.983271
prob('email') =3D 0.991159
prob('please') =3D 0.991803

Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: lists-linux-kernel@bruce-guenter.dyndns.org
Received: (qmail 27880 invoked from network); 16 Apr 2002 17:23:30 -0=
000
Received: from vger.kernel.org (209.116.70.75)
  by bruce-guenter.dyndns.org (192.168.1.3) with ESMTP; 16 Apr 2002
17:23:30 -0000
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpa=
nd
        id <S313775AbSDPRUh>; Tue, 16 Apr 2002 13:20:37 -0400
Received: (majordomo@vger.kernel.org) by vger.kernel.org
        id <S313776AbSDPRUg>; Tue, 16 Apr 2002 13:20:36 -0400
Received: from moutvdomng1.kundenserver.de ([212.227.126.181]:56795 "=
EHLO
        moutvdomng1.kundenserver.de") by vger.kernel.org with ESMTP
        id <S313775AbSDPRUf>; Tue, 16 Apr 2002 13:20:35 -0400
Received: from [212.227.126.155] (helo=3Dmrvdomng2.kundenserver.de)
        by moutvdomng1.kundenserver.de with esmtp (Exim 3.22 #2)
        id 16xWd5-0001Gw-00
        for linux-kernel@vger.kernel.org; Tue, 16 Apr 2002 19:20:31 +=
0200
Received: from pd9e23b10.dip.t-dialin.net ([217.226.59.16]
helo=3Dngforever.de)
        by mrvdomng2.kundenserver.de with esmtp (Exim 3.22 #2)
        id 16xWd4-0007sA-00
        for linux-kernel@vger.kernel.org; Tue, 16 Apr 2002 19:20:31 +=
0200
Message-ID: <3CBC5D5D.7060909@ngforever.de>
Date:   Tue, 16 Apr 2002 11:20:29 -0600
=46rom:   Thunder from the hill <thunder@ngforever.de>
Organization: The LuckyNet Administration
User-Agent: Mozilla/5.0 (X11; U; Linux i586; en-US; rv:0.9.9+)
Gecko/20020405
X-Accept-Language: en-us, en
MIME-Version: 1.0
To:     LKML <linux-kernel@vger.kernel.org>
Subject: Re: 60 Million Emails inc. 600,000 Uk =3D?ISO-8859-1?Q?=3DA3=
19=3D2E95?=3D
References: <20020416154606Z313666-22651+7853@vger.kernel.org>
Content-Type: text/plain; charset=3Dus-ascii; format=3Dflowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Length: 948

Hi,

Bulk Email Cd wrote:
> Bulk Email CD just #19.95 inc. p&p and contains:
>
> 60 Million World wide email addresses.
> 600,000 VALIDATED UK email addresses - Verified in March 2002, ensu=
ring a
low failure rate.
>
> The World-wide emails have been split and compressed into many file=
s for
ease of use. The UK lists comes in easily identifiable files.
>
> The CD comes with simple instuctions and will be sent by first clas=
s post
as soon as your money has been received.
> [Snip]

People selling email addresses really make me sick. People degraded t=
o
wares, regardless of their personalities. We might even find Alan in =
there.

Regards,
Thunder
--
Thunder from the hill.
Citizen of our universe.

-
To unsubscribe from this list: send the line "unsubscribe linux-kerne=
l" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/"""
"""

If I have to keep the quote of the Nigerian-scam spam in my ham, this=
 one
has noooooo excuse for being called spam <wink>.

Here's another one it was unsure about:

"""
Return-Path: <jax@inet.pl>
Delivered-To: em-ca-bruceg@em.ca
Received: (qmail 15516 invoked from network); 8 Aug 2002 22:20:13 -00=
00
Received: from mail.inet.pl (195.116.59.85)
  by churchill.factcomp.com with SMTP; 8 Aug 2002 22:20:13 -0000
Received: (qmail 26458 invoked by uid 33); 8 Aug 2002 22:26:04 -0000
Date: 8 Aug 2002 22:26:04 -0000
Message-ID: <20020808222604.26455.qmail@mail.inet.pl>
TO: bruceg@em.ca
=46rom: jax@inet.pl
Subject: Wiadomo=B6=E6 zosta=B3a dostarczona
Content-Length: 129

Twoja Wiadomo=B6=E6 zosta=B3a dostarczona !
 Zostanie jednak przeczytana 12 sierpnia.
 Do tego czasu korzystam z wypoczynku.
""

I have no idea -- do you?  I really despise the presumption that non-=
English
msgs are spam, BTW.


From guido@python.org  Fri Oct  4 20:59:12 2002
From: guido@python.org (Guido van Rossum)
Date: Fri, 04 Oct 2002 15:59:12 -0400
Subject: [Spambayes] Microsoft Outlook 'support'
In-Reply-To: Your message of "Fri, 04 Oct 2002 13:05:51 EDT."
             <LNBBLJKPBEHFEDALKOLCGEOHBIAB.tim@zope.com> 
References: <LNBBLJKPBEHFEDALKOLCGEOHBIAB.tim@zope.com> 
Message-ID: <200210041959.g94JxCR29867@pcp02138704pcs.reston01.va.comcast.net>

> Sorry, I'm unpersuaded.  Most users have no idea what "MUA" or "MTA" mean,
> and I'm not going to hide what they're looking for under layers of
> geek-speak.  Neither you nor they will be confused by a directory named
> Outlook2000; if 30 other such directories appear, then I'll think about
> rearranging stuff; I predict it ain't gonna happen.

What I said.

--Guido van Rossum (home page: http://www.python.org/~guido/)

From richie@entrian.com  Fri Oct  4 20:59:23 2002
From: richie@entrian.com (Richie Hindle)
Date: Fri, 04 Oct 2002 20:59:23 +0100
Subject: [Spambayes] splitndirs bug [need help]
In-Reply-To: <200210041231.g94CV3020176@pcp02138704pcs.reston01.va.comcast.net>
References: <Pine.OSF.4.44.0210040121230.94726-100000@fly.hiwaay.net>
	<ihiqpug426vg9iuoh30t3qvm1mo7a5nlc5@4ax.com>
	<200210041231.g94CV3020176@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <ejsrpu4f57i7dfdosc5seddqeppssnunmd@4ax.com>


> So please go ahead!

Done.

-- 
Richie Hindle
richie@entrian.com


From python-spambayes@discworld.dyndns.org  Fri Oct  4 21:02:08 2002
From: python-spambayes@discworld.dyndns.org (Charles Cazabon)
Date: Fri, 4 Oct 2002 14:02:08 -0600
Subject: [Spambayes] For the bold
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEPPBIAB.tim.one@comcast.net>;
	from tim.one@comcast.net on Fri, Oct 04, 2002 at 03:43:26PM -0400
References: <LNBBLJKPBEHFEDALKOLCCEPMBIAB.tim.one@comcast.net>
	<LNBBLJKPBEHFEDALKOLCMEPPBIAB.tim.one@comcast.net>
Message-ID: <20021004140208.A3542@discworld.dyndns.org>

Tim Peters <tim.one@comcast.net> wrote:
> 
> Here's another one it was unsure about:
> """
> Return-Path: <jax@inet.pl>
> Delivered-To: em-ca-bruceg@em.ca

Original envelope recipient address:  bruceg@em.ca, his most widely-advertised
address, and one that is trivial to harvest from webpages.  Suspicious.

> Twoja Wiadomo�� zosta�a dostarczona !
>  Zostanie jednak przeczytana 12 sierpnia.
>  Do tego czasu korzystam z wypoczynku.
> 
> I have no idea -- do you?  I really despise the presumption that non-English
> msgs are spam, BTW.

In this case, it appears to be true.  I can't read it, but identical messages
(with identical envelope senders) went to the linux-kernel mailing list, the
qmail mailing list, and a bunch of other people who have it listed in their
spam blocklists and such.

Charles
-- 
-----------------------------------------------------------------------
Charles Cazabon                 <python-spambayes@discworld.dyndns.org>
GPL'ed software available at:     http://www.qcc.ca/~charlesc/software/
-----------------------------------------------------------------------

From tim.one@comcast.net  Fri Oct  4 21:30:51 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 04 Oct 2002 16:30:51 -0400
Subject: [Spambayes] For the bold
In-Reply-To: <20021004140208.A3542@discworld.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEAJBJAB.tim.one@comcast.net>

>> Twoja Wiadomo=B6=E6 zosta=B3a dostarczona !
>>  Zostanie jednak przeczytana 12 sierpnia.
>>  Do tego czasu korzystam z wypoczynku.

> In this case, it appears to be true.  I can't read it, but
> identical messages (with identical envelope senders) went to the
> linux-kernel  mailing list, the qmail mailing list, and a bunch of
> other people who have it listed in their spam blocklists and such.

Luckily, I was able to find an excellent Polish -> Ukranian dictionar=
y
online, and with my encyclopedic knowledge of Ukranian can supply thi=
s fine
translation:

 Your it's known < known > your =B6 & # 230; & # 179 zosta; but suppl=
ied!
 However, it will be read 12 august.
 I use with refreshment for this time.

OTOH, it may a technical service manual for a 1993 Subaru wagon, or p=
erhaps
a translation of Romeo and Juliet.  I'll keep researching it in my sp=
are
time ...


From python-spambayes@discworld.dyndns.org  Fri Oct  4 21:38:29 2002
From: python-spambayes@discworld.dyndns.org (Charles Cazabon)
Date: Fri, 4 Oct 2002 14:38:29 -0600
Subject: [Spambayes] For the bold
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEAJBJAB.tim.one@comcast.net>;
	from tim.one@comcast.net on Fri, Oct 04, 2002 at 04:30:51PM -0400
References: <20021004140208.A3542@discworld.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCCEAJBJAB.tim.one@comcast.net>
Message-ID: <20021004143829.A5033@discworld.dyndns.org>

Tim Peters <tim.one@comcast.net> wrote:
> 
> Luckily, I was able to find an excellent Polish -> Ukranian dictionary
> online, and with my encyclopedic knowledge of Ukranian can supply this fine
> translation:
> 
>  Your it's known < known > your � & # 230; & # 179 zosta; but supplied!
>  However, it will be read 12 august.
>  I use with refreshment for this time.
> 
> OTOH, it may a technical service manual for a 1993 Subaru wagon, or perhaps
> a translation of Romeo and Juliet.  I'll keep researching it in my spare
> time ...

Your translation sounds suspiciously like a Polish vacation message.

Charles
-- 
-----------------------------------------------------------------------
Charles Cazabon                 <python-spambayes@discworld.dyndns.org>
GPL'ed software available at:     http://www.qcc.ca/~charlesc/software/
-----------------------------------------------------------------------

From popiel@wolfskeep.com  Fri Oct  4 23:15:17 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Fri, 04 Oct 2002 15:15:17 -0700
Subject: [Spambayes] Effects of training set size
Message-ID: <20021004221518.04E60F59A@cashew.wolfskeep.com>

Executive summary: Increasing the training set size helps, but
not as much as one might think.  Specifically, the ham/spam
means spread apart, but the error rates stay fairly constant.
More data improves classification of ham, but it seems that
a _very_ small sample of spam (200 messages) is enough to
represent it.


I'm running with everything at defaults, which means I'm using
the Robinson classifier, spam_cutoff of 0.560, x = 0.5, s = 0.45,
et cetera, et cetera, ad nauseum.  I have about 3000 spam and
nearly 2000 ham, representing everything from my own personal
mail feed since 22 Aug 2002 (when I stopped throwing away a
significant portion of my ham).  I should have a full 2000 ham
in another day or two, at which point I'll probably redo my
data directories.

I did cross-validation (via timcv.py) using --ham-keep and
--spam-keep at each of 50, 70, 90, 110, 130, 150, 170, and 190.
This means that I used training corpus sizes of 200, 280, 360,
440, 520, 600, 680, and 760 hams and spams, testing against the
smaller numbers of messages.

I used the following adaptation of runtest.sh:

"""
#! /bin/sh -x
##
## runsizes.sh -- run some tests for Tim
##
## This does everything you need to test yer data.  You may want to skip
## the rebal steps if you've recently moved some of your messages
## (because they were in the wrong corpus) or you may suffer my fate and
## get stuck forever re-categorizing email.
##
## Just set up your messages as detailed in README.txt; put them all in
## the reservoir directories, and this script will take care of the
## rest.  Paste the output (also in results.txt) to the mailing list for
## good karma.
##
## Neale Pickett <neale@woozle.org>
##

if [ "$1" = "-r" ]; then
    REBAL=1
    shift
fi

# Number of messages per rebalanced set
RNUM=190

# Number of sets
SETS=5

# Seed for random number generator
SEED=13666

if [ -n "$REBAL" ]; then
    # Put them all into reservoirs
    python2.2 rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n 0 -Q
    python2.2 rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n 0 -Q
    # Rebalance
    python2.2 rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n $RNUM -Q
    python2.2 rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n $RNUM -Q
fi

for keep in 50 70 90 110 130 150 170 190; do
    python2.2 timcv.py -n $SETS --ham-keep $keep --spam-keep $keep -s $SEED > run$keep.txt
done

for k1 in 50 70 90 110 130 150 170; do
    k2=`echo $k1 20 + p | dc`
    python2.2 rates.py run$k1 run$k2 > runrates$k1.txt
    python2.2 cmp.py run${k1}s run${k2}s | tee results$k1.txt
done

for k1 in 50 70 90 110 130 150 170; do
    k2=190
    python2.2 rates.py run$k1 run$k2 > runrates${k1}-190.txt
    python2.2 cmp.py run${k1}s run${k2}s | tee results${k1}-190.txt
done
"""

I then hand-munged the results output to reveal:

keep:     50      70      90     110     130     150     170     190
fp %:           (meaningless, only 1 or 2 fp in any run)
fn %:    3.20    4.57    4.00    4.36    4.15    3.20    3.53    4.53
h mean: 25.28   24.38   22.19   21.35   21.21   20.91   20.37   19.50
h sdev:  7.45    7.56    6.86    6.89    7.05    6.92    6.87    6.81
s mean: 74.21   74.54   73.65   73.92   74.63   74.99   74.81   74.52
s sdev:  8.56    9.10    8.84    9.13    8.98    8.76    8.62    8.99
mean difference:
        48.93   50.16   51.46   52.57   53.42   54.08   54.44   55.02

I'm not sure if the fn % are significant, and they're jumping
enough for me to suspect they're not.  No obvious trend there,
anyway.

The ham mean drifted down steadily with more data, and the spam
mean held fairly constant with a very slight upward drift.
Ham sdev seems to get slowly tighter, with spam sdev jiggling
in no particularly obvious direction.

Finally, the difference in means steadily increased, echoing the
downward drift of the ham mean.

All of the reports are available at:
  http://www.wolfskeep.com/~popiel/spambayes/trainsize


My next experiment: try this all again with --ham-keep constant
and only --spam-keep variable. :-)

- Alex

From papaDoc@videotron.ca  Sat Oct  5 02:26:25 2002
From: papaDoc@videotron.ca (Remi Ricard)
Date: Fri, 04 Oct 2002 21:26:25 -0400
Subject: [Spambayes] New tokenization of the Subject line
Message-ID: <1033781185.1125.7.camel@localhost.localdomain>

Hi,

I try something again.

Since most of the mail from subscribed groups have in their
subject [spambayes] or [freesco] i.e "[" and "]".

I decided to keep this as a word so my words from a subject line
like: Re: [Spambayes] Moving closer to Gary's ideal
will be
Re: 
[Spambayes] 
Moving 
closer 
to 
Gary's 
ideal

And this is the result.

-> <stat> tested 200 hams & 279 spams against 800 hams & 1113 spams
-> <stat> tested 200 hams & 275 spams against 800 hams & 1117 spams
-> <stat> tested 200 hams & 298 spams against 800 hams & 1094 spams
-> <stat> tested 200 hams & 272 spams against 800 hams & 1120 spams
-> <stat> tested 200 hams & 268 spams against 800 hams & 1124 spams
-> <stat> tested 200 hams & 279 spams against 800 hams & 1113 spams
-> <stat> tested 200 hams & 275 spams against 800 hams & 1117 spams
-> <stat> tested 200 hams & 298 spams against 800 hams & 1094 spams
-> <stat> tested 200 hams & 272 spams against 800 hams & 1120 spams
-> <stat> tested 200 hams & 268 spams against 800 hams & 1124 spams

false positive percentages
    1.000  0.500  won    -50.00%
    1.500  1.500  tied          
    2.000  2.500  lost   +25.00%
    1.000  1.000  tied          
    0.000  0.000  tied          

won   1 times
tied  3 times
lost  1 times

total unique fp went from 11 to 11 tied          
mean fp % went from 1.1 to 1.1 tied          

false negative percentages
    0.717  0.717  tied          
    0.727  0.727  tied          
    1.007  1.342  lost   +33.27%
    0.000  0.368  lost  +(was 0)
    0.746  0.373  won    -50.00%

won   1 times
tied  2 times
lost  2 times

total unique fn went from 9 to 10 lost   +11.11%
mean fn % went from 0.639419734305 to 0.705436374356 lost   +10.32%

ham mean                     ham sdev
  24.51   25.20   +2.82%        9.45    9.09   -3.81%
  26.14   27.20   +4.06%        8.62    8.32   -3.48%
  26.04   26.94   +3.46%       10.00    9.68   -3.20%
  25.15   25.85   +2.78%        8.05    7.93   -1.49%
  25.12   26.11   +3.94%        8.28    8.16   -1.45%

ham mean and sdev for all runs
  25.39   26.26   +3.43%        8.93    8.69   -2.69%

spam mean                    spam sdev
  80.41   79.86   -0.68%        8.80    8.81   +0.11%
  79.87   79.47   -0.50%        8.20    8.11   -1.10%
  79.87   79.31   -0.70%        8.79    8.73   -0.68%
  80.42   80.03   -0.48%        8.13    8.22   +1.11%
  80.11   79.70   -0.51%        9.32    9.07   -2.68%

spam mean and sdev for all runs
  80.13   79.66   -0.59%        8.66    8.60   -0.69%

ham/spam mean difference: 54.74 53.40 -1.34

I'm still having problem reading the result can someone
explain this a little bit.

My statistic knowledge is comming from a course I took 
almost 15 years ago and it was the only course I manage 
to fell asleep in it..... even if I like math (I did a
B.Sc in physics).


papaDoc


From popiel@wolfskeep.com  Sat Oct  5 02:30:50 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Fri, 04 Oct 2002 18:30:50 -0700
Subject: [Spambayes] New tokenization of the Subject line 
In-Reply-To: Message from Remi Ricard <papaDoc@videotron.ca> 
	of "Fri, 04 Oct 2002 21:26:25 EDT."
	<1033781185.1125.7.camel@localhost.localdomain> 
References: <1033781185.1125.7.camel@localhost.localdomain> 
Message-ID: <20021005013050.B393CF59A@cashew.wolfskeep.com>

In message:  <1033781185.1125.7.camel@localhost.localdomain>
             Remi Ricard <papaDoc@videotron.ca> writes:
>
>I try something again.
>
>Since most of the mail from subscribed groups have in their
>subject [spambayes] or [freesco] i.e "[" and "]".
>
>I decided to keep this as a word

Unfortunately, this makes things worse overall.  Good idea,
but I think that it's not helping because mailing lists
get spammed, too... so showing that something is on a
mailing list really doesn't help (it just gives the spam
that does show up on the list some apparent validity).

>total unique fp went from 11 to 11 tied          
>mean fp % went from 1.1 to 1.1 tied          

This is neutral.

>total unique fn went from 9 to 10 lost   +11.11%
>mean fn % went from 0.639419734305 to 0.705436374356 lost   +10.32%

This is a loss, though too small of one to be significant.
(One message in either direction is too small to care about.)

>ham mean and sdev for all runs
>  25.39   26.26   +3.43%        8.93    8.69   -2.69%

This shows the ham scores moving up, and getting tighter
together.  The first is bad, the second is good.

>spam mean and sdev for all runs
>  80.13   79.66   -0.59%        8.66    8.60   -0.69%

This shows the spam scores moving down, and getting tighter.
Again, first is bad, second is good.

>ham/spam mean difference: 54.74 53.40 -1.34

This shows ham and spam getting closer together overall, and
is bad.  The reduction in the standard deviation is (I think)
too small to overcome this... but I'm just eyeballing it;
can someone with a bit of the theory help here?

- Alex

From tim.one@comcast.net  Sat Oct  5 05:23:04 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 05 Oct 2002 00:23:04 -0400
Subject: [Spambayes] For the bold
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEPMBIAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEECHBJAB.tim.one@comcast.net>

Two new programs have been checked in, to help with analyzing internals of
the central-limit schemes:

clgen.py
    A test driver.  Its primary purpose is to generate a binary
    pickle, recording all relevant details of every prediction
    made:  msg id, whether it's really ham or really spam, the
    raw ham-mean value, the ham zscore, the raw spam-mean value,
    the spam zscore, and the number of "extreme words" used in
    scoring (up to the maximum of max_discriminators).

    There's enough info here so that the exact results of changing
    anything about deciding whether a msg is ham or spam, or about
    deciding how confident we are, can be determined quickly (i.e.,
    without needing to rerun the test).

clpik.py
    A sample analysis program, showing how to load the pickles
    created by clgen, how to extract info from them, and how to
    generate histograms from the data.  Note that the Histogram
    module was previously made more robust (numerically speaking)
    and more flexible, in anticipation of this.

Apart from that, here are central_limit2 results from a larger training set
than I've reported on before:  trained on 2000 ham + 2000 spam, then
predicted against 8000 of each (see earlier email for the .ini file I used
here; max_discriminators is 50 here, and there are only 4 possible scores):

-> <stat> Ham scores for all runs: 8000 items; mean 0.17; sdev 3.07
-> <stat> min 0; median 0; max 100
* = 131 items
  0.0 7975 *************************************************************
 25.0   21 *   unsure but right
 50.0    2 *   unsure but wrong
 75.0    2 *   sure but wrong

-> <stat> Spam scores for all runs: 8000 items; mean 99.82; sdev 3.07
-> <stat> min 0; median 100; max 100
* = 131 items
  0.0    1 *   sure but wrong
 25.0    3 *   unsure but wrong
 50.0   24 *   unsure but right
 75.0 7972 *************************************************************

So the results are even more intense with more training data:  it's certain
about almost everything, has miniscule error rates when it is certain, and
has large error rates on the few msgs it's unsure about.

The 2 "certain but wrong" false postives were, again, the Nigerian-scam
quote:

prob('*zham*') = -39.112
prob('*zspam*') = -7.06214
prob('*hmean*') = -3.35442
prob('*smean*') = -0.569831
prob('*n*') = 50

and the lady with the obnoxious employer-generated sig:

prob('*zham*') = -21.6362
prob('*zspam*') = -9.10452
prob('*hmean*') = -1.98372
prob('*smean*') = -0.684297
prob('*n*') = 50

The 1 "certain but wrong" false negative's body consists of a uuencoded text
file, which we throw away without decoding:

prob('*zham*') = -5.97943
prob('*zspam*') = -12.4922
prob('*hmean*') = -1.45919
prob('*smean*') = -1.92436
prob('*n*') = 8

The histograms generated by clpik on this data are encouraging too (that's
your cue, Rob <wink>).


From janzert@haskincentral.com  Fri Oct  4 21:13:02 2002
From: janzert@haskincentral.com (Brian Haskin)
Date: Fri, 04 Oct 2002 16:13:02 -0400
Subject: [Spambayes] Re: For the bold
References: <LNBBLJKPBEHFEDALKOLCCEPMBIAB.tim.one@comcast.net>
	<LNBBLJKPBEHFEDALKOLCMEPPBIAB.tim.one@comcast.net>
Message-ID: <anksnl$14j$1@main.gmane.org>

Tim Peters wrote:
> Here's another one it was unsure about:
> 
> """
> Return-Path: <jax@inet.pl>
> Delivered-To: em-ca-bruceg@em.ca
> Received: (qmail 15516 invoked from network); 8 Aug 2002 22:20:13 -0000
> Received: from mail.inet.pl (195.116.59.85)
>   by churchill.factcomp.com with SMTP; 8 Aug 2002 22:20:13 -0000
> Received: (qmail 26458 invoked by uid 33); 8 Aug 2002 22:26:04 -0000
> Date: 8 Aug 2002 22:26:04 -0000
> Message-ID: <20020808222604.26455.qmail@mail.inet.pl>
> TO: bruceg@em.ca
> From: jax@inet.pl
 From the toplevel domain we can guess polish and 
http://www.poltran.com/ supplies the following

> Subject: Wiadomo�� zosta�a dostarczona
It's known < known > zosta supplied
> Content-Length: 129
> 
> Twoja Wiadomo�� zosta�a dostarczona !
It's known < known > your supplied zosta!
>  Zostanie jednak przeczytana 12 sierpnia.
However, it will be read 12 august.
>  Do tego czasu korzystam z wypoczynku.
I use with refreshment for this time.
> ""
> 
> I have no idea -- do you?  I really despise the presumption that non-English
> msgs are spam, BTW.

Anyone have an idea what zosta is? or know someone that can actually 
read polish?

Brian Haskin
Janzert@haskincentral.com


From tim.one@comcast.net  Sat Oct  5 05:42:22 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 05 Oct 2002 00:42:22 -0400
Subject: [Spambayes] Re: For the bold
In-Reply-To: <anksnl$14j$1@main.gmane.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCKECIBJAB.tim.one@comcast.net>

[Brian Haskin]
> Anyone have an idea what zosta is?

An online dictionary said "to have been".

> or know someone that can actually read polish?

Yes, but not well enough to bother with this trivia <wink>.

From tim.one@comcast.net  Sat Oct  5 08:18:16 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 05 Oct 2002 03:18:16 -0400
Subject: [Spambayes] For the bold
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEECHBJAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCGECMBJAB.tim.one@comcast.net>

There's one more "central limit" scheme on the table now:
use_central_limit3.  The spamprob() code is identical to use_central_limit2,
but the ham and spam populations are computed differently.

Under central_limit2, the spam population is computed like so:

for each msg in the training spam:
    for each extreme word w in msg:
        if we haven't seen w before:
            add ln(prob(w)) to the spam population

The ham population is computed similarly, except using ln(1-prob(w))
instead.

Under central_limit3, the spam population is composed of whole-msg scores,
not of individual word scores:

for each msg in the training spam:
    compute the mean of ln(prob(w)) over the extreme words w in msg
    add that mean to the spam population

And likewise for the ham population, using ln(1-prob(w)) instead.

There's not even a ghost of an illusion that the central limit theorem
applies to this variant, but the spamprob() code remains identical, happily
ignoring that it's utterly unjustified <wink>.

Still, brief preliminary tests suggest this *may* actually work better.
Here's the bottom line for a run training against 5000 ham + 5000 spam, then
predicting against 5000 of each:

-> <stat> Ham scores for all runs: 5000 items; mean 0.09; sdev 2.31
-> <stat> min 0; median 0; max 100
* = 82 items
  0 4992 *************************************************************
 25    7 *
 50    0
 75    1 *  this was the Nigerian scam spam

-> <stat> Spam scores for all runs: 5000 items; mean 99.68; sdev 4.07
-> <stat> min 0; median 100; max 100
* = 82 items
  0    1 *  this was the spam with a uuencoded body we ignore
 25    6 *
 50   24 *
 75 4969 *************************************************************

The advantage-- if it's real --is that it's certain more often.

The populations are sharply separated:

ham ham mean: 5000 items; mean -0.35; sdev 0.20
-> <stat> min -3.55286; median -0.316515; max -0.00523756

spam ham mean: 5000 items; mean -3.87; sdev 0.92
-> <stat> min -6.03683; median -3.857; max -1.22996

That is, when we score a ham using the ham ln(1-prob) rule, the mean msg
mean is -0.35 with a small sdev of 0.20.  But when we score a spam using the
ham ln(1-prob) rule, the mean msg mean is -3.87, with a larger sdev.

Another pair of results says what happens when we score ham and spam using
the spam ln(prob) rule:

ham spam mean: 5000 items; mean -3.02; sdev 0.71
-> <stat> min -5.72426; median -2.91819; max -0.602309

spam spam mean: 5000 items; mean -0.11; sdev 0.14
-> <stat> min -2.23055; median -0.0546932; max -0.00268306

It's essentially impossible for a msg to score well under both measures, but
it's easy for a msg to score poorly under both measures.  The most
appropriate rule again appears to be that it doesn't matter how poorly a msg
scores, it only matters how much more poorly it scores under the other
measure.


From rob@hooft.net  Sat Oct  5 09:34:12 2002
From: rob@hooft.net (Rob Hooft)
Date: Sat, 05 Oct 2002 10:34:12 +0200
Subject: [Spambayes] clt tests
Message-ID: <3D9EA404.8090603@hooft.net>

I focussed for our night on optimizing the max_discriminators for clt2 
using 10x(200+200) messages out of my corpses, running rebal.py between 
runs to get the best possible idea of variations.

It appears the fp and fn counts drop upto max_discriminators ~ 30, and
after that it appears constant upto 400. An optimist might see a slight 
descent between 30 and 100. I will post a plot later.

This morning I programmed a Debugger class with the same functionality 
as Tim's debugging pickle. Will start analysing those results somewhere 
this weekend.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From paul-bayes@svensson.org  Sat Oct  5 10:58:34 2002
From: paul-bayes@svensson.org (Paul Svensson)
Date: Sat, 5 Oct 2002 05:58:34 -0400 (EDT)
Subject: [Spambayes] Re: For the bold
In-Reply-To: <anksnl$14j$1@main.gmane.org>
Message-ID: <Pine.LNX.4.44.0210050548130.402-100000@familjen.svensson.org>

On Fri, 4 Oct 2002, Brian Haskin wrote:

>> Subject: Wiadomo�� zosta�a dostarczona
>It's known < known > zosta supplied
>> Content-Length: 129
>>
>> Twoja Wiadomo�� zosta�a dostarczona !
>It's known < known > your supplied zosta!
>>  Zostanie jednak przeczytana 12 sierpnia.
>However, it will be read 12 august.
>>  Do tego czasu korzystam z wypoczynku.
>I use with refreshment for this time.
>> ""
>>
>> I have no idea -- do you?  I really despise the presumption that non-English
>> msgs are spam, BTW.
>
>Anyone have an idea what zosta is? or know someone that can actually
>read polish?

Machine translation has ways to go.
A translator colleague at http://www.proz.com gave me:

zosta'a = has been delivered
"Your message has been delivered, however, it will be read on August 12.
I enjoy my holiday until then."

	/Paul


From rob@hooft.net  Sat Oct  5 14:31:12 2002
From: rob@hooft.net (Rob Hooft)
Date: Sat, 05 Oct 2002 15:31:12 +0200
Subject: [Spambayes] Re: For the bold
References: <LNBBLJKPBEHFEDALKOLCCEPMBIAB.tim.one@comcast.net>
Message-ID: <3D9EE9A0.7010505@hooft.net>

This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Tim Peters wrote:

> Nothing is known about how max_discrimators affects this.  Play!

See the attached plot anarun1.pdf where this is the horizontal axis. I'd 
think ~30 would be enough, but more doesn't seem to take much more time 
and doesn't hurt for me. The raw data are below; run1ref is a comparable
non-clt result.

Rob

==> run1ref.txt <==
->     fp rate 1.43%  fn rate 1.23%

==> run1_10.txt <==
->     fp rate 1.9%  fn rate 1.16%

==> run1_15.txt <==
->     fp rate 1.22%  fn rate 0.867%

==> run1_20.txt <==
->     fp rate 1.33%  fn rate 1.01%

==> run1_25.txt <==
->     fp rate 0.922%  fn rate 0.867%

==> run1_30.txt <==
->     fp rate 1.07%  fn rate 0.95%

==> run1_35.txt <==
->     fp rate 1.02%  fn rate 1.18%

==> run1_40.txt <==
->     fp rate 1.23%  fn rate 0.983%

==> run1_45.txt <==
->     fp rate 0.939%  fn rate 0.983%

==> run1_50.txt <==
->     fp rate 0.65%  fn rate 1.16%

==> run1_55.txt <==
->     fp rate 0.672%  fn rate 1.23%

==> run1_60.txt <==
->     fp rate 0.983%  fn rate 1.26%

==> run1_65.txt <==
->     fp rate 0.756%  fn rate 1.04%

==> run1_70.txt <==
->     fp rate 0.739%  fn rate 0.828%

==> run1_75.txt <==
->     fp rate 0.917%  fn rate 0.867%

==> run1_80.txt <==
->     fp rate 0.717%  fn rate 0.944%

==> run1_85.txt <==
->     fp rate 0.828%  fn rate 0.883%

==> run1_90.txt <==
->     fp rate 0.972%  fn rate 1.07%

==> run1_95.txt <==
->     fp rate 1.04%  fn rate 0.861%

==> run1_100.txt <==
->     fp rate 0.822%  fn rate 1.21%

==> run1_110.txt <==
->     fp rate 0.989%  fn rate 1.2%

==> run1_120.txt <==
->     fp rate 0.556%  fn rate 1.46%

==> run1_130.txt <==
->     fp rate 0.85%  fn rate 1.26%

==> run1_140.txt <==
->     fp rate 0.794%  fn rate 1.33%

==> run1_150.txt <==
->     fp rate 0.606%  fn rate 1.47%

==> run1_160.txt <==
->     fp rate 0.733%  fn rate 1.08%

==> run1_170.txt <==
->     fp rate 0.628%  fn rate 1.37%

==> run1_180.txt <==
->     fp rate 0.517%  fn rate 1.25%

==> run1_200.txt <==
->     fp rate 0.589%  fn rate 1.18%

==> run1_300.txt <==
->     fp rate 0.872%  fn rate 1.17%

==> run1_400.txt <==
->     fp rate 0.822%  fn rate 0.917%

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: anarun1.pdf
Type: application/pdf
Size: 6545 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021005/efb61190/anarun1.pdf

---------------------- multipart/mixed attachment--


From rob@hooft.net  Sat Oct  5 15:22:22 2002
From: rob@hooft.net (Rob Hooft)
Date: Sat, 05 Oct 2002 16:22:22 +0200
Subject: [Spambayes] Re: For the bold
References: <LNBBLJKPBEHFEDALKOLCCEPMBIAB.tim.one@comcast.net>
Message-ID: <3D9EF59E.4040207@hooft.net>

This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Here are two zham/zspam scatter plots: one for my spam body, and one for 
my ham body. This was done using clt2. Tim's test basically says 
"certain" if the distance to the diagonal line is sufficiently large. 
You can see that that is a reasonable proposal. I'll do more analyses.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: hamscat.png
Type: image/png
Size: 27946 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021005/6da9b8a3/hamscat.png

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: spamscat.png
Type: image/png
Size: 22771 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021005/6da9b8a3/spamscat.png

---------------------- multipart/mixed attachment--


From rob@hooft.net  Sat Oct  5 16:26:34 2002
From: rob@hooft.net (Rob Hooft)
Date: Sat, 05 Oct 2002 17:26:34 +0200
Subject: [Spambayes] Re: For the bold
References: <LNBBLJKPBEHFEDALKOLCCEPMBIAB.tim.one@comcast.net>
Message-ID: <3D9F04AA.8050706@hooft.net>

This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Another large message.

Appended is a pdf containing six histograms made using 
max_discriminators=55

The first one is zham for all ham messages. As you can see, the 
distribution is asymmetric. Furthermore, a simple average and standard 
deviation calculation results in a bell curve that does not follow the 
important tail of the histogram: the chances will be severely 
underestimated by these parameters.

The second one is abs(zham) for all ham messages. The bell curve fits 
this histogram much better!

The third page is zspam for all spam messages.

The fourth page is abs(zspam) for all spam messages. Also much better.

Fifth and sixth are zspam for all ham and zham for all spam, just to 
complete the picture.

 From the second and fourth image, I drew the conclusion that my 
Z-scores are overestimated by a factor of 6.7/6.6. This means e.g. that 
the zspam for all ham distribution is not -53 +/- 20, but -8 +/- 3 and 
the zham for all spam distribution is not -43 +/- 18, but -6.4 +/- 2.6

I will try a discriminator based on this.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: all.pdf
Type: application/pdf
Size: 56510 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021005/df86955c/all.pdf

---------------------- multipart/mixed attachment--


From noreply@sourceforge.net  Sat Oct  5 14:46:02 2002
From: noreply@sourceforge.net (noreply@sourceforge.net)
Date: Sat, 05 Oct 2002 06:46:02 -0700
Subject: [Spambayes] 
 [ spambayes-Patches-618928 ] runtest.sh: add timtest + spam/ham!=1
Message-ID: <E17xpFq-0004Bd-00@usw-sf-web1.sourceforge.net>

Patches item #618928, was opened at 2002-10-05 13:46
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618928&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Rob W.W. Hooft (hooft)
Assigned to: Nobody/Anonymous (nobody)
Summary: runtest.sh: add timtest + spam/ham!=1

Initial Comment:
 * Add timtest to runtest.sh
 * Add different spam/ham counts to runtest.sh


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618928&group_id=61702

From noreply@sourceforge.net  Sat Oct  5 14:52:48 2002
From: noreply@sourceforge.net (noreply@sourceforge.net)
Date: Sat, 05 Oct 2002 06:52:48 -0700
Subject: [Spambayes] 
 [ spambayes-Patches-618932 ] fpfn.py: add interactivity on unix
Message-ID: <E17xpMO-0004Qn-00@usw-sf-web1.sourceforge.net>

Patches item #618932, was opened at 2002-10-05 13:52
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618932&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Rob W.W. Hooft (hooft)
Assigned to: Nobody/Anonymous (nobody)
Summary: fpfn.py: add interactivity on unix

Initial Comment:
 * Add "-i" option to show all falses using "less", and
ask the user what to do with them. I used this a lot to
clean up my spam/ham corpuses.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618932&group_id=61702

From noreply@sourceforge.net  Sat Oct  5 14:53:39 2002
From: noreply@sourceforge.net (noreply@sourceforge.net)
Date: Sat, 05 Oct 2002 06:53:39 -0700
Subject: [Spambayes] 
 [ spambayes-Patches-618928 ] runtest.sh: add timtest + spam/ham!=1
Message-ID: <E17xpND-0004Sd-00@usw-sf-web1.sourceforge.net>

Patches item #618928, was opened at 2002-10-05 13:46
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618928&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Rob W.W. Hooft (hooft)
Assigned to: Nobody/Anonymous (nobody)
Summary: runtest.sh: add timtest + spam/ham!=1

Initial Comment:
 * Add timtest to runtest.sh
 * Add different spam/ham counts to runtest.sh


----------------------------------------------------------------------

>Comment By: Rob W.W. Hooft (hooft)
Date: 2002-10-05 13:53

Message:
Logged In: YES 
user_id=47476

Here is the patch

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618928&group_id=61702

From tim.one@comcast.net  Sat Oct  5 19:26:23 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 05 Oct 2002 14:26:23 -0400
Subject: [Spambayes] Microsoft Outlook 'support'
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHOEHBDLAA.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEECBJAB.tim.one@comcast.net>

[Sean True]
> I've written a couple of scripts which use Mark H's win32com package
> to do the following: [for Outlook 2000]
> ...

Those not on the spambayes-checkins mailing list probably missed that I
checked Sean's files in to the project yesterday.  Have at it!  They're all
in a top-level Outlook2000 directory; see its README.txt.


From bkc@murkworks.com  Sat Oct  5 19:32:36 2002
From: bkc@murkworks.com (Brad Clements)
Date: Sat, 05 Oct 2002 14:32:36 -0400
Subject: [Spambayes] CL2 results
Message-ID: <3D9EF7D4.23399.2790EC5D@localhost>

(2nd time posting, first time was rejected for name of attachment quoted in this 
message ending with a .bat extension) looks like virus buster for this list isn't decoding 
the message, just scanning for content-type even if it's body text ?

---


Wow, sure found a lot of ham in my spam..

Also, turns out I had a lot of zero length message files that came up as false 
negatives.. I've rm `find -empty` and rebal..

I'm doing 50-50 training testing. 

-> Training on Data/Ham/Set{6,7,8,9,10} & Data/Spam/Set{6,7,8,9,10} ... 6500 hams & 
6500 spams
hammean -0.258919766598 hamvar 0.235232283813
spammean -0.238803626095 spamvar 0.189273495163
-> <stat> population hammean -0.258919766598 hamvar 0.235232283813
-> <stat> population spammean -0.238803626095 spamvar 0.189273495163
-> Predicting Data/Ham/Set{1,2,3,4,5} & Data/Spam/Set{1,2,3,4,5} ...
-> <stat> tested 6500 hams & 6500 spams against 6500 hams & 6500 spams
-> <stat> false positive %: 1.12307692308
-> <stat> false negative %: 0.369230769231
-> <stat> 73 new false positives


A lot of the false positives are messages from

e-trade, paypal, novell, ingram-micro, HP reseller, my mother ...

Here's a false negative I think should be caught somehow.. though I don't know how..

(actually, I think this is klez.. I've saved those as spam too. I have 3 other msgs with the 
same subject, 138k, when I open them Sophos says it's klez)

Data/Spam/Set5/11706
prob = 0.0
prob('*zham*') = -1.74447
prob('*zspam*') = -23.1814
prob('*hmean*') = -0.396172
prob('*smean*') = -1.87484
prob('*n*') = 38
prob('header:Received:1') = 0.00372208
prob('base64') = 0.0121951
prob('from:email addr:murkworks.com>') = 0.0261568
prob('skip:g 70') = 0.0266272
prob('from:email name:<bkc') = 0.0292237
prob('content-id:') = 0.0412844
prob('skip:/ 70') = 0.0610425
prob('skip:f 70') = 0.0847751
prob('skip:t 70') = 0.0910781
prob('skip:n 40') = 0.0983936
prob('skip:r 70') = 0.117225
prob('skip:- 10') = 0.131484
prob('skip:k 70') = 0.14497
prob('skip:q 70') = 0.14497
prob('skip:h 70') = 0.14497
prob('skip:z 70') = 0.14497
prob('skip:j 70') = 0.14497
prob('skip:u 70') = 0.14497
prob('skip:v 70') = 0.14497
prob('skip:x 70') = 0.14497
prob('skip:d 70') = 0.194323
prob('skip:n 70') = 0.228997
prob('skip:b 70') = 0.239777
prob('skip:s 70') = 0.268638
prob('skip:c 70') = 0.29021
prob('skip:e 70') = 0.299427
prob('skip:9 70') = 0.308612
prob('skip:i 70') = 0.308612
prob('skip:6 70') = 0.308612
prob('skip:m 70') = 0.314126
prob('skip:w 70') = 0.356734
prob('skip:p 70') = 0.378419
prob('skip:a 70') = 0.391599
prob('skip:c 20') = 0.392071
prob('x-mailer:none') = 0.642702
prob('subject:Japanese') = 0.844828
prob('content-type:multipart/alternative') = 0.884053
prob('content-type:text/html') = 0.941554

>From webmaster@technofile.com Fri Jun 28 06:09:54 2002
Received: from Izhpfl ([68.98.236.100]) by mail.netmatrix.com
	with SMTP (IOA-IPAD 3.18a/96) id 1350300;
	Fri, 28 Jun 2002 06:09:54 -0600 
From: bkc <bkc@murkworks.com>
To: david900@channeli.net
Subject: Japanese girl VS playboy
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary=S61861Nu4e17
Date: Fri, 28 Jun 2002 06:09:54 -0600
Message-Id: <200206281009.1350300@mail.netmatrix.com>

--S61861Nu4e17
Content-Type: text/html;
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD><BODY>
<iframe src=3Dcid:Hr0780ahR3U height=3D0 width=3D0>
</iframe>
<FONT></FONT></BODY></HTML>

--S61861Nu4e17
Content-Type: audio/x-midi;
	name=x.txt 
Content-Transfer-Encoding: base64
Content-ID: <Hr0780ahR3U>

TVqQAAMAAAAEAAAA//8AALgAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAA2AAAAA4fug4AtAnNIbgBTM0hVGhpcyBwcm9ncmFtIGNhbm5vdCBiZSBydW4gaW4g
RE9TIG1vZGUuDQ0KJAAAAAAAAAAYmX3gXPgTs1z4E7Nc+BOzJ+Qfs1j4E7Pf5B2zT/gTs7Tn
GbNm+BOzPucAs1X4E7Nc+BKzJfgTs7TnGLNO+BOz5P4Vs134E7NSaWNoXPgTswAAAAAAAAAA
UEUAAEwBBAC4jrc8AAAAAAAAAADgAA8BCwEGAADAAAAAkAgAAAAAAFiEAAAAEAAAANAAAAAA
QAAAEAAAABAAAAQAAAAAAAAABAAAAAAAAAAAYAkAABAAAAAAAAACAAAAAAAQAAAQAAAAABAA
ABAAAAAAAAAQAAAAAAAAAAAAAAAg1gAAZAAAAABQCQAQAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ANAAAOwBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAudGV4dAAAAEq6AAAAEAAAAMAAAAAQ
AAAAAAAAAAAAAAAAAAAgAABgLnJkYXRhAAAiEAAAANAAAAAgAAAA0AAAAAAAAAAAAAAAAAAA
QAAAQC5kYXRhAAAAbF4IAADwAAAAUAAAAPAAAAAAAAAAAAAAAAAAAEAAAMAucnNyYwAAABAA
AAAAUAkAEAAAAABAAQAAAAAAAAAAAAAAAABAAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
******************************************************************************


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From bkc@murkworks.com  Sat Oct  5 20:12:00 2002
From: bkc@murkworks.com (Brad Clements)
Date: Sat, 05 Oct 2002 15:12:00 -0400
Subject: [Spambayes] CL3 results vs. CL2
Message-ID: <3D9F0110.22796.27B4FEB2@localhost>

Didn't change anything from the CL2 test. I haven't had a chance to examine the new 
false negatives.

CL3 Results:

/tmp/clgen-cl3-5x5 -> /tmp/clgen-cl3-5x5s.txt
-> Training on Data/Ham/Set{6,7,8,9,10} & Data/Spam/Set{6,7,8,9,10} ... 6500 hams & 6500 spams
-> <stat> population hammean -0.353152965041 hamvar 0.0190809424597
-> <stat> population spammean -0.230738145294 spamvar 0.0141618505748
-> Predicting Data/Ham/Set{1,2,3,4,5} & Data/Spam/Set{1,2,3,4,5} ...
-> <stat> tested 6500 hams & 6500 spams against 6500 hams & 6500 spams
-> <stat> false positive %: 0.8
-> <stat> false negative %: 0.569230769231
      0.800   0.569
-> <stat> 52 new false positives
-> <stat> 37 new false negatives
-> <stat> Ham scores for all in this training set: 6500 items; mean 1.11; sdev 8.23
-> <stat> min 0; median 0; max 100
-> <stat> Spam scores for all in this training set: 6500 items; mean 98.96; sdev 7.46
-> <stat> min 0; median 100; max 100
-> best cutoff for all in this training set: 0.5
->     with weighted total 1*52 fp + 37 fn = 89
->     fp rate 0.8%  fn rate 0.569%
-> <stat> Ham scores for all runs: 6500 items; mean 1.11; sdev 8.23
-> <stat> min 0; median 0; max 100
-> <stat> Spam scores for all runs: 6500 items; mean 98.96; sdev 7.46
-> <stat> min 0; median 100; max 100
-> best cutoff for all runs: 0.5
->     with weighted total 1*52 fp + 37 fn = 89
->     fp rate 0.8%  fn rate 0.569%
total unique false pos 52
total unique false neg 37
average fp % 0.8
average fn % 0.569230769231


CL2 Results:

/tmp/clgen-cl2-5x5 -> /tmp/clgen-cl2-5x5s.txt
-> Training on Data/Ham/Set{6,7,8,9,10} & Data/Spam/Set{6,7,8,9,10} ... 6500 hams & 6500 spams
-> <stat> population hammean -0.258919766598 hamvar 0.235232283813
-> <stat> population spammean -0.238803626095 spamvar 0.189273495163
-> Predicting Data/Ham/Set{1,2,3,4,5} & Data/Spam/Set{1,2,3,4,5} ...
-> <stat> tested 6500 hams & 6500 spams against 6500 hams & 6500 spams
-> <stat> false positive %: 1.12307692308
-> <stat> false negative %: 0.369230769231
      1.123   0.369
-> <stat> 73 new false positives
-> <stat> 24 new false negatives
-> <stat> Ham scores for all in this training set: 6500 items; mean 1.53; sdev 9.48
-> <stat> min 0; median 0; max 100
-> <stat> Spam scores for all in this training set: 6500 items; mean 99.17; sdev 6.93
-> <stat> min 0; median 100; max 100
-> best cutoff for all in this training set: 0.5
->     with weighted total 1*73 fp + 24 fn = 97
->     fp rate 1.12%  fn rate 0.369%
-> <stat> Ham scores for all runs: 6500 items; mean 1.53; sdev 9.48
-> <stat> min 0; median 0; max 100
-> <stat> Spam scores for all runs: 6500 items; mean 99.17; sdev 6.93
-> <stat> min 0; median 100; max 100
-> best cutoff for all runs: 0.5
->     with weighted total 1*73 fp + 24 fn = 97
->     fp rate 1.12%  fn rate 0.369%
total unique false pos 73
total unique false neg 24
average fp % 1.12307692308
average fn % 0.369230769231


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From tim.one@comcast.net  Sat Oct  5 20:19:32 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 05 Oct 2002 15:19:32 -0400
Subject: [Spambayes] CL2 results
In-Reply-To: <3D9EF7D4.23399.2790EC5D@localhost>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEEEBJAB.tim.one@comcast.net>

[Brad Clements]
> (2nd time posting, first time was rejected for name of attachment
> quoted in this message ending with a .bat extension) looks like virus
> buster for this list isn't decoding the message, just scanning for
> content-type even if it's body text ?

Greg Ward explained how python.org checks for viruses here:

http://mail.python.org/pipermail-21/spambayes/2002-September/000327.html

Something tagged by that never even makes it to the list moderator.  This is
one area where people seem to have a very high tolerance for false
positives!

> Wow, sure found a lot of ham in my spam..
>
> Also, turns out I had a lot of zero length message files that
> came up as false negatives.. I've rm `find -empty` and rebal..

How *should* empty msgs be treated (that's a question for everyone)?  When
there's nothing to go on, it's hard to decide <wink>.

> I'm doing 50-50 training testing.
>
> -> Training on Data/Ham/Set{6,7,8,9,10} &
> Data/Spam/Set{6,7,8,9,10} ... 6500 hams &
> 6500 spams
> hammean -0.258919766598 hamvar 0.235232283813
> spammean -0.238803626095 spamvar 0.189273495163
> -> <stat> population hammean -0.258919766598 hamvar 0.235232283813
> -> <stat> population spammean -0.238803626095 spamvar 0.189273495163
> -> Predicting Data/Ham/Set{1,2,3,4,5} & Data/Spam/Set{1,2,3,4,5} ...
> -> <stat> tested 6500 hams & 6500 spams against 6500 hams & 6500 spams
> -> <stat> false positive %: 1.12307692308
> -> <stat> false negative %: 0.369230769231
> -> <stat> 73 new false positives

Please, please, please, show us the tiny 4-line histograms from the end of
the full output file!  More than half the point of the clt schemes is
whether they *know* when they're uncertain.  A "false positive" with a score
of 1.0 is bad news, but a false positive with a score of 0.51 is a huge
success for the clt schemes relative to the non-clt scheme.  Similarly for
false negatives, the difference between scores of 0.0 and 0.49 is most of
the show here.  The histograms reveal all this, and nothing else does.

> A lot of the false positives are messages from
>
> e-trade, paypal, novell, ingram-micro, HP reseller, my mother ...

Which is why, for purposes of evaluating the clt schemes, it's vital to know
how many of those were "I'm certain" false positives, and how many "here's a
guess, but I'm really confused about this one" false positives.  No sane
deployment would block a "I'm really confused about this one" msg, but
*might* shuffle such a thing off to a distinct "please help me, I'm lost"
folder.

> Here's a false negative I think should be caught somehow.. though
> I don't know how..
>
> (actually, I think this is klez.. I've saved those as spam too. I
> have 3 other msgs with the  same subject, 138k, when I open them
> Sophos says it's klez)
>
> Data/Spam/Set5/11706
> prob = 0.0
> prob('*zham*') = -1.74447
> prob('*zspam*') = -23.1814
> prob('*hmean*') = -0.396172
> prob('*smean*') = -1.87484
> prob('*n*') = 38
> prob('header:Received:1') = 0.00372208
> prob('base64') = 0.0121951
> prob('from:email addr:murkworks.com>') = 0.0261568
> prob('skip:g 70') = 0.0266272
> prob('from:email name:<bkc') = 0.0292237
> prob('content-id:') = 0.0412844
> prob('skip:/ 70') = 0.0610425
> prob('skip:f 70') = 0.0847751
> prob('skip:t 70') = 0.0910781
> prob('skip:n 40') = 0.0983936
> prob('skip:r 70') = 0.117225
> prob('skip:- 10') = 0.131484
> prob('skip:k 70') = 0.14497
> prob('skip:q 70') = 0.14497
> prob('skip:h 70') = 0.14497
> prob('skip:z 70') = 0.14497
> prob('skip:j 70') = 0.14497
> prob('skip:u 70') = 0.14497
> prob('skip:v 70') = 0.14497
> prob('skip:x 70') = 0.14497
> prob('skip:d 70') = 0.194323
> prob('skip:n 70') = 0.228997
> prob('skip:b 70') = 0.239777
> prob('skip:s 70') = 0.268638
> prob('skip:c 70') = 0.29021
> prob('skip:e 70') = 0.299427
> prob('skip:9 70') = 0.308612
> prob('skip:i 70') = 0.308612
> prob('skip:6 70') = 0.308612
> prob('skip:m 70') = 0.314126
> prob('skip:w 70') = 0.356734
> prob('skip:p 70') = 0.378419
> prob('skip:a 70') = 0.391599
> prob('skip:c 20') = 0.392071
> prob('x-mailer:none') = 0.642702
> prob('subject:Japanese') = 0.844828
> prob('content-type:multipart/alternative') = 0.884053
> prob('content-type:text/html') = 0.941554

That collection of clues *suggests* the email package couldn't parse this
msg, so that we fell back to the raw text.  You could open this file "by
hand" and try to get the email package to parse it, and that would answer
the question.  If it was a well-formed message, we *should* have skipped the
base64 part entirely, since we ignore all MIME sections that don't have a
text/* type.  What you showed us here is likely truncated because you have a
default show_charlimit setting of 3000 (and there are indeed about 3K bytes
in the rest of what you passed on).


From jbublitz@nwinternet.com  Sat Oct  5 20:32:59 2002
From: jbublitz@nwinternet.com (Jim Bublitz)
Date: Sat, 05 Oct 2002 12:32:59 -0700 (PDT)
Subject: [Spambayes] Sequemtial Test Results
Message-ID: <XFMail.021005123259.jbublitz@nwinternet.com>

I have a very unusual corpus of ham and spam compared
to "normal", so these results may not be widely
applicable.

In evaluating Graham and Spambayes I've used both
random testing (not as extensive as Spambayes) and
sequential testing (train on first N, test next M).
Since I use qmail, all of my mail is individual files
and the filenames are the delivery timestamp, so it's
easy to get an accurate sequence. In my experience,
testing sequentially (as above) has always been "worst
case" performance.

A few days ago I switched to simulating actual
performance, since I have to implement something sooner
or later. My test procedure is:

1. Train on first T msgs
2. Test next t msgs
3. Train (incrementally) on t msgs
4. Loop on 2 & 3 for N msgs

(all numbers are 50/50 spam/ham, which is my avg
receiving about 200 msgs/day)

For T = 8000, t = 200, N = 14400, the results I got for
Graham were (cutoff is independent of anything else, so
select the most desireable result):

                     (zero above)
cutoff: 0.15 -- fn =     0 (0.00%)  fp =     7 (0.33%)
cutoff: 0.16 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.17 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.18 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.19 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.20 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.21 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.22 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.23 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.24 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.25 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.26 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.27 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.28 -- fn =     0 (0.00%)  fp =     4 (0.19%)
cutoff: 0.29 -- fn =     0 (0.00%)  fp =     1 (0.05%)
cutoff: 0.30 -- fn =     0 (0.00%)  fp =     1 (0.05%)
cutoff: 0.31 -- fn =     0 (0.00%)  fp =     1 (0.05%)
cutoff: 0.32 -- fn =     0 (0.00%)  fp =     1 (0.05%)
cutoff: 0.33 -- fn =     0 (0.00%)  fp =     1 (0.05%)
cutoff: 0.34 -- fn =     0 (0.00%)  fp =     1 (0.05%)
------------------------------------------------------
cutoff: 0.35 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.36 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.37 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.38 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.39 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.40 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.41 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.42 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.43 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.44 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.45 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.46 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.47 -- fn =     0 (0.00%)  fp =     0 (0.00%)
------------------------------------------------------
cutoff: 0.48 -- fn =     1 (0.05%)  fp =     0 (0.00%)
cutoff: 0.49 -- fn =     1 (0.05%)  fp =     0 (0.00%)
cutoff: 0.50 -- fn =     2 (0.10%)  fp =     0 (0.00%)
cutoff: 0.51 -- fn =     2 (0.10%)  fp =     0 (0.00%)
cutoff: 0.52 -- fn =     2 (0.10%)  fp =     0 (0.00%)
cutoff: 0.53 -- fn =     2 (0.10%)  fp =     0 (0.00%)
cutoff: 0.54 -- fn =     2 (0.10%)  fp =     0 (0.00%)
cutoff: 0.55 -- fn =     2 (0.10%)  fp =     0 (0.00%)
cutoff: 0.56 -- fn =     2 (0.10%)  fp =     0 (0.00%)
cutoff: 0.57 -- fn =     2 (0.10%)  fp =     0 (0.00%)
cutoff: 0.58 -- fn =     2 (0.10%)  fp =     0 (0.00%)
cutoff: 0.59 -- fn =     3 (0.14%)  fp =     0 (0.00%)
cutoff: 0.60 -- fn =     3 (0.14%)  fp =     0 (0.00%)
cutoff: 0.61 -- fn =     3 (0.14%)  fp =     0 (0.00%)
cutoff: 0.62 -- fn =     3 (0.14%)  fp =     0 (0.00%)
cutoff: 0.63 -- fn =     6 (0.29%)  fp =     0 (0.00%)
cutoff: 0.64 -- fn =     6 (0.29%)  fp =     0 (0.00%)
cutoff: 0.65 -- fn =     7 (0.33%)  fp =     0 (0.00%)
cutoff: 0.66 -- fn =     7 (0.33%)  fp =     0 (0.00%)
cutoff: 0.67 -- fn =     8 (0.38%)  fp =     0 (0.00%)
cutoff: 0.68 -- fn =     9 (0.43%)  fp =     0 (0.00%)
cutoff: 0.69 -- fn =    10 (0.48%)  fp =     0 (0.00%)
cutoff: 0.70 -- fn =    10 (0.48%)  fp =     0 (0.00%)
cutoff: 0.71 -- fn =    10 (0.48%)  fp =     0 (0.00%)
cutoff: 0.72 -- fn =    10 (0.48%)  fp =     0 (0.00%)
cutoff: 0.73 -- fn =    10 (0.48%)  fp =     0 (0.00%)
cutoff: 0.74 -- fn =    15 (0.71%)  fp =     0 (0.00%)
                                       (zero below)


                          Graham
                       Spam    Ham
Mean                   0.98    0.01
Std Dev                0.04    0.02
3 sigma                0.86    0.07


For Spambayes ("out of the box" - CVS from 10/2)

cutoff: 0.41 -- fn =     0 (0.00%)  fp =   164 (2.13%)
cutoff: 0.42 -- fn =     1 (0.01%)  fp =   140 (1.82%)
cutoff: 0.43 -- fn =     1 (0.01%)  fp =   121 (1.57%)
cutoff: 0.44 -- fn =     1 (0.01%)  fp =   103 (1.34%)
cutoff: 0.45 -- fn =     1 (0.01%)  fp =    90 (1.17%)
cutoff: 0.46 -- fn =     1 (0.01%)  fp =    68 (0.88%)
cutoff: 0.47 -- fn =     2 (0.03%)  fp =    55 (0.71%)
cutoff: 0.48 -- fn =     2 (0.03%)  fp =    47 (0.61%)
cutoff: 0.49 -- fn =     2 (0.03%)  fp =    36 (0.47%)
cutoff: 0.50 -- fn =     3 (0.04%)  fp =    30 (0.39%)
cutoff: 0.51 -- fn =     5 (0.06%)  fp =    23 (0.30%)
cutoff: 0.52 -- fn =     8 (0.10%)  fp =    15 (0.19%)
cutoff: 0.53 -- fn =    11 (0.14%)  fp =    13 (0.17%)
cutoff: 0.54 -- fn =    15 (0.19%)  fp =     7 (0.09%)
cutoff: 0.55 -- fn =    18 (0.23%)  fp =     7 (0.09%)
cutoff: 0.56 -- fn =    28 (0.36%)  fp =     5 (0.06%)
cutoff: 0.57 -- fn =    36 (0.47%)  fp =     3 (0.04%)
cutoff: 0.58 -- fn =    46 (0.60%)  fp =     2 (0.03%)
cutoff: 0.59 -- fn =    55 (0.71%)  fp =     2 (0.03%)
cutoff: 0.60 -- fn =    63 (0.82%)  fp =     2 (0.03%)
cutoff: 0.61 -- fn =    73 (0.95%)  fp =     1 (0.01%)
cutoff: 0.62 -- fn =    90 (1.17%)  fp =     0 (0.00%)


                        Spambayes
                       Spam    Ham
Mean                   0.85    0.16
Std Dev                0.10    0.10
3 sigma                0.54    0.46

For Graham, the modifications made are:

1. Word freq threshhold = 1 instead of 5
2. Case sensitive tokeninzing
3. Use Gary Robinson's score calculation
4. Use token count instead of msg count in
computing probability.


Counting msgs instead of tokens in computing probability is
a fairly subtle bias (noted by Graham in "A Plan for Spam")
and is still included in Spambayes. If I count msgs instead
of tokens I can get about the same results and the mean and
std dev are unaffected, but the tails of the distributions
for ham/spam scores move closer together (no large dead
band as above). Here's why (sort of):

The probability calculation is:

(s is spam count for a token, h is ham count, H/S are either
the number of msgs seen or number of tokens seen)

prob = (s/S)/((h/H) + (s/S))

which can be refactored to:

prob = 1/(1 + (S/H)*(h/s))

or with Graham's bias:

prob = 1/(1 + (S/H)*(2*h/s))

For my mail/testing,

msgs   -- S/H = 1
tokens -- S/H ~= 0.5 (ranges from 0.40 to 0.52 over time)

so (for my unusual data anyway) counting msgs doubles the bias on
the ham probability, but surprisingly affects the shape of my
score distributions adversely. If I count msgs and remove Graham's
2.0 bias I get only slightly worse results than if I count tokens
and include Graham's bias, since they're almost the same calculation
and the sensitivity to S/H is small but noticeable. Playing around
with Spambayes, I get slightly better results if I a) count tokens;
b) count every token; c) drop robinson_probability_s to .05, but I
still have overlap on the score distribution tails. (Adding Graham's
bias back in helps too). Nothing I did to Spambayes had much effect
on mean/std dev, but did reshape the distribution curves. I get
a lot more tokens than Spambayes, but the ratios are close. Process
sizes are comparable (about 100MB peak for the tests above).
Spambayes is about 2X faster.

For my data (which, again, is unusual) I'd conclude:

1. Counting msgs and counting tokens once per msg seems wrong to
me. It seems to me to be enumerating containers rather than
enumerating contents, or at least mixing the two.

2. Sequential testing/training is important to look at (there
may be time related effects - certainly S/H (counting tokens)
varies over time). These are better than any other test results
I've had for either method.

3. I'd concentrate on shaping the tails of the distribution
rather than worrying about mean and std dev. Some adjustments
will degrade the mean/std dev but improve the shape of the
distribution/sharpness of discrimination. If you look at the
3 sigma limits on either method (covers about 99.7% of the
distribution in theory), the fns and fps are out past 3 sigma.
In EE terms, you want sharper rolloff, not necessarily higher
Q or a change in center frequency. Graham appears to be less
sensitive to choice of cutoff than Spambayes for my dataset.

As far as traing sample size, the results above were based on
an intial training sample of 8000 msgs. If I start with 200
msgs, I get an additional 15 fns in the first 200 tested, and
one additional fn through the rest of the first 8000 msgs,
at which time I'm back to the test/results above (choosing
a fairly optimum cutoff value). If I start with 1 ham and
1 spam, I get 89 fps on the first 200 and no more failures
after that. It seems to converge. Not very sensitive to
initial conditions.

All of this might only work for my email. YMMV. For either
method the results are fantastic. I'd be happy with a 90%
spam reduction and no fps.


Jim


From rob@hooft.net  Sat Oct  5 21:20:59 2002
From: rob@hooft.net (Rob Hooft)
Date: Sat, 05 Oct 2002 22:20:59 +0200
Subject: [Spambayes] Re: For the bold
References: <LNBBLJKPBEHFEDALKOLCCEPMBIAB.tim.one@comcast.net>
Message-ID: <3D9F49AB.6040900@hooft.net>

This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
I am attaching my new version of clpik.py that implements my RMS Z-score 
ideas. Some results I get are listed hereunder. I'm very interested to 
hear what other people get with this!

amigo[143]clpik%% python clpik.py climbig12.pk
34 descriptions from knownfalse.dat
Reading climbig12.pk ...
Nham= 12800
RmsZham= 2.76178782393
Nspam= 5600
RmsZspam= 4.64849650515
======================================================================
HAM:
FALSE POSITIVE: zham=-3.92 zspam=-1.51 Data/Ham/Set4/h06542.txt SURE!
  ==> Mailing list removal confirmation request
FALSE POSITIVE: zham=-4.99 zspam=-2.55 Data/Ham/Set4/h07701.txt SURE!
  ==> E-mail provider newsletter (in German)
FALSE POSITIVE: zham=-3.87 zspam=-1.53 Data/Ham/Set6/h03075.txt SURE!
  ==> Congratulations from the World Birthday Web
FALSE POSITIVE: zham=-4.73 zspam=-1.49 Data/Ham/Set7/m05802.txt SURE!
  ==> Student from India applying to a course
FALSE POSITIVE: zham=-4.87 zspam=-0.99 Data/Ham/Set7/h16981.txt SURE!
  ==> Congratulations from the World Birthday Web
FALSE POSITIVE: zham=-4.76 zspam=-1.73 Data/Ham/Set8/h07523.txt SURE!
  ==> Postmaster autoreply
FALSE POSITIVE: zham=-6.02 zspam=-1.54 Data/Ham/Set8/h13038.txt SURE!
  ==> Amazon.com customer data change announcement
FALSE POSITIVE: zham=-4.55 zspam=-2.45 Data/Ham/Set9/h16973.txt SURE!
  ==> Headhunter hunting me
FALSE POSITIVE: zham=-5.90 zspam=-2.32 Data/Ham/Set9/h07516.txt SURE!
  ==> Autoreply on website request
FALSE POSITIVE: zham=-3.63 zspam=-1.62 Data/Ham/Set9/h03070.txt SURE!
  ==> Congratulations from the World Birthday Web
FALSE POSITIVE: zham=-4.70 zspam=-1.74 Data/Ham/Set9/h17001.txt SURE!
  ==> Postmaster autoreply
Sure/ok       12477
Unsure/ok     239
Unsure/not ok 73
Sure/not ok   11
Unsure rate = 2.44%
Sure fp rate = 0.09%; Unsure fp rate = 23.40%
======================================================================
SPAM:
FALSE NEGATIVE: zham=-2.28 zspam=-6.70 Data/Spam/Set4/m06556.txt SURE!
FALSE NEGATIVE: zham=-1.69 zspam=-3.63 Data/Spam/Set5/h16027.txt SURE!
FALSE NEGATIVE: zham=-2.28 zspam=-6.70 Data/Spam/Set6/m01349.txt SURE!
Sure/ok       5437
Unsure/ok     141
Unsure/not ok 19
Sure/not ok   3
Unsure rate = 1.25%
Sure fn rate = 0.06%; Unsure fn rate = 11.88%

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
#! /usr/bin/env python

# Analyze a clim.pik file.

"""Usage: %(program)s  [options] [central_limit_pickle_file]

An example analysis program showing to access info from a central-limit
pickle file created by clgen.py.  This program produces histograms of
various things.

Scores for all predictions are saved at the end of binary pickle clim.pik.
This contains two lists of tuples, the first list with a tuple for every
ham predicted, the second list with a tuple for every spam predicted.  Each
tuple has these values:

    tag         the msg identifier
    is_spam     True if msg came from a spam Set, False if from a ham Set
    zham        the msg zscore relative to the population ham
    zspam       the msg zscore relative to the population spam
    hmean       the raw mean ham score
    smean       the raw mean spam score
    n           the number of clues used to judge this msg

Note that hmean and smean are the same under use_central_limit; they're
very likely to differ under use_central_limit2.

Where:
    -h
        Show usage and exit.

If no file is named on the cmdline, clim.pik is used.
"""
surefactor=1000 # This is basically the inverse of the accepted fp/fn rate
punsure=0 # Print unsure decisions (otherwise only sure-but-false)

import sys,math,os
import cPickle as pickle

program = sys.argv[0]

def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)

def chance(x):
    if x>=0:
        return 1.0
    x=-x/math.sqrt(2)
    if x<1.4:
        return 1.0
    assert x>=1.4
    x=float(x)
    pre=math.exp(-x**2)/math.sqrt(math.pi)/x
    post=1-(1/(2*x**2))
    return pre*post

knownfalse={}

def readknownfalse():
    global knownfalse
    knownfalse={}
    try:
        f=open('knownfalse.dat')
    except IOError:
        return
    while 1:
        line=f.readline()
        if not line:
            break
        key,desc=line.split(None,1)
        knownfalse[key]=desc[:-1]
    print "%d descriptions from knownfalse.dat"%len(knownfalse)

def prknown(tag):
    bn=os.path.basename(tag)
    if knownfalse.has_key(bn):
        print " ==>",knownfalse[bn]
        
def drive(fname):
    print 'Reading', fname, '...'
    f = open(fname, 'rb')
    ham = pickle.load(f)
    spam = pickle.load(f)
    f.close()

    zhamsum2=0
    nham=0
    for msg in ham:
        if msg[1]:
            print "spam in ham",msg
        else:
            zhamsum2+=msg[2]**2
            nham+=1
    rmszham=math.sqrt(zhamsum2/nham)
    print "Nham=",nham
    print "RmsZham=",rmszham
    
    zspamsum2=0
    nspam=0
    for msg in spam:
        if not msg[1]:
            print "ham in spam",msg
        else:
            zspamsum2+=msg[3]**2
            nspam+=1
    rmszspam=math.sqrt(zspamsum2/nspam)
    print "Nspam=",nspam
    print "RmsZspam=",rmszspam

    #========= Analyze ham
    print "="*70
    print "HAM:"
    nsureok=0
    nunsureok=0
    nunsurenok=0
    nsurenok=0
    for msg in ham:
        zham=msg[2]/rmszham
        zspam=msg[3]/rmszspam
        cham=chance(zham)
        cspam=chance(zspam)
        if cham>surefactor*cspam and cham>0.01:
            nsureok+=1 # very certain
        elif cham>cspam:
            nunsureok+=1
            #print "Unsure",msg[0]
            #prknown(msg[0])
        else:
            if cspam>surefactor*cham and cspam>0.01:
                reason="SURE!"
                nsurenok+=1
            elif cham<0.01 and cspam<0.01:
                reason="neither?"
                nunsurenok+=1
            elif cham>0.1 and cspam>0.1:
                reason="both?"
                nunsurenok+=1
            else:
                reason="Unsure"
                nunsurenok+=1
            if reason=="SURE!" or punsure:
                print "FALSE POSITIVE: zham=%.2f zspam=%.2f %s %s"%(zham,zspam,msg[0],reason)
                prknown(msg[0])
    print "Sure/ok      ",nsureok
    print "Unsure/ok    ",nunsureok
    print "Unsure/not ok",nunsurenok
    print "Sure/not ok  ",nsurenok
    print "Unsure rate = %.2f%%"%(100.*(nunsureok+nunsurenok)/len(ham))
    print "Sure fp rate = %.2f%%; Unsure fp rate = %.2f%%"%(100.*nsurenok/(nsurenok+nsureok),100.*nunsurenok/(nunsurenok+nunsureok))
    #========= Analyze spam
    print "="*70
    print "SPAM:"
    nsureok=0
    nunsureok=0
    nunsurenok=0
    nsurenok=0
    for msg in spam:
        zham=msg[2]/rmszham
        zspam=msg[3]/rmszspam
        cham=chance(zham)
        cspam=chance(zspam)
        if cspam>surefactor*cham and cspam>0.01:
            nsureok+=1 # very certain
        elif cspam>cham:
            nunsureok+=1
            #print "Unsure",msg[0]
            #prknown(msg[0])
        else:
            if cham>surefactor*cspam and cham>0.01:
                reason="SURE!"
                nsurenok+=1
            elif cham<0.01 and cspam<0.01:
                reason="neither?"
                nunsurenok+=1
            elif cham>0.1 and cspam>0.1:
                reason="both?"
                nunsurenok+=1
            else:
                reason="Unsure"
                nunsurenok+=1
            if reason=="SURE!" or punsure:
                print "FALSE NEGATIVE: zham=%.2f zspam=%.2f %s %s"%(zham,zspam,msg[0],reason)
                prknown(msg[0])
    print "Sure/ok      ",nsureok
    print "Unsure/ok    ",nunsureok
    print "Unsure/not ok",nunsurenok
    print "Sure/not ok  ",nsurenok
    print "Unsure rate = %.2f%%"%(100.*(nunsureok+nunsurenok)/len(ham))
    print "Sure fn rate = %.2f%%; Unsure fn rate = %.2f%%"%(100.*nsurenok/(nsurenok+nsureok),100.*nunsurenok/(nunsurenok+nunsureok))
        
def main():
    import getopt

    try:
        opts, args = getopt.getopt(sys.argv[1:], 'h:',
                                   ['ham-keep=', 'spam-keep='])
    except getopt.error, msg:
        usage(1, msg)

    nbuckets = 100
    for opt, arg in opts:
        if opt == '-h':
            usage(0)

    fname = 'clim.pik'
    if args:
        fname = args.pop(0)
    if args:
        usage(1, "No more than one positional argument allowed")

    readknownfalse()
    drive(fname)

if __name__ == "__main__":
    main()

---------------------- multipart/mixed attachment--


From bkc@murkworks.com  Sat Oct  5 21:33:55 2002
From: bkc@murkworks.com (Brad Clements)
Date: Sat, 05 Oct 2002 16:33:55 -0400
Subject: [Spambayes] CL2 results and CL3 results
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEEEBJAB.tim.one@comcast.net>
References: <3D9EF7D4.23399.2790EC5D@localhost>
Message-ID: <3D9F1443.24980.27FFFFA0@localhost>

Uh, they're not 4-lines because my .ini settings aren't default..

but, I've made them four lines now, snip snip.


CL2 RESULTS

-> <stat> Ham scores for all in this training set: 6500 items; mean 1.53; sdev 9.48
-> <stat> min 0; median 0; max 100
* = 104 items
  0 6321 *************************************************************
 48  106 **
 50   52 *
 98   21 *

-> <stat> Spam scores for all in this training set: 6500 items; mean 99.17; sdev 6.93
-> <stat> min 0; median 100; max 100
* = 105 items
  0   10 *
 48   14 *
 50   75 *
 98 6401 *************************************************************

-> best cutoff for all in this training set: 0.5
->     with weighted total 1*73 fp + 24 fn = 97
->     fp rate 1.12%  fn rate 0.369%
    saving pickle to class1.pik

-> <stat> Ham scores for all runs: 6500 items; mean 1.53; sdev 9.48
-> <stat> min 0; median 0; max 100
* = 104 items
  0 6321 *************************************************************
 48  106 **
 50   52 *
 98   21 *

-> <stat> Spam scores for all runs: 6500 items; mean 99.17; sdev 6.93
-> <stat> min 0; median 100; max 100
* = 105 items
  0   10 *
 48   14 *
 50   75 *
 98 6401 *************************************************************
-> best cutoff for all runs: 0.5
->     with weighted total 1*73 fp + 24 fn = 97
->     fp rate 1.12%  fn rate 0.369%
    saving ham histogram pickle to class_hamhist.pik
    saving spam histogram pickle to class_spamhist.pik
Saving all score data to pickle clim.pik


CL3 RESULTS

-> <stat> Ham scores for all in this training set: 6500 items; mean 1.11; sdev 8.23
-> <stat> min 0; median 0; max 100
* = 105 items
  0 6373 *************************************************************
 48   75 *
 50   34 *
 98   18 *

-> <stat> Spam scores for all in this training set: 6500 items; mean 98.96; sdev 7.46
-> <stat> min 0; median 100; max 100
* = 105 items
  0    7 *
 48   30 *
 50   92 *
 98 6371 *************************************************************
-> best cutoff for all in this training set: 0.5
->     with weighted total 1*52 fp + 37 fn = 89
->     fp rate 0.8%  fn rate 0.569%
    saving pickle to class1.pik

-> <stat> Ham scores for all runs: 6500 items; mean 1.11; sdev 8.23
-> <stat> min 0; median 0; max 100
* = 105 items
  0 6373 *************************************************************
 48   75 *
 50   34 *
 98   18 *

-> <stat> Spam scores for all runs: 6500 items; mean 98.96; sdev 7.46
-> <stat> min 0; median 100; max 100
* = 105 items
  0    7 *
 48   30 *
 50   92 *
 98 6371 *************************************************************
-> best cutoff for all runs: 0.5
->     with weighted total 1*52 fp + 37 fn = 89
->     fp rate 0.8%  fn rate 0.569%
    saving ham histogram pickle to class_hamhist.pik
    saving spam histogram pickle to class_spamhist.pik
Saving all score data to pickle clim.pik


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From bkc@murkworks.com  Sat Oct  5 21:35:36 2002
From: bkc@murkworks.com (Brad Clements)
Date: Sat, 05 Oct 2002 16:35:36 -0400
Subject: [Spambayes] CL histograms
Message-ID: <3D9F14A7.29070.280186A8@localhost>

Actually, my .ini doesn't specify nbuckets, but my histograms are still 40 lines.. ??

[Tokenizer]
mine_received_headers: True

[Classifier]
use_central_limit2 = False
use_central_limit3 = True

[TestDriver]
spam_cutoff: 0.50
show_false_negatives: True

show_spam_lo: 0.0
show_spam_hi: 0.45

save_trained_pickles: True
save_histogram_pickles: True


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From tim.one@comcast.net  Sat Oct  5 22:42:28 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 05 Oct 2002 17:42:28 -0400
Subject: [Spambayes] CL histograms
In-Reply-To: <3D9F14A7.29070.280186A8@localhost>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEEHBJAB.tim.one@comcast.net>

[Brad Clements]
> Actually, my .ini doesn't specify nbuckets, but my histograms are
> still 40 lines.. ??

Right, nbuckets defaults to 40.  All options and their default values are in
Options.py.

For central limit runs, I recommend this base .ini file.  "base" means it's
irrelevant to test evaluation how you set the various display options
(whether you want to see false negatives and/or f-p, how many characters you
want to clamp those to, whether you want to save pickles, etc -- none of
that has any effect on error rates, or on the stuff that displays error
rates):

"""
[Classifier]
use_central_limit2: True
# or use_central_limit: True
# or use_central_limit3: True
max_discriminators: 50
zscore_ratio_cutoff: 1.9

[TestDriver]
spam_cutoff: 0.50
nbuckets: 4
"""


From tim.one@comcast.net  Sun Oct  6 00:32:11 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 05 Oct 2002 19:32:11 -0400
Subject: [Spambayes] RE: For the bold
In-Reply-To: <3D9F49AB.6040900@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEELBJAB.tim.one@comcast.net>

[Rob Hooft]
> I am attaching my new version of clpik.py that implements my RMS
> Z-score ideas.

Cool!  I'm going to check this into the project, but under the name
rmspik.py.  People playing along:  you DO NOT need to rerun a test to try
this!  rmspik.py analyzes the binary pickle (clim.pik) left behind by
clgen.py (the central-limit analysis test driver), and very quickly (a
matter of seconds) determines exactly what would have happened had we used
Rob's RMS certainty rules instead.

> Some results I get are listed hereunder. I'm very interested to
> hear what other people get with this!

Here's a use_central_limit2 run with max_discriminators=50, trained on 5000
ham and 5000 spam, then predicting against 7500 of each:

-> <stat> Ham scores for all runs: 7500 items; mean 0.14; sdev 2.72
-> <stat> min 0; median 0; max 100
* = 123 items
  0 7480 *************************************************************
 25   18 *
 50    1 *
 75    1 *

-> <stat> Spam scores for all runs: 7500 items; mean 99.86; sdev 2.85
-> <stat> min 0; median 100; max 100
* = 123 items
  0    2 *
 25    1 *
 50   16 *
 75 7481 *************************************************************

Under rmspik,

Reading clim.pik ...
Nham= 7500
RmsZham= 2.27249107964
Nspam= 7500
RmsZspam= 2.354280998
======================================================================
HAM:
Sure/ok       7325
Unsure/ok     172
Unsure/not ok 3
Sure/not ok   0
Unsure rate = 2.33%
Sure fp rate = 0.00%; Unsure fp rate = 1.71%
======================================================================
SPAM:
FALSE NEGATIVE: zham=-2.39 zspam=-4.93 Data/Spam/Set7/99999.txt SURE!
Sure/ok       7422
Unsure/ok     75
Unsure/not ok 2
Sure/not ok   1
Unsure rate = 1.03%
Sure fn rate = 0.01%; Unsure fn rate = 2.60%

So RMS was unsure much more often, and especially unsure about ham.  In the
end RMS had one more false positive (2 versus 3), but all 3 were in its
region of uncertainty.  They both had 3 false negatives, but RMS had one
fewer in its region of certainty.  The sole f-n it was certain about is also
one clim2 was certain about, and is a spam with a uuencoded body that we
don't decode.  This is a tradeoff in the tokenizer:  it simply doesn't
generate enough clues to nail this one (10 "words" total).  It's especially
embarrassing because the subject line is

    Subject: HOW TO BECOME A MILLIONAIRE IN WEEKS!!

Sheesh <wink>.

BTW, for python.org use, an uncertainty rate over 2% may not fly -- Greg
already gripes about reviewing a trivial number of msgs each day.


Now all over again, but with use_central_limit3; max_discriminators still
50, and same sets of msgs trained on and predicted against:

-> <stat> Ham scores for all runs: 7500 items; mean 0.05; sdev 1.61
-> <stat> min 0; median 0; max 51
* = 123 items
  0 7492 *************************************************************
 25    7 *
 50    1 *
 75    0

-> <stat> Spam scores for all runs: 7500 items; mean 99.63; sdev 4.43
-> <stat> min 0; median 100; max 100
* = 123 items
  0    2 *
 25    5 *
 50   48 *
 75 7445 *************************************************************

The uncertainty rate on ham is plain jaw-dropping there.  It's less sure
about spam, but in the end makes the same "but I was certain" mistakes.

Let's see how rmspik does on it:

Reading clim.pik ...
Nham= 7500
RmsZham= 9.77605846416
Nspam= 7500
RmsZspam= 10.1887670936
======================================================================
HAM:
Sure/ok       7316
Unsure/ok     183
Unsure/not ok 1
Sure/not ok   0
Unsure rate = 2.45%
Sure fp rate = 0.00%; Unsure fp rate = 0.54%
======================================================================
SPAM:
FALSE NEGATIVE: zham=-2.32 zspam=-6.04 Data/Spam/Set7/99999.txt SURE!
Sure/ok       7269
Unsure/ok     225
Unsure/not ok 5
Sure/not ok   1
Unsure rate = 3.07%
Sure fn rate = 0.01%; Unsure fn rate = 2.17%

RMS's uncertainty about spam skyrocketed under this scheme, but it did a
little better on ham under this scheme (1 fp total versus 3 before).  In
return, it has more fn (6 total vs 3 before).


From tim.one@comcast.net  Sun Oct  6 00:47:32 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 05 Oct 2002 19:47:32 -0400
Subject: [Spambayes] CL2 results and CL3 results
In-Reply-To: <3D9F1443.24980.27FFFFA0@localhost>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEEMBJAB.tim.one@comcast.net>

[Brad Clements]
> Uh, they're not 4-lines because my .ini settings aren't default..
>
> but, I've made them four lines now, snip snip.

Thanks!

> CL2 RESULTS
> ...
> -> <stat> Ham scores for all runs: 6500 items; mean 1.53; sdev 9.48
> -> <stat> min 0; median 0; max 100
> * = 104 items
>   0 6321 *************************************************************
>  48  106 **
>  50   52 *
>  98   21 *
>
> -> <stat> Spam scores for all runs: 6500 items; mean 99.17; sdev 6.93
> -> <stat> min 0; median 100; max 100
> * = 105 items
>   0   10 *
>  48   14 *
>  50   75 *
>  98 6401 *************************************************************

> CL3 RESULTS
> ...
> -> <stat> Ham scores for all runs: 6500 items; mean 1.11; sdev 8.23
> -> <stat> min 0; median 0; max 100
> * = 105 items
>   0 6373 *************************************************************
>  48   75 *
>  50   34 *
>  98   18 *
>
> -> <stat> Spam scores for all runs: 6500 items; mean 98.96; sdev 7.46
> -> <stat> min 0; median 100; max 100
> * = 105 items
>   0    7 *
>  48   30 *
>  50   92 *
>  98 6371 *************************************************************

Your test data looks tougher than mine, but three outcomes are the same:

1. CL3 is certain more often than CL2 about ham, and makes fewer
   mistakes when it is certain.

2. CL3 is certain less often than CL2 about spam, but makes fewer
   mistakes when it's certain there too.

3. CL2 and CL3 both have high error rates in their regions of
   uncertainty.  I think that's a Very Good Thing, because it
   means manual review won't be overwhelmingly a waste of time.
   If the error rate in the uncertainty region is just a percent
   or two, I believe manual review will become careless, or even
   skipped.  But if it's actually wrong in its guess a third of
   the time, it will be fun to remind yourself of how much smarter
   you are than a stupid computer <wink>.

If you've still got the clim pickles from these runs, please try Rob's
rmspik.py on them too (I just checked that into the project).


From tim.one@comcast.net  Sun Oct  6 01:35:49 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 05 Oct 2002 20:35:49 -0400
Subject: [Spambayes] RE: For the bold
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEELBJAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEENBJAB.tim.one@comcast.net>

One more test result here, using Gary's *original* central-limit scheme.
That didn't get a fair trial when it was introduced:  at the time, the
business about "certainty" under these schemes wasn't known, or even
suspected, so it looked poorer by comparison due to a seemingly large
increase in errors rates.  Now we know that *most* of that was just the
system very helpfully telling us it's unsure of its decision.  But at the
time, Gary immediately came up with central_limit2, and central_limit has
been neglected ever since.

Same setup as before, but with use_central_limit:

-> <stat> Ham scores for all runs: 7500 items; mean 0.26; sdev 3.67
-> <stat> min 0; median 0; max 100
* = 123 items
  0 7461 *************************************************************
 25   37 *
 50    1 *
 75    1 *

-> <stat> Spam scores for all runs: 7500 items; mean 99.75; sdev 3.59
-> <stat> min 0; median 100; max 100
* = 123 items
  0    1 *
 25    3 *
 50   33 *
 75 7463 *************************************************************

Overall, it's quite comparable to the two other central limit variations,
just uncertain slightly (in absolute terms) more often.  The uncertainty
increase is large in *relative* terms, though, which is why this looked like
a big jump in error rates when it was first tried.

Crunching the raw data via rmspik:

Reading clim.pik ...
Nham= 7500
RmsZham= 2.93763751621
Nspam= 7500
RmsZspam= 3.62374621717
======================================================================
HAM:
Sure/ok       7491
Unsure/ok     8
Unsure/not ok 1
Sure/not ok   0
Unsure rate = 0.12%
Sure fp rate = 0.00%; Unsure fp rate = 11.11%
======================================================================
SPAM:
FALSE NEGATIVE: zham=4.22 zspam=-4.08 Data/Spam/Set4/3434.txt SURE!
FALSE NEGATIVE: zham=4.55 zspam=-3.75 Data/Spam/Set4/635.txt SURE!
FALSE NEGATIVE: zham=4.90 zspam=-3.41 Data/Spam/Set6/12822.txt SURE!
FALSE NEGATIVE: zham=3.18 zspam=-5.12 Data/Spam/Set7/4234.txt SURE!
FALSE NEGATIVE: zham=4.85 zspam=-3.45 Data/Spam/Set8/975.txt SURE!
Sure/ok       0
Unsure/ok     0
Unsure/not ok 7495
Sure/not ok   5
Unsure rate = 99.93%
Sure fn rate = 100.00%; Unsure fn rate = 100.00%

So the RMS business is certain very much more often under the original
central limit scheme:

                 RMS ham unsure    RMS spam unsure
                 --------------    ---------------
central_limit                 9                  0
central_limit2              175                 77
central_limit3              184                227

That suggests to me that, whatever the heck <wink> RMS is doing, it's a much
better fit to the original central_limit scheme, but has a bizarre problem
with spam there.  I don't know whether I care about it, though, as it would
have leaked 5 spam out of 7500, and that's a measly 0.067% total f-n rate.

Let's look at the "sure but wrong" FN there:

FALSE NEGATIVE: zham=4.22 zspam=-4.08 Data/Spam/Set4/3434.txt SURE!
   The "Hello, my Name is BlackIntrepid" spam, discussed at length
   previously here.  Had no spam indicators at all when
   max_discriminators was 16 under the Graham scheme (highest
   spamprob among the 16 most extreme was about 0.05(!)).

FALSE NEGATIVE: zham=4.55 zspam=-3.75 Data/Spam/Set4/635.txt SURE!
   A short "just folks" spam that has given lots of schemes trouble:

"""
Return-Path: <scottmark1968@hotmail.com>
Delivered-To: em-ca-bruceg@em.ca
Received: (qmail 13437 invoked from network); 16 Aug 2002 02:37:15 -0000
Received: from unknown (HELO pakistan) (203.135.9.174)
  by churchill.factcomp.com with SMTP; 16 Aug 2002 02:37:15 -0000
From: "Scott Mark" <scottmark1968@hotmail.com>
To: <bruceg@em.ca>
Subject: Hello !
Mime-Version: 1.0
Content-Type: text/html; charset="iso-8859-1"
Date: Fri, 9 Aug 2002 08:35:24
Content-Length: 609

<BR>
Hi,      <BR>
<BR>
Just wanted you to check out this cool online website builder. It lets
people create cool websites in minutes and for free. You can create your own
Flash animations and Intro as well. Its really simple and easy to use :) and
its all a matter of minutes, you'll have an impressive website up and
running in no time, i'm impressed ... I bet you'll be impressed as well.<BR>
<BR>
This website gives a nice review and how to get started creating your first
website easily : <a href="http://www.click-free.com">www.click-free.com</a>
<BR>
<BR>
Thanks,      <BR>
Scott Mark.   <BR>
"""


FALSE NEGATIVE: zham=4.90 zspam=-3.41 Data/Spam/Set6/12822.txt SURE!
    "Subject: Website Programmers Available Now"

    Loaded with tech terms related to web design and programming,
    a frequent topic on c.l.py (my ham).  The more-extreme central-
    limit schemes get huge benefit out of extremely large spamprob
    words like "offshore".  Extreme extreme <wink> words don't have
    such extreme effect under the original cl scheme.


FALSE NEGATIVE: zham=3.18 zspam=-5.12 Data/Spam/Set7/4234.txt SURE!
    "Subject: www.NameYork.com / Webmaster link directory"

    I've had lots of trouble with this one before.  It's a long HTML
    msg full of links that would be of actual interest to webmasters.
    It even includes a link for Python.  I've never been entirely
    sure that it's spam, but it "smells more" like spam than ham
    to me.


FALSE NEGATIVE: zham=4.85 zspam=-3.45 Data/Spam/Set8/975.txt SURE!
    Another "just folks" spam that has given lots of schemes
    trouble:

"""
Return-Path: <jarph3@core.com>
Delivered-To: em-ca-bruceg@em.ca
Received: (qmail 27970 invoked from network); 14 Jul 2002 01:43:00 -0000
Received: from agamemnon.bfsmedia.com (204.83.201.2)
  by churchill.factcomp.com with SMTP; 14 Jul 2002 01:43:00 -0000
Received: (qmail 28917 invoked from network); 14 Jul 2002 01:26:53 -0000
Received: from c-24-131-114-96.mw.client2.attbi.com (HELO core.com)
(24.131.114.96)
  by agamemnon.bfsmedia.com with SMTP; 14 Jul 2002 01:26:53 -0000
From: "jarph3@core.com" <jarph3@core.com>
To: <bruceg@em.ca>
Subject: I want to share with you what I found
Sender: "jarph3@core.com" <jarph3@core.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8bit
Date: Sat, 13 Jul 2002 21:16:04 -0400
Content-Length: 676

My brother asked me to design a web page for his band Tainted Emotions. At
first, his site was nothing more than a few paragraphs describing his
unique psychotic melodies.  Although a good start, mere words failed to
convey the complete Tainted Emotions experience.  For that, I needed
graphics.  Not just any graphics though.  Fast, sleek, and professional
images that only my brother's band deserves.

I found all the free public domain photos I needed at freewebgrafix.com.
They had everything an aspiring graphics designer needs to transform a
texty site into a graphic sensation.  Animated GIFs, backgrounds, banners,
and of course--photos.

http://www.freewebgrafix.com
"""

I can live with spam like that -- the combination of original-cl and RMS
looks very much worth pursuing.


From tim.one@comcast.net  Sun Oct  6 01:46:32 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 05 Oct 2002 20:46:32 -0400
Subject: [Spambayes] RE: For the bold
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEENBJAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEEOBJAB.tim.one@comcast.net>

Oops!  I misread this data badly.

> Crunching the raw data via rmspik [from the original use_central_limit]:
>
> Reading clim.pik ...
> Nham= 7500
> RmsZham= 2.93763751621
> Nspam= 7500
> RmsZspam= 3.62374621717
> ======================================================================
> HAM:
> Sure/ok       7491
> Unsure/ok     8
> Unsure/not ok 1
> Sure/not ok   0
> Unsure rate = 0.12%
> Sure fp rate = 0.00%; Unsure fp rate = 11.11%
> ======================================================================
> SPAM:
> FALSE NEGATIVE: zham=4.22 zspam=-4.08 Data/Spam/Set4/3434.txt SURE!
> FALSE NEGATIVE: zham=4.55 zspam=-3.75 Data/Spam/Set4/635.txt SURE!
> FALSE NEGATIVE: zham=4.90 zspam=-3.41 Data/Spam/Set6/12822.txt SURE!
> FALSE NEGATIVE: zham=3.18 zspam=-5.12 Data/Spam/Set7/4234.txt SURE!
> FALSE NEGATIVE: zham=4.85 zspam=-3.45 Data/Spam/Set8/975.txt SURE!
> Sure/ok       0
> Unsure/ok     0
> Unsure/not ok 7495
> Sure/not ok   5
> Unsure rate = 99.93%
> Sure fn rate = 100.00%; Unsure fn rate = 100.00%

It actually unsure about alomst 100% of the spam!  So this table's first
row:

>                  RMS ham unsure    RMS spam unsure
>                  --------------    ---------------
> central_limit                 9                  0
> central_limit2              175                 77
> central_limit3              184                227

should have said

> central_limit                 9               7495

instead.  I assume this is evidence of a bug somewhere.  Note that the hmean
and smean for a msg are always identical under the original central limit
scheme.


From tim.one@comcast.net  Sun Oct  6 02:32:10 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 05 Oct 2002 21:32:10 -0400
Subject: [Spambayes] RE: For the bold
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEEOBJAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEFBBJAB.tim.one@comcast.net>

[Tim]
> ...
> It actually unsure about alomst 100% of the spam!  So this table's first
> row:
>
>                  RMS ham unsure    RMS spam unsure
>                  --------------    ---------------
> central_limit                 9                  0
> central_limit2              175                 77
> central_limit3              184                227
>
> should have said
>
> central_limit                 9               7495
>
> instead.  I assume this is evidence of a bug somewhere.  Note
> that the hmean and smean for a msg are always identical under the
> original central limit scheme.

The stuff below changes the first line to

  central_limit                49                  11

I believe "the bug" is in rmspik.chance(), which appears to assume that a
zscore in the positive direction is an indicator of certainty.  That seems
to be true in the logarithmic central-limit schemes, but isn't true in the
original central-limit scheme.  Changing the first three lines like so:

#    if x>=0:
#        return 1.0
#    x=-x/math.sqrt(2)
    x = abs(x)/math.sqrt(2)

and rerunning rmspik leads to very different results under the original
central limit scheme:

Reading clim.pik ...
Nham= 7500
RmsZham= 2.93763751621
Nspam= 7500
RmsZspam= 3.62374621717
======================================================================
HAM:
FALSE POSITIVE: zham=6.64 zspam=-1.66 Data/Ham/Set10/107687.txt SURE!
Sure/ok       7413
Unsure/ok     79
Unsure/not ok 7
Sure/not ok   1
Unsure rate = 1.15%
Sure fp rate = 0.01%; Unsure fp rate = 8.14%
======================================================================
SPAM:
Sure/ok       7451
Unsure/ok     38
Unsure/not ok 11
Sure/not ok   0
Unsure rate = 0.65%
Sure fn rate = 0.00%; Unsure fn rate = 22.45%

All the problems with spam went away then, and ham gives it more trouble
now.  It's still certain much more often here than under the extreme
central-limit schemes, so I still suspect RMS is a better fit to the
original cl scheme (but the probability calculation has to change to
something more symmetric).

The false positive it was certain about was the lady with a brief relevant
question, and a long, obnoxious, employer-generated sig.  That's one of my
two remaining f-p under the all-default scheme too (it so happens that the
Nigerian scam quote was in the training data on these runs, so can't show up
as an f-p).


From tim.one@comcast.net  Sun Oct  6 02:45:34 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 05 Oct 2002 21:45:34 -0400
Subject: [Spambayes] RE: For the bold
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEFBBJAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEFCBJAB.tim.one@comcast.net>

>>                  RMS ham unsure    RMS spam unsure
>>                  --------------    ---------------
>> central_limit                 9                  0
>> central_limit2              175                 77
>> central_limit3              184                227
>>
>> should have said
>>
>> central_limit                 9               7495
>>

> ...
> The stuff below changes the first line to
>
> central_limit                49                  11

Heh.  I give up.  That should have read

  central_limit                86                  49

None of the conclusions change:  RMS is happiest with the original central
limit scheme, and does very well with it indeed, but rmspik.chance() needs
to change to be symmetric when used with the original central limit scheme
else the spam results are a disaster.


From tim.one@comcast.net  Sun Oct  6 04:36:25 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 05 Oct 2002 23:36:25 -0400
Subject: [Spambayes] Sequemtial Test Results
In-Reply-To: <XFMail.021005123259.jbublitz@nwinternet.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEFHBJAB.tim.one@comcast.net>

[Jim Bublitz]

Thanks for sharing this!  It's an excellent report.

> I have a very unusual corpus of ham and spam compared
> to "normal", so these results may not be widely
> applicable.

Could you say something about *what* makes you abnormal <wink>?

> In evaluating Graham and Spambayes I've used both
> random testing (not as extensive as Spambayes) and
> sequential testing (train on first N, test next M).
> Since I use qmail, all of my mail is individual files
> and the filenames are the delivery timestamp, so it's
> easy to get an accurate sequence. In my experience,
> testing sequentially (as above) has always been "worst
> case" performance.
>
> A few days ago I switched to simulating actual
> performance, since I have to implement something sooner
> or later.

Excellent -- nobody here has done that yet (that I know of), and I've
worried out loud about that randomization allows msgs to get benefit from
training msgs that appeared *after* them in time; e.g., a ham msg can be
helped by that a reply to it appeared in the training ham, but that can
never happen in real life.

> My test procedure is:
>
> 1. Train on first T msgs
> 2. Test next t msgs
> 3. Train (incrementally) on t msgs
> 4. Loop on 2 & 3 for N msgs
>
> (all numbers are 50/50 spam/ham, which is my avg
> receiving about 200 msgs/day)
>
> For T = 8000, t = 200, N = 14400, the results I got for
> Graham were (cutoff is independent of anything else, so
> select the most desireable result):

I'm not sure what these are results *of* -- like, the last time you ran step
#2?  An average over all times you ran step #2?

>                      (zero above)
> cutoff: 0.15 -- fn =     0 (0.00%)  fp =     7 (0.33%)
> cutoff: 0.16 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.17 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.18 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.19 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.20 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.21 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.22 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.23 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.24 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.25 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.26 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.27 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.28 -- fn =     0 (0.00%)  fp =     4 (0.19%)
> cutoff: 0.29 -- fn =     0 (0.00%)  fp =     1 (0.05%)
> cutoff: 0.30 -- fn =     0 (0.00%)  fp =     1 (0.05%)
> cutoff: 0.31 -- fn =     0 (0.00%)  fp =     1 (0.05%)
> cutoff: 0.32 -- fn =     0 (0.00%)  fp =     1 (0.05%)
> cutoff: 0.33 -- fn =     0 (0.00%)  fp =     1 (0.05%)
> cutoff: 0.34 -- fn =     0 (0.00%)  fp =     1 (0.05%)
> ------------------------------------------------------
> cutoff: 0.35 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.36 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.37 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.38 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.39 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.40 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.41 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.42 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.43 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.44 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.45 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.46 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.47 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> ------------------------------------------------------
> cutoff: 0.48 -- fn =     1 (0.05%)  fp =     0 (0.00%)
> cutoff: 0.49 -- fn =     1 (0.05%)  fp =     0 (0.00%)
> cutoff: 0.50 -- fn =     2 (0.10%)  fp =     0 (0.00%)
> cutoff: 0.51 -- fn =     2 (0.10%)  fp =     0 (0.00%)
> cutoff: 0.52 -- fn =     2 (0.10%)  fp =     0 (0.00%)
> cutoff: 0.53 -- fn =     2 (0.10%)  fp =     0 (0.00%)
> cutoff: 0.54 -- fn =     2 (0.10%)  fp =     0 (0.00%)
> cutoff: 0.55 -- fn =     2 (0.10%)  fp =     0 (0.00%)
> cutoff: 0.56 -- fn =     2 (0.10%)  fp =     0 (0.00%)
> cutoff: 0.57 -- fn =     2 (0.10%)  fp =     0 (0.00%)
> cutoff: 0.58 -- fn =     2 (0.10%)  fp =     0 (0.00%)
> cutoff: 0.59 -- fn =     3 (0.14%)  fp =     0 (0.00%)
> cutoff: 0.60 -- fn =     3 (0.14%)  fp =     0 (0.00%)
> cutoff: 0.61 -- fn =     3 (0.14%)  fp =     0 (0.00%)
> cutoff: 0.62 -- fn =     3 (0.14%)  fp =     0 (0.00%)
> cutoff: 0.63 -- fn =     6 (0.29%)  fp =     0 (0.00%)
> cutoff: 0.64 -- fn =     6 (0.29%)  fp =     0 (0.00%)
> cutoff: 0.65 -- fn =     7 (0.33%)  fp =     0 (0.00%)
> cutoff: 0.66 -- fn =     7 (0.33%)  fp =     0 (0.00%)
> cutoff: 0.67 -- fn =     8 (0.38%)  fp =     0 (0.00%)
> cutoff: 0.68 -- fn =     9 (0.43%)  fp =     0 (0.00%)
> cutoff: 0.69 -- fn =    10 (0.48%)  fp =     0 (0.00%)
> cutoff: 0.70 -- fn =    10 (0.48%)  fp =     0 (0.00%)
> cutoff: 0.71 -- fn =    10 (0.48%)  fp =     0 (0.00%)
> cutoff: 0.72 -- fn =    10 (0.48%)  fp =     0 (0.00%)
> cutoff: 0.73 -- fn =    10 (0.48%)  fp =     0 (0.00%)
> cutoff: 0.74 -- fn =    15 (0.71%)  fp =     0 (0.00%)
>                                        (zero below)
>
>
>                           Graham
>                        Spam    Ham
> Mean                   0.98    0.01

And these are the means of what?  For example, there's no false-negative
rate as large as 0.98 in the table above, so 0.98 certainly isn't the mean
of the table entries.

> Std Dev                0.04    0.02
> 3 sigma                0.86    0.07
>
>
> For Spambayes ("out of the box" - CVS from 10/2)
>
>
> cutoff: 0.41 -- fn =     0 (0.00%)  fp =   164 (2.13%)
> cutoff: 0.42 -- fn =     1 (0.01%)  fp =   140 (1.82%)
> cutoff: 0.43 -- fn =     1 (0.01%)  fp =   121 (1.57%)
> cutoff: 0.44 -- fn =     1 (0.01%)  fp =   103 (1.34%)
> cutoff: 0.45 -- fn =     1 (0.01%)  fp =    90 (1.17%)
> cutoff: 0.46 -- fn =     1 (0.01%)  fp =    68 (0.88%)
> cutoff: 0.47 -- fn =     2 (0.03%)  fp =    55 (0.71%)
> cutoff: 0.48 -- fn =     2 (0.03%)  fp =    47 (0.61%)
> cutoff: 0.49 -- fn =     2 (0.03%)  fp =    36 (0.47%)
> cutoff: 0.50 -- fn =     3 (0.04%)  fp =    30 (0.39%)
> cutoff: 0.51 -- fn =     5 (0.06%)  fp =    23 (0.30%)
> cutoff: 0.52 -- fn =     8 (0.10%)  fp =    15 (0.19%)
> cutoff: 0.53 -- fn =    11 (0.14%)  fp =    13 (0.17%)
> cutoff: 0.54 -- fn =    15 (0.19%)  fp =     7 (0.09%)
> cutoff: 0.55 -- fn =    18 (0.23%)  fp =     7 (0.09%)
> cutoff: 0.56 -- fn =    28 (0.36%)  fp =     5 (0.06%)
> cutoff: 0.57 -- fn =    36 (0.47%)  fp =     3 (0.04%)
> cutoff: 0.58 -- fn =    46 (0.60%)  fp =     2 (0.03%)
> cutoff: 0.59 -- fn =    55 (0.71%)  fp =     2 (0.03%)
> cutoff: 0.60 -- fn =    63 (0.82%)  fp =     2 (0.03%)
> cutoff: 0.61 -- fn =    73 (0.95%)  fp =     1 (0.01%)
> cutoff: 0.62 -- fn =    90 (1.17%)  fp =     0 (0.00%)
>
>
>                         Spambayes
>                        Spam    Ham
> Mean                   0.85    0.16
> Std Dev                0.10    0.10
> 3 sigma                0.54    0.46


> For Graham, the modifications made are:
>
> 1. Word freq threshhold = 1 instead of 5

That helped us a lot when we were using Graham.

> 2. Case sensitive tokeninzing

That did not (made no overall difference in error rates; it systematically
called conference announcements spam, but was better at distinguishing spam
screaming about MONEY from casual mentions of money in ham).

> 3. Use Gary Robinson's score calculation

With or without artificially clamping spamprobs into [0.01, 0.99] first (as
Graham does)?

> 4. Use token count instead of msg count in computing probability.

We haven't tried that.

> Counting msgs instead of tokens in computing probability is
> a fairly subtle bias (noted by Graham in "A Plan for Spam")
> and is still included in Spambayes.

Not really.  We currently depart from Graham too in counting multiple
occurrences of a word only once in both training and scoring.  Our hamcounts
and spamcounts are counts of the # of messages a word appears in now, not
counts of the total number of times the word appears in msgs (as they were
under Graham).

> If I count msgs instead of tokens I can get about the same results
> and the mean and std dev are unaffected, but the tails of the
> distributions for ham/spam scores move closer together (no large
> dead band as above). Here's why (sort of):
>
> The probability calculation is:
>
> (s is spam count for a token, h is ham count, H/S are either
> the number of msgs seen or number of tokens seen)

I'm not sure what "spam count for a token" means.  For Graham, it means the
total number of times a token appears in spam, regardless of msg boundaries.
For us today, it means the number of spams in which the token appears (and
"Nigeria" appearing 100 times in a single spam adds only 1 to Nigeria's spam
count for us; it adds 100 to Graham's Nigeria spam count).  Our error rates
got lower when we made training symmetric with scoring in this respect,
although that wasn't true before we purged *all* of the deliberate biases in
Paul's scheme.

> prob = (s/S)/((h/H) + (s/S))
>
> which can be refactored to:
>
> prob = 1/(1 + (S/H)*(h/s))
>
> or with Graham's bias:
>
> prob = 1/(1 + (S/H)*(2*h/s))

Did you keep Graham's ham bias?  We have not.

> For my mail/testing,
>
> msgs   -- S/H = 1
> tokens -- S/H ~= 0.5 (ranges from 0.40 to 0.52 over time)
>
> so (for my unusual data anyway) counting msgs doubles the bias on
> the ham probability, but surprisingly affects the shape of my
> score distributions adversely. If I count msgs and remove Graham's
> 2.0 bias I get only slightly worse results than if I count tokens
> and include Graham's bias, since they're almost the same calculation
> and the sensitivity to S/H is small but noticeable. Playing around
> with Spambayes, I get slightly better results if I a) count tokens;
> b) count every token; c) drop robinson_probability_s to .05, but I
> still have overlap on the score distribution tails. (Adding Graham's
> bias back in helps too).

Note that overlapping tails aren't something our default scheme tries to
eliminate.  It's considered "a feature" here that Gary's scheme has a middle
ground where mistakes are very likely to live.  This is something you learn
to love <wink> after realizing that mistakes cannot be stopped.

For example, under Graham's scheme, you're *eventually* going to find ham
that scores 1.0 (and spam that scores 0.0).  For example, with 15
discriminators, sooner or later you're going to find a ham that just happens
to have 8 .99 clues and 7 .01 clues, and then Graham is certain it's spam.

There's no cutoff value that can save you from this kind of false positive,
short of never calling anything spam.  When Gary's scheme makes a mistake,
it's almost always within a short distance of the data's best spam_cutoff
value.  In a system with manual human review, this is very exploitable; in a
system without manual review, I suppose you just pass such msgs on, but
still have the *possibility* to say clearly that the system is known to make
mistakes in this range.

> Nothing I did to Spambayes had much effect on mean/std dev, but did
> reshape the distribution curves.  I get a lot more tokens than
> Spambayes,

?  What does that mean?  If you're using spambayes, it's generating tokens,
so it seems hard to get a lot more than that <wink>.

> but the ratios are close. Process sizes are comparable (about 100MB
> peak for the tests above).  Spambayes is about 2X faster.
>
> For my data (which, again, is unusual) I'd conclude:
>
> 1. Counting msgs and counting tokens once per msg seems wrong to
> me. It seems to me to be enumerating containers rather than
> enumerating contents, or at least mixing the two.

Graham also *scores* tokens (at most) once per message.  The training we do
matches the way our scoring uses the information produced by training.
We've seen reason to believe that the density of a word in a msg does
contain exploitable information, but don't have a way to exploit it;
experiment showed that using this info to distort spamprobs was not a useful
way to exploit it; for now, we just ignore it.

> 2. Sequential testing/training is important to look at (there
> may be time related effects - certainly S/H (counting tokens)
> varies over time).

No argument, and we really need to test that too.

> These are better than any other test results I've had for either
> method.
>
> 3. I'd concentrate on shaping the tails of the distribution
> rather than worrying about mean and std dev.

The so-called central-limit schemes we're investigating now are almost
entirely about separating the tails, and *knowing* when we can't, so that
should give you cause for hope.

> Some adjustments will degrade the mean/std dev but improve the
> shape of the distribution/sharpness of discrimination. If you
> look at the 3 sigma limits on either method (covers about 99.7%
> of the distribution in theory),

I've got no reason to assume the distributions here are normal, and short of
that you're reduced to Chebyshev's inequality (that at least 8/9ths of the
data lives inside 3 sigmas, regardless of distribution).  That said, they
"look pretty normal" <wink>, apart indeed from the long dribbly tails.
OTOH, some ham and some spam simply aren't clearcut, even for human
judgment, so I see no hope that this can be wholly eliminated "even in
theory".

> the fns and fps are out past 3 sigma.  In EE terms, you want sharper
> rolloff, not necessarily higher Q or a change in center frequency.
> Graham appears to be less sensitive to choice of cutoff than
> Spambayes for my dataset.

This was universally observed:  the Graham score histograms approximated two
solid bars, one at 0.0, the other at 1.0, the more data it was trained on.
Unfortunately, its *mistakes* also lived on these bars.

> As far as traing sample size, the results above were based on
> an intial training sample of 8000 msgs. If I start with 200
> msgs, I get an additional 15 fns in the first 200 tested, and
> one additional fn through the rest of the first 8000 msgs,
> at which time I'm back to the test/results above (choosing
> a fairly optimum cutoff value). If I start with 1 ham and
> 1 spam, I get 89 fps on the first 200 and no more failures
> after that. It seems to converge. Not very sensitive to
> initial conditions.

That's been my experience too, and that it takes a very large increase in
training data to cut an error rate in half.  My corpus now is at the point
where it can't really improve, due to the "some ham and spam simply aren't
clearcut" reason above.  It would take telepathy, and even people on this
list argue about whether specific msgs are ham or spam.

> All of this might only work for my email. YMMV. For either
> method the results are fantastic. I'd be happy with a 90%
> spam reduction and no fps.

You can get a 90% spam reduction, but you won't be as happy with that when
the amount of spam you get increases by a factor of 10 again.  But there's
no way you can get no fps, short of calling nothing spam.  I've noted before
that the chance my classifier would produce an FP over the next year is
smaller than the chance I'll die in that time, and I personally don't fear a
false positive more than death <wink>.


From tim.one@comcast.net  Sun Oct  6 06:00:56 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 06 Oct 2002 01:00:56 -0400
Subject: [Spambayes] Sequemtial Test Results
In-Reply-To: <XFMail.021005123259.jbublitz@nwinternet.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEFMBJAB.tim.one@comcast.net>

[Jim Bublitz]
> ...
> Playing around with Spambayes, I get slightly better results if I
> ...
> c) drop robinson_probability_s to .05,

That's a very low value.  I find this way of rewriting Gary's adjustment
easier to reason about:

    s*x + n*p          x - p
    --------- =  p +  -------
      s + n           1 + n/s

This makes it clear that it moves p in the direction of x, but less so the
larger n is, or the smaller s is.  For you, s=.05, and then that's

          x-p
    p +  ------
         1+20*n

At n=1, that's p + (x-p)/21.  The *interesting* <wink> thing there is that,
since you said you effectively removed Graham's mincount gimmick, under pure
Graham you *were* getting extreme spamprobs of 0.01 and 0.99 for words that
had been seen only once in the training data.  Setting s to 0.05 gives a
very similar effect under Gary's adjustment.  If x is 0.5,

0 + .5/21 ~= 0.024

and

1 + -.5/21 ~= 0.976

Those are really extreme probability estimates based on 1 measly occurence
in training data, but perhaps this ties in to the unusual nature of your
data.  For example, I've seen that low s helps ham message threads when a
typo or unusual word gets repeated in replies.


From rob@hooft.net  Sun Oct  6 06:49:06 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 06 Oct 2002 07:49:06 +0200
Subject: [Spambayes] RE: For the bold
References: <LNBBLJKPBEHFEDALKOLCKEFBBJAB.tim.one@comcast.net>
Message-ID: <3D9FCED2.4050802@hooft.net>

Tim Peters wrote:
> I believe "the bug" is in rmspik.chance(), which appears to assume that a
> zscore in the positive direction is an indicator of certainty.  That seems
> to be true in the logarithmic central-limit schemes, but isn't true in the
> original central-limit scheme.  Changing the first three lines like so:
> 
> #    if x>=0:
> #        return 1.0
> #    x=-x/math.sqrt(2)
>     x = abs(x)/math.sqrt(2)

Indeed, the chance function as I wrote it uses the information I had, 
which was only based on my clt2 experience.where positive Z-scores mean 
"absolute certainty", and negative Z-scores are increasingly uncertain. 
But: in practice, even for clt2, positive Z-scores above 2.0 do not 
appear very frequently if at all, and if/when that happens, the chance 
that the message belongs to the "other" group is extremely small. I just 
tried it for my clt2 data: your fix doesn't change anything there.

In case you're wondering what chance(x) is using under these if statements:

     if x < 1.4:
         return 1.0
     pre = math.exp(-x**2) / math.sqrt(math.pi) / x
     post = 1.0 - (1.0 / (2.0 * x**2))
     return pre * post

This is an approximation of the integral under the tail of the unit 
normal Gaussian, but the approximation only valid for x>>1 so for the 
"mass" of the curve, we just return 1.

Tim: It does look like your messages are a bit easier to classify than 
mine....

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob@hooft.net  Sun Oct  6 06:54:05 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 06 Oct 2002 07:54:05 +0200
Subject: [Spambayes] Re: For the bold
References: <LNBBLJKPBEHFEDALKOLCEEELBJAB.tim.one@comcast.net>
Message-ID: <3D9FCFFD.3060609@hooft.net>

Tim Peters wrote:
[clt2]

> Nham= 7500
> RmsZham= 2.27249107964
> Nspam= 7500
> RmsZspam= 2.354280998

[clt3]
> Nham= 7500
> RmsZham= 9.77605846416
> Nspam= 7500
> RmsZspam= 10.1887670936

OOF! Under clt3 your rms values are 4x bigger! I have to look at the 
details of that: the assumption under which the rmspik.py code works is 
that the distributions of zham and zspam values are normally distributed
if all values are "mirrored" around 0. I'll have to test that assumption 
for clt1 and clt3!

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob@hooft.net  Sun Oct  6 07:10:12 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 06 Oct 2002 08:10:12 +0200
Subject: [Spambayes] tokenizing identical words
Message-ID: <3D9FD3C4.9060902@hooft.net>

I have ony been following the tonenizer from a distance, but has it been 
tried yet to use logarithm tokens for multiple occurrences of a word? 
So, a spam mentioning Nigeria a couple times could result in "nigeria 
nigeria:2 nigeria:4 nigeria:8" tokens. I can imagine that the:16 is not 
going to mean a lot, but nigeria:4 like this message may quickly result 
in a spam score...

So: if you want to be removed, take your credit card, get rich quick, 
pay $100000 and click here: http://123456789/ :-)

Rob

PS: In my ham corpus there is a message of someone sending a list of all 
ISO country codes. In my spam corpus there is a spam that lists a lot of 
  countries where this company is selling stuff....

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tim.one@comcast.net  Sun Oct  6 07:14:15 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 06 Oct 2002 02:14:15 -0400
Subject: [Spambayes] RE: For the bold
In-Reply-To: <3D9F04AA.8050706@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEFPBJAB.tim.one@comcast.net>

[Rob Hooft]
>...
> Appended is a pdf containing six histograms made using
> max_discriminators=55
>
> The first one is zham for all ham messages. As you can see, the
> distribution is asymmetric. Furthermore, a simple average and standard
> deviation calculation results in a bell curve that does not follow the
> important tail of the histogram: the chances will be severely
> underestimated by these parameters.

Two things.  First, the raw spam score (smean) of a msg is the natural log
of the geometric mean of the extreme-word spamprobs.  This statistic can
never be positive, has no theoretical bound on how low it can go, and is
typically a small negative number, around -0.12.  It's simply impossible to
get a raw score "much larger" (much more positive) than that, but easy to
get one much smaller (much more negative), so I think the asymmetry is
inevitable.

The raw ham score (hmean) is similar, but uses the log of the geometric mean
of 1-prob, and is typically farther away from 0.0, nearer -0.33.  That gives
more room for larger scores to exist (remember that it can never be
positive!), and I expect that's why the first stab at fitting a bell curve
to the ham worked better than for the spam, despite that both were poor
fits.

All this may well be why the original use_central_limit scheme (which uses
the straight mean of the word spamprobs -- no logs, no geometric means, no
two-way prob vs 1-prob scoring) worked better for me under your scheme in my
tests:  that's got no fundamental reason (as far as I can see) to be *so*
lopsided; indeed, the mean and median of hmean are very close under
use_central_limit, and likewise for smean.  This isn't true under the other
central limit schemes.  They're still lopsided, though; here from an
original central limit run:

ham ham mean: 6000 items; mean 0.18; sdev 0.09
-> <stat> min 0.00620435; median 0.183251; max 0.840666

spam spam mean: 6000 items; mean 0.93; sdev 0.07
-> <stat> min 0.486362; median 0.950825; max 0.996632

The ham mean can't get below 0 under that scheme, and 0 is just two sdevs
away from the ham-mean mean ~= the ham-mean median.

The spam mean can't get above 1.0 under that scheme, and 1.0 is just one
sdev removed the spam-mean mean ~= (but less so) the spam-mean median.  So
here again, fitting the ham in a bell curve is easier than fitting the spam.


Second, there's no real justification for the way zscores are computed in
the classifier code now.  You may get better results if you ignore the
zscores in the pickle, and work directly with the raw hmean and smean scores
instead (which are also in the binary pickle saved by clgen).  They're the
actual data here, and the zscores are a distorted version that factor in n
(the number of extreme words) in a way that doesn't make real sense.  Note
that n is also in the clgen pickle tuples:  all the relevant info is there,
except for the individual word probabilities used.

> The second one is abs(zham) for all ham messages. The bell curve fits
> this histogram much better!

Since use_central_limit2 and use_central_limit3 produce inherently and
highly lopsided distributions, I think that makes good sense.


From tim.one@comcast.net  Sun Oct  6 07:43:30 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 06 Oct 2002 02:43:30 -0400
Subject: [Spambayes] RE: For the bold
In-Reply-To: <3D9FCFFD.3060609@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEGBBJAB.tim.one@comcast.net>

[Tim]
> [clt2]
> Nham= 7500
> RmsZham= 2.27249107964
> Nspam= 7500
> RmsZspam= 2.354280998
>
> [clt3]
> Nham= 7500
> RmsZham= 9.77605846416
> Nspam= 7500
> RmsZspam= 10.1887670936

[Rob Hooft]
> OOF! Under clt3 your rms values are 4x bigger! I have to look at the
> details of that:

clt1 and clt2 build ham and spam populations out of individual word
probabilities.  If the central limit theorem actually applied (which it does
not), the way zscores are computed would make sense (at least when n > 30).

clt3 builds ham and spam populations out of whole-msg scores.  The way
zscores are computed there is the same as under clt2, but it makes no sense
whatsoever under clt3.  I didn't care, because the results were at least as
good regardless; "zscores" in the hundreds are pretty common under clt3.

I think you should ignore the classifier's zscores, Rob:  *none* of them
make good sense, and under clt3 they make no sense.  The only virtue they
have is that tests say they work really well <wink -- but I can't escape
noticing that the less justification a scheme has here, the better it seems
to work!>.

> the assumption under which the rmspik.py code works is that the
> distributions of zham and zspam values are normally distributed
> if all values are "mirrored" around 0. I'll have to test that
> assumption for clt1 and clt3!

I didn't catch the meaning there, but expect any assumption you would like
to make is most likely to be true under clt1 (which is the least extreme of
these gimmicks).


From noreply@sourceforge.net  Sun Oct  6 07:41:02 2002
From: noreply@sourceforge.net (noreply@sourceforge.net)
Date: Sat, 05 Oct 2002 23:41:02 -0700
Subject: [Spambayes] 
 [ spambayes-Patches-618928 ] runtest.sh: add timtest + spam/ham!=1
Message-ID: <E17y566-0005ti-00@usw-sf-web3.sourceforge.net>

Patches item #618928, was opened at 2002-10-05 06:46
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618928&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Rob W.W. Hooft (hooft)
>Assigned to: Neale Pickett (npickett)
Summary: runtest.sh: add timtest + spam/ham!=1

Initial Comment:
 * Add timtest to runtest.sh
 * Add different spam/ham counts to runtest.sh


----------------------------------------------------------------------

Comment By: Rob W.W. Hooft (hooft)
Date: 2002-10-05 06:53

Message:
Logged In: YES 
user_id=47476

Here is the patch

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618928&group_id=61702

From noreply@sourceforge.net  Sun Oct  6 07:48:07 2002
From: noreply@sourceforge.net (noreply@sourceforge.net)
Date: Sat, 05 Oct 2002 23:48:07 -0700
Subject: [Spambayes] 
 [ spambayes-Patches-618928 ] runtest.sh: add timtest + spam/ham!=1
Message-ID: <E17y5Cx-00029P-00@usw-sf-web1.sourceforge.net>

Patches item #618928, was opened at 2002-10-05 06:46
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618928&group_id=61702

Category: None
Group: None
>Status: Closed
>Resolution: Accepted
Priority: 5
Submitted By: Rob W.W. Hooft (hooft)
Assigned to: Neale Pickett (npickett)
Summary: runtest.sh: add timtest + spam/ham!=1

Initial Comment:
 * Add timtest to runtest.sh
 * Add different spam/ham counts to runtest.sh


----------------------------------------------------------------------

>Comment By: Neale Pickett (npickett)
Date: 2002-10-05 23:48

Message:
Logged In: YES 
user_id=619391

Looks good, thanks for the patch!

----------------------------------------------------------------------

Comment By: Rob W.W. Hooft (hooft)
Date: 2002-10-05 06:53

Message:
Logged In: YES 
user_id=47476

Here is the patch

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618928&group_id=61702

From neale@woozle.org  Sun Oct  6 07:53:00 2002
From: neale@woozle.org (Neale Pickett)
Date: 05 Oct 2002 23:53:00 -0700
Subject: [Spambayes] ["Neale Pickett" <npickett@users.sourceforge.net>]
 [Spambayes-checkins] spambayes runtest.sh,1.6,1.7
Message-ID: <w53it0gnjn7.fsf@woozle.org>

---------------------- multipart/mixed attachment
Take heed: the runtest.sh I just checked in uses "cv1.txt" and "cv2.txt"
instead of "run1.txt" and "run2.txt", as there can now be different
types of runs with different output.  If you want to keep your old run
data, rename your "run1.txt" and "run2.txt" files.

Neale


---------------------- multipart/mixed attachment
An embedded message was scrubbed...
From: "Neale Pickett" <npickett@users.sourceforge.net>
Subject: [Spambayes-checkins] spambayes runtest.sh,1.6,1.7
Date: Sat, 05 Oct 2002 23:47:38 -0700
Size: 6169
Url: http://mail.python.org/pipermail-21/spambayes/attachments/20021005/77169193/attachment.txt

---------------------- multipart/mixed attachment--

From jbublitz@nwinternet.com  Sun Oct  6 10:42:23 2002
From: jbublitz@nwinternet.com (Jim Bublitz)
Date: Sun, 06 Oct 2002 02:42:23 -0700 (PDT)
Subject: [Spambayes] Sequemtial Test Results
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEFHBJAB.tim.one@comcast.net>
Message-ID: <XFMail.021006024223.jbublitz@nwinternet.com>

On 06-Oct-02 Tim Peters wrote:
> Thanks for sharing this!  It's an excellent report.

Thanks for taking time to reply - I realize I'm somewhat off-topic
here.
 
>> I have a very unusual corpus of ham and spam compared
>> to "normal", so these results may not be widely
>> applicable.
 
> Could you say something about *what* makes you abnormal <wink>?

You mean besides coding Python?

Spam: over 50% is Asian language; includes virus msgs (raw or ISP
scrubbed/tagged); business related or industry specific spam; plus
all of the other usual kinds of spam. A lot of virus msgs (raw
or scrubbed/tagged by an ISP or firewall filtering) - my favorite
was "I insult your mother".

Ham: Over 3/4 is lists (some long) of part numbers/quantities/some
other info (eg TMS320P14FNL  1000  99DC $10.00) or related
correspondence (quotes, RFQs, inquiries, etc), some with a small
amount of Asian text mixed in. Otherwise newsletters, mailing
lists, personal stuff (small amount).

> Excellent -- nobody here has done that yet (that I know of), and
> I've worried out loud about that randomization allows msgs to get
> benefit from training msgs that appeared *after* them in time;
> e.g., a ham msg can be helped by that a reply to it appeared in
> the training ham, but that can never happen in real life.

It seems the opposite is true - my results were worse before (0.5%
to 1.0% failures or worse). You might have already acheived
perfection and not know it due to randomization :) It appears that
the systems both learn gradually. For example, one of my ISPs
started virus filtering at a point after the initial training data,
and that produced problems in the past training on N msgs then
testing the next M without any retraining. That didn't occur here.
Some other hard to filter msgs (again, for both methods) also didn't
fail.

 
> I'm not sure what these are results *of* -- like, the last time
> you ran step>#2?  An average over all times you ran step #2?

Total results for testing 14400 msgs in batches of 200 (and
training after each 200) - failures against (virtual) cutoff
setting.


>                           Graham
>                        Spam    Ham
> Mean                   0.98    0.01

> And these are the means of what?  For example, there's no
> false-negative rate as large as 0.98 in the table above, so 0.98
> certainly isn't the mean of the table entries.

Mean/std deviation for scores of all msgs tested.

>> Std Dev                0.04    0.02
>> 3 sigma                0.86    0.07
 
>> 1. Word freq threshhold = 1 instead of 5
 
> That helped us a lot when we were using Graham.
 
>> 2. Case sensitive tokeninzing
 
> That did not (made no overall difference in error rates; it
> systematically called conference announcements spam, but was
> better at distinguishing spam screaming about MONEY from casual
> mentions of money in ham).

Everything made a *small* difference - I'm really quite surprised
everything lined up in the same direction for once. I went through
most of the tweaks from scratch one at a time (including some of my
own that I thought were really cool but ultimately didn't work very
well) and what's left is what what worked the best. Finally having
clean samples really helped too.
 
>> 3. Use Gary Robinson's score calculation
 
> With or without artificially clamping spamprobs into [0.01, 0.99]
> first (as Graham does)?

Same as Graham. I went back and tried Graham's scoring again too,
and it's only marginally worse than Robinson's (but has the problem
of extreme values of fp & fn). My "Robinson scoring" is just the
S = (P - Q)/(P + Q) kind.
 
>> 4. Use token count instead of msg count in computing
>> probability.
 
> We haven't tried that.

It's a programming error (wrong indentation in token processing
loop) that led to better results. Wish I could say I thought of it,
but it makes more sense to me now. Again, it makes a small
difference overall, but has a bigger effect on the shape of the
score distribution in my tests.
 
>> Counting msgs instead of tokens in computing probability is
>> a fairly subtle bias (noted by Graham in "A Plan for Spam")
>> and is still included in Spambayes.
 
> Not really.  We currently depart from Graham too in counting
> multiple occurrences of a word only once in both training and
> scoring.  Our hamcounts and spamcounts are counts of the # of
> messages a word appears in now, not counts of the total number
> of times the word appears in msgs (as they were
> under Graham).

Yes - that's what bothers me.
 
>> If I count msgs instead of tokens I can get about the same
>> results
>> and the mean and std dev are unaffected, but the tails of the
>> distributions for ham/spam scores move closer together (no large
>> dead band as above). Here's why (sort of):
>>
>> The probability calculation is:
>>
>> (s is spam count for a token, h is ham count, H/S are either
>> the number of msgs seen or number of tokens seen)
 
> I'm not sure what "spam count for a token" means.  For Graham, it
> means the total number of times a token appears in spam,
> regardless of msg boundaries. For us today, it means the number
> of spams in which the token appears (and "Nigeria" appearing 100
> times in a single spam adds only 1 to Nigeria's spam count for
> us; it adds 100 to Graham's Nigeria spam count).  Our error rates
> got lower when we made training symmetric with scoring in this
> respect, although that wasn't true before we purged *all* of the
> deliberate biases in Paul's scheme.

"spam count" means the same as your "spamcount" variable in
update_probabilities - you count once per msg, I count every
occurance in a msg. Making "training symmetric with scoring" is
what seems intuitively incorrect to me, along with nham/nspam being
msg counts instead of token counts. 

If you arbitrarily see the word "fussball" in a wordstream is the
wordstream German or English ("football" in German, a table game
found in bars in US English)? I'd guess German because I'm also
guessing the word occurs with greater frequency in German
wordstreams than in English wordstreams (absent context) - not
because I think more German books contain at least one occurance of
the word compared to English books. On the testing side, if the test
wordstream contained "fussball!fussball!fussball!", would you change
your guess? I'd suggest your guess would still be based on a single
occurance - the repetition doesn't change the probability of which
set the wordstream belongs to. I can't see it would 3X more likely
one way or the other - what else could you conclude then but that
"fussball" and "fussball!fussball!fussball!" have identical
probabilities of being elements of a German wordstream without
some other kind of data? 

>> prob = 1/(1 + (S/H)*(2*h/s))
 
> Did you keep Graham's ham bias?  We have not.

Yes - again, a small (positive) difference.
 
> Note that overlapping tails aren't something our default scheme
> tries to eliminate.  It's considered "a feature" here that
> Gary's scheme has a middle ground where mistakes are very likely
> to live.  This is something you learn to love <wink> after
> realizing that mistakes cannot be stopped.

Yes - and if your scores really indicate the actual probability of
spamminess, you can use that info to sort the msgs for manual
review. Given the volume of spam, fatigue is a real problem in
manual review - I wouldn't risk the possibility of fps except that
they're more likely with a manual system (as I found out in sorting
25K msgs semi-manually). I'm actually concerned that if the fp
rate is too low, they're won't be enough reward in reviewing the
results manually - my fps could be very expensive. It appears to me
that perfect results are not obtainable because everyone probably
has msgs that they can't reliably bucket as spam or ham.
 
> For example, under Graham's scheme, you're *eventually* going to
> find ham that scores 1.0 (and spam that scores 0.0).  For
> example, with 15 discriminators, sooner or later you're going to
> find a ham that just happens to have 8 .99 clues and 7 .01
> clues, and then Graham is certain it's spam.

Happened a lot in other kinds of testing, but not much when testing
sequentially as described above - I have no idea why.
 
> There's no cutoff value that can save you from this kind of false
> positive, short of never calling anything spam.  When Gary's
> scheme makes a mistake, it's almost always within a short
> distance of the data's best spam_cutoff value.  In a system with
> manual human review, this is very exploitable;

Agree - I should have read ahead before the response above.

> in a system without manual review, I suppose you just pass such
> msgs on, but still have the *possibility* to say clearly that
> the system is known to make mistakes in this range.
 
>> Nothing I did to Spambayes had much effect on mean/std dev, but
>> did reshape the distribution curves.  I get a lot more tokens
>> than Spambayes,
 
> ?  What does that mean?  If you're using spambayes, it's
> generating tokens, so it seems hard to get a lot more than that
> <wink>.

My tokenizer consists of a findall on 
re.compile(r"[\w'$_-]+", re.U). I get a lot more tokens than than
spambayes tokenizer produces. "I get" meant my Graham version vs.
spambayes.

 
>> 3. I'd concentrate on shaping the tails of the distribution
>> rather than worrying about mean and std dev.
 
> The so-called central-limit schemes we're investigating now are
> almost entirely about separating the tails, and *knowing* when
> we can't, so that should give you cause for hope.

I gathered that from today's list msgs - didn't notice it before.
 
> OTOH, some ham and some spam simply aren't clearcut, even for
> human judgment, so I see no hope that this can be wholly
> eliminated "even in theory".

Either method does better than I do (or thinks it does at any rate)
 
>> the fns and fps are out past 3 sigma.  In EE terms, you want
>> sharper rolloff, not necessarily higher Q or a change in center
>> frequency. Graham appears to be less sensitive to choice of
>> cutoff than Spambayes for my dataset.
 
> This was universally observed:  the Graham score histograms
> approximated two solid bars, one at 0.0, the other at 1.0, the
> more data it was trained on. Unfortunately, its *mistakes* also
> lived on these bars.

Yes, but the (P - Q)/(P + Q) scoring fixes that nicely for my data.
 
> It would take telepathy, and even people on this list argue about
> whether specific msgs are ham or spam.

The computer is always right.
 
> I've noted before that the chance my classifier
> would produce an FP over the next year is smaller than the
> chance I'll die in that time, and I personally don't fear a
> false positive more than death <wink>.

You haven't met my wife - one persistent fp was from a woman who is
both my wife's best friend and was (and may be again) our best
customer. I suppose that's what whitelists are for.


Jim


From nas@python.ca  Sun Oct  6 17:03:38 2002
From: nas@python.ca (Neil Schemenauer)
Date: Sun, 6 Oct 2002 09:03:38 -0700
Subject: [Spambayes] CL2 results
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEEEBJAB.tim.one@comcast.net>
References: <3D9EF7D4.23399.2790EC5D@localhost>
	<LNBBLJKPBEHFEDALKOLCEEEEBJAB.tim.one@comcast.net>
Message-ID: <20021006160338.GA9127@glacier.arctrix.com>

Tim Peters wrote:
> Greg Ward explained how python.org checks for viruses here:
> 
> http://mail.python.org/pipermail-21/spambayes/2002-September/000327.html

Here's what I'm using with qmail:

    http://arctrix.com/~nas/misc/qmail-filter-exe.py

Message with viruses are usually pretty big so keeping them out of
corpus saves a lot of space.

  Neil

From tim.one@comcast.net  Sun Oct  6 17:20:48 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 06 Oct 2002 12:20:48 -0400
Subject: [Spambayes] RE: For the bold
In-Reply-To: <3D9FCED2.4050802@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEHGBJAB.tim.one@comcast.net>

[Rob Hooft]
> ...
> Tim: It does look like your messages are a bit easier to classify than
> mine....

I don't know.  The results I reported were:

> Here's a use_central_limit2 run with max_discriminators=50, trained
> on 5000 ham and 5000 spam, then predicting against 7500 of each

and all runs were on the same set of msgs.

The last time you mentioned how "big" your tests are was:

> I focussed for our night on optimizing the max_discriminators for
> clt2 using 10x(200+200) messages out of my corpses,

I'm not sure exactly what 10x(200+200) means, but at the plausible extremes
it means your classifiers were trained on 200 on each, or on 1800 of each.
So at worst, my classifier was trained on 3x as much data, and at best on
25x as much data.  Error rates certainly improve with more training data,
albeit slowly.

OTOH, later you showed output saying

> Reading climbig12.pk ...
> Nham= 12800
> RmsZham= 2.76178782393
> Nspam= 5600

so at *some* point you stopped predicting against equal amounts of ham and
spam, but there's no way to guess how much was trained on for that result.

Interpreting results here gets very difficult because it's often not clear
what a tester is reporting on (how much training data, how much prediction
data, which test driver produced the results, what the relevant options
were).

That said, I expect my ham is easier than most, because newsgroup traffic
almost never contains personal msgs -- no screaming red HTML birthday wishes
from 9-year-old nieces, no confirmations of payment received, no opt-in
marketing newsletters, no chain letters forwarded from naive brothers, etc.


From bkc@murkworks.com  Sun Oct  6 17:34:07 2002
From: bkc@murkworks.com (Brad Clements)
Date: Sun, 06 Oct 2002 12:34:07 -0400
Subject: [Spambayes] CL2 test part II
Message-ID: <3DA02D8B.11446.2C4AD3E5@localhost>

In my earlier CL2 and CL3 tests, I trained on the 2nd half of my corpus, and tested the 
first half.

Now, I'm training on the first half and testing the 2nd half. 

First run of CL2 uncovered more misclassifications (which probably affected the 
training of my first test).

I'm temporarily "borrowing" a client's dual Xeon machine, still only using one processor 
of course, but it seems a lot faster than my PIII-933

In any case, here's CL2 results training first, testing second half.

> <stat> Ham scores for all runs: 6500 items; mean 0.94; sdev 7.21
-> <stat> min 0; median 0; max 100
* = 105 items
  0 6384 *************************************************************
 25   87 *
 50   21 *
 75    8 *

-> <stat> Spam scores for all runs: 6500 items; mean 99.32; sdev 5.94
-> <stat> min 0; median 100; max 100
* = 106 items
  0    3 *
 25   15 *
 50   68 *
 75 6414 *************************************************************
-> best cutoff for all runs: 0.5
->     with weighted total 1*29 fp + 18 fn = 47
->     fp rate 0.446%  fn rate 0.277%

Tokenizer]
mine_received_headers: True

[Classifier]
use_central_limit2 = True
use_central_limit3 = False
zscore_ratio_cutoff: 1.9

[TestDriver]
spam_cutoff: 0.50
show_false_negatives: True
nbuckets: 4

show_spam_lo: 0.0
show_spam_hi: 0.45

save_trained_pickles: True
save_histogram_pickles: True


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From bkc@murkworks.com  Sun Oct  6 17:48:44 2002
From: bkc@murkworks.com (Brad Clements)
Date: Sun, 06 Oct 2002 12:48:44 -0400
Subject: [Spambayes] CL3 test part II
Message-ID: <3DA030F7.11395.2C5834AA@localhost>

As in CL2 part II test, here are the results of training on the first half, and testing on the 
2nd half of my data set

-> <stat> Ham scores for all runs: 6500 items; mean 0.64; sdev 5.98
-> <stat> min 0; median 0; max 100
* = 106 items
  0 6422 *************************************************************
 25   59 *
 50   13 *
 75    6 *

-> <stat> Spam scores for all runs: 6500 items; mean 98.85; sdev 7.84
-> <stat> min 0; median 100; max 100
* = 105 items
  0    8 *
 25   22 *
 50  113 **
 75 6357 *************************************************************
-> best cutoff for all runs: 0.5
->     with weighted total 1*19 fp + 30 fn = 49
->     fp rate 0.292%  fn rate 0.462%


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From bkc@murkworks.com  Sun Oct  6 18:01:03 2002
From: bkc@murkworks.com (Brad Clements)
Date: Sun, 06 Oct 2002 13:01:03 -0400
Subject: [Spambayes] rmspik results on CL2 and CL3
Message-ID: <3DA033DA.25462.2C637B3C@localhost>

on CL2 test II.

Reading results/cl2-b/clim.pik ...
Nham= 6500
RmsZham= 4.48175325539
Nspam= 6500
RmsZspam= 3.7809204202
======================================================================
HAM:
FALSE POSITIVE: zham=-5.79 zspam=-2.33 Data/Ham/Set6/6438 SURE!
FALSE POSITIVE: zham=-5.43 zspam=-2.27 Data/Ham/Set6/10068 SURE!
FALSE POSITIVE: zham=-3.97 zspam=-1.72 Data/Ham/Set7/9964 SURE!
FALSE POSITIVE: zham=-6.17 zspam=-2.35 Data/Ham/Set9/6415 SURE!
Sure/ok       6297
Unsure/ok     181
Unsure/not ok 18
Sure/not ok   4
Unsure rate = 3.06%
Sure fp rate = 0.06%; Unsure fp rate = 9.05%
======================================================================
SPAM:
FALSE NEGATIVE: zham=-1.86 zspam=-3.48 Data/Spam/Set7/6718 SURE!
FALSE NEGATIVE: zham=-1.48 zspam=-6.37 Data/Spam/Set10/10979 SURE!
Sure/ok       6240
Unsure/ok     232
Unsure/not ok 26
Sure/not ok   2
Unsure rate = 3.97%
Sure fn rate = 0.03%; Unsure fn rate = 10.08%

All the hams really are hams. Network Computing renewal, etc..

The Set10/10979 spam .. really was a ham (oops)

The Set7/6718 is .. I think a spam, you decide


>From ???@??? Sat Sep 21 16:59:31 2002
Received: from SpoolDir by GIMPELSTIMER (Mercury 1.44);
	12 Sep 02 11:10:59 -0400
Received: from anvil.murkworks.com (128.153.43.1) by coal.murkworks.com
	(Mercury 1.44) with ESMTP;    12 Sep 02 11:10:49 -0400
Received: from hotmail.com (oe19.pav1.hotmail.com [64.4.30.123])
	by anvil.murkworks.com (8.9.1/8.9.1) with ESMTP id LAA14649
	for <support@murkworks.com>; Thu, 12 Sep 2002 11:00:52 -0400 (EDT)
Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC;
	 Thu, 12 Sep 2002 08:11:15 -0700
X-Originating-IP: [202.88.161.198]
From: "preeti" <umeshghode@hotmail.com>
To: <support@murkworks.com>
Cc: <rohan_7@indiatimes.com>, <rohan_194@rediff.com>
Date: Mon, 19 Aug 2002 22:01:06 +0530
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_NextPart_000_0005_01C247CB.EB1BDC40"
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2600.0000
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000
Message-ID: <OE19ij0w4TPnhur860E00001a55@hotmail.com>
X-OriginalArrivalTime: 12 Sep 2002 15:11:15.0656 (UTC)
	FILETIME=[A38C0480:01C25A6E]

This is a multi-part message in MIME format.

------=_NextPart_000_0005_01C247CB.EB1BDC40
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

I want information about novell netware.
=20

<snip>

I think it's spam because the return path is maxi_sexy@hotmail.com

Though, we do have software for NetWare that we market, but I'd expect web inquiries 
to go to info@, not support@

I suppose technically this is not spam, since a human individually typed this in, I'm sure.


CL3 test II rmspik results

Reading results/cl3-b/clim.pik ...
Nham= 6500
RmsZham= 12.6590343376
Nspam= 6500
RmsZspam= 14.8475623174
======================================================================
HAM:
FALSE POSITIVE: zham=-6.65 zspam=-2.35 Data/Ham/Set6/6438 SURE!
FALSE POSITIVE: zham=-6.19 zspam=-2.29 Data/Ham/Set6/10068 SURE!
FALSE POSITIVE: zham=-4.60 zspam=-1.73 Data/Ham/Set7/9964 SURE!
FALSE POSITIVE: zham=-7.11 zspam=-2.37 Data/Ham/Set9/6415 SURE!
Sure/ok       6294
Unsure/ok     182
Unsure/not ok 20
Sure/not ok   4
Unsure rate = 3.11%
Sure fp rate = 0.06%; Unsure fp rate = 9.90%
======================================================================
SPAM:
FALSE NEGATIVE: zham=-1.37 zspam=-6.36 Data/Spam/Set10/10979 SURE!
Sure/ok       6271
Unsure/ok     207
Unsure/not ok 21
Sure/not ok   1
Unsure rate = 3.51%
Sure fn rate = 0.02%; Unsure fn rate = 9.21%


All hams really are hams,

The ONE spam .. isn't a spam.


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From bkc@murkworks.com  Sun Oct  6 18:34:37 2002
From: bkc@murkworks.com (Brad Clements)
Date: Sun, 06 Oct 2002 13:34:37 -0400
Subject: [Spambayes] CL3 part I (reprise)
Message-ID: <3DA03BB8.17777.2C82369A@localhost>

I've re-run CL3 test from yesterday after further cleaning up my corpus

(training on second half, testing on first half)

-> <stat> Ham scores for all runs: 6500 items; mean 0.98; sdev 7.64
-> <stat> min 0; median 0; max 100
* = 105 items
  0 6386 *************************************************************
 25   74 *
 50   26 *
 75   14 *

-> <stat> Spam scores for all runs: 6500 items; mean 98.90; sdev 7.55
-> <stat> min 0; median 100; max 100
* = 105 items
  0    5 *
 25   33 *
 50  101 *
 75 6361 *************************************************************
-> best cutoff for all runs: 0.5
->     with weighted total 1*40 fp + 38 fn = 78
->     fp rate 0.615%  fn rate 0.585%


Reading results/cl3-a/clim.pik ...
Nham= 6500
RmsZham= 14.4660600316
Nspam= 6500
RmsZspam= 15.176558614
======================================================================
HAM:
FALSE POSITIVE: zham=-11.40 zspam=-2.41 Data/Ham/Set1/10180 SURE!
FALSE POSITIVE: zham=-6.90 zspam=-2.54 Data/Ham/Set1/10852 SURE!
FALSE POSITIVE: zham=-4.81 zspam=-2.54 Data/Ham/Set3/5943 SURE!
FALSE POSITIVE: zham=-6.97 zspam=-2.22 Data/Ham/Set3/6480 SURE!
FALSE POSITIVE: zham=-4.69 zspam=-1.39 Data/Ham/Set4/69 SURE!
FALSE POSITIVE: zham=-4.88 zspam=-2.31 Data/Ham/Set4/5548 SURE!
FALSE POSITIVE: zham=-12.55 zspam=-1.49 Data/Ham/Set4/10008 SURE!
FALSE POSITIVE: zham=-6.22 zspam=-2.06 Data/Ham/Set4/10937 SURE!
FALSE POSITIVE: zham=-12.06 zspam=-0.40 Data/Ham/Set5/5105 SURE!
FALSE POSITIVE: zham=-5.21 zspam=-2.42 Data/Ham/Set5/6369 SURE!
Sure/ok       6272
Unsure/ok     182
Unsure/not ok 36
Sure/not ok   10
Unsure rate = 3.35%
Sure fp rate = 0.16%; Unsure fp rate = 16.51%
======================================================================
SPAM:
FALSE NEGATIVE: zham=-0.96 zspam=-5.47 Data/Spam/Set2/5185 SURE!
FALSE NEGATIVE: zham=-2.12 zspam=-4.62 Data/Spam/Set2/6457 SURE!
FALSE NEGATIVE: zham=-1.97 zspam=-18.20 Data/Spam/Set3/3010 SURE!
Sure/ok       6248
Unsure/ok     215
Unsure/not ok 34
Sure/not ok   3
Unsure rate = 3.83%
Sure fn rate = 0.05%; Unsure fn rate = 13.65%

I'm going to stick with the ham and spam classification .. hams are mostly network 
computing renewals, discover card statement, etc.

spams .. stuff I don't want..


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From rob@hooft.net  Sun Oct  6 18:40:25 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 06 Oct 2002 19:40:25 +0200
Subject: [Spambayes] RE: For the bold
References: <LNBBLJKPBEHFEDALKOLCGEHGBJAB.tim.one@comcast.net>
Message-ID: <3DA07589.50407@hooft.net>

Tim Peters wrote:
> [Rob Hooft]
> 
>>...
>>Tim: It does look like your messages are a bit easier to classify than
>>mine....
> 
> 
> I don't know.  The results I reported were:
> 
> 
>>Here's a use_central_limit2 run with max_discriminators=50, trained
>>on 5000 ham and 5000 spam, then predicting against 7500 of each
> 
> 
> and all runs were on the same set of msgs.
> 
> The last time you mentioned how "big" your tests are was:
> 
> 
>>I focussed for our night on optimizing the max_discriminators for
>>clt2 using 10x(200+200) messages out of my corpses,
> 
> 
> I'm not sure exactly what 10x(200+200) means, but at the plausible extremes
> it means your classifiers were trained on 200 on each, or on 1800 of each.
> So at worst, my classifier was trained on 3x as much data, and at best on
> 25x as much data.  Error rates certainly improve with more training data,
> albeit slowly.

I did have 10 sets each of ham and spam, each set containing 200 
messages out of a total reservoir of ~17000 ham and 7500 spam. This 
subset of everything was heavy enough for this optimization: it took 
about 24 hours of calculating to get that analysis done....

> OTOH, later you showed output saying
> 
> 
>>Reading climbig12.pk ...
>>Nham= 12800
>>RmsZham= 2.76178782393
>>Nspam= 5600
> 
> 
> so at *some* point you stopped predicting against equal amounts of ham and
> spam, but there's no way to guess how much was trained on for that result.

At that point, I had 10 sets, each ham set contained 1600 hams, and each 
spam set 700 spams. I was using 2 sets each to train, and 8 to analyse.

Since that time I have cleaned out the spam body by looking for 
duplicate "Date:" headers, and removed ~1300 spams that were identical 
(only sent to different addresses). I think this is a useful thing to do 
to prevent that the same spam in two messages is both in the training 
and in the test set. The "Message-ID" sort I did in the beginning didn't 
help all that much, because lots of these spams do not have their 
message-id added by the spammer.

I am currently using 10 sets of 1600 ham, and 10 sets of 560 spam. I am 
now using 1,2,3,4,5 to train and 6,7,8,9,10 for analysis, and a second
test takes 6,7,8,9,10 to train and 1,2,3,4,5 for analysis.

> That said, I expect my ham is easier than most, because newsgroup traffic
> almost never contains personal msgs -- no screaming red HTML birthday wishes
> from 9-year-old nieces, no confirmations of payment received, no opt-in
> marketing newsletters, no chain letters forwarded from naive brothers, etc.

Exactly. I find that my ham is very diverse. Besides all the things you 
mentioned, I had (but removed) communications with postmasters over 
early-day spam that was sent using their machines. And I am using some 
ham from my previous job. There is not a lot of mailing list traffic, 
because I am no longer storing all that. Lots of customer E-mails with 
many different computer backgrounds. I removed all the viruses.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From bkc@murkworks.com  Sun Oct  6 18:51:17 2002
From: bkc@murkworks.com (Brad Clements)
Date: Sun, 06 Oct 2002 13:51:17 -0400
Subject: [Spambayes] CL2 part I (reprise)
Message-ID: <3DA03FA0.19882.2C917996@localhost>

FP rate is down after cleaning up training sets

but CL2 seems less sure than CL3

-> <stat> Ham scores for all runs: 6500 items; mean 1.36; sdev 8.88
-> <stat> min 0; median 0; max 100
* = 104 items
  0 6339 *************************************************************
 25  100 *
 50   44 *
 75   17 *

-> <stat> Spam scores for all runs: 6500 items; mean 99.27; sdev 6.21
-> <stat> min 0; median 100; max 100
* = 106 items
  0    4 *
 25   14 *
 50   74 *
 75 6408 *************************************************************
-> best cutoff for all runs: 0.5
->     with weighted total 1*61 fp + 18 fn = 79
->     fp rate 0.938%  fn rate 0.277%

Reading results/cl2-a/clim.pik ...
Nham= 6500
RmsZham= 4.91231538116
Nspam= 6500
RmsZspam= 3.7454876964
======================================================================
HAM:
FALSE POSITIVE: zham=-9.71 zspam=-2.38 Data/Ham/Set1/10180 SURE!
FALSE POSITIVE: zham=-6.08 zspam=-2.50 Data/Ham/Set1/10852 SURE!
FALSE POSITIVE: zham=-6.09 zspam=-2.19 Data/Ham/Set3/6480 SURE!
FALSE POSITIVE: zham=-4.08 zspam=-1.36 Data/Ham/Set4/69 SURE!
FALSE POSITIVE: zham=-4.32 zspam=-2.28 Data/Ham/Set4/5548 SURE!
FALSE POSITIVE: zham=-10.65 zspam=-1.44 Data/Ham/Set4/10008 SURE!
FALSE POSITIVE: zham=-5.43 zspam=-2.03 Data/Ham/Set4/10937 SURE!
FALSE POSITIVE: zham=-10.25 zspam=-0.34 Data/Ham/Set5/5105 SURE!
FALSE POSITIVE: zham=-4.61 zspam=-2.40 Data/Ham/Set5/6369 SURE!
Sure/ok       6280
Unsure/ok     184
Unsure/not ok 27
Sure/not ok   9
Unsure rate = 3.25%
Sure fp rate = 0.14%; Unsure fp rate = 12.80%
======================================================================
SPAM:
FALSE NEGATIVE: zham=-1.00 zspam=-5.48 Data/Spam/Set2/5185 SURE!
FALSE NEGATIVE: zham=-2.01 zspam=-4.62 Data/Spam/Set2/6457 SURE!
FALSE NEGATIVE: zham=-2.09 zspam=-18.28 Data/Spam/Set3/3010 SURE!
FALSE NEGATIVE: zham=-2.46 zspam=-4.49 Data/Spam/Set4/6367 SURE!
FALSE NEGATIVE: zham=-2.46 zspam=-4.49 Data/Spam/Set4/6371 SURE!
Sure/ok       6226
Unsure/ok     224
Unsure/not ok 45
Sure/not ok   5
Unsure rate = 4.14%
Sure fn rate = 0.08%; Unsure fn rate = 16.73%


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From rob@hooft.net  Sun Oct  6 19:25:20 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 06 Oct 2002 20:25:20 +0200
Subject: [Spambayes] Re: For the bold
References: <LNBBLJKPBEHFEDALKOLCCEPMBIAB.tim.one@comcast.net>
Message-ID: <3DA08010.3010803@hooft.net>

I made a number of changes to rmspik.py:

  - The "chance" function was replaced by something a bit more 
scientific (this helps!).
  - There are new parameters in the source code (I'm hoping someone else 
can make these configurable through the .ini file).

# surefactor: the ratio of the two p's to decide we're sure a message
# belongs to one of the two populations.  raising this number increases
# the "unsures" on both sides, decreasing the "sure fp" and "sure fn"
# rates.  A value of 1000 works well for me; at 10000 you get slightly
# less sure fp/fn at a cost of a lot more middle ground; at 10 you have
# much less work on the middle ground but ~50% more "sure false"
# scores.  This variable operates on messages that are "a bit of both
# ham and spam"
surefactor = 100

# pminhamsure: The minimal pham at which we say it's surely ham
# lowering this value gives less "unsure ham" and more "sure ham"; it
# might however result in more "sure fn" 0.01 works well, but to accept
# a bit more fn, I set it to 0.005. This variable operates on messages
# that are "neither ham nor spam; but a bit more ham than spam"
pminhamsure = 0.005

# pminspamsure: The minimal pspam at which we say it's surely spam
# lowering this value gives less "unsure spam" and more "sure spam"; it
# might however result in more "sure fp" Since most people find fp
# worse than fn, this value should most probably be higher than
# pminhamsure. 0.01 works well, but to accept a bit less fp, I set it
# to 0.02.  This variable operates on messages that are "neither ham
# nor spam; but a bit more spam than ham"
pminspamsure = 0.02


# usetail: if False, use complete distributions to renormalize the
# Z-scores; if True, use only the worst tail value. I get worse results
# if I set this to True, so the default is False.
usetail = False

# medianoffset: If True, set the median of the zham and zspam to 0
# before calculating rmsZ. If False, do not shift the data and hence
# assume that 0 is the center of the population. True seems to help for
# my data.
medianoffset = True

I'd like to invite everyone to play with this. It takes only a few 
seconds to run once the .pik is set up using "clgen"!

I'll post some of my results under separate cover.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From bkc@murkworks.com  Sun Oct  6 19:37:30 2002
From: bkc@murkworks.com (Brad Clements)
Date: Sun, 06 Oct 2002 14:37:30 -0400
Subject: [Spambayes] Re: For the bold
In-Reply-To: <3DA08010.3010803@hooft.net>
Message-ID: <3DA04A75.3377.2CBBC9B0@localhost>

On 6 Oct 2002 at 20:25, Rob Hooft wrote:

> I made a number of changes to rmspik.py:
> 
>   - The "chance" function was replaced by something a bit more 
> scientific (this helps!).
>   - There are new parameters in the source code (I'm hoping someone else
> can make these configurable through the .ini file).
> 

Does this mean it works for CL now?


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From bkc@murkworks.com  Sun Oct  6 19:43:16 2002
From: bkc@murkworks.com (Brad Clements)
Date: Sun, 06 Oct 2002 14:43:16 -0400
Subject: [Spambayes] incremental testing with CL2/CL3?
Message-ID: <3DA04BCE.30178.2CC10EB2@localhost>

Someone mentioned they did incremental testing and posted their results, but I couldn't 
figure out what the results meant.

So, I want to try it too.

I notice in the TestDriver, comments like:

    # CAUTION:  this just doesn't work for incrememental training when
    # options.use_central_limit is in effect.
    def train(self, ham, spam):


I'm not planning on using untrain(), so does this comment still apply?

my plan is:

1. Receive 100 (configurable) messages "per day", with a (configurable) percentage of 
those being spam.

2. run the classifier on those messages and make 3 categories: ham, spam, unsure. I 
want to know how many fall into each category on each "day".

3. some percentage (configurable) of each category will be fed back into training each 
"day".

4. Plot fn and fp rate "per day" for .. 30 days (configurable) to show how rates vary..

5. modulate max_discriminators, training feedback (% of messages in each category 
fed back into system) vs. "days" to get a feel for the results a typical user might expect..

6. re-run testing using new classifier schemes.. 

where do I start?


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From rob@hooft.net  Sun Oct  6 19:43:30 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 06 Oct 2002 20:43:30 +0200
Subject: [Spambayes] rmspik results
Message-ID: <3DA08452.8060105@hooft.net>

As promised, here are some of my results from the current version of 
rmspik.py. For the record: I just wrote in a previous message:

I am currently using 10 sets of 1600 ham, and 10 sets of 560 spam. I am 
now using 1,2,3,4,5 to train and 6,7,8,9,10 for analysis, and a second
test takes 6,7,8,9,10 to train and 1,2,3,4,5 for analysis.

These two tests I did with clt1, clt2 and with clt3 resulting in 6 pik 
files that I analysed using rmspik.py. This results in such a mass of 
results that I wrote a quick script to make a "score" out of each run, 
something that weighs the work of filtering unsure messages, the 
occurrence of fp's and the occurrence of fn's. The score is done using:

         fprate=float(nfp)/nham
         fnrate=float(nfn)/nspam
         unsurerate=float(nunsure)/ntot
         score=fprate*fpfac+fnrate*fnfac+unsurerate*unsurefac

Where: fpfac=3000.0; fnfac=300.0; unsurefac=100.0 representing one 
possible "private" mix of priorities (you could think of these as the 
cost in Euros or Dollars for such a mistake). For a mailing list a 
philosophy tells me fnfac/unsurefac should be about the number of 
members of the list, and fp's are not too bad if you can send a nice 
message to the poster telling him what happened and how to get his 
message posted anyway.

The score is the last number on each line describing a run.

surefactor=1000 pmin(sp|h)amsure=0.01 usetail=False medianoffset=False

expt sets    ham-OK  Unsure UnsNOK  ERR spam-OK  Unsure UnsNOK  ERR
clt1-12345   7745    228    21      6   2683     104    13      0  5.6
clt1-67890   7738    225    33      4   2680     108     8      4  5.4
clt2-12345   7814    155    26      5   2690     101     9      0  4.6
clt2-67890   7781    180    34      5   2714      75     8      3  4.9
clt3-12345   7751    211    32      6   2681     110     9      0  5.6
clt3-67890   7704    256    35      5   2699      91     7      3  5.8

With pminhamsure=0.005 and pminspamsure=0.02

expt sets    ham-OK  Unsure UnsNOK  ERR spam-OK  Unsure UnsNOK  ERR
clt1-12345   7746    227    21      6   2671     116    13      0  5.7
clt1-67890   7738    225    34      3   2670     118     8      4  5.1
clt2-12345   7835    134    26      5   2673     118     9      0  4.5
clt2-67890   7822    139    34      5   2693      96     8      3  4.8
clt3-12345   7783    179    33      5   2665     126     9      0  5.1
clt3-67890   7752    208    37      3   2672     118     7      3  4.9

With surefactor=10000

expt sets    ham-OK  Unsure UnsNOK  ERR spam-OK  Unsure UnsNOK  ERR
clt1-12345   7492    481    23      4   2618     169    13      0  7.9
clt1-67890   7481    482    34      3   2601     187     9      3  8.0
clt2-12345   7810    159    27      4   2644     147     9      0  4.7
clt2-67890   7792    169    35      4   2660     129     8      3  5.0
clt3-12345   7743    219    34      4   2640     151     9      0  5.3
clt3-67890   7717    243    37      3   2643     147     7      3  5.5

With surefactor=10

expt sets    ham-OK  Unsure UnsNOK  ERR spam-OK  Unsure UnsNOK  ERR
clt1-12345   7922     51    19      8   2733      54    11      2  4.5
clt1-67890   7905     39    32      5   2749      39     7      5  3.5
clt2-12345   7864    105    25      6   2683     108     8      1  4.6
clt2-67890   7852    109    33      6   2701      88     7      4  4.9
clt3-12345   7814    148    33      5   2675     116     9      0  4.7
clt3-67890   7779    181    36      4   2679     111     6      4  5.0

With surefactor=100, usetail=True

expt sets    ham-OK  Unsure UnsNOK  ERR spam-OK  Unsure UnsNOK  ERR
clt1-12345   7777    192    24      7   2705      83    12      0  5.5
clt1-67890   7791    171    34      4   2704      84     8      4  4.7
clt2-12345   7824    143    28      5   2695      96     9      0  4.4
clt2-67890   7668    277    47      8   2731      62     4      3  6.9
clt3-12345   7802    165    28      5   2692      99     9      0  4.7
clt3-67890   7636    309    48      7   2727      66     4      3  6.9

With surefactor=100, usetail=True, medianoffset=True

expt sets    ham-OK  Unsure UnsNOK  ERR spam-OK  Unsure UnsNOK  ERR
clt1-12345   7728    241    24      7   2704      84    12      0  6.0
clt1-67890   7753    210    33      4   2695      93     9      3  5.0
clt2-12345   7813    154    28      5   2699      92     9      0  4.5
clt2-67890   7653    292    48      7   2733      60     4      3  6.7
clt3-12345   7803    164    28      5   2693      98     9      0  4.6
clt3-67890   7636    307    50      7   2728      65     4      3  6.9

With surefactor=100, usetail=False, medianoffset=True

expt sets    ham-OK  Unsure UnsNOK  ERR spam-OK  Unsure UnsNOK  ERR
clt1-12345   7842    127    25      6   2672     116    12      0  4.8
clt1-67890   7832    131    34      3   2675     113     8      4  4.2
clt2-12345   7816    147    32      5   2675     116     9      0  4.7
clt2-67890   7786    174    36      4   2684     106     6      4  4.9
clt3-12345   7807    156    32      5   2673     118     9      0  4.8
clt3-67890   7774    187    35      5   2684     106     6      4  5.4

Conclusions so far:
  - medianoffset=True helps
  - usetail=False is better than True
  - clt1 seems to do best, although the difference is not large.
  - there are large differences between the 12345 and 67890 runs.

I'm sure that systematic variation of the parameters (e.g. using a 
simplex optimization?) will give me even better scores.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob@hooft.net  Sun Oct  6 20:01:57 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 06 Oct 2002 21:01:57 +0200
Subject: [Spambayes] Re: For the bold
References: <LNBBLJKPBEHFEDALKOLCAEGBBJAB.tim.one@comcast.net>
Message-ID: <3DA088A5.9060100@hooft.net>

Tim Peters wrote:

>>the assumption under which the rmspik.py code works is that the
>>distributions of zham and zspam values are normally distributed
>>if all values are "mirrored" around 0. I'll have to test that
>>assumption for clt1 and clt3!
> 
> 
> I didn't catch the meaning there, but expect any assumption you would like
> to make is most likely to be true under clt1 (which is the least extreme of
> these gimmicks).

I intended to say that the approach assumes that the (tails of the) zham 
and zspam values in the pickles can be described with a Gaussian 
centered on 0. This was a fairly good approximation for clt2 that I 
tried first, but appeared HORRIBLE for both clt1 and clt3. That is when 
I started trying to explain the tails by looking at the tails only. But 
that didn't help. What did help is recentering the data on the median 
value (not the average value; that is a bad approximation for an 
assymmetric distribution).

BTW: I tried using the direct values hmean and smean in the pickles as 
well. This worked fine immediately for clt2 and gives (as expected) the 
exact same results. But for clt1 and clt3 this needed the medianoffset 
parameter which I only implemented now. There seems to be a small 
difference, but I did not investigate that yet.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From jbublitz@nwinternet.com  Sun Oct  6 22:10:28 2002
From: jbublitz@nwinternet.com (Jim Bublitz)
Date: Sun, 06 Oct 2002 14:10:28 -0700 (PDT)
Subject: [Spambayes] incremental testing with CL2/CL3?
In-Reply-To: <3DA04BCE.30178.2CC10EB2@localhost>
Message-ID: <XFMail.021006141028.jbublitz@nwinternet.com>

On 06-Oct-02 Brad Clements wrote:
> Someone mentioned they did incremental testing and posted their
> results, but I couldn't figure out what the results meant.

That would be me. Apparently nobody could figure out what I wrote.
The short summary is that for my data, running it sequentially
with "daily" retraining gave far better results than any other
testing method, Graham worked slightly better than Spambayes for me
(< 0.3% difference in fp/fn %'s - small), the effect of initial
training size (as low as 1 ham, 1 spam) disappeared after the first
"day".
 
> So, I want to try it too.
 
> I notice in the TestDriver, comments like:
 
>     # CAUTION:  this just doesn't work for incrememental training
> when
>     # options.use_central_limit is in effect.
>     def train(self, ham, spam):
 
> 
> I'm not planning on using untrain(), so does this comment still
> apply?
> 
> my plan is:

I'd suggest:

0. Start with a size-configurable basic training sample.
 
> 1. Receive 100 (configurable) messages "per day", with a
> (configurable) percentage of 
> those being spam.
> 
> 2. run the classifier on those messages and make 3 categories:
> ham, spam, unsure. I 
> want to know how many fall into each category on each "day".
> 
> 3. some percentage (configurable) of each category will be fed
> back into training each 
> "day".
> 
> 4. Plot fn and fp rate "per day" for .. 30 days (configurable) to
> show how rates vary..

I had no errors in 21 day tests (with large enough initial training
sample - otherwise only errors on first "day"). I needed to test 7K
to 8K of *each* type of msg to see any errors in the best case.
Short tests are nice for code debugging/checking the effects of
methodology changes, as in (5) and (6) below.
 
> 5. modulate max_discriminators, training feedback (% of messages
> in each category 
> fed back into system) vs. "days" to get a feel for the results a
> typical user might expect..

The other thing that would be interesting (to me anyway) is if it's
possible/desireable for the system to modify the discrimination
cutoff(s) automatically based on new training data. In other words,
if the system starts at "score > 0.5 is spam", can learning adjust
that number to compensate for changes in newly learned data?

> 6. re-run testing using new classifier schemes.. 
> 
> where do I start?

I'm not sure what other info you need - you seem to have it all in
order. For my data, msg filename == delivery timestamp (one msg
per file), but otherwise you'd probably get the most accurate
ordering from the first encountered "Received" line in the headers
if the msgs aren't already ordered. Otherwise, I instantiated Hammie
(from hammie.py) with hammie.createBayes and just did calls to
Hammie.train, Hammie.update_probabilities, and Hammie.score. I
didn't try "untrain" either - it would be interesting see whether
using that is good or bad.

I also accumulated "weekly" totals thinking I might need to smooth
out "daily" variations, but the error rates are so low, the only
thing it told me is whether the errors occurred early or late in
the sequence. The last test run I did only had two errors - one the
first "day" and one at almost the last "day" (somewhere between
7700 and 8000 ham msgs).


Jim


From tim.one@comcast.net  Sun Oct  6 23:10:51 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 06 Oct 2002 18:10:51 -0400
Subject: [Spambayes] incremental testing with CL2/CL3?
In-Reply-To: <3DA04BCE.30178.2CC10EB2@localhost>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEILBJAB.tim.one@comcast.net>

[Brad Clements]
> ...
> I notice in the TestDriver, comments like:
>
>     # CAUTION:  this just doesn't work for incrememental training when
>     # options.use_central_limit is in effect.
>     def train(self, ham, spam):
>
> I'm not planning on using untrain(), so does this comment still apply?

Yes, afraid so.  A do-something compute_population_stats() is unique to the
central limit schemes, and all it knows about the world is the ham and spam
passed to train().  If you had trained on 20000 ham and 20000 spam, and then
passed 10 of each to train() in another call, the population statistics for
the previous 40000 of each would be lost, overwritten by the stats for the
new 20 msgs.

I don't see an obvious way to fix this, alas.  It would be easiest to fix
under clt1.  You could train on every previous msg every time, but that's a
quadratic-time proposition overall.


From tim.one@comcast.net  Sun Oct  6 23:18:13 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 06 Oct 2002 18:18:13 -0400
Subject: [Spambayes] RE: For the bold
In-Reply-To: <3DA07589.50407@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEIMBJAB.tim.one@comcast.net>

[Tim]
>> I'm not sure exactly what 10x(200+200) means, but at the
>> plausible extremes it means your classifiers were trained on
>> 200 on each, or on 1800 of each.  So at worst, my classifier was
>> trained on 3x as much data, and at best on 25x as much data.
>>  Error rates certainly improve with more training data,
>> albeit slowly.

[Rob Hooft]
> I did have 10 sets each of ham and spam, each set containing 200
> messages out of a total reservoir of ~17000 ham and 7500 spam.

See?  This still doesn't give the reader a clue about how many msgs your
classifiers were trained on, or how many they predicted against.  It's
important info, and I don't know how to convince people to reveal it -- the
cmp.py output even prints it, but most people snip that part off, as if
cmp.py printed it by mistake <wink>.

> This subset of everything was heavy enough for this optimization: it
> took about 24 hours of calculating to get that analysis done....

Without knowing what you did, I really can't comment.  This seems like an
awfully long time to go through 4,000 (10*(200+200)) total msgs, though, no
matter what you were doing.  On my box (866MHz, 256MB RAM), the system
scores about 80 msgs per second.


From tim.one@comcast.net  Sun Oct  6 23:26:08 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 06 Oct 2002 18:26:08 -0400
Subject: [Spambayes] RE: For the bold
In-Reply-To: <3DA088A5.9060100@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEINBJAB.tim.one@comcast.net>

[Rob Hooft]
> ...
> BTW: I tried using the direct values hmean and smean in the pickles as
> well. This worked fine immediately for clt2 and gives (as expected) the
> exact same results.

Actually, I wouldn't expect that:  the "zscores" aren't a function of smean
or hmean alone, they also depend on each msg's n value (in a way that's
highly dubious in clt1 and clt2, and wholly unjustified in clt3).

To the limited extent that the zscores *may* make sense under clt1 and clt2,
it's the dependence on n that the sense comes from.

If you got the same results by ignoring n completely (which just looking at
hmean or smean does), then that at least confirms that fiddling with n is in
fact of no value in the current pseudo-zscore computation.


From tim.one@comcast.net  Sun Oct  6 23:53:13 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 06 Oct 2002 18:53:13 -0400
Subject: [Spambayes] CL2 test part II
In-Reply-To: <3DA02D8B.11446.2C4AD3E5@localhost>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEIPBJAB.tim.one@comcast.net>

[Brad Clements]
> In my earlier CL2 and CL3 tests, I trained on the 2nd half of my
> corpus, and tested the first half.
>
> Now, I'm training on the first half and testing the 2nd half.
>
> First run of CL2 uncovered more misclassifications (which
> probably affected the training of my first test).
>
> ...
>
> In any case, here's CL2 results training first, testing second half.
>
> > <stat> Ham scores for all runs: 6500 items; mean 0.94; sdev 7.21
> -> <stat> min 0; median 0; max 100
> * = 105 items
>   0 6384 *************************************************************
>  25   87 *
>  50   21 *
>  75    8 *

Let me interleave these from various followup msgs.

Ham for clt2 rmspick:

Reading results/cl2-b/clim.pik ...
Nham= 6500
RmsZham= 4.48175325539
Nspam= 6500
RmsZspam= 3.7809204202
======================================================================
HAM:
FALSE POSITIVE: zham=-5.79 zspam=-2.33 Data/Ham/Set6/6438 SURE!
FALSE POSITIVE: zham=-5.43 zspam=-2.27 Data/Ham/Set6/10068 SURE!
FALSE POSITIVE: zham=-3.97 zspam=-1.72 Data/Ham/Set7/9964 SURE!
FALSE POSITIVE: zham=-6.17 zspam=-2.35 Data/Ham/Set9/6415 SURE!
Sure/ok       6297
Unsure/ok     181
Unsure/not ok 18
Sure/not ok   4
Unsure rate = 3.06%
Sure fp rate = 0.06%; Unsure fp rate = 9.05%

Ham for clt3:

-> <stat> Ham scores for all runs: 6500 items; mean 0.98; sdev 7.64
-> <stat> min 0; median 0; max 100
* = 105 items
  0 6386 *************************************************************
 25   74 *
 50   26 *
 75   14 *

Ham for clt3 rmspick:

Reading results/cl3-b/clim.pik ...
Nham= 6500
RmsZham= 12.6590343376
Nspam= 6500
RmsZspam= 14.8475623174
======================================================================
HAM:
FALSE POSITIVE: zham=-6.65 zspam=-2.35 Data/Ham/Set6/6438 SURE!
FALSE POSITIVE: zham=-6.19 zspam=-2.29 Data/Ham/Set6/10068 SURE!
FALSE POSITIVE: zham=-4.60 zspam=-1.73 Data/Ham/Set7/9964 SURE!
FALSE POSITIVE: zham=-7.11 zspam=-2.37 Data/Ham/Set9/6415 SURE!
Sure/ok       6294
Unsure/ok     182
Unsure/not ok 20
Sure/not ok   4
Unsure rate = 3.11%
Sure fp rate = 0.06%; Unsure fp rate = 9.90%

I don't see a significant difference between clt2 & clt3 here; clt2 may be
doing slightly better.  The differences after rmspick are clearly
insignificant.  rmspick is unsure twice as often, but is dead wrong half as
often as clt2, and even better wrt raw clt3.

On to the spam:

> -> <stat> Spam scores for all runs: 6500 items; mean 99.32; sdev 5.94
> -> <stat> min 0; median 100; max 100
> * = 106 items
>   0    3 *
>  25   15 *
>  50   68 *
>  75 6414 *************************************************************
> -> best cutoff for all runs: 0.5
> ->     with weighted total 1*29 fp + 18 fn = 47
> ->     fp rate 0.446%  fn rate 0.277%

Spam for clt2 rmspick:

SPAM:
FALSE NEGATIVE: zham=-1.86 zspam=-3.48 Data/Spam/Set7/6718 SURE!
FALSE NEGATIVE: zham=-1.48 zspam=-6.37 Data/Spam/Set10/10979 SURE!  BOGUS
Sure/ok       6240
Unsure/ok     232
Unsure/not ok 26
Sure/not ok   2
Unsure rate = 3.97%
Sure fn rate = 0.03%; Unsure fn rate = 10.08%

and the 2nd false negative was bogus (really a ham).  rmspick was uncertain
3x as often.

Spam for clt3:

-> <stat> Spam scores for all runs: 6500 items; mean 98.85; sdev 7.84
-> <stat> min 0; median 100; max 100
* = 105 items
  0    8 *
 25   22 *
 50  113 **
 75 6357 *************************************************************
-> best cutoff for all runs: 0.5
->     with weighted total 1*19 fp + 30 fn = 49
->     fp rate 0.292%  fn rate 0.462%

Less certain than clt2, *and* made more mistakes when certain (that's a bad
combination <wink>).

Spam for clt3 rmspick:

SPAM:
FALSE NEGATIVE: zham=-1.37 zspam=-6.36 Data/Spam/Set10/10979 SURE!
Sure/ok       6271
Unsure/ok     207
Unsure/not ok 21
Sure/not ok   1
Unsure rate = 3.51%
Sure fn rate = 0.02%; Unsure fn rate = 9.21%

Uncertain more often than raw clt3, but far fewer errors when certain.


Overall, I'd say that clt2 works better for you than clt3, and that rmspick
gives an improvement either way.  I bet you're just dying to try clt1
<wink>.

> [Tokenizer]
> mine_received_headers: True
>
> [Classifier]
> use_central_limit2 = True
> use_central_limit3 = False
> zscore_ratio_cutoff: 1.9
>
> [TestDriver]
> spam_cutoff: 0.50
> show_false_negatives: True
> nbuckets: 4
>
> show_spam_lo: 0.0
> show_spam_hi: 0.45
>
> save_trained_pickles: True
> save_histogram_pickles: True

Looks good!  Thank you, Brad.


From tim.one@comcast.net  Mon Oct  7 03:08:26 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 06 Oct 2002 22:08:26 -0400
Subject: [Spambayes] RE: For the bold
In-Reply-To: <3DA08010.3010803@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEJKBJAB.tim.one@comcast.net>

[Rob Hooft]
> I made a number of changes to rmspik.py:
>
>   - The "chance" function was replaced by something a bit more
> scientific (this helps!).

If you think more accuracy would help more, there are three well-known
routines for computing areas under the unit Gaussian in this file:

    http://lib.stat.cmu.edu/apstat/66

The fanciest is good to 15 digits, which is possibly more than is really
needed here <wink>.


From bkc@murkworks.com  Mon Oct  7 03:23:02 2002
From: bkc@murkworks.com (Brad Clements)
Date: Sun, 06 Oct 2002 22:23:02 -0400
Subject: [Spambayes] CL1 tests
Message-ID: <3DA0B78F.29877.2E65FDB5@localhost>

Two tests, a and b using cl1 and rmspik.

Not formatted too well, doing this via vnc. 


Set A

-> <stat> Ham scores for all in this training set: 6500 items; mean 1.99; sdev 10.31
-> <stat> min 0; median 0; max 100
* = 103 items
  0 6254 *************************************************************
 25  169 **
 50   62 *
 75   15 *

-> <stat> Spam scores for all in this training set: 6500 items; mean 99.08; sdev 6.76
-> <stat> min 0; median 100; max 100
* = 105 items
  0    2 *
 25    9 *
 50  108 **
 75 6381 *************************************************************
-> best cutoff for all in this training set: 0.5
->     with weighted total 1*77 fp + 11 fn = 88
->     fp rate 1.18%  fn rate 0.169%
    saving pickle to class1.pik

-> <stat> Ham scores for all runs: 6500 items; mean 1.99; sdev 10.31
-> <stat> min 0; median 0; max 100
* = 103 items
  0 6254 *************************************************************
 25  169 **
 50   62 *
 75   15 *

-> <stat> Spam scores for all runs: 6500 items; mean 99.08; sdev 6.76
-> <stat> min 0; median 100; max 100
* = 105 items
  0    2 *
 25    9 *
 50  108 **
 75 6381 *************************************************************
-> best cutoff for all runs: 0.5
->     with weighted total 1*77 fp + 11 fn = 88
->     fp rate 1.18%  fn rate 0.169%
    saving ham histogram pickle to class_hamhist.pik
    saving spam histogram pickle to class_spamhist.pik
Saving all score data to pickle clim.pik
Reading results/cl1-a/clim.pik ...
Nham= 6500
RmsZham= 4.15398302786
Nspam= 6500
RmsZspam= 4.50455044819
======================================================================
HAM:
FALSE POSITIVE: zham=3.62 zspam=-1.05 Data/Ham/Set7/9964 SURE!
Sure/ok       6236
Unsure/ok     232
Unsure/not ok 31
Sure/not ok   1
Unsure rate = 4.05%
Sure fp rate = 0.02%; Unsure fp rate = 11.79%
======================================================================
SPAM:
FALSE NEGATIVE: zham=0.91 zspam=-3.42 Data/Spam/Set10/9656 SURE!
Sure/ok       6144
Unsure/ok     336
Unsure/not ok 19
Sure/not ok   1
Unsure rate = 5.46%
Sure fn rate = 0.02%; Unsure fn rate = 5.35%


Set B


-> <stat> Ham scores for all in this training set: 6500 items; mean 1.72; sdev 9.38
-> <stat> min 0; median 0; max 100
* = 103 items
  0 6282 *************************************************************
 25  173 **
 50   37 *
 75    8 *

-> <stat> Spam scores for all in this training set: 6500 items; mean 99.16; sdev 6.43
-> <stat> min 0; median 100; max 100
* = 105 items
  0    1 *
 25   10 *
 50   99 *
 75 6390 *************************************************************
-> best cutoff for all in this training set: 0.5
->     with weighted total 1*45 fp + 11 fn = 56
->     fp rate 0.692%  fn rate 0.169%
    saving pickle to class1.pik

-> <stat> Ham scores for all runs: 6500 items; mean 1.72; sdev 9.38
-> <stat> min 0; median 0; max 100
* = 103 items
  0 6282 *************************************************************
 25  173 **
 50   37 *
 75    8 *

-> <stat> Spam scores for all runs: 6500 items; mean 99.16; sdev 6.43
-> <stat> min 0; median 100; max 100
* = 105 items
  0    1 *
 25   10 *
 50   99 *
 75 6390 *************************************************************
-> best cutoff for all runs: 0.5
->     with weighted total 1*45 fp + 11 fn = 56
->     fp rate 0.692%  fn rate 0.169%
    saving ham histogram pickle to class_hamhist.pik
    saving spam histogram pickle to class_spamhist.pik
Saving all score data to pickle clim.pik
Reading results/cl1-b/clim.pik ...
Nham= 6500
RmsZham= 4.43688346925
Nspam= 6500
RmsZspam= 4.49901192821
======================================================================
HAM:
FALSE POSITIVE: zham=8.00 zspam=-1.23 Data/Ham/Set1/10180 SURE!
FALSE POSITIVE: zham=3.55 zspam=-1.56 Data/Ham/Set4/69 SURE!
FALSE POSITIVE: zham=8.59 zspam=-0.61 Data/Ham/Set4/10008 SURE!
FALSE POSITIVE: zham=8.86 zspam=-0.32 Data/Ham/Set5/5105 SURE!
Sure/ok       6251
Unsure/ok     193
Unsure/not ok 52
Sure/not ok   4
Unsure rate = 3.77%
Sure fp rate = 0.06%; Unsure fp rate = 21.22%
======================================================================
SPAM:
FALSE NEGATIVE: zham=0.60 zspam=-3.25 Data/Spam/Set2/5185 SURE!
FALSE NEGATIVE: zham=0.19 zspam=-9.53 Data/Spam/Set3/3010 SURE!
Sure/ok       6131
Unsure/ok     337
Unsure/not ok 30
Sure/not ok   2
Unsure rate = 5.65%
Sure fn rate = 0.03%; Unsure fn rate = 8.17%


[Tokenizer]
mine_received_headers: True

[Classifier]
use_central_limit = True
use_central_limit2 = False
use_central_limit3 = False
zscore_ratio_cutoff: 1.9

[TestDriver]
spam_cutoff: 0.50
show_false_negatives: True
nbuckets: 4

show_spam_lo: 0.0
show_spam_hi: 0.45

save_trained_pickles: True
save_histogram_pickles: True


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From tim.one@comcast.net  Mon Oct  7 04:12:10 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 06 Oct 2002 23:12:10 -0400
Subject: [Spambayes] New tokenization of the Subject line
In-Reply-To: <1033781185.1125.7.camel@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEJNBJAB.tim.one@comcast.net>

[Remi Ricard]
> I try something again.
>
> Since most of the mail from subscribed groups have in their
> subject [spambayes] or [freesco] i.e "[" and "]".
> I decided to keep this as a word so my words from a subject line
> like: Re: [Spambayes] Moving closer to Gary's ideal
> will be
> Re:
> [Spambayes]
> Moving
> closer
> to
> Gary's
> ideal

Two things about that:

1. It's not a precise enough description to know exactly what you
   did.  On a list with programmers, don't be afraid to show code <wink>.

2. Do you think it's more likely that a spam would have "freesco"
   than "[freesco]" in its Subject line?  Not bloodly likely <wink>.
   That is, you couldn't have picked worse examples for selling the
   idea that this *might* help.  Indeed, that may be why it didn't
   help.

It's usually more fruitful to stare at mistakes made by the system, and then
see if there's something about them in common that the tokenizer isn't
presenting in a usable way (very clear example:  we throw away uuencoded
pieces entirely; very muddy example:  we throw away info about how many
times a word appears in a msg).

> And this is the result.

Alex did a nice of job of running thru this, so I'll skip to the end.

> -> <stat> tested 200 hams & 279 spams against 800 hams & 1113 spams
> -> <stat> tested 200 hams & 275 spams against 800 hams & 1117 spams
> -> <stat> tested 200 hams & 298 spams against 800 hams & 1094 spams
> -> <stat> tested 200 hams & 272 spams against 800 hams & 1120 spams
> -> <stat> tested 200 hams & 268 spams against 800 hams & 1124 spams
> -> <stat> tested 200 hams & 279 spams against 800 hams & 1113 spams
> -> <stat> tested 200 hams & 275 spams against 800 hams & 1117 spams
> -> <stat> tested 200 hams & 298 spams against 800 hams & 1094 spams
> -> <stat> tested 200 hams & 272 spams against 800 hams & 1120 spams
> -> <stat> tested 200 hams & 268 spams against 800 hams & 1124 spams
>
> false positive percentages
>     1.000  0.500  won    -50.00%
>     1.500  1.500  tied
>     2.000  2.500  lost   +25.00%
>     1.000  1.000  tied
>     0.000  0.000  tied
>
> won   1 times
> tied  3 times
> lost  1 times
>
> total unique fp went from 11 to 11 tied
> mean fp % went from 1.1 to 1.1 tied
>
> false negative percentages
>     0.717  0.717  tied
>     0.727  0.727  tied
>     1.007  1.342  lost   +33.27%
>     0.000  0.368  lost  +(was 0)
>     0.746  0.373  won    -50.00%
>
> won   1 times
> tied  2 times
> lost  2 times
>
> total unique fn went from 9 to 10 lost   +11.11%
> mean fn % went from 0.639419734305 to 0.705436374356 lost   +10.32%
>
> ham mean                     ham sdev
>   24.51   25.20   +2.82%        9.45    9.09   -3.81%
>   26.14   27.20   +4.06%        8.62    8.32   -3.48%
>   26.04   26.94   +3.46%       10.00    9.68   -3.20%
>   25.15   25.85   +2.78%        8.05    7.93   -1.49%
>   25.12   26.11   +3.94%        8.28    8.16   -1.45%
>
> ham mean and sdev for all runs
>   25.39   26.26   +3.43%        8.93    8.69   -2.69%
>
> spam mean                    spam sdev
>   80.41   79.86   -0.68%        8.80    8.81   +0.11%
>   79.87   79.47   -0.50%        8.20    8.11   -1.10%
>   79.87   79.31   -0.70%        8.79    8.73   -0.68%
>   80.42   80.03   -0.48%        8.13    8.22   +1.11%
>   80.11   79.70   -0.51%        9.32    9.07   -2.68%
>
> spam mean and sdev for all runs
>   80.13   79.66   -0.59%        8.66    8.60   -0.69%
>
> ham/spam mean difference: 54.74 53.40 -1.34

[T. Alexander Popiel]
> This shows ham and spam getting closer together overall, and
> is bad.  The reduction in the standard deviation is (I think)
> too small to overcome this... but I'm just eyeballing it;
> can someone with a bit of the theory help here?

Not much in this case, because it had nothing else going for it:  the
conclusion to give up on this idea should have been reached long before
getting to this point <wink>.

We don't know what this distribution "looks like", exactly.  It appears to
be "kinda normal", but is tighter than normal at the endpoints, and looser
than normal where the tails dribble toward each other.  This limits the
usefulness we can get out of sdevs:  the only thoroughly general result is
that, for *any* distribution, no more than 1/k**2 of the data lives more
than k standard deviations away from the mean.  This is an especially
useless result when k <= 1 <wink>.  There's a one-tailed version that says
something non-trivial for k <= 1:

    http://www.btinternet.com/~se16/hgb/cheb.htm

But we're more interested in the overlap, and that occurs at higher k.

The rule of thumb I fall back on is that, *whatever* sdev means for this
distribution, I assume it means much the same thing across testers, and that
(which is justified although hard to quantify here) separating the means by
more sdevs is a good thing.  So I look for the value of k such that (and
assuming mean1 < mean2):

    mean1 + k * sdev1 = mean2 - k * sdev2

or, rearranging,

    mean2 - mean1
k = -------------
    sdev1 + sdev2

That tells us the score that's "equally far away" from both means in a
standard-deviation sense, and how far away that is from both means (in units
of standard deviations).

A little Python helps:

def findk(mean1, sdev1, mean2, sdev2):
    """Solve mean1 + k*sdev1 = mean2 - k*sdev2 for k.

    Return (k, common value).
    """

    assert mean1 < mean2
    k = (mean2 - mean1) / (sdev1 + sdev2)
    score = mean1 + k * sdev1
    return k, score

Plugging in the "before" means and sdevs gives:

>>> findk(25.39, 8.93, 80.13, 8.66)
(3.1119954519613415, 53.180119386014781)

BTW, if you don't favor one kind of error over another, this suggests
spam_cutoff=0.5318 may well be a good value for this data.  If it isn't, the
direction it errs in is a clue about which distribution is stranger.

Plugging in the "after" values gives:

>>> findk(26.26, 8.69, 79.66, 8.60)
(3.0884904569115093, 53.098982070561014)
>>>

So the means have gotten a tiny bit closer in an sdev sense too (they meet
at 3.09 sdevs from both, instead of at 3.11 before).  The difference is so
small as to be insignificant, though.


From tim.one@comcast.net  Mon Oct  7 06:30:41 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 07 Oct 2002 01:30:41 -0400
Subject: [Spambayes] incremental testing with CL2/CL3?
In-Reply-To: <3DA04BCE.30178.2CC10EB2@localhost>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEKCBJAB.tim.one@comcast.net>

[Brad Clements]
> Someone mentioned they did incremental testing and posted their
> results, but I couldn't
> figure out what the results meant.
>
> So, I want to try it too.
>
> I notice in the TestDriver, comments like:
>
>     # CAUTION:  this just doesn't work for incrememental training when
>     # options.use_central_limit is in effect.
>     def train(self, ham, spam):
>
>
> I'm not planning on using untrain(), so does this comment still apply?

I replied to this before with "sorry, yes", but this issue needs to be
forced, and I checked in changes so we can at least *try* this.

Let me explain the problem:

Under the all-default scheme, the only thing we remember about training msgs
is now many msgs each word appears in.  That's all.  Given any msg, we can
add it or remove it at will, and the only affect it has is on the
ord->hamcount and word->spamcount maps (from which we guess probabilities).

The central limit schemes are quite different this way:  we not only save
word->hamcount and word->spamcount maps (and in exactly the same way, so no
problem there), we also do a third training pass
(.central_limit_compute_population_stats{,2,3}) under the covers.  This
looks for the set of "extreme words" in each training message (which can't
be known until after update_probabilities() completes), and saves away
statistics about their probabilities, one set of statistics for all the ham
messages trained on, and a parallel, distinct set for all the spam messages
trained on.

The problem with incremental training under the clt schemes is in that third
pass:  when you train on any new data:

1. The word->hamcount and word->spamcount maps change.

2. This in turn changes word probabilities.  The word probabilities
   that *were* used in the third training pass for *previous* data
   are no onger current, and so the statistics computed from them are
   also incorrect for the new state of the world.

3. Changing word probabilities can in turn even change the *set*
   of extreme words in a msg.  And again, the set of extreme words
   found by the third training pass for previous data may not even
   be the correct extreme words for the new state of the world.

There's simply no way to repair #2 and #3 short of recomputing them from
scratch for every msg ever trained on, and that requires feeding them all
into the system again (or a moral equivalent, like storing, for each msg
ever trained on, the set of tokens it generated).

In particular, as time goes on the probabilities computed in #2 get more
extreme (closer to 0.0 and closer to 1.0) for strong clues, and clt2 and
clt3 in particular make extreme use of extreme words.  clt1 is less
sensitive that way.  This implies that, if you don't retrain on every msg,
the mild spamprobs in the msgs first trained on will forever after drag down
the statistics toward neutrality.


There are two hacks I can think of to try, short of retraining on every msg
ever seen:

1. Just keep adding in new statistics, and don't worry about the
   moderating effects of the early msgs.  The code as checked in now
   will do this:  so long as you don't call new_classifier(), each
   time train() is called it justs adds the new statistics to the
   old ones (before I checked in the changes, it overwrite the
   old statistics, as if they had never existed).

2. Indeed simply overwrite the old statistics.  This is as if the
   third training pass had never been done for older messages.

My intuition (which isn't worth much!) is that #2 is quirkier and riskier,
making much of the effect of the central-limit gimmicks depend solely on the
last batch of msgs trained on.  #1 should have much greater stability over
time, but that's not necessarily a good thing if the stability is bought at
the cost of not moving quickly enough toward the true state of the world.

Anyway, the only way to know is to try it.

> my plan is:
>
> 1. Receive 100 (configurable) messages "per day", with a
>   (configurable) percentage of those being spam.

You're ordering these by time received, right?

> 2. run the classifier on those messages and make 3 categories:
>    ham, spam, unsure.  I want to know how many fall into each
>    category on each "day".

I would like to see eight categories instead:

     ham sure correct
     ham sure incorrect
     ham unsure correct
     ham unsure incorrect

and the same four for spam.

> 3. some percentage (configurable) of each category will be fed
>    back into training each "day".

There's a world of interesting variations here <wink>.  For example, what if
you only feed it "sure but wrong" false positives and false negatives?  Or
only those plus "unsure but wrong" mistakes?  Or only the latter?  Etc.
Semi-realistic is to feed it all mistakes, and a random sampling from
correct results.  It's hard to know what people would really do, but I'm
*most* interested at first in what happens if intelligent use of the system
is made.

> 4. Plot fn and fp rate "per day" for .. 30 days (configurable) to
>    show how rates vary..

Note that there two f-n and two f-p rates under the clt schemes (the "sure"
and "unsure" mistake rates).

> 5. modulate max_discriminators, training feedback (% of messages
>   in each category fed back into system) vs. "days" to get a feel
>   for the results a typical user might expect..

Like such a beast exists <wink>.  I know one of my sisters well enough to
guess that she would feed it every false negative, and nothing else.

> 6. re-run testing using new classifier schemes..
>
> where do I start?

At step #1 <wink>.  You'll need a custom test driver, but those are easy
enough to write.  Really stare at the differences between, e.g., timtest.py
and timcv.py:  the differences between strategies as different as a grid
driver and a cross-validation driver amount to a few dozen lines of code in
one function.

For this, something like:

d = TestDriver.Driver()
ham, spam = some initial set of msgs to get things started
d.train(ham, spam)

for day in range(number_of_days):
    ham, spam = get the day's new msgs
    d.test(ham, spam)
    d.finishtest()
    print out whatever stats you want, athough d.finishtest()
        automatically prints out all the stuff you're interested
        in, so this may be much more a matter of writing a
        custom output analyzer; inferring the 4 error rates
        from pairs of 4-line histograms would be a PITA that we
        could make easier (adding new "-> <stat>" lines is easy, and
        harmless so long as they're not easily confusable with
        the lines of this kind other programs are already
        extracting)
    ham2, spam2 = the msgs from ham & spam you want to train on
    d.train(ham2, spam2)
d.alldone()


From papaDoc@videotron.ca  Mon Oct  7 13:17:02 2002
From: papaDoc@videotron.ca (papaDoc)
Date: Mon, 07 Oct 2002 08:17:02 -0400
Subject: [Spambayes] New tokenization of the Subject line
References: <LNBBLJKPBEHFEDALKOLCOEJNBJAB.tim.one@comcast.net>
Message-ID: <3DA17B3E.4060503@videotron.ca>

Hi,

>[Remi Ricard]
>  
>
>>I try something again.
>>
>>Since most of the mail from subscribed groups have in their
>>subject [spambayes] or [freesco] i.e "[" and "]".
>>I decided to keep this as a word so my words from a subject line
>>like: Re: [Spambayes] Moving closer to Gary's ideal
>>will be
>>Re:
>>[Spambayes]
>>Moving
>>closer
>>to
>>Gary's
>>ideal
>>    
>>
>
>Two things about that:
>
>1. It's not a precise enough description to know exactly what you
>   did.  On a list with programmers, don't be afraid to show code <wink>.
>
>2. Do you think it's more likely that a spam would have "freesco"
>   than "[freesco]" in its Subject line?  Not bloodly likely <wink>.
>   That is, you couldn't have picked worse examples for selling the
>   idea that this *might* help.  Indeed, that may be why it didn't
>   help.
>
>
>It's usually more fruitful to stare at mistakes made by the system, and then
>see if there's something about them in common that the tokenizer isn't
>presenting in a usable way (very clear example:  we throw away uuencoded
>pieces entirely; very muddy example:  we throw away info about how many
>times a word appears in a msg).
>

OK this is the code
I changed this
subject_word_re = re.compile(r"[\w\x80-\xff$.%]+")
punctuation_run_re = re.compile(r'\W+')
for
subject_word_re = re.compile(r"[\w\x80-\xff\[\]$.%]+")
punctuation_run_re = re.compile(r'\W^\[^\]+')

Why I did that is because I found this "prob(subject: '[') 0.0012345 and 
prob(subject: ']') 0.0012345
and usually I have a '[' of ']' in the subject if I have 
"[someword_from_a_mailing_list]" so
instead of having '['   'someword_from_a_mailing_list' and ']' as three 
token why not using
[someword_from_a_mailing_list] as one token.


I is more likely that a ham will have in its subject [freesco] than only 
freesco "for my case" and I think
a spam won't have at all freesco in its subject. (This is a clean 
mailing list he he.. this is still possible.....)

And I don't want a spam with a subject like: "[[[[[[New free porn 
site]]]]]]"  to have its '[' and ']' to count
as ham.

papaDoc


P.S Thanks for the statistic explanation of my result :-)


From popiel@wolfskeep.com  Mon Oct  7 20:58:51 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Mon, 07 Oct 2002 12:58:51 -0700
Subject: [Spambayes] Effects of ham to spam ratio
Message-ID: <20021007195851.A54C3F57F@cashew.wolfskeep.com>

Executive summary: more spam is VERY good.  1:4 ham:spam is
_much_ more accurate than 4:1 ham:spam, or even 1:1 ham:spam.

I'm back with another unusual experiment.  This time, I varied
the ratio of ham to spam, while keeping the total number of
messages trained and tested constant.  Once again, I'm doing
this using the all-defaults Robinson classifier.  If someone
gives me a good set of .ini files, I'd be more than happy to
run this test using any of the central limit algorithms, too.

I again used timcv.py as my test driver, this time with 200
messages in each ham/spam set.  For the different runs, I used
the --{ham,spam}-keep options to control how much of each set
got used, with the total used always being 250 ham+spam from
each pair.  The script I used (along with all the run output,
etc.) is on my website at:

  http://www.wolfskeep.com/~popiel/spambayes/ratio

I also mangled a version of cmp.py (now called table.py,
also on the website) to generate the following output:

-> <stat> tested 50 hams & 200 spams against 200 hams & 800 spams
[... edited for brevity ...]
-> <stat> tested 200 hams & 50 spams against 800 hams & 200 spams

ham-spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
fp tot:          2       1       2       2       3       3       1
fp %:         0.80    0.27    0.40    0.32    0.40    0.34    0.10
fn tot:         12      17      20      28      28      30      36
fn %:         1.20    1.94    2.67    4.48    5.60    8.00   14.40
h mean:      28.80   25.01   22.57   20.83   19.80   18.74   16.59
h sdev:       8.37    7.61    7.09    7.07    7.24    7.24    7.30
s mean:      78.32   76.48   75.05   73.79   72.88   70.96   68.10
s sdev:       7.87    8.36    8.82    9.28    9.77   10.36   10.86
mean diff:   49.52   51.47   52.48   52.96   53.08   52.22   51.51
k:            3.05    3.22    3.30    3.24    3.12    2.97    2.84

There are several interesting things here:

1. The false positive rate remains insignificant throughout.
2. The false negative rate drops significantly as the ham:spam
   ratio goes down.  The more spam you have in your mailfeed,
   the better this whole thing works.
3. The ham:spam ratio affects the spam sdev much more than the
   ham sdev.
4. Tim's k value (mean separation divided by sum of standard
   deviations) is best with slightly less ham than spam (at 2:3),
   which happens to be about the same ratio as in my real mailfeed.

It would be very interesting to find out if the best ham:spam
ratio for k (#4 above) is constant, or if it's actually tied to
the ratio in the real mail feed from which the training data is
taken.  This may be hard to measure for people who are using
corpora augmented from several sources.

- Alex

From skip@pobox.com  Mon Oct  7 21:13:14 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 7 Oct 2002 15:13:14 -0500
Subject: [Spambayes] CL2 results
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEEEBJAB.tim.one@comcast.net>
References: <3D9EF7D4.23399.2790EC5D@localhost>
        <LNBBLJKPBEHFEDALKOLCEEEEBJAB.tim.one@comcast.net>
Message-ID: <15777.60122.260016.744625@montanaro.dyndns.org>


    >> Also, turns out I had a lot of zero length message files that came up
    >> as false negatives.. I've rm `find -empty` and rebal..

    Tim> How *should* empty msgs be treated (that's a question for
    Tim> everyone)?  When there's nothing to go on, it's hard to decide
    Tim> <wink>.

Well, even empty messages will have headers.  Sounds like Brad's files were
truly zero-length, that is, not really mail messages.

(I suspect this response is kind of late for this thread.  I'm still working
through mail problems on my new computer, so this will also serve as a test
to see if it makes it out and back...)

Skip

From chk@pobox.com  Mon Oct  7 21:17:20 2002
From: chk@pobox.com (Harald Koch)
Date: Mon, 07 Oct 2002 16:17:20 -0400
Subject: [Spambayes] Re: Effects of ham to spam ratio 
In-Reply-To: popiel's message of "Mon, 07 Oct 2002 12:58:51 -0700".
	 <20021007195851.A54C3F57F@cashew.wolfskeep.com> 
References: <20021007195851.A54C3F57F@cashew.wolfskeep.com>
Message-ID: <9288.1034021840@elisabeth.cfrq.net>

> Executive summary: more spam is VERY good.  1:4 ham:spam is
> _much_ more accurate than 4:1 ham:spam, or even 1:1 ham:spam.

Thank the Gods I don't *receive* spam in that ratio...

-- 
Harald Koch     <chk@pobox.com>

From tim@zope.com  Tue Oct  8 01:47:01 2002
From: tim@zope.com (Tim Peters)
Date: Mon, 7 Oct 2002 20:47:01 -0400
Subject: [Spambayes] CL2 results
In-Reply-To: <15777.60122.260016.744625@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEPIBJAB.tim@zope.com>

[Skip Montanaro]
> Well, even empty messages will have headers.  Sounds like Brad's
> files were truly zero-length, that is, not really mail messages.

Yup, and they should come out with "a score" of 0.5 then.  Judging a msg
from the headers alone (with an empty body) seems too much a crapshoot,
though.

> (I suspect this response is kind of late for this thread.  I'm
> still working through mail problems on my new computer, so this will
> also serve as a test to see if it makes it out and back...)

I can testify it made it one way <wink>.  BTW, I don't think it's ever too
late for a response.  There's been an unreal amount of traffic on this list.
I see I still have 134 msgs here I intended to reply to, and God only knows
if I'll ever get to a fraction of them.  So if someone feels a good point
has been lost in the shuffle, pleae bring it up again.


From tim@zope.com  Tue Oct  8 18:02:03 2002
From: tim@zope.com (Tim Peters)
Date: Tue, 8 Oct 2002 13:02:03 -0400
Subject: [Spambayes] spamprob combining
In-Reply-To: <B9AF58D8.1621E%grobinson@transpose.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHCEIKDLAA.tim@zope.com>

This is a multi-part message in MIME format.

---------------------- multipart/mixed attachment
The attached sets up an experiment:

    create a vector of 50 "probabilities" at random, uniformly
    distributed in (0.0, 1.0)

    combine them using Paul Graham's scheme, and using Gary
    Robinson's scheme

    record the results

    repeat 5000 times

The results should look familiar for those playing this game from the start:

Result for random vectors of 50 probs, + 0 forced to 0.99

Graham combining 5000 items; mean 0.50; sdev 0.47
-> <stat> min 9.54792e-022; median 0.506715; max 1
* = 35 items
0.00 2051 ***********************************************************
0.05  100 ***
0.10   75 ***
0.15   63 **
0.20   44 **
0.25   35 *
0.30   40 **
0.35   34 *
0.40   30 *
0.45   25 *
0.50   34 *
0.55   32 *
0.60   31 *
0.65   24 *
0.70   39 **
0.75   43 **
0.80   56 **
0.85   55 **
0.90  108 ****
0.95 2081 ************************************************************

Robinson combining 5000 items; mean 0.50; sdev 0.04
-> <stat> min 0.350831; median 0.500083; max 0.649056
* = 34 items
0.00    0
0.05    0
0.10    0
0.15    0
0.20    0
0.25    0
0.30    0
0.35   20 *
0.40  450 **************
0.45 2027 ************************************************************
0.50 2019 ************************************************************
0.55  452 **************
0.60   32 *
0.65    0
0.70    0
0.75    0
0.80    0
0.85    0
0.90    0
0.95    0

IOW, Paul's scheme is almost always "certain" given 50 discriminators, even
in the face of random input.  Gary's is never "certain" then.

OTOH, do the experiment all over again, but attach one prob of 0.99 to each
random vector of 50 probs.  The probs are now systematically biased:

Result for random vectors of 50 probs, + 1 forced to 0.99

Graham combining 5000 items; mean 0.65; sdev 0.45
-> <stat> min 8.36115e-021; median 0.992403; max 1
* = 47 items
0.00 1353 *****************************
0.05   92 **
0.10   50 **
0.15   42 *
0.20   40 *
0.25   35 *
0.30   26 *
0.35   31 *
0.40   32 *
0.45   31 *
0.50   23 *
0.55   29 *
0.60   30 *
0.65   31 *
0.70   45 *
0.75   33 *
0.80   58 **
0.85   84 **
0.90  113 ***
0.95 2822 *************************************************************

Robinson combining 5000 items; mean 0.51; sdev 0.04
-> <stat> min 0.377845; median 0.513446; max 0.637992
* = 42 items
0.00    0
0.05    0
0.10    0
0.15    0
0.20    0
0.25    0
0.30    0
0.35    2 *
0.40  181 *****
0.45 1549 *************************************
0.50 2527 *************************************************************
0.55  698 *****************
0.60   43 **
0.65    0
0.70    0
0.75    0
0.80    0
0.85    0
0.90    0
0.95    0

There's a dramatic difference in the Paul results, while the Gary results
move sublty (in comparison).

If we force 10 additional .99 spamprobs, the differences are night and day:

Result for random vectors of 50 probs, + 10 forced to 0.99

Graham combining 5000 items; mean 1.00; sdev 0.01
-> <stat> min 0.213529; median 1; max 1
* = 82 items
0.00    0
0.05    0
0.10    0
0.15    0
0.20    1 *
0.25    0
0.30    1 *
0.35    0
0.40    0
0.45    0
0.50    0
0.55    0
0.60    0
0.65    0
0.70    0
0.75    0
0.80    0
0.85    0
0.90    0
0.95 4998 *************************************************************

Robinson combining 5000 items; mean 0.59; sdev 0.03
-> <stat> min 0.49794; median 0.58555; max 0.694905
* = 51 items
0.00    0
0.05    0
0.10    0
0.15    0
0.20    0
0.25    0
0.30    0
0.35    0
0.40    0
0.45    2 *
0.50  412 *********
0.55 3068 *************************************************************
0.60 1447 *****************************
0.65   71 **
0.70    0
0.75    0
0.80    0
0.85    0
0.90    0
0.95    0

It's hard to know what to make of this, especially in light of the claim
that Gary-combining has been proven to be the most sensitive possible test
for rejecting the hypothesis that a collection of probs is uniformly
distributed.  At least in this test, Paul-combining seemed far more
sensitive (even when the data is random <wink>).

Intuitively, it *seems* like it would be good to get something not so
insanely sensitive to random input as Paul-combining, but more sensitive to
overwhelming amounts of evidence than Gary-combining.  Even forcing 50
spamprobs of 0.99, the latter only moves up to an average of 0.7:

Result for random vectors of 50 probs, + 50 forced to 0.99

Graham combining 5000 items; mean 1.00; sdev 0.00
-> <stat> min 1; median 1; max 1
* = 82 items
0.00    0
0.05    0
0.10    0
0.15    0
0.20    0
0.25    0
0.30    0
0.35    0
0.40    0
0.45    0
0.50    0
0.55    0
0.60    0
0.65    0
0.70    0
0.75    0
0.80    0
0.85    0
0.90    0
0.95 5000 *************************************************************

Robinson combining 5000 items; mean 0.70; sdev 0.02
-> <stat> min 0.628976; median 0.704543; max 0.810235
* = 45 items
0.00    0
0.05    0
0.10    0
0.15    0
0.20    0
0.25    0
0.30    0
0.35    0
0.40    0
0.45    0
0.50    0
0.55    0
0.60   40 *
0.65 2070 **********************************************
0.70 2743 *************************************************************
0.75  146 ****
0.80    1 *
0.85    0
0.90    0
0.95    0

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: combine.py
Type: application/octet-stream
Size: 1294 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021008/6dd66b22/combine.exe

---------------------- multipart/mixed attachment--


From neale@woozle.org  Tue Oct  8 18:45:36 2002
From: neale@woozle.org (Neale Pickett)
Date: 08 Oct 2002 10:45:36 -0700
Subject: [Spambayes] quick hammie poll
Message-ID: <w53y998kenz.fsf@woozle.org>

RSVP, but only if you use hammie :)

1. Do you use the pickle store (pickle jar? :) or the anydbm store (-d
   option)?

2. How big is your store file?

3. Would you be able and willing to run an XML-RPC server process all
   the time for mail scoring?


(Note to Tim: my resistance is weakening ;)

Thanks

Neale

From neale@woozle.org  Tue Oct  8 18:58:31 2002
From: neale@woozle.org (Neale Pickett)
Date: 08 Oct 2002 10:58:31 -0700
Subject: [Spambayes] spamprob combining
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHCEIKDLAA.tim@zope.com>
References: <BIEJKCLHCIOIHAGOKOLHCEIKDLAA.tim@zope.com>
Message-ID: <w53u1jwke2g.fsf@woozle.org>

So then, "Tim Peters" <tim@zope.com> is all like:

> The attached sets up an experiment:
> 
>     create a vector of 50 "probabilities" at random, uniformly
>     distributed in (0.0, 1.0)
> 
>     combine them using Paul Graham's scheme, and using Gary
>     Robinson's scheme
> 
>     record the results
> 
>     repeat 5000 times
> 
> The results should look familiar for those playing this game from the start:

Heh, I got an exception:

  Traceback (most recent call last):
    File "combine.py", line 56, in ?
      h1.display()
    File "Histogram.py", line 116, in display
      raise ValueError("nbuckets %g > 0 required" % nbuckets)
  TypeError: float argument required

I patched Histogram.py to do what I think you meant (Also ITYM
"buckets", not "buckts"):

@@ -111,6 +112,8 @@
     # buckts to a list of nbuckets counts, but only if at least one
     # data point is in the collection.
     def display(self, nbuckets=None, WIDTH=61):
+        if nbuckets is None:
+            nbuckets = self.nbuckets
         if nbuckets <= 0:
             raise ValueError("nbuckets %g > 0 required" % nbuckets)
         self.compute_stats()

Submitted for your approval,

Neale

From tim.one@comcast.net  Tue Oct  8 19:04:45 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 08 Oct 2002 14:04:45 -0400
Subject: [Spambayes] spamprob combining
Message-ID: <2c2842cb6f.2cb6f2c284@icomcast.net>

[Neale Pickett]
> Heh, I got an exception:
> 
>  Traceback (most recent call last):
>    File "combine.py", line 56, in ?
>      h1.display()
>    File "Histogram.py", line 116, in display
>      raise ValueError("nbuckets %g > 0 required" % nbuckets)
>  TypeError: float argument required
>
> I patched Histogram.py to do what I think you meant (Also ITYM
> "buckets", not "buckts"):
> 
> @@ -111,6 +112,8 @@
>     # buckts to a list of nbuckets counts, but only if at least one
>     # data point is in the collection.
>     def display(self, nbuckets=None, WIDTH=61):
> +        if nbuckets is None:
> +            nbuckets = self.nbuckets
>         if nbuckets <= 0:

Heh.  I had made the same change here but neglected to check it in.  Be 
my guest!

> Submitted for your approval,

I always approve of you, Neale.


From tim.one@comcast.net  Tue Oct  8 22:45:02 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 08 Oct 2002 17:45:02 -0400
Subject: [Spambayes] Effects of ham to spam ratio
In-Reply-To: <20021007195851.A54C3F57F@cashew.wolfskeep.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHGEJMDLAA.tim.one@comcast.net>

[T. Alexander Popiel]
> Executive summary: more spam is VERY good.  1:4 ham:spam is
> _much_ more accurate than 4:1 ham:spam, or even 1:1 ham:spam.
>
> I'm back with another unusual experiment.  This time, I varied
> the ratio of ham to spam, while keeping the total number of
> messages trained and tested constant.  Once again, I'm doing
> this using the all-defaults Robinson classifier.  If someone
> gives me a good set of .ini files, I'd be more than happy to
> run this test using any of the central limit algorithms, too.

They're all the same, except for which one of

use_central_limit: True
use_central_limit2: True
use_central_limit3: True

you want to use.  Other than that, the spam cutoff ratio must be 0.5, and
the only semi-automated way to extract the 4 error rates (fp/fn when
certain/uncertain) is to set nbuckets to 4 and stare at the little
histograms.

> I again used timcv.py as my test driver, this time with 200
> messages in each ham/spam set.

How many sets (-n10, -n5, ...?).  Looks like 5.

>  For the different runs, I used the --{ham,spam}-keep options to
> control how much of each set got used, with the total used always
> being 250 ham+spam from each pair.  The script I used (along with
> all the run output, etc.) is on my website at:
>
>   http://www.wolfskeep.com/~popiel/spambayes/ratio
>
> I also mangled a version of cmp.py (now called table.py,
> also on the website) to generate the following output:
>
> -> <stat> tested 50 hams & 200 spams against 200 hams & 800 spams
> [... edited for brevity ...]
> -> <stat> tested 200 hams & 50 spams against 800 hams & 200 spams
>
> ham-spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
> fp tot:          2       1       2       2       3       3       1
> fp %:         0.80    0.27    0.40    0.32    0.40    0.34    0.10
> fn tot:         12      17      20      28      28      30      36
> fn %:         1.20    1.94    2.67    4.48    5.60    8.00   14.40
> h mean:      28.80   25.01   22.57   20.83   19.80   18.74   16.59
> h sdev:       8.37    7.61    7.09    7.07    7.24    7.24    7.30
> s mean:      78.32   76.48   75.05   73.79   72.88   70.96   68.10
> s sdev:       7.87    8.36    8.82    9.28    9.77   10.36   10.86
> mean diff:   49.52   51.47   52.48   52.96   53.08   52.22   51.51
> k:            3.05    3.22    3.30    3.24    3.12    2.97    2.84
>
> There are several interesting things here:
>
> 1. The false positive rate remains insignificant throughout.
> 2. The false negative rate drops significantly as the ham:spam
>    ratio goes down.  The more spam you have in your mailfeed,
>    the better this whole thing works.

The reason isn't clear, though:  it may well have less to do with the ratio
than with the absolute quantity of spam trained on.  If there's sufficient
variety in your spam, it could simply be that 200 is way too few to get a
representative sampling of the diversity your spam, umm, enjoys <wink..

> 3. The ham:spam ratio affects the spam sdev much more than the
>    ham sdev.

Which is more reason to be suspicious:  sdev is a measure of how wild the
data is.  If the sdev gets steady as the absolute count increases, it means
the data is "settling down".  Your spam sdev goes up by about 0.50 in each
column, with no sign of settling down "to the left", which suggests that
even at the 50-200 extreme it's *still* finding plenty of new stuff in the
spam.

Do you have a lot of Asian spam?  The gimmicks we've got for that ("skip"
and "8bit%" meta-tokens) learn slowly, and that "skip" learns at all here is
just a lucky accident.

> 4. Tim's k value (mean separation divided by sum of standard
>    deviations) is best with slightly less ham than spam (at 2:3),
>    which happens to be about the same ratio as in my real mailfeed.
>
> It would be very interesting to find out if the best ham:spam
> ratio for k (#4 above) is constant, or if it's actually tied to
> the ratio in the real mail feed from which the training data is
> taken.  This may be hard to measure for people who are using
> corpora augmented from several sources.

It would be better <wink> to get independent results from the same kind of
test but run with more data.  I know that, for example, in my data, I have
to train on several thousand spam before the improvement in spam
identification slows to a crawl.

Thanks for the report, Alex!  Well down and provocative.


From popiel@wolfskeep.com  Tue Oct  8 23:58:37 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Tue, 08 Oct 2002 15:58:37 -0700
Subject: [Spambayes] Effects of ham to spam ratio 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	of "Tue, 08 Oct 2002 17:45:02 EDT."
	<BIEJKCLHCIOIHAGOKOLHGEJMDLAA.tim.one@comcast.net> 
References: <BIEJKCLHCIOIHAGOKOLHGEJMDLAA.tim.one@comcast.net> 
Message-ID: <20021008225837.32620F588@cashew.wolfskeep.com>

In message:  <BIEJKCLHCIOIHAGOKOLHGEJMDLAA.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>
>the only semi-automated way to extract the 4 error rates (fp/fn when
>certain/uncertain) is to set nbuckets to 4 and stare at the little
>histograms.

I'll see if I can get something to read those histograms for me,
when I start doing the central limit testing. ;-)

>> I again used timcv.py as my test driver, this time with 200
>> messages in each ham/spam set.
>
>How many sets (-n10, -n5, ...?).  Looks like 5.

Yeah, I was only using 5 sets, even though I have 10 available.  Doh!

>> There are several interesting things here:
>>
>> 1. The false positive rate remains insignificant throughout.
>> 2. The false negative rate drops significantly as the ham:spam
>>    ratio goes down.  The more spam you have in your mailfeed,
>>    the better this whole thing works.
>
>The reason isn't clear, though:  it may well have less to do with the ratio
>than with the absolute quantity of spam trained on.  If there's sufficient
>variety in your spam, it could simply be that 200 is way too few to get a
>representative sampling of the diversity your spam, umm, enjoys <wink..

Well, given my prior experiment last Friday(?) on training set size,
which showed virtually no improvement in spam recognition as the training
set grew across the range I'm dealing with here, I don't think it's just
quantity that's the cause.  I probably should have mentioned those results
again.  They're still available at:

  http://www.wolfskeep.com/~popiel/spambayes/trainsize

I've also put an index of my experiments at:

  http://www.wolfskeep.com/~popiel/spambayes

>> 3. The ham:spam ratio affects the spam sdev much more than the
>>    ham sdev.
>
>Which is more reason to be suspicious:  sdev is a measure of how wild the
>data is.  If the sdev gets steady as the absolute count increases, it means
>the data is "settling down".  Your spam sdev goes up by about 0.50 in each
>column, with no sign of settling down "to the left", which suggests that
>even at the 50-200 extreme it's *still* finding plenty of new stuff in the
>spam.

True.  Hrm.

>Do you have a lot of Asian spam?  The gimmicks we've got for that ("skip"
>and "8bit%" meta-tokens) learn slowly, and that "skip" learns at all here is
>just a lucky accident.

Nope.  No Asian spam at all.  My spam is mostly in English, with a
fair amount of German porn spam (I have _no_ idea how I got onto
that list) and one or two spams in Spanish or Italian (I'm not sure
which).

>> 4. Tim's k value (mean separation divided by sum of standard
>>    deviations) is best with slightly less ham than spam (at 2:3),
>>    which happens to be about the same ratio as in my real mailfeed.
>>
>> It would be very interesting to find out if the best ham:spam
>> ratio for k (#4 above) is constant, or if it's actually tied to
>> the ratio in the real mail feed from which the training data is
>> taken.  This may be hard to measure for people who are using
>> corpora augmented from several sources.
>
>It would be better <wink> to get independent results from the same kind of
>test but run with more data.  I know that, for example, in my data, I have
>to train on several thousand spam before the improvement in spam
>identification slows to a crawl.

I'll rerun using all 10 sets instead of just 5. *blush*

- Alex

From popiel@wolfskeep.com  Wed Oct  9 06:02:43 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Tue, 08 Oct 2002 22:02:43 -0700
Subject: [Spambayes] More ratio experiments
Message-ID: <20021009050244.1D47FF588@cashew.wolfskeep.com>

Executive summary: Yes, a low ham:spam ratio is good even with larger
(to the limit of my available corpora) data sets, but the degree of the
goodness seems to go down as the training corpora get larger.  Also,
the 2:3 ham:spam ratio seems to be interesting for some reason...

The methodology I'm using for this experiment is almost identical to
that I used for my original ratio experiment:

  http://www.wolfskeep.com/~popiel/spambayes/ratio

The only thing that I changed was the number of sets I was using for
the timcv.py (from 5 in the original experiment to 8, 10, and 15).
At 10 this more than doubled and at 15 it more than tripled the
training set size for each run, keeping the testing set size the same.

For the runs with 15 sets, I had to rebalance my ham and spam, and
I could only go up to 1:1 instead of 4:1 due to lack of raw data.

The original experiment (with 5 sets) produced:

-> <stat> tested 50 hams & 200 spams against 200 hams & 800 spams
[... edited for brevity ...]
-> <stat> tested 200 hams & 50 spams against 800 hams & 200 spams

ham-spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
fp tot:          2       1       2       2       3       3       1
fp %:         0.80    0.27    0.40    0.32    0.40    0.34    0.10
fn tot:         12      17      20      28      28      30      36
fn %:         1.20    1.94    2.67    4.48    5.60    8.00   14.40
h mean:      28.80   25.01   22.57   20.83   19.80   18.74   16.59
h sdev:       8.37    7.61    7.09    7.07    7.24    7.24    7.30
s mean:      78.32   76.48   75.05   73.79   72.88   70.96   68.10
s sdev:       7.87    8.36    8.82    9.28    9.77   10.36   10.86
mean diff:   49.52   51.47   52.48   52.96   53.08   52.22   51.51
k:            3.05    3.22    3.30    3.24    3.12    2.97    2.84

The new experiment (with 8 sets) produced:

-> <stat> tested 50 hams & 200 spams against 350 hams & 1400 spams
[... edited for brevity ...]
-> <stat> tested 200 hams & 50 spams against 1400 hams & 350 spams

ham-spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
fp tot:          1       2       3       2       4       2       2
fp %:         0.25    0.33    0.38    0.20    0.33    0.14    0.12
fn tot:         18      27      34      40      44      44      45
fn %:         1.12    1.93    2.83    4.00    5.50    7.33   11.25
h mean:      26.37   23.64   21.76   19.87   19.03   18.30   17.02
h sdev:       7.73    7.18    6.95    6.89    7.01    7.16    7.35
s mean:      78.66   77.49   76.49   74.85   73.92   72.44   69.86
s sdev:       7.96    8.50    8.64    9.14    9.74   10.31   10.82
mean diff:   52.29   53.85   54.73   54.98   54.89   54.14   52.84
k:            3.33    3.43    3.51    3.43    3.28    3.10    2.91

With 10 sets it produced:

-> <stat> tested 50 hams & 200 spams against 450 hams & 1800 spams
[... edited for brevity ...]
-> <stat> tested 200 hams & 50 spams against 1800 hams & 450 spams

ham-spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
fp tot:          2       3       3       3       4       3       3
fp %:         0.40    0.40    0.30    0.24    0.27    0.17    0.15
fn tot:         32      41      43      43      47      48      51
fn %:         1.60    2.34    2.87    3.44    4.70    6.40   10.20
h mean:      24.25   21.75   20.12   18.87   18.33   17.72   16.71
h sdev:       7.52    7.13    7.04    7.09    7.16    7.31    7.43
s mean:      77.56   76.66   75.93   74.85   74.13   72.80   70.57
s sdev:       8.24    8.62    8.77    9.09    9.68    9.90   10.54
mean diff:   53.31   54.91   55.81   55.98   55.80   55.08   53.86
k:            3.38    3.49    3.53    3.46    3.31    3.20    3.00

With 15 sets it produced:

-> <stat> tested 50 hams & 200 spams against 700 hams & 2800 spams
[... edited for brevity ...]
-> <stat> tested 125 hams & 125 spams against 1750 hams & 1750 spams

ham-spam:   50-200  75-175 100-150 125-125
fp tot:          2       3       4       3
fp %:         0.27    0.27    0.27    0.16
fn tot:         61      69      62      62
fn %:         2.03    2.63    2.76    3.31
h mean:      21.13   19.54   18.44   17.90
h sdev:       6.96    7.02    6.95    7.24
s mean:      76.89   76.47   76.41   75.85
s sdev:       8.35    8.65    8.85    9.02
mean diff:   55.76   56.93   57.97   57.95
k:            3.64    3.63    3.67    3.56

The value of a small ham:spam ratio seems to go down at the training
set size increases... or perhaps the sweet spot on the curve is moving,
since the fn rates went up on the small ham:spam ratios while they
went down on the large ham:spam ratios.

Of note, the best k seems to remain at the 2:3 ratio, independent
of training set size.  This is also the point at which the fn rates
switched directions as with more data.  _Something_ is interesting
about that ratio.  This could be due to that being near the
real ratio of my mail, or it could be due to some of the tunables
(spam cutoff, a, s, whatever) in the the classifier, or it could
be something completely unexpected.

All of this is (of course) on my website at:

  http://www.wolfskeep.com/~popiel/spambayes/ratio2

- Alex


From rob@hooft.net  Wed Oct  9 08:09:49 2002
From: rob@hooft.net (Rob W.W. Hooft)
Date: Wed, 09 Oct 2002 09:09:49 +0200
Subject: [Spambayes] More ratio experiments
References: <20021009050244.1D47FF588@cashew.wolfskeep.com>
Message-ID: <3DA3D63D.5090203@hooft.net>

T. Alexander Popiel wrote:
> Executive summary: Yes, a low ham:spam ratio is good even with larger
> (to the limit of my available corpora) data sets, but the degree of the
> goodness seems to go down as the training corpora get larger.  Also,
> the 2:3 ham:spam ratio seems to be interesting for some reason...
> 
> The methodology I'm using for this experiment is almost identical to
> that I used for my original ratio experiment:
> 
>   http://www.wolfskeep.com/~popiel/spambayes/ratio
> 
> The only thing that I changed was the number of sets I was using for
> the timcv.py (from 5 in the original experiment to 8, 10, and 15).
> At 10 this more than doubled and at 15 it more than tripled the
> training set size for each run, keeping the testing set size the same.

Nope, it is the other way around: you still train on s+h=250 messages
all the time, you're just testing the scores of more messages.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From Alexander@Leidinger.net  Wed Oct  9 09:33:25 2002
From: Alexander@Leidinger.net (Alexander Leidinger)
Date: Wed, 9 Oct 2002 10:33:25 +0200
Subject: [Spambayes] quick hammie poll
In-Reply-To: <w53y998kenz.fsf@woozle.org>
References: <w53y998kenz.fsf@woozle.org>
Message-ID: <20021009103325.62746537.Alexander@Leidinger.net>

On 08 Oct 2002 10:45:36 -0700
Neale Pickett <neale@woozle.org> wrote:

> RSVP, but only if you use hammie :)

I don't use it for classifying my regular mail, at the moment I just use
it so seperate the spam from the ham in my mega corpus (10^6 mails total).

> 1. Do you use the pickle store (pickle jar? :) or the anydbm store (-d
>    option)?

I hadn't time to investigate my dbm issue, so I still use the pickle
store.

> 2. How big is your store file?

This depends... :-) If I only train on 11 ham sets (~950 msgs each) and
one spam set (~4600 at the moment) it only has 9MB.

> 3. Would you be able and willing to run an XML-RPC server process all
>    the time for mail scoring?

I don't think it makes a difference in my scenario, but if it makes a
speed difference in a delivery pipeline: sure (is it an option to make
this optional?).

-- 
                   Press every key to continue.

http://www.Leidinger.net                       Alexander @ Leidinger.net
  GPG fingerprint = C518 BC70 E67F 143F BE91  3365 79E2 9C60 B006 3FE7

From tim_one@email.msn.com  Wed Oct  9 09:35:41 2002
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 9 Oct 2002 04:35:41 -0400
Subject: [Spambayes] More ratio experiments
In-Reply-To: <3DA3D63D.5090203@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEGOBIAB.tim_one@email.msn.com>

[T. Alexander Popiel]
> ...
> The only thing that I changed was the number of sets I was using for
> the timcv.py (from 5 in the original experiment to 8, 10, and 15).
> At 10 this more than doubled and at 15 it more than tripled the
> training set size for each run, keeping the testing set size the same.

[Rob W.W. Hooft]
> Nope, it is the other way around: you still train on s+h=250 messages
> all the time, you're just testing the scores of more messages.

It can be confusing if you didn't write these test drivers, which gives me a
small advantage here <wink>.  Alex left some of the test driver output
intact, which really helps:

> The original experiment (with 5 sets) produced:
>
> -> <stat> tested 50 hams & 200 spams against 200 hams & 800 spams
> [... edited for brevity ...]
> -> <stat> tested 200 hams & 50 spams against 800 hams & 200 spams

...

> The new experiment (with 8 sets) produced:
>
> -> <stat> tested 50 hams & 200 spams against 350 hams & 1400 spams
> [... edited for brevity ...]
> -> <stat> tested 200 hams & 50 spams against 1400 hams & 350 spams

So all can see that the # of ham & spam trained on really did increase;
that's why the test driver prints this stuff, and the summary file retains
it, of course.

If you're running timcv with n sets:
    n classifiers are built
    1 run is done with each classifier
    each classifier is trained on n-1 sets, and predicts against the sole
        remaining set (the set not used to train the classifier)
    mboxtest does the same
    timcv should not be used for central limit tests (it requires
        incremental learning and unlearning)

If you're running timtest with n sets:
    n classifiers are built
    n-1 runs are done with each classifier
    each classifier is trained on 1 set, and predicts against each of
        the n-1 remaining sets (those not used to train the classifier)
    central limit tests are fine with timtest
    this is a much harder test than timcv, because it trains on less
        data, and makes each classifier predict against n-1 times more
        data than it's been taught about


From jm@jmason.org  Wed Oct  9 13:21:11 2002
From: jm@jmason.org (Justin Mason)
Date: Wed, 09 Oct 2002 13:21:11 +0100
Subject: [Spambayes] fully-public corpus of mail available
Message-ID: <20021009122116.6EB2416F03@jmason.org>

(Please feel free to forward this message to other possibly-interested
parties.)

Hi all,

One of the big problems working with spam classification, is finding good
mail to test with.  There are few public corpora available; Ion
Androutsopoulos' "Ling-spam" corpus is one (hi Ion!), but unfortunately
this does not contain all of the mail message data, so would not be useful
to a SpamAssassin-style system (which relies heavily on header data), for
example.

Another effect of not having a common, shared corpus, is the difficulty
this introduces in comparing accuracy rates between spam filter software;
since everyone tests using different corpora, statistics can be unportable
as a result.

Building public corpora is difficult, as it typically involves saving your
own (classified) mail.  This brings privacy problems, as your mail senders
may not wish to see this made public.

But what the heck, that's what I've done anyway ;)  Here's a public corpus
I've assembled from my own corpora, removing messages which were not
public in the first place.  Please feel free to download it and use
it for spam-filter development.

It's quite small, but should be big enough for use as a reference corpus,
at least, so that hit-rate statistics can be compared across tools.
Hope it helps.

It lives here:

  http://spamassassin.org/publiccorpus/


and here's the README.txt:

Welcome to the SpamAssassin public mail corpus.  This is a selection of mail
messages, suitable for use in testing spam filtering systems.  Pertinent
points:

  - All headers are reproduced in full.  Some address obfuscation has taken
    place; hostnames in some cases have been replaced with "example.com",
    which should have a valid MX record (if I recall correctly).  In most
    cases though, the headers appear as they were received.

  - All of these messages were posted to public fora, were sent to me in the
    knowledge that they may be made public, were sent by me, or originated as
    newsletters from public news web sites.

  - Copyright for the text in the messages remains with the original senders.


OK, now onto the corpus description.  It's split into three parts, as follows:

  - spam: 500 spam messages, all received from non-spam-trap sources.

  - easy_ham: 350 non-spam messages.  These are typically quite easy to
    differentiate from spam, since they frequently do not contain any spammish
    signatures (like HTML etc).

  - hard_ham: 250 non-spam messages which are closer in many respects to
    typical spam: use of HTML, unusual HTML markup, coloured text,
    "spammish-sounding" phrases etc.

The corpora are prefixed with "200210", because that's the date when I
assembled it, so it's as good a version string as anything else ;) . They are
compressed using "bzip2".

This corpus lives at http://spamassassin.org/publiccorpus/ .  Mail
jm - public - corpus AT jmason dot org if you have questions, or to donate
mail.

(Oct  9 2002 jm)


From popiel@wolfskeep.com  Wed Oct  9 16:37:43 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Wed, 09 Oct 2002 08:37:43 -0700
Subject: [Spambayes] More ratio experiments 
In-Reply-To: Message from "Tim Peters" <tim_one@email.msn.com> 
	of "Wed, 09 Oct 2002 04:35:41 EDT."
	<LNBBLJKPBEHFEDALKOLCEEGOBIAB.tim_one@email.msn.com> 
References: <LNBBLJKPBEHFEDALKOLCEEGOBIAB.tim_one@email.msn.com> 
Message-ID: <20021009153743.9EC45F54A@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCEEGOBIAB.tim_one@email.msn.com>
             "Tim Peters" <tim_one@email.msn.com> writes:
>
>Alex left some of the test driver output intact

All of the test driver output is available at

  http://www.wolfskeep.com/~popiel/spambayes/ratio2

just in case someone wants to look at it.  Histograms, more
verbose indications of the training and testing cycles, false
positive excerpts, and everything.


After sleeping on the data (yes, my bedroom is over the computer
rooms ;-) ), some more things are niggling at me... like the
error rates (specifically fn) going _UP_ as more training data
is added for the very low ham:spam ratios.  I'm guessing that
that's due to the classifier seeming to discover that yes, there
_is_ ham in the universe, and maybe more stuff should be classified
as ham.

I'm also wondering if there's a point at which where dropping the
ham:spam ratio starts increasing the fn rate, holding the training
set size constant (this I can test), and if there's an amount of
training data above which low ham:spam is nolonger good, or even bad
(this I don't have enough data to test).

Lastly, I'm wondering if I should even bother with the non-central-limit
stuff anymore, since the central-limit stuff seems from other reports
to be more interesting.  (I really ought to do comparisons among the
7 extant classifiers (default, clt[123] x {cl,rms}pik) on my data...
heck, it might even be getting close to shootout time again...

- Alex

From popiel@wolfskeep.com  Wed Oct  9 19:04:39 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Wed, 09 Oct 2002 11:04:39 -0700
Subject: [Spambayes] Modifications to timcv.py
Message-ID: <20021009180439.62465F54A@cashew.wolfskeep.com>

The inability to use timcv.py with the central limit stuff
annoyed me.  I offer this patch to correct that problem...

- Alex

Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.9
diff -u -r1.9 timcv.py
--- timcv.py	24 Sep 2002 05:37:11 -0000	1.9
+++ timcv.py	9 Oct 2002 17:59:56 -0000
@@ -26,6 +26,15 @@
         at least on of {--ham-keep, --spam-keep} is specified.  If -s
         isn't specifed, the seed is taken from current time.
 
+If you want full retraining for each classifier (because untrain and
+retrain don't work),
+
+    --trainstyle arg
+        Use one of the following training styles:
+            partial: train on everything, then untrain individual sets
+            full: train from scratch on only applicable sets
+        partial is the historical (and default) behaviour.
+
 In addition, an attempt is made to merge bayescustomize.ini into the options.
 If that exists, it can be used to change the settings in Options.options.
 """
@@ -48,7 +57,7 @@
     print >> sys.stderr, __doc__ % globals()
     sys.exit(code)
 
-def drive(nsets):
+def drive(nsets, trainstyle):
     print options.display()
 
     hamdirs  = [options.ham_directories % i for i in range(1, nsets+1)]
@@ -67,16 +76,28 @@
         spamstream = msgs.SpamStream(s, [s])
 
         if i > 0:
-            # Forget this set.
-            d.untrain(hamstream, spamstream)
+            if trainstyle == 'partial':
+                # Forget this set.
+                d.untrain(hamstream, spamstream)
+            elif trainstyle == 'full':
+                # Retrain with the other sets.
+                hname = "%s-%d, except %d" % (hamdirs[0], nsets, i + 1)
+                h2 = hamdirs * 1
+                del h2[i]
+                sname = "%s-%d, except %d" % (spamdirs[0], nsets, i + 1)
+                s2 = spamdirs * 1
+                del s2[i]
+                d.new_classifier()
+                d.train(msgs.HamStream(hname, h2), msgs.SpamStream(sname, s2))
 
         # Predict this set.
         d.test(hamstream, spamstream)
         d.finishtest()
 
         if i < nsets - 1:
-            # Add this set back in.
-            d.train(hamstream, spamstream)
+            if trainstyle == 'partial':
+                # Add this set back in.
+                d.train(hamstream, spamstream)
 
     d.alldone()
 
@@ -85,11 +106,12 @@
 
     try:
         opts, args = getopt.getopt(sys.argv[1:], 'hn:s:',
-                                   ['ham-keep=', 'spam-keep='])
+                                   ['ham-keep=', 'spam-keep=', 'trainstyle='])
     except getopt.error, msg:
         usage(1, msg)
 
     nsets = seed = hamkeep = spamkeep = None
+    trainstyle = 'partial'
     for opt, arg in opts:
         if opt == '-h':
             usage(0)
@@ -101,14 +123,18 @@
             hamkeep = int(arg)
         elif opt == '--spam-keep':
             spamkeep = int(arg)
+        elif opt == '--trainstyle':
+            trainstyle = arg
 
     if args:
         usage(1, "Positional arguments not supported")
     if nsets is None:
         usage(1, "-n is required")
+    if trainstyle not in ('partial', 'full'):
+        usage(1, "Unknown train style '%s'" % trainstyle)
 
     msgs.setparms(hamkeep, spamkeep, seed)
-    drive(nsets)
+    drive(nsets, trainstyle)
 
 if __name__ == "__main__":
     main()

From tim@zope.com  Wed Oct  9 19:30:03 2002
From: tim@zope.com (Tim Peters)
Date: Wed, 9 Oct 2002 14:30:03 -0400
Subject: [Spambayes] Modifications to timcv.py
In-Reply-To: <20021009180439.62465F54A@cashew.wolfskeep.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHKELNDLAA.tim@zope.com>

[T. Alexander Popiel]
> The inability to use timcv.py with the central limit stuff
> annoyed me.  I offer this patch to correct that problem...

Thank you!  It annoys me too.  I can't work on this now, but will check this
in (or a minor variant) tonight, when I can test it first.


PS:  Don't worry -- I won't tell anyone you're writing Python code <wink>.


From bkc@murkworks.com  Wed Oct  9 21:09:50 2002
From: bkc@murkworks.com (Brad Clements)
Date: Wed, 09 Oct 2002 16:09:50 -0400
Subject: [Spambayes] runratio with timcv.py
Message-ID: <3DA45489.14551.B29C989@localhost>

Hmm,

Well I didn't get tim's message about "using timcv.py for incremental is bad" until after I 
started my ratio testing.

This is timcv.py on -n 10 with 1200 messages total per set.

use_central_limit: True

I also have timtest.py running, I have no idea if runratio.sh will handle it's output.. Will 
post when that finishes running.

Also, I modified runratio.sh to handle arbitrary list of spam/ham count steps.. want me 
to post?


(last stat line)
-> <stat> tested 1050 hams & 150 spams against 9450 hams & 1350 spams

And the table

ham-spam: 150-1050 300-900 450-750 600-600 750-450 900-3001050-150
fp tot:         30      39      45      48      48      50      40
fp %:         2.00    1.30    1.00    0.80    0.64    0.56    0.38
fn tot:         14      20      17      19      14      16      15
fn %:         0.13    0.22    0.23    0.32    0.31    0.53    1.00
h mean:       3.31    2.36    1.93    1.74    1.53    1.36    1.08
h sdev:      13.41   10.95    9.86    9.47    8.86    8.26    7.31
s mean:      99.37   99.16   99.02   98.74   98.57   98.26   97.11
s sdev:       5.61    6.50    7.04    8.08    8.54    9.46   12.29
mean diff:   96.06   96.80   97.09   97.00   97.04   96.90   96.03
k:            5.05    5.55    5.74    5.53    5.58    5.47    4.90


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From bkc@murkworks.com  Wed Oct  9 22:00:03 2002
From: bkc@murkworks.com (Brad Clements)
Date: Wed, 09 Oct 2002 17:00:03 -0400
Subject: [Spambayes] runratio with timtest.py
Message-ID: <3DA4604E.10211.B57C38C@localhost>

use_central_limit: true

this is runratio.sh, but using timtest.py


(last stat line is)
-> <stat> tested 1050 hams & 150 spams against 1050 hams & 150 spams


ham-spam: 150-1050 300-900 450-750 600-600 750-450 900-3001050-150
fp tot:        237     213     208     209     168     119      59
fp %:         8.40    4.37    2.72    1.69    1.01    0.58    0.20
fn tot:         34      42      57      87     130     186     181
fn %:         0.06    0.16    0.28    0.54    1.18    2.57    7.23
h mean:      16.84    7.22    4.53    3.38    2.43    1.68    0.96
h sdev:      26.45   19.08   15.16   13.02   10.96    9.08    6.82
s mean:      99.66   99.18   98.71   97.93   96.71   94.66   88.92
s sdev:       4.11    6.34    7.92   10.05   12.62   16.17   23.10
mean diff:   82.82   91.96   94.18   94.55   94.28   92.98   87.96
k:            2.71    3.62    4.08    4.10    4.00    3.68    2.94


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From richie@entrian.com  Wed Oct  9 22:08:44 2002
From: richie@entrian.com (Richie Hindle)
Date: Wed, 09 Oct 2002 22:08:44 +0100
Subject: [Spambayes] quick hammie poll
In-Reply-To: <w53y998kenz.fsf@woozle.org>
References: <w53y998kenz.fsf@woozle.org>
Message-ID: <tes8quk1fklmmr4clqgnpie1lgfr2gljem@4ax.com>

Hi Neale,

> RSVP, but only if you use hammie :)

I use it to create the pickle, but not to classify mails (I use pop3proxy
for that, surprise surprise!)

> 1. Do you use the pickle store (pickle jar? :) or the anydbm store (-d
>    option)?

Pickle (but mostly out of laziness, not for any concrete reason).

> 2. How big is your store file?

4,775,607 bytes.  That's after training on around 4,200 messages.

> 3. Would you be able and willing to run an XML-RPC server process all
>    the time for mail scoring?

Sure.  I have to start and stop pop3proxy anyway, so that would be no
problem (but for exactly that reason I'm probably not the sort of user you
should be asking...)

-- 
Richie Hindle
richie@entrian.com


From mhammond@skippinet.com.au  Wed Oct  9 23:38:42 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Thu, 10 Oct 2002 08:38:42 +1000
Subject: [Spambayes] Demo Outlook Plugin available
Message-ID: <LCEPIIGDJPKCOIHOBJEPCEKDGMAA.mhammond@skippinet.com.au>

Hi all,
  I just released new win32all builds that contain support for Microsoft
Outlook Extensions.

If you install win32all-149, 150 or the most recent CVS snapshot build, you
will find a file win32com\demos\outlookAddin.py - please see the comments in
the file for information on how to install and test this plugin.

Please let me know if you try it.  Also, feel free to contact me if you need
some help turning it into something useful.

Mark.


From jbublitz@nwinternet.com  Wed Oct  9 23:55:47 2002
From: jbublitz@nwinternet.com (Jim Bublitz)
Date: Wed, 09 Oct 2002 15:55:47 -0700 (PDT)
Subject: [Spambayes] spamprob combining
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHCEIKDLAA.tim@zope.com>
Message-ID: <XFMail.021009155547.jbublitz@nwinternet.com>

On 08-Oct-02 Tim Peters wrote:
> It's hard to know what to make of this, especially in light of
> the claim that Gary-combining has been proven to be the most
> sensitive possible test for rejecting the hypothesis that a
> collection of probs is uniformly distributed.  At least in this
> test, Paul-combining seemed far more sensitive (even when the
> data is random <wink>).
 
> Intuitively, it *seems* like it would be good to get something
> not so insanely sensitive to random input as Paul-combining, but
> more sensitive to overwhelming amounts of evidence than
> Gary-combining.  Even forcing 50 spamprobs of 0.99, the latter
> only moves up to an average of 0.7:

Since my last msg was incomprehensible, I'm just going to attach my
code at the bottom and refer to it.

Graham's original score calculation - 

    product/(product + inverseProducct)

does give the kind of score distribution you described. If you
substitute Gary Robinson's suggestion (see below - last few lines),
the score distribution does spread out to the center a little bit.
You can get Robinson's scoring calculation (as below) to produce a
normal distribution around the mean ham or spam score if you
either:

a. Increase VECTOR_SIZE (max_discriminators??) - a value
of around 100 seems to do pretty well

b. Instead of selecting the most extreme N word probabilities
from the msg being tested, select the words randomly from the list
of words in the msg (not shown in code below). You immediately
(VECTOR_SIZE = 15) get a normal distribution around the means, but
accuracy sucks until you select 75 to 100 words/msg randomly.

Neither (a) nor (b) works as well as the 15 most extreme words on
my test data. Also, Robinson's calculation doesn't produce ham at
0.99 or spam at 0.01 - in fact the msgs that I had a hard time
classifying manually are (mostly) the ones that fall near the
cutoff.

Note also that the code below will produce an unpredictable score
if the msg contains only 30 .01 words and 30 .99 words. It depends
on how pairs.sort (...) handles ties. Making the limits
asymmetrical (eg .989 and .01 instead of .99/.01) doesn't seem to
work very well.

The other thing that helps make the scores extreme in actual use is
that the distribution of word probabilities is extreme. For my
corpora using the code below I get 169378 unique tokens (from 24000
msgs, 50% spam):

Probability    Number of Tokens  % of Total
[0.00, 0.01)       46329           27.4%  (never in spam)
[0.99, 1.00)      104367           61.7%  (never in ham)
                                   -----
                                   89.1%

>From looking at failures (and assuming passes behave similarly) the
10.9% (~17000 tokens) in between 0.01 and 0.99 still do a lot of the
work, which makes sense, since those are the most commonly used
words.

My experience has been that the tail tips of the score distribution
maintain about the same distance from the mean score no matter what
you do. If you improve the shape of the distribution (make it look
more normal), you move the tails about the same distance as the
distribution has spread out, and the ham and spam tails overlap
more and more, increasing the fp/fn rates. The little testing I did
on Spambayes (last week's CVS) seemed to show the same effect.

For the code below, if I train on 8000 msgs (50% spam) and then
test 200, retrain on those 200, and repeat for 16000 msgs, I get 4
fns (3 are identical msgs from the same sender with different dates,
all are Klez msgs) and 1 fp (an ISP msg "Routine Service
Maintenance"), which are fn and fp rates of 0.05% and 0.01%. The
failures all scored in the range [0.495, 0.511] (cuttoff at 0.50)
I ran the the SA Corpus today also and don't get any failures if I
train on 8K of my msgs and 50/100 of their msgs (worse results under
other conditions), but the sample sizes there are too small to do an
adequete training sample and have enough test data to have
confidence in the results. I can post those results if anyone is
interested.

Graham's method was basically designed to produce extreme scores,
and the distribution of words in the data seems to reinforce that.

If it's of any use to anybody (it's certainly beyond me), both the
distribution of msg scores and distribution of word probabilities
look like exponential or Weibull distributions. (They're "bathtub"
curves, if anyone is familiar with reliability statistics).

This is all based on my data, which is not the same as your data.
YMMV.

Jim


# classes posted to c.l.p by Erik Max Francis
# algorithm from Paul Graham ("A Plan for Spam")

# was TOKEN_RE = re.compile(r"[a-zA-Z0-9'$_-]+")
# changed to catch Asian charsets
TOKEN_RE = re.compile(r"[\w'$_-]+", re.U)
FREQUENCY_THRESHHOLD = 1 # was 5
GOOD_BIAS = 2.0
BAD_BIAS = 1.0
# changed to improve distribution 'width' because
# of smaller token count in training data
GOOD_PROB = 0.0001 # was 0.01
BAD_PROB = 0.9999  # was 0.99
VECTOR_SIZE = 15
UNKNOWN_PROB = 0.5 # was 0.4 or 0.2

# remove mixed alphanumerics or strictly numeric:
#  eg: HM6116, 555N, 1234 (also Windows98, 133t, h4X0r)
pn1_re = re.compile (r"[a-zA-Z]+[0-9]+")
pn2_re = re.compile (r"[0-9]+[a-zA-Z]+")
num_re = re.compile (r"^[0-9]+")


class Corpus(dict):
    # instantiate one training Corpus for spam, one for ham,
    # and then one Corpus for each test msg as msgs are tested
    # (the msg Corpus instance is destroyed after
    # testing the msg)

    def __init__(self, data=None):
        dict.__init__(self)
        self.count = 0
        if data is not None:
            self.process(data)

    # process is used to extract tokens from msg,
    # either in building the training sample or
    # when testing a msg (can process entire msg
    # or one part of msg at a time)
    # 'data' is a string

    def process(self, data):
        tokens = TOKEN_RE.findall(str (data))
        if not len (tokens): return

        # added the first 'if' in the loop to reduce
        # total # of tokens by >75%
        deletes = 0
        for token in tokens:
            if (len (token) > 20)\
                or (pn1_re.search (token) != None)\
                or (pn2_re.search (token) != None)\
                or (num_re.search (token) != None):
                deletes += 1
                continue

            if self.has_key(token):
                self[token] += 1
            else:
                self[token] = 1

        # count tokens, not msgs
        self.count += len (tokens) - deletes


class Database(dict):
    def __init__(self, good, bad):
        dict.__init__(self)
        self.build(good, bad)

    # 'build' constructs the dict of token: probability
    # run once after training from the ham/spam Corpus
    # instances; the ham/spam Corpus instances can be
    # destroyed (after saving?) after 'build' is run

    def build(self, good, bad):
        ngood = good.count
        nbad = bad.count
#        print ngood, nbad, float(nbad)/float(ngood)

        for token in good.keys() + bad.keys(): # doubles up, but
                                               # works
            if not self.has_key(token):
                g = GOOD_BIAS*good.get(token, 0)
                b = BAD_BIAS*bad.get(token, 0)

                if g + b >= FREQUENCY_THRESHHOLD:
                    # the 'min's are leftovers from counting
                    # msgs instead of tokens for ngood, nbad
                    goodMetric = min(1.0, g/ngood)
                    badMetric = min(1.0, b/nbad)
                    total = goodMetric + badMetric
                    prob = max(GOOD_PROB,\
                        min(BAD_PROB,badMetric/total))

                    self[token] = prob

    def scan(self, corpus):
        pairs = [(token, self.get(token, UNKNOWN_PROB)) \
            for token in corpus.keys()]

        pairs.sort(lambda x, y: cmp(abs(y[1] - 0.5), abs(x[1]\
                         - 0.5)))
        significant = pairs[:VECTOR_SIZE]

        inverseProduct = product = 1.0
        for token, prob in significant:
            product *= prob
            inverseProduct *= 1.0 - prob

# Graham scoring - was:
#        return pairs, significant, product/(product +\
#                                inverseProduct)
# 'pairs' and 'significant' added to assist data logging, evaluation

# Robinson scoring - don't know why, but this works great

        n = float (len (significant)) # n could be < VECTOR_SIZE

        # div by zero possible if no headers (and msg has no body)
        try:
            P = 1 - inverseProduct ** (1/n)
            Q = 1 - product ** (1/n)
            S = (1 + (P - Q)/(P + Q))/2
        except:
            S = 0.99

        return pairs, significant, S


From tim.one@comcast.net  Thu Oct 10 01:34:15 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 09 Oct 2002 20:34:15 -0400
Subject: [Spambayes] spamprob combining
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHCEIKDLAA.tim@zope.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEICBKAB.tim.one@comcast.net>

[Tim]
> ...
> Intuitively, it *seems* like it would be good to get something not so
> insanely sensitive to random input as Paul-combining, but more
> sensitive to overwhelming amounts of evidence than Gary-combining.

So there's a new option,

[Classifier]
use_tim_combining: True

The comments (from Options.py) explain it:

# For the default scheme, use "tim-combining" of probabilities.  This
# has no effect under the central-limit schemes.  Tim-combining is a
# kind of cross between Paul Graham's and Gary Robinson's combining
# schemes.  Unlike Paul's, it's never crazy-certain, and compared to
# Gary's, in Tim's tests it greatly increased the spread between mean
# ham-scores and spam-scores, while simultaneously decreasing the
# variance of both.  Tim needed a higher spam_cutoff value for best
# results, but spam_cutoff is less touchy than under Gary-combining.
use_tim_combining: False

"Tim combining" simply takes the geometric mean of the spamprobs as a
measure of spamminess S, and the geometric mean of 1-spamprob as a measure
of hamminess H, then returns S/(S+H) as "the score".  This is well-behaved
when fed random, uniformly distributed probabilities, but isn't reluctant to
let an overwhelming number of extreme clues lead it to an extreme conclusion
(although you're not going to see it give Graham-like 1e-30 or
1.0000000000000 scores).

Don't use a central-limit scheme with this (it has no effect on those).  If
you test it, use whatever variations on the "all default" scheme you usually
use, but it will probably help to boost spam_cutoff.  Note that the default
max_discriminators is still 150, and that's what I used below.

Here's a 10-set cross-validation run on my data, restricted to 100 ham and
100 spam per set, with all defaults, except

                    before   after
                    ------   -----
use_tim_combining   False    True
spam_cutoff         0.55     0.615


-> <stat> tested 100 hams & 100 spams against 900 hams & 900 spams
   [ditto 19 times]

false positive percentages
    0.000  0.000  tied
    1.000  0.000  won   -100.00%
    1.000  1.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   1 times
tied  9 times
lost  0 times

total unique fp went from 2 to 1 won    -50.00%
mean fp % went from 0.2 to 0.1 won    -50.00%

false negative percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    1.000  1.000  tied
    0.000  0.000  tied

won   0 times
tied 10 times
lost  0 times

total unique fn went from 1 to 1 tied
mean fn % went from 0.1 to 0.1 tied

The real story here is in the score distributions; contrary to what the
comment said above, the ham-score variance increased with this little data:

ham mean                     ham sdev
  30.63   18.80  -38.62%        6.03    6.83  +13.27%
  29.31   17.35  -40.81%        5.48    6.84  +24.82%
  29.96   18.50  -38.25%        6.95    9.02  +29.78%
  29.66   18.12  -38.91%        5.89    6.81  +15.62%
  29.51   17.34  -41.24%        5.73    6.71  +17.10%
  29.40   17.43  -40.71%        5.73    6.61  +15.36%
  29.75   17.74  -40.37%        5.76    6.96  +20.83%
  29.71   18.17  -38.84%        5.97    6.48   +8.54%
  31.98   20.41  -36.18%        5.96    8.02  +34.56%
  29.83   18.11  -39.29%        4.75    5.41  +13.89%

ham mean and sdev for all runs
  29.97   18.20  -39.27%        5.90    7.08  +20.00%

spam mean                    spam sdev
  79.23   88.38  +11.55%        6.96    5.52  -20.69%
  79.40   88.70  +11.71%        7.00    5.64  -19.43%
  78.68   88.06  +11.92%        6.69    5.13  -23.32%
  79.65   89.01  +11.75%        7.20    5.22  -27.50%
  79.91   88.87  +11.21%        6.35    4.67  -26.46%
  80.47   89.16  +10.80%        7.22    6.06  -16.07%
  80.94   89.78  +10.92%        6.60    4.45  -32.58%
  80.30   89.41  +11.34%        6.95    5.49  -21.01%
  78.54   87.70  +11.66%        7.30    6.45  -11.64%
  80.06   89.06  +11.24%        6.98    5.43  -22.21%

spam mean and sdev for all runs
  79.72   88.81  +11.40%        6.97    5.47  -21.52%

ham/spam mean difference: 49.75 70.61 +20.86

So before, the score equidistant from both means was 52.78, at 3.87 sdevs
from each; after, it was 58.03, at 5.63 sdevs from each.  The populations
are much better separated by this measure.

Histograms before:

-> <stat> Ham scores for all runs: 1000 items; mean 29.97; sdev 5.90
-> <stat> min 13.521; median 29.6919; max 60.8937
* = 2 items
...
 13  2 *
 14  0
 15  2 *
 16  8 ****
 17  4 **
 18  9 *****
 19 17 *********
 20 14 *******
 21 16 ********
 22 24 ************
 23 38 *******************
 24 47 ************************
 25 62 *******************************
 26 65 *********************************
 27 69 ***********************************
 28 73 *************************************
 29 70 ***********************************
 30 76 **************************************
 31 70 ***********************************
 32 61 *******************************
 33 51 **************************
 34 50 *************************
 35 34 *****************
 36 30 ***************
 37 27 **************
 38 18 *********
 39 12 ******
 40 11 ******
 41 13 *******
 42  2 *
 43  5 ***
 44  8 ****
 45  2 *
 46  1 *
 47  3 **
 48  1 *
 49  0
 50  3 **
 51  0
 52  0
 53  0
 54  0
 55  1 *
 56  0
 57  0
 58  0
 59  0
 60  1 *
...

-> <stat> Spam scores for all runs: 1000 items; mean 79.72; sdev 6.97
-> <stat> min 52.3428; median 79.9799; max 98.1879
* = 2 items
...
 52  1 *
 53  0
 54  0
 55  0
 56  3 **
 57  1 *
 58  0
 59  1 *
 60  4 **
 61  4 **
 62  4 **
 63  3 **
 64  4 **
 65  7 ****
 66  9 *****
 67 10 *****
 68 13 *******
 69 16 ********
 70 26 *************
 71 18 *********
 72 29 ***************
 73 35 ******************
 74 40 ********************
 75 39 ********************
 76 56 ****************************
 77 52 **************************
 78 50 *************************
 79 76 **************************************
 80 60 ******************************
 81 77 ***************************************
 82 45 ***********************
 83 61 *******************************
 84 50 *************************
 85 43 **********************
 86 41 *********************
 87 33 *****************
 88 19 **********
 89 11 ******
 90 11 ******
 91  8 ****
 92  2 *
 93  9 *****
 94  4 **
 95  9 *****
 96  2 *
 97 11 ******
 98  3 **
 99  0

Histograms after:

-> <stat> Ham scores for all runs: 1000 items; mean 18.20; sdev 7.08
-> <stat> min 5.6946; median 17.1757; max 73.1302
* = 2 items
...
  5  1 *
  6 13 *******
  7 16 ********
  8 25 *************
  9 22 ***********
 10 37 *******************
 11 45 ***********************
 12 56 ****************************
 13 70 ***********************************
 14 61 *******************************
 15 66 *********************************
 16 79 ****************************************
 17 63 ********************************
 18 59 ******************************
 19 59 ******************************
 20 56 ****************************
 21 47 ************************
 22 36 ******************
 23 37 *******************
 24 32 ****************
 25  9 *****
 26 20 **********
 27 17 *********
 28  8 ****
 29  7 ****
 30 11 ******
 31  6 ***
 32  7 ****
 33  5 ***
 34  4 **
 35  2 *
 36  2 *
 37  6 ***
 38  1 *
 39  0
 40  3 **
 41  3 **
 42  0
 43  1 *
 44  1 *
 45  1 *
 46  0
 47  1 *
 48  0
 49  0
 50  2 *
 51  1 *
 52  0
 53  0
 54  0
 55  0
 56  0
 57  0
 58  0
 59  0
 60  0
 61  1 *
 62  0
 63  0
 64  0
 65  0
 66  0
 67  0
 68  0
 69  0
 70  0
 71  0
 72  0
 73  1 *

-> <stat> Spam scores for all runs: 1000 items; mean 88.81; sdev 5.47
-> <stat> min 54.9382; median 89.5188; max 98.3805
* = 2 items
...
 54   1 *
 55   0
 56   0
 57   0
 58   0
 59   0
 60   0
 61   0
 62   0
 63   1 *
 64   3 **
 65   0
 66   1 *
 67   0
 68   2 *
 69   2 *
 70   3 **
 71   3 **
 72   2 *
 73   2 *
 74   4 **
 75   4 **
 76   6 ***
 77   8 ****
 78   8 ****
 79   6 ***
 80  12 ******
 81  25 *************
 82  26 *************
 83  25 *************
 84  39 ********************
 85  58 *****************************
 86  70 ***********************************
 87  64 ********************************
 88  74 *************************************
 89 106 *****************************************************
 90  85 *******************************************
 91  62 *******************************
 92  86 *******************************************
 93  79 ****************************************
 94  37 *******************
 95  23 ************
 96  42 *********************
 97  25 *************
 98   6 ***
 99   0

There are snaky tails in either case, but "the middle ground" here is
larger, sparser, and still contains the errors.

Across my full test data, which I actually ran first, you can ignore the
"won/lost" business; I had spam_cutoff at 0.55 for both runs, and the
overall results would have been virtually identical had I boosted
spam_cutoff in the second run (recall that I can't demonstrate an
improvement on this data anymore!  I can only determine whether something is
a disaster, and this ain't).

-> <stat> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams
   [ditto 19 times]
...
false positive percentages
    0.000  0.050  lost  +(was 0)
    0.000  0.050  lost  +(was 0)
    0.000  0.050  lost  +(was 0)
    0.000  0.000  tied
    0.050  0.100  lost  +100.00%
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.050  tied

won   0 times
tied  6 times
lost  4 times

total unique fp went from 2 to 6 lost  +200.00%
mean fp % went from 0.01 to 0.03 lost  +200.00%

false negative percentages
    0.000  0.000  tied
    0.071  0.071  tied
    0.000  0.000  tied
    0.071  0.071  tied
    0.143  0.071  won    -50.35%
    0.143  0.000  won   -100.00%
    0.143  0.143  tied
    0.143  0.000  won   -100.00%
    0.071  0.000  won   -100.00%
    0.000  0.000  tied

won   4 times
tied  6 times
lost  0 times

total unique fn went from 11 to 5 won    -54.55%
mean fn % went from 0.0785714285714 to 0.0357142857143 won    -54.55%

ham mean                     ham sdev
  25.65   10.68  -58.36%        5.67    5.44   -4.06%
  25.61   10.68  -58.30%        5.50    5.29   -3.82%
  25.57   10.68  -58.23%        5.67    5.49   -3.17%
  25.66   10.71  -58.26%        5.54    5.27   -4.87%
  25.42   10.55  -58.50%        5.72    5.71   -0.17%
  25.51   10.43  -59.11%        5.39    5.11   -5.19%
  25.65   10.40  -59.45%        5.59    5.29   -5.37%
  25.61   10.51  -58.96%        5.41    5.21   -3.70%
  25.84   10.80  -58.20%        5.48    5.30   -3.28%
  25.81   10.85  -57.96%        5.81    5.73   -1.38%

ham mean and sdev for all runs
  25.63   10.63  -58.53%        5.58    5.39   -3.41%

spam mean                    spam sdev
  83.86   93.17  +11.10%        7.09    4.55  -35.83%
  83.64   93.16  +11.38%        6.83    4.52  -33.82%
  83.27   92.91  +11.58%        6.81    4.52  -33.63%
  83.82   93.14  +11.12%        6.88    4.67  -32.12%
  83.89   93.29  +11.21%        6.65    4.56  -31.43%
  83.78   93.11  +11.14%        6.96    4.72  -32.18%
  83.42   93.00  +11.48%        6.82    4.74  -30.50%
  83.86   93.29  +11.24%        6.71    4.55  -32.19%
  83.88   93.22  +11.13%        6.98    4.71  -32.52%
  83.75   93.28  +11.38%        6.65    4.32  -35.04%

spam mean and sdev for all runs
  83.72   93.16  +11.28%        6.84    4.59  -32.89%

ham/spam mean difference: 58.09 82.53 +24.44

So the equidistant score changed from 51.73 at 4.68 sdevs from each mean, to
55.20 at 8.27 sdevs from each.  That's big.

The "after" histograms had 200 buckets in this run:

-> <stat> Ham scores for all runs: 20000 items; mean 10.63; sdev 5.39
-> <stat> min 0.281945; median 9.69929; max 81.9673
* = 17 items
 0.0   7 *
 0.5  13 *
 1.0  21 **
 1.5  41 ***
 2.0  86 ******
 2.5 166 **********
 3.0 239 ***************
 3.5 326 ********************
 4.0 466 ****************************
 4.5 554 *********************************
 5.0 642 **************************************
 5.5 701 ******************************************
 6.0 793 ***********************************************
 6.5 804 ************************************************
 7.0 933 *******************************************************
 7.5 972 **********************************************************
 8.0 997 ***********************************************************
 8.5 934 *******************************************************
 9.0 947 ********************************************************
 9.5 939 ********************************************************
10.0 839 **************************************************
10.5 786 ***********************************************
11.0 752 *********************************************
11.5 760 *********************************************
12.0 636 **************************************
12.5 606 ************************************
13.0 554 *********************************
13.5 483 *****************************
14.0 461 ****************************
14.5 399 ************************
15.0 360 **********************
15.5 317 *******************
16.0 275 *****************
16.5 224 **************
17.0 193 ************
17.5 169 **********
18.0 172 ***********
18.5 154 **********
19.0 153 *********
19.5  92 ******
20.0 104 *******
20.5  99 ******
21.0  74 *****
21.5  73 *****
22.0  73 *****
22.5  50 ***
23.0  38 ***
23.5  50 ***
24.0  38 ***
24.5  34 **
25.0  26 **
25.5  39 ***
26.0  24 **
26.5  34 **
27.0  18 **
27.5  15 *
28.0  20 **
28.5  15 *
29.0  14 *
29.5  15 *
30.0  12 *
30.5  15 *
31.0  14 *
31.5  10 *
32.0  12 *
32.5   6 *
33.0  10 *
33.5   4 *
34.0   8 *
34.5   5 *
35.0   5 *
35.5   6 *
36.0   7 *
36.5   4 *
37.0   2 *
37.5   3 *
38.0   1 *
38.5   4 *
39.0   6 *
39.5   2 *
40.0   2 *
40.5   5 *
41.0   0
41.5   2 *
42.0   3 *
42.5   3 *
43.0   1 *
43.5   2 *
44.0   1 *
44.5   2 *
45.0   1 *
45.5   1 *
46.0   2 *
46.5   0
47.0   3 *
47.5   0
48.0   1 *
48.5   1 *
49.0   1 *
49.5   0
50.0   1 *
50.5   0
51.0   2 *
51.5   0
52.0   1 *
52.5   0
53.0   0
53.5   1 *
54.0   1 *
54.5   2 *
55.0   0
55.5   0
56.0   1 *
56.5   1 *
57.0   0
57.5   0
58.0   0
58.5   1 *
59.0   0
59.5   0
60.0   0
60.5   0
61.0   1 *
61.5   0
62.0   0
62.5   0
63.0   0
63.5   0
64.0   0
64.5   0
65.0   0
65.5   0
66.0   0
66.5   0
67.0   0
67.5   0
68.0   0
68.5   0
69.0   0
69.5   0
70.0   1 *  the lady with the long & obnoxious employer-generated sig
70.5   0
71.0   0
71.5   0
72.0   0
72.5   0
73.0   0
73.5   0
74.0   0
74.5   0
75.0   0
75.5   0
76.0   0
76.5   0
77.0   0
77.5   0
78.0   0
78.5   0
79.0   0
79.5   0
80.0   0
80.5   0
81.0   0
81.5   1 *  the verbatim quote of a long Nigerian-scam spam
...

-> <stat> Spam scores for all runs: 14000 items; mean 93.16; sdev 4.59
-> <stat> min 24.3497; median 93.8141; max 99.6769
* = 15 items
...
24.0   1 *  not really sure -- it's a giant base64-encoded plain text file
24.5   0
25.0   0
25.5   0
26.0   0
26.5   0
27.0   0
27.5   0
28.0   0
28.5   0
29.0   1 *  the spam with the uuencoded body we throw away
29.5   0
30.0   0
30.5   0
31.0   0
31.5   0
32.0   0
32.5   0
33.0   0
33.5   0
34.0   0
34.5   0
35.0   0
35.5   0
36.0   0
36.5   0
37.0   0
37.5   0
38.0   0
38.5   0
39.0   0
39.5   0
40.0   0
40.5   0
41.0   0
41.5   0
42.0   0
42.5   0
43.0   0
43.5   0
44.0   0
44.5   0
45.0   0
45.5   0
46.0   1 *  Hello, my Name is BlackIntrepid
46.5   0
47.0   0
47.5   0
48.0   0
48.5   0
49.0   0
49.5   0
50.0   0
50.5   0
51.0   0
51.5   0
52.0   0
52.5   0
53.0   0
53.5   1 *  unclear; a collection of webmaster links
54.0   1 *  Susan makes a propsal (sic) to Tim
54.5   0
55.0   1 *
55.5   0
56.0   0
56.5   1 *
57.0   2 *
57.5   0
58.0   0
58.5   1 *
59.0   0
59.5   0
60.0   1 *
60.5   2 *
61.0   1 *
61.5   1 *
62.0   0
62.5   1 *
63.0   1 *
63.5   0
64.0   1 *
64.5   1 *
65.0   0
65.5   1 *
66.0   1 *
66.5   2 *
67.0   4 *
67.5   2 *
68.0   0
68.5   1 *
69.0   0
69.5   3 *
70.0   1 *
70.5   5 *
71.0   5 *
71.5   3 *
72.0   4 *
72.5   3 *
73.0   3 *
73.5   6 *
74.0   3 *
74.5   4 *
75.0   8 *
75.5   8 *
76.0  10 *
76.5  10 *
77.0  10 *
77.5  17 **
78.0  14 *
78.5  27 **
79.0  16 **
79.5  23 **
80.0  28 **
80.5  29 **
81.0  37 ***
81.5  37 ***
82.0  46 ****
82.5  55 ****
83.0  47 ****
83.5  53 ****
84.0  58 ****
84.5  68 *****
85.0  86 ******
85.5 118 ********
86.0 135 *********
86.5 159 ***********
87.0 165 ***********
87.5 178 ************
88.0 209 **************
88.5 231 ****************
89.0 299 ********************
89.5 391 ***************************
90.0 425 *****************************
90.5 402 ***************************
91.0 501 **********************************
91.5 582 ***************************************
92.0 636 *******************************************
92.5 667 *********************************************
93.0 713 ************************************************
93.5 685 **********************************************
94.0 610 *****************************************
94.5 621 ******************************************
95.0 721 *************************************************
95.5 735 *************************************************
96.0 870 **********************************************************
96.5 742 **************************************************
97.0 449 ******************************
97.5 447 ******************************
98.0 556 **************************************
98.5 561 **************************************
99.0 264 ******************
99.5 171 ************

The mistakes are all familiar; the good news is that "the normal cases" are
far removed from what might plausibly be called a middle ground.  For
example, if we called the region from 40 thru 70 here "the middle ground",
and kicked those out for manual review, there would be very few msgs to
review, but they would contain almost all the mistakes.

How does this do on your data?  I'm in favor what works <wink>.


From grobinson@transpose.com  Thu Oct 10 02:06:56 2002
From: grobinson@transpose.com (Gary Robinson)
Date: Wed, 09 Oct 2002 21:06:56 -0400
Subject: [Spambayes] spamprob combining
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEICBKAB.tim.one@comcast.net>
Message-ID: <B9CA4AF0.173C4%grobinson@transpose.com>


The thing about the geometric mean is that it is much more sensitive to
numbers near 0, so the S/(S+H) technique is biased in that way.

If you want to try something like that, I would suggest using the ARITHMETIC
means in computing S and H and again using S(S+H). That would remove that
bias.

It wouldn't be invoking that optimality theorem, but whatever works...

It really seems, as a matter of being educated, that the arithmetic approach
is worth trying if it doesn't take a lot of trouble to try it.

>"but more sensitive to overwhelming amounts of evidence than Gary-combining"

>From the email you sent at 1:02PM yesterday:

0.40    0
0.45    2 *
0.50  412 *********
0.55 3068 *************************************************************
0.60 1447 *****************************
0.65   71 **
0.70    0

One thing I'd like to be more clear on. If I understand the experiment
correctly you set 10 to .99 and 40 were random.

What percentage actually ended up as > .5, without regard to HOW MUCH over
.5?
'
> It's hard to know what to make of this, especially in light of the claim
> that Gary-combining has been proven to be the most sensitive possible test
> for rejecting the hypothesis that a collection of probs is uniformly
> distributed.

It's not the (S-H)/(S+H) that is the most sensitive (under certain
conditions), it that the geometric mean approach for computing S gives a
result that is MONOTONIC WITH a calculation which is the most sensitive.

The real technique would take S and feed it into an inverse chi-square
function with (in this experiment) 100 degrees of freedom. The output
(roughly speaking) would be the probability that that S (or a more extreme
one) might have occurred by chance alone.

Call these numbers S' and H' for S and H respectively.

The calculation (S-H)/(S+H) will be > 0 if and only if (S'-H')/(S'+H')
(unless I've made some error).

So, as a binary indicator, the two are equivalent. However, if you used S'
and H', you would see something more like real probabilities that would
probably be of magnitudes that would be more attractive to you.

You could probably use a table to approximate the inverse chi-square calc
rather than actually doing the computations all the time.

I didn't suggest doing that, at first, because I was interested in providing
a binary indicator and wanting to keep things simple -- and from the POV of
a binary indicator, it doesn't make any difference.

So, if it happens that feel like taking the time to go "all the way" with
this approach, I would suggest actually computing S' and H' and seeing what
happens. I think you would like the results better -- I just didn't suggest
it at first because I didn't know the spread would be of such interest and I
wanted to keep things simple.

I think this would work better than the S/(S+H) approach, because if you use
geometric means, it's more sensitive to one condition than the other, and if
you use arithmetic means, you don't invoke the optimality theorem.

Of course, this is ALL speculative. But the probabilities involved will
DEFINATELY be of greater magnitude, and so a better-defined spread, if the
inverse chi-square is used.


--Gary


-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454


> From: Tim Peters <tim.one@comcast.net>
> Date: Wed, 09 Oct 2002 20:34:15 -0400
> To: SpamBayes <spambayes@python.org>
> Cc: Gary Robinson <grobinson@transpose.com>
> Subject: RE: [Spambayes] spamprob combining
> 
> [Tim]
>> ...
>> Intuitively, it *seems* like it would be good to get something not so
>> insanely sensitive to random input as Paul-combining, but more
>> sensitive to overwhelming amounts of evidence than Gary-combining.
> 
> So there's a new option,
> 
> [Classifier]
> use_tim_combining: True
> 
> The comments (from Options.py) explain it:
> 
> # For the default scheme, use "tim-combining" of probabilities.  This
> # has no effect under the central-limit schemes.  Tim-combining is a
> # kind of cross between Paul Graham's and Gary Robinson's combining
> # schemes.  Unlike Paul's, it's never crazy-certain, and compared to
> # Gary's, in Tim's tests it greatly increased the spread between mean
> # ham-scores and spam-scores, while simultaneously decreasing the
> # variance of both.  Tim needed a higher spam_cutoff value for best
> # results, but spam_cutoff is less touchy than under Gary-combining.
> use_tim_combining: False
> 
> "Tim combining" simply takes the geometric mean of the spamprobs as a
> measure of spamminess S, and the geometric mean of 1-spamprob as a measure
> of hamminess H, then returns S/(S+H) as "the score".  This is well-behaved
> when fed random, uniformly distributed probabilities, but isn't reluctant to
> let an overwhelming number of extreme clues lead it to an extreme conclusion
> (although you're not going to see it give Graham-like 1e-30 or
> 1.0000000000000 scores).
> 
> Don't use a central-limit scheme with this (it has no effect on those).  If
> you test it, use whatever variations on the "all default" scheme you usually
> use, but it will probably help to boost spam_cutoff.  Note that the default
> max_discriminators is still 150, and that's what I used below.
> 
> Here's a 10-set cross-validation run on my data, restricted to 100 ham and
> 100 spam per set, with all defaults, except
> 
>                   before   after
>                   ------   -----
> use_tim_combining   False    True
> spam_cutoff         0.55     0.615
> 
> 
> -> <stat> tested 100 hams & 100 spams against 900 hams & 900 spams
>  [ditto 19 times]
> 
> false positive percentages
>   0.000  0.000  tied
>   1.000  0.000  won   -100.00%
>   1.000  1.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
> 
> won   1 times
> tied  9 times
> lost  0 times
> 
> total unique fp went from 2 to 1 won    -50.00%
> mean fp % went from 0.2 to 0.1 won    -50.00%
> 
> false negative percentages
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   1.000  1.000  tied
>   0.000  0.000  tied
> 
> won   0 times
> tied 10 times
> lost  0 times
> 
> total unique fn went from 1 to 1 tied
> mean fn % went from 0.1 to 0.1 tied
> 
> The real story here is in the score distributions; contrary to what the
> comment said above, the ham-score variance increased with this little data:
> 
> ham mean                     ham sdev
> 30.63   18.80  -38.62%        6.03    6.83  +13.27%
> 29.31   17.35  -40.81%        5.48    6.84  +24.82%
> 29.96   18.50  -38.25%        6.95    9.02  +29.78%
> 29.66   18.12  -38.91%        5.89    6.81  +15.62%
> 29.51   17.34  -41.24%        5.73    6.71  +17.10%
> 29.40   17.43  -40.71%        5.73    6.61  +15.36%
> 29.75   17.74  -40.37%        5.76    6.96  +20.83%
> 29.71   18.17  -38.84%        5.97    6.48   +8.54%
> 31.98   20.41  -36.18%        5.96    8.02  +34.56%
> 29.83   18.11  -39.29%        4.75    5.41  +13.89%
> 
> ham mean and sdev for all runs
> 29.97   18.20  -39.27%        5.90    7.08  +20.00%
> 
> spam mean                    spam sdev
> 79.23   88.38  +11.55%        6.96    5.52  -20.69%
> 79.40   88.70  +11.71%        7.00    5.64  -19.43%
> 78.68   88.06  +11.92%        6.69    5.13  -23.32%
> 79.65   89.01  +11.75%        7.20    5.22  -27.50%
> 79.91   88.87  +11.21%        6.35    4.67  -26.46%
> 80.47   89.16  +10.80%        7.22    6.06  -16.07%
> 80.94   89.78  +10.92%        6.60    4.45  -32.58%
> 80.30   89.41  +11.34%        6.95    5.49  -21.01%
> 78.54   87.70  +11.66%        7.30    6.45  -11.64%
> 80.06   89.06  +11.24%        6.98    5.43  -22.21%
> 
> spam mean and sdev for all runs
> 79.72   88.81  +11.40%        6.97    5.47  -21.52%
> 
> ham/spam mean difference: 49.75 70.61 +20.86
> 
> So before, the score equidistant from both means was 52.78, at 3.87 sdevs
> from each; after, it was 58.03, at 5.63 sdevs from each.  The populations
> are much better separated by this measure.
> 
> Histograms before:
> 
> -> <stat> Ham scores for all runs: 1000 items; mean 29.97; sdev 5.90
> -> <stat> min 13.521; median 29.6919; max 60.8937
> * = 2 items
> ...
> 13  2 *
> 14  0
> 15  2 *
> 16  8 ****
> 17  4 **
> 18  9 *****
> 19 17 *********
> 20 14 *******
> 21 16 ********
> 22 24 ************
> 23 38 *******************
> 24 47 ************************
> 25 62 *******************************
> 26 65 *********************************
> 27 69 ***********************************
> 28 73 *************************************
> 29 70 ***********************************
> 30 76 **************************************
> 31 70 ***********************************
> 32 61 *******************************
> 33 51 **************************
> 34 50 *************************
> 35 34 *****************
> 36 30 ***************
> 37 27 **************
> 38 18 *********
> 39 12 ******
> 40 11 ******
> 41 13 *******
> 42  2 *
> 43  5 ***
> 44  8 ****
> 45  2 *
> 46  1 *
> 47  3 **
> 48  1 *
> 49  0
> 50  3 **
> 51  0
> 52  0
> 53  0
> 54  0
> 55  1 *
> 56  0
> 57  0
> 58  0
> 59  0
> 60  1 *
> ...
> 
> -> <stat> Spam scores for all runs: 1000 items; mean 79.72; sdev 6.97
> -> <stat> min 52.3428; median 79.9799; max 98.1879
> * = 2 items
> ...
> 52  1 *
> 53  0
> 54  0
> 55  0
> 56  3 **
> 57  1 *
> 58  0
> 59  1 *
> 60  4 **
> 61  4 **
> 62  4 **
> 63  3 **
> 64  4 **
> 65  7 ****
> 66  9 *****
> 67 10 *****
> 68 13 *******
> 69 16 ********
> 70 26 *************
> 71 18 *********
> 72 29 ***************
> 73 35 ******************
> 74 40 ********************
> 75 39 ********************
> 76 56 ****************************
> 77 52 **************************
> 78 50 *************************
> 79 76 **************************************
> 80 60 ******************************
> 81 77 ***************************************
> 82 45 ***********************
> 83 61 *******************************
> 84 50 *************************
> 85 43 **********************
> 86 41 *********************
> 87 33 *****************
> 88 19 **********
> 89 11 ******
> 90 11 ******
> 91  8 ****
> 92  2 *
> 93  9 *****
> 94  4 **
> 95  9 *****
> 96  2 *
> 97 11 ******
> 98  3 **
> 99  0
> 
> Histograms after:
> 
> -> <stat> Ham scores for all runs: 1000 items; mean 18.20; sdev 7.08
> -> <stat> min 5.6946; median 17.1757; max 73.1302
> * = 2 items
> ...
> 5  1 *
> 6 13 *******
> 7 16 ********
> 8 25 *************
> 9 22 ***********
> 10 37 *******************
> 11 45 ***********************
> 12 56 ****************************
> 13 70 ***********************************
> 14 61 *******************************
> 15 66 *********************************
> 16 79 ****************************************
> 17 63 ********************************
> 18 59 ******************************
> 19 59 ******************************
> 20 56 ****************************
> 21 47 ************************
> 22 36 ******************
> 23 37 *******************
> 24 32 ****************
> 25  9 *****
> 26 20 **********
> 27 17 *********
> 28  8 ****
> 29  7 ****
> 30 11 ******
> 31  6 ***
> 32  7 ****
> 33  5 ***
> 34  4 **
> 35  2 *
> 36  2 *
> 37  6 ***
> 38  1 *
> 39  0
> 40  3 **
> 41  3 **
> 42  0
> 43  1 *
> 44  1 *
> 45  1 *
> 46  0
> 47  1 *
> 48  0
> 49  0
> 50  2 *
> 51  1 *
> 52  0
> 53  0
> 54  0
> 55  0
> 56  0
> 57  0
> 58  0
> 59  0
> 60  0
> 61  1 *
> 62  0
> 63  0
> 64  0
> 65  0
> 66  0
> 67  0
> 68  0
> 69  0
> 70  0
> 71  0
> 72  0
> 73  1 *
> 
> -> <stat> Spam scores for all runs: 1000 items; mean 88.81; sdev 5.47
> -> <stat> min 54.9382; median 89.5188; max 98.3805
> * = 2 items
> ...
> 54   1 *
> 55   0
> 56   0
> 57   0
> 58   0
> 59   0
> 60   0
> 61   0
> 62   0
> 63   1 *
> 64   3 **
> 65   0
> 66   1 *
> 67   0
> 68   2 *
> 69   2 *
> 70   3 **
> 71   3 **
> 72   2 *
> 73   2 *
> 74   4 **
> 75   4 **
> 76   6 ***
> 77   8 ****
> 78   8 ****
> 79   6 ***
> 80  12 ******
> 81  25 *************
> 82  26 *************
> 83  25 *************
> 84  39 ********************
> 85  58 *****************************
> 86  70 ***********************************
> 87  64 ********************************
> 88  74 *************************************
> 89 106 *****************************************************
> 90  85 *******************************************
> 91  62 *******************************
> 92  86 *******************************************
> 93  79 ****************************************
> 94  37 *******************
> 95  23 ************
> 96  42 *********************
> 97  25 *************
> 98   6 ***
> 99   0
> 
> There are snaky tails in either case, but "the middle ground" here is
> larger, sparser, and still contains the errors.
> 
> Across my full test data, which I actually ran first, you can ignore the
> "won/lost" business; I had spam_cutoff at 0.55 for both runs, and the
> overall results would have been virtually identical had I boosted
> spam_cutoff in the second run (recall that I can't demonstrate an
> improvement on this data anymore!  I can only determine whether something is
> a disaster, and this ain't).
> 
> -> <stat> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams
>  [ditto 19 times]
> ...
> false positive percentages
>   0.000  0.050  lost  +(was 0)
>   0.000  0.050  lost  +(was 0)
>   0.000  0.050  lost  +(was 0)
>   0.000  0.000  tied
>   0.050  0.100  lost  +100.00%
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.050  0.050  tied
> 
> won   0 times
> tied  6 times
> lost  4 times
> 
> total unique fp went from 2 to 6 lost  +200.00%
> mean fp % went from 0.01 to 0.03 lost  +200.00%
> 
> false negative percentages
>   0.000  0.000  tied
>   0.071  0.071  tied
>   0.000  0.000  tied
>   0.071  0.071  tied
>   0.143  0.071  won    -50.35%
>   0.143  0.000  won   -100.00%
>   0.143  0.143  tied
>   0.143  0.000  won   -100.00%
>   0.071  0.000  won   -100.00%
>   0.000  0.000  tied
> 
> won   4 times
> tied  6 times
> lost  0 times
> 
> total unique fn went from 11 to 5 won    -54.55%
> mean fn % went from 0.0785714285714 to 0.0357142857143 won    -54.55%
> 
> ham mean                     ham sdev
> 25.65   10.68  -58.36%        5.67    5.44   -4.06%
> 25.61   10.68  -58.30%        5.50    5.29   -3.82%
> 25.57   10.68  -58.23%        5.67    5.49   -3.17%
> 25.66   10.71  -58.26%        5.54    5.27   -4.87%
> 25.42   10.55  -58.50%        5.72    5.71   -0.17%
> 25.51   10.43  -59.11%        5.39    5.11   -5.19%
> 25.65   10.40  -59.45%        5.59    5.29   -5.37%
> 25.61   10.51  -58.96%        5.41    5.21   -3.70%
> 25.84   10.80  -58.20%        5.48    5.30   -3.28%
> 25.81   10.85  -57.96%        5.81    5.73   -1.38%
> 
> ham mean and sdev for all runs
> 25.63   10.63  -58.53%        5.58    5.39   -3.41%
> 
> spam mean                    spam sdev
> 83.86   93.17  +11.10%        7.09    4.55  -35.83%
> 83.64   93.16  +11.38%        6.83    4.52  -33.82%
> 83.27   92.91  +11.58%        6.81    4.52  -33.63%
> 83.82   93.14  +11.12%        6.88    4.67  -32.12%
> 83.89   93.29  +11.21%        6.65    4.56  -31.43%
> 83.78   93.11  +11.14%        6.96    4.72  -32.18%
> 83.42   93.00  +11.48%        6.82    4.74  -30.50%
> 83.86   93.29  +11.24%        6.71    4.55  -32.19%
> 83.88   93.22  +11.13%        6.98    4.71  -32.52%
> 83.75   93.28  +11.38%        6.65    4.32  -35.04%
> 
> spam mean and sdev for all runs
> 83.72   93.16  +11.28%        6.84    4.59  -32.89%
> 
> ham/spam mean difference: 58.09 82.53 +24.44
> 
> So the equidistant score changed from 51.73 at 4.68 sdevs from each mean, to
> 55.20 at 8.27 sdevs from each.  That's big.
> 
> The "after" histograms had 200 buckets in this run:
> 
> -> <stat> Ham scores for all runs: 20000 items; mean 10.63; sdev 5.39
> -> <stat> min 0.281945; median 9.69929; max 81.9673
> * = 17 items
> 0.0   7 *
> 0.5  13 *
> 1.0  21 **
> 1.5  41 ***
> 2.0  86 ******
> 2.5 166 **********
> 3.0 239 ***************
> 3.5 326 ********************
> 4.0 466 ****************************
> 4.5 554 *********************************
> 5.0 642 **************************************
> 5.5 701 ******************************************
> 6.0 793 ***********************************************
> 6.5 804 ************************************************
> 7.0 933 *******************************************************
> 7.5 972 **********************************************************
> 8.0 997 ***********************************************************
> 8.5 934 *******************************************************
> 9.0 947 ********************************************************
> 9.5 939 ********************************************************
> 10.0 839 **************************************************
> 10.5 786 ***********************************************
> 11.0 752 *********************************************
> 11.5 760 *********************************************
> 12.0 636 **************************************
> 12.5 606 ************************************
> 13.0 554 *********************************
> 13.5 483 *****************************
> 14.0 461 ****************************
> 14.5 399 ************************
> 15.0 360 **********************
> 15.5 317 *******************
> 16.0 275 *****************
> 16.5 224 **************
> 17.0 193 ************
> 17.5 169 **********
> 18.0 172 ***********
> 18.5 154 **********
> 19.0 153 *********
> 19.5  92 ******
> 20.0 104 *******
> 20.5  99 ******
> 21.0  74 *****
> 21.5  73 *****
> 22.0  73 *****
> 22.5  50 ***
> 23.0  38 ***
> 23.5  50 ***
> 24.0  38 ***
> 24.5  34 **
> 25.0  26 **
> 25.5  39 ***
> 26.0  24 **
> 26.5  34 **
> 27.0  18 **
> 27.5  15 *
> 28.0  20 **
> 28.5  15 *
> 29.0  14 *
> 29.5  15 *
> 30.0  12 *
> 30.5  15 *
> 31.0  14 *
> 31.5  10 *
> 32.0  12 *
> 32.5   6 *
> 33.0  10 *
> 33.5   4 *
> 34.0   8 *
> 34.5   5 *
> 35.0   5 *
> 35.5   6 *
> 36.0   7 *
> 36.5   4 *
> 37.0   2 *
> 37.5   3 *
> 38.0   1 *
> 38.5   4 *
> 39.0   6 *
> 39.5   2 *
> 40.0   2 *
> 40.5   5 *
> 41.0   0
> 41.5   2 *
> 42.0   3 *
> 42.5   3 *
> 43.0   1 *
> 43.5   2 *
> 44.0   1 *
> 44.5   2 *
> 45.0   1 *
> 45.5   1 *
> 46.0   2 *
> 46.5   0
> 47.0   3 *
> 47.5   0
> 48.0   1 *
> 48.5   1 *
> 49.0   1 *
> 49.5   0
> 50.0   1 *
> 50.5   0
> 51.0   2 *
> 51.5   0
> 52.0   1 *
> 52.5   0
> 53.0   0
> 53.5   1 *
> 54.0   1 *
> 54.5   2 *
> 55.0   0
> 55.5   0
> 56.0   1 *
> 56.5   1 *
> 57.0   0
> 57.5   0
> 58.0   0
> 58.5   1 *
> 59.0   0
> 59.5   0
> 60.0   0
> 60.5   0
> 61.0   1 *
> 61.5   0
> 62.0   0
> 62.5   0
> 63.0   0
> 63.5   0
> 64.0   0
> 64.5   0
> 65.0   0
> 65.5   0
> 66.0   0
> 66.5   0
> 67.0   0
> 67.5   0
> 68.0   0
> 68.5   0
> 69.0   0
> 69.5   0
> 70.0   1 *  the lady with the long & obnoxious employer-generated sig
> 70.5   0
> 71.0   0
> 71.5   0
> 72.0   0
> 72.5   0
> 73.0   0
> 73.5   0
> 74.0   0
> 74.5   0
> 75.0   0
> 75.5   0
> 76.0   0
> 76.5   0
> 77.0   0
> 77.5   0
> 78.0   0
> 78.5   0
> 79.0   0
> 79.5   0
> 80.0   0
> 80.5   0
> 81.0   0
> 81.5   1 *  the verbatim quote of a long Nigerian-scam spam
> ...
> 
> -> <stat> Spam scores for all runs: 14000 items; mean 93.16; sdev 4.59
> -> <stat> min 24.3497; median 93.8141; max 99.6769
> * = 15 items
> ...
> 24.0   1 *  not really sure -- it's a giant base64-encoded plain text file
> 24.5   0
> 25.0   0
> 25.5   0
> 26.0   0
> 26.5   0
> 27.0   0
> 27.5   0
> 28.0   0
> 28.5   0
> 29.0   1 *  the spam with the uuencoded body we throw away
> 29.5   0
> 30.0   0
> 30.5   0
> 31.0   0
> 31.5   0
> 32.0   0
> 32.5   0
> 33.0   0
> 33.5   0
> 34.0   0
> 34.5   0
> 35.0   0
> 35.5   0
> 36.0   0
> 36.5   0
> 37.0   0
> 37.5   0
> 38.0   0
> 38.5   0
> 39.0   0
> 39.5   0
> 40.0   0
> 40.5   0
> 41.0   0
> 41.5   0
> 42.0   0
> 42.5   0
> 43.0   0
> 43.5   0
> 44.0   0
> 44.5   0
> 45.0   0
> 45.5   0
> 46.0   1 *  Hello, my Name is BlackIntrepid
> 46.5   0
> 47.0   0
> 47.5   0
> 48.0   0
> 48.5   0
> 49.0   0
> 49.5   0
> 50.0   0
> 50.5   0
> 51.0   0
> 51.5   0
> 52.0   0
> 52.5   0
> 53.0   0
> 53.5   1 *  unclear; a collection of webmaster links
> 54.0   1 *  Susan makes a propsal (sic) to Tim
> 54.5   0
> 55.0   1 *
> 55.5   0
> 56.0   0
> 56.5   1 *
> 57.0   2 *
> 57.5   0
> 58.0   0
> 58.5   1 *
> 59.0   0
> 59.5   0
> 60.0   1 *
> 60.5   2 *
> 61.0   1 *
> 61.5   1 *
> 62.0   0
> 62.5   1 *
> 63.0   1 *
> 63.5   0
> 64.0   1 *
> 64.5   1 *
> 65.0   0
> 65.5   1 *
> 66.0   1 *
> 66.5   2 *
> 67.0   4 *
> 67.5   2 *
> 68.0   0
> 68.5   1 *
> 69.0   0
> 69.5   3 *
> 70.0   1 *
> 70.5   5 *
> 71.0   5 *
> 71.5   3 *
> 72.0   4 *
> 72.5   3 *
> 73.0   3 *
> 73.5   6 *
> 74.0   3 *
> 74.5   4 *
> 75.0   8 *
> 75.5   8 *
> 76.0  10 *
> 76.5  10 *
> 77.0  10 *
> 77.5  17 **
> 78.0  14 *
> 78.5  27 **
> 79.0  16 **
> 79.5  23 **
> 80.0  28 **
> 80.5  29 **
> 81.0  37 ***
> 81.5  37 ***
> 82.0  46 ****
> 82.5  55 ****
> 83.0  47 ****
> 83.5  53 ****
> 84.0  58 ****
> 84.5  68 *****
> 85.0  86 ******
> 85.5 118 ********
> 86.0 135 *********
> 86.5 159 ***********
> 87.0 165 ***********
> 87.5 178 ************
> 88.0 209 **************
> 88.5 231 ****************
> 89.0 299 ********************
> 89.5 391 ***************************
> 90.0 425 *****************************
> 90.5 402 ***************************
> 91.0 501 **********************************
> 91.5 582 ***************************************
> 92.0 636 *******************************************
> 92.5 667 *********************************************
> 93.0 713 ************************************************
> 93.5 685 **********************************************
> 94.0 610 *****************************************
> 94.5 621 ******************************************
> 95.0 721 *************************************************
> 95.5 735 *************************************************
> 96.0 870 **********************************************************
> 96.5 742 **************************************************
> 97.0 449 ******************************
> 97.5 447 ******************************
> 98.0 556 **************************************
> 98.5 561 **************************************
> 99.0 264 ******************
> 99.5 171 ************
> 
> The mistakes are all familiar; the good news is that "the normal cases" are
> far removed from what might plausibly be called a middle ground.  For
> example, if we called the region from 40 thru 70 here "the middle ground",
> and kicked those out for manual review, there would be very few msgs to
> review, but they would contain almost all the mistakes.
> 
> How does this do on your data?  I'm in favor what works <wink>.
> 


From grobinson@transpose.com  Thu Oct 10 02:18:28 2002
From: grobinson@transpose.com (Gary Robinson)
Date: Wed, 09 Oct 2002 21:18:28 -0400
Subject: [Spambayes] spamprob combining
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEICBKAB.tim.one@comcast.net>
Message-ID: <B9CA4DA4.173CB%grobinson@transpose.com>


If you do decide to try the chi-square thing, the idea is to find or create
a function (perhaps using a lookup table) that takes a chi-square random
variable and outputs the associated p-value. The input random variable is
the product of the p's or (1-p)'s as the case may be. If the p's are
uniformly distributed under the null hypothesis, the product is chi-square
with 2n degrees of freedom, where n is the number of terms making up the
product. So the inverse chi-square function gives the probability associated
with that product.

--Gary


-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454


From grobinson@transpose.com  Thu Oct 10 02:41:31 2002
From: grobinson@transpose.com (Gary Robinson)
Date: Wed, 09 Oct 2002 21:41:31 -0400
Subject: [Spambayes] spamprob combining
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEICBKAB.tim.one@comcast.net>
Message-ID: <B9CA530B.173CD%grobinson@transpose.com>

No no no, I had that wrong. Silly of me. Sorry. It's been too long since I
did that calc. It's not the product of the p's that is a chi-square
distribution, it's the following, given p1, p2..., pn:

2*((ln p1) + (ln p2) + ... + (ln pn))

That expression has a chi-square distribution with 2n degrees of freedom. So
you feed THAT into the inverse-chi square function to get a p-value.

Let invchi(x, f), where s is the random variable and f is the degrees of
freedom, be the inverse chi square function. Let S be a number near 1 when
the email looks spammy and H be a number near 1 when the email looks hammy.

Then you want

S = 1 - invchi(2*((ln (1-p1)) + (ln (1-p2)) + ... + (ln (1-pn)), 2*n)

and

H = 1 - invchi(2*((ln p1) + (ln p2) + ... + (ln pn)), 2*n)

I am a little out-of-practice but I am about 99.9% sure that the above is
right. I looked up some of my notes from a few years ago to get the calc.

--Gary


-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454


> 


From tim.one@comcast.net  Thu Oct 10 04:08:03 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 09 Oct 2002 23:08:03 -0400
Subject: [Spambayes] spamprob combining
In-Reply-To: <B9CA4AF0.173C4%grobinson@transpose.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEIJBKAB.tim.one@comcast.net>

[Gary Robinson]
> The thing about the geometric mean is that it is much more sensitive to
> numbers near 0, so the S/(S+H) technique is biased in that way.

A single geometric mean would surely be biased, but the combination of two
used here doesn't appear to be.  That is, throwing random data at it, the
mean and median are 0.5, and it's symmetric around that:

5000 items; mean 0.50; sdev 0.06
-> <stat> min 0.291521; median 0.500264; max 0.726668
* = 24 items
0.25    3 *
0.30   34 **
0.35  211 *********
0.40  816 **********************************
0.45 1431 ************************************************************
0.50 1442 *************************************************************
0.55  809 **********************************
0.60  219 **********
0.65   33 **
0.70    2 *

If I do the same random-data experiment and force a prob of 0.99, the mean
rises to 0.52; if I force a prob of 0.01, it falls to 0.48.  If there's a
bias, it's hiding pretty well <wink>.

If there is a spamprob near 0, it's very much the intent that S take that
seriously, and if one near 1, that H take that seriously; else, as now, I
see screaming spam or screaming ham barely cracking scores above 70 or below
30.  "Too much ends up in the middle."

> If you want to try something like that, I would suggest using the
> ARITHMETIC means in computing S and H and again using S(S+H).  That
> would remove that bias.

That doesn't appear promising:

If
   S = Smean = (sum p_i)/n

and
   H = Hmean = (sum 1-p_i)/n

then Hmean = n/n - Smean = 1 - Smean, and Smean + Hmean = 1.  So whether you
meant S*(S+H) or S/(S+H), the result is S.  To within roundoff error, that's
what happens, too.

> It wouldn't be invoking that optimality theorem, but whatever works...

I'm not sure the optimality theorem in question is relevant to the task at
hand, though.  Why should we care abour rejecting a hypothesis that the word
probabilities are uniformly distributed?  There's virtually no message in
which they are, and no reason to believe that the *majority* of words in
spam will have spamprobs over 0.5.  Graham got results as good as he did
because the spamprob strength of a mere handful of words is usually enough
to decide it.  In a sense, I am trying to move back toward what worked best
in his formulation.

> It really seems, as a matter of being educated, that the
> arithmetic approach is worth trying if it doesn't take a lot of
> trouble to try it.

Nope, no trouble, but my test data can't demonstrate improvements, just
disasters.  On a brief 10-fold cv run with 100 ham + 100 spam in each set,
using the arithmetic spamprob mean gave results pretty much the same as the
default scheme; error rates were the same, but the best range for
spam_cutoff shifted from 0.52 thru 0.54, to 0.56 thru 0.58; it increased the
spread a little:

ham mean and sdev for all runs
  30.35   30.53   +0.59%        5.83    5.91   +1.37%

spam mean and sdev for all runs
  80.97   84.08   +3.84%        7.07    6.38   -9.76%

ham/spam mean difference: 50.62 53.55 +2.93

>> "but more sensitive to overwhelming amounts of evidence than
>> Gary-combining"

> From the email you sent at 1:02PM yesterday:
>
> 0.40    0
> 0.45    2 *
> 0.50  412 *********
> 0.55 3068 *************************************************************
> 0.60 1447 *****************************
> 0.65   71 **
> 0.70    0
>
> One thing I'd like to be more clear on. If I understand the experiment
> correctly you set 10 to .99 and 40 were random.

I have to dig up that email to find the context ... OK, this one was tagged

    Result for random vectors of 50 probs, + 10 forced to 0.99

That means there were 60 probs in all, 50 drawn from (0.0, 1.0), + 10 of
0.99.

> What percentage actually ended up as > .5, without regard to
> HOW MUCH over .5?

>From the histogram, all but 2, out of 5000 trials.  0.5 doesn't work as a
spam_cutoff on anyone's corpus here, though (it's too low; too many false
positives).  The median value in that run was 0.58555, which is close to
what some people have been using for spam_cutoff.

Under the S/(S+H) scheme, the same experiment yields

5000 items; mean 0.68; sdev 0.05
-> <stat> min 0.490773; median 0.683328; max 0.819528
* = 34 items
0.45    2 *
0.50   27 *
0.55  171 ******
0.60  991 ******************************
0.65 2016 ************************************************************
0.70 1510 *********************************************
0.75  275 *********
0.80    8 *

So if the percentage above 0.5 is sole the measure of goodness here, S/(S+H)
did equally well in this experiement.

> ...
> It's not the (S-H)/(S+H) that is the most sensitive (under certain
> conditions), it that the geometric mean approach for computing S gives a
> result that is MONOTONIC WITH a calculation which is the most sensitive.
>
> The real technique would take S and feed it into an inverse chi-square
> function with (in this experiment) 100 degrees of freedom. The output
> (roughly speaking) would be the probability that that S (or a more extreme
> one) might have occurred by chance alone.
>
> Call these numbers S' and H' for S and H respectively.
>
> The calculation (S-H)/(S+H) will be > 0 if and only if (S'-H')/(S'+H')
> (unless I've made some error).
>
> So, as a binary indicator, the two are equivalent. However, if you used S'
> and H', you would see something more like real probabilities that would
> probably be of magnitudes that would be more attractive to you.
>
> You could probably use a table to approximate the inverse chi-square calc
> rather than actually doing the computations all the time.
>
> I didn't suggest doing that, at first, because I was interested
> in providing a binary indicator and wanting to keep things simple --
> and from the POV of a binary indicator, it doesn't make any difference.

It's not a question of attraction <wink> so much as that this "binary
indicator" doesn't come with a decision rule for knowing which outcome is
which:  it varies across corpus, and within a given corpus varies over time,
depending on how much data has been trained on.  So we get a stream of test
results where the numbers have to be fudged retroactively via "but if I had
set the cutoff to *this* on this run, the results would have been very
different".  It's just too delicate as is.

> So, if it happens that feel like taking the time to go "all the way"
> with this approach, I would suggest actually computing S' and H' and
> seeing what happens.

Sounds like fun.

> I think you would like the results better -- I just didn't suggest
> it at first because I didn't know the spread would be of such
> interest and I wanted to keep things simple.

That's fine.  In practice, the touchiness of spam_cutoff has been an ongoing
practical problem; but it's been the *only* ongoing problem, so that's why
we're talking about it <wink>>

> I think this would work better than the S/(S+H) approach, because
> if you use geometric means, it's more sensitive to one condition than
> the other, and if you use arithmetic means, you don't invoke the
> optimality theorem.

As above, I've found no reason yet to believe S/(S+H) favors one side over
the other, and the test runs didn't show me evidence of that either.
Indeed, it made the same mistakes on the same messages, but moved mounds of
correctly classified message out of "the middle ground".

> Of course, this is ALL speculative. But the probabilities involved will
> DEFINATELY be of greater magnitude, and so a better-defined spread, if
> the inverse chi-square is used.

It's doable, but the experimental results so far are promising enough that
I'm still keener to see how it works for others here.


From popiel@wolfskeep.com  Thu Oct 10 04:19:46 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Wed, 09 Oct 2002 20:19:46 -0700
Subject: [Spambayes] spamprob combining 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	of "Wed, 09 Oct 2002 20:34:15 EDT."
	<LNBBLJKPBEHFEDALKOLCEEICBKAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCEEICBKAB.tim.one@comcast.net> 
Message-ID: <20021010031946.EA9ACF54A@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCEEICBKAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>
>So there's a new option,
>
>[Classifier]
>use_tim_combining: True

>How does this do on your data?  I'm in favor what works <wink>.

Oooh, goodie!  Another thing to consume CPU-hours!

I'll run this one after I get done with my initial clt tests
(which are taking about 4.5 hours each :-/ ).  I can't really
say anything else, yet, but clt seems _much_ slower than the
default classifier.

- Alex

From tim.one@comcast.net  Thu Oct 10 04:29:38 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 09 Oct 2002 23:29:38 -0400
Subject: [Spambayes] spamprob combining
In-Reply-To: <20021010031946.EA9ACF54A@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEIKBKAB.tim.one@comcast.net>

[T. Alexander Popiel]
> Oooh, goodie!  Another thing to consume CPU-hours!

Yup, that's the only idea here <wink>.

> I'll run this one after I get done with my initial clt tests
> (which are taking about 4.5 hours each :-/ ).

Use less data?

> I can't really say anything else, yet, but clt seems _much_ slower
> than the default classifier.

I haven't really noticed that.  If you're using your "--trainstyle full"
patch with timcv, then, yes, it would be enormously slower -- timcv gets
enormous *efficiency* benefits (both instruction-count and temporal cache
locality) out of incremental learning and unlearning.

The "third training pass" unique to the clt methods also doubles the
training time (each msg in the training data is tokenized once to update the
wordprobs, and then a second time to compute the clt ham and spam population
statistics).


From tim.one@comcast.net  Thu Oct 10 05:11:20 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 10 Oct 2002 00:11:20 -0400
Subject: [Spambayes] Demo Outlook Plugin available
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPCEKDGMAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEINBKAB.tim.one@comcast.net>

[Mark Hammond]
>   I just released new win32all builds that contain support for Microsoft
> Outlook Extensions.
>
> If you install win32all-149, 150 or the most recent CVS snapshot
> build, you will find a file win32com\demos\outlookAddin.py - please
> see the  comments in the file for information on how to install and
> test this plugin.

Those of you who aren't Python natives may be wondering where to find that!
Here you go:

    http://starship.python.net/crew/mhammond/

If you want to know how to *use* it <wink>, Mark is the co-author of
O'Reilly's "Python Programming on Win 32".  Tell 'em Uncle Timmy sent you.


From tim.one@comcast.net  Thu Oct 10 05:58:06 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 10 Oct 2002 00:58:06 -0400
Subject: [Spambayes] Modifications to timcv.py
In-Reply-To: <20021009180439.62465F54A@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEJBBKAB.tim.one@comcast.net>

[T. Alexander Popiel]
> The inability to use timcv.py with the central limit stuff
> annoyed me.  I offer this patch to correct that problem...

Thanks again!  I checked this in.  The biggest difference is that this has
become a new option in a new section:

[CV Driver]
build_each_classifier_from_scratch: False

See associated comments in Options.py.  mboxtest.py is also a
cross-validation (CV) driver, so should also learn how to do this.

When this options is True, a CV driver can be used safely with a
central-limit test -- although it will run much slower due to the "build
each from scratch" business that *makes* it safe.


From quinlan@pathname.com  Thu Oct 10 05:47:02 2002
From: quinlan@pathname.com (Daniel Quinlan)
Date: 09 Oct 2002 21:47:02 -0700
Subject: [Spambayes] Re: [SAdev] fully-public corpus of mail available
In-Reply-To: jm@jmason.org's message of "Wed, 09 Oct 2002 13:21:11 +0100"
References: <20021009122116.6EB2416F03@jmason.org>
Message-ID: <yf2r8eygat5.fsf@proton.pathname.com>

> (Please feel free to forward this message to other possibly-interested
> parties.)

Some caveats (in decending order of concern):

1. These messages could end up being falsely (or incorrectly) reported
   to Razor, DCC, Pyzor, etc.  Certain RBLs too.  I don't think the
   results for these distributed tests can be trusted in any way,
   shape, or form when running over a public corpus.

2. These messages could also be submitted (more than once) to projects
   like SpamAssassin that rely on filtering results submission for GA
   tuning and development.

3. Spammers could adopt elements of the good messages to throw off
   filters.  And, of course, there's always progression in technology
   (by both spammers and non-spammers).

The second problem could be alleviated somewhat by adding a Nilsimsa
signature (or similar) to the mass-check file (the results format used
by SpamAssassin) and giving the message files unique names (MD5 or
SHA-1 of each file).

The third problem doesn't really worry me.

These problems (and perhaps others I have not identified) are unique
to spam filtering.  Compression corpuses and other performance-related
corpuses have their own set of problems, of course.

In other words, I don't think there's any replacement for having
multiple independent corpuses.  Finding better ways to distribute
testing and collate results seems like a more viable long-term solution
(and I'm glad we're working on exactly that for SpamAssassin).  If
you're going to seriously work on filter development, building a corpus
of 10000-50000 messages (half spam/half non-spam) is not really that
much work.  If you don't get enough spam, creating multi-technique
spamtraps (web, usenet, replying to spam) is pretty easy.  And who
doesn't get thousands of non-spam every week?  ;-)

Dan

From rob@hooft.net  Thu Oct 10 06:00:04 2002
From: rob@hooft.net (Rob Hooft)
Date: Thu, 10 Oct 2002 07:00:04 +0200
Subject: [Spambayes] spamprob combining
References: <LNBBLJKPBEHFEDALKOLCEEICBKAB.tim.one@comcast.net>
Message-ID: <3DA50954.7040503@hooft.net>

Tim Peters wrote:
> "Tim combining" simply takes the geometric mean of the spamprobs as a
> measure of spamminess S, and the geometric mean of 1-spamprob as a measure
> of hamminess H, then returns S/(S+H) as "the score".  This is well-behaved
> when fed random, uniformly distributed probabilities, but isn't reluctant to
> let an overwhelming number of extreme clues lead it to an extreme conclusion
> (although you're not going to see it give Graham-like 1e-30 or
> 1.0000000000000 scores).

While reading this I had a sudden thought: With the distributions I'm 
normally interested in, I want to explain the "bulk" accurately, without 
being extremely sensitive to the tails. e.g. in my previous job, the 
bulk was a database of protein structures, and I wanted to describe the 
bulk so that I could recognize the outliers. In my current job, the 
population is pixel activity on a CCD, and I don't want to be sensitive 
to bad pixels.

The standard way to calculate a standard deviation is to calculate the 
mean first, and then calculate (x-<x>)^2/(n-1) in a second pass over the 
numbers. This is rather sensitive to outliers, however. In both cases I 
have experience with, the best way to describe the bulk is to use the 
median, and "median ways" to calculate the standard deviation. These 
methods absolutely ignore the extreme values.

But now spambayes. The bulk are words like "the" and "with" and "want" 
and,.... All totally uninteresting. So if we want to be sensitive to 
outliers, we should "go the other way". We have two options I can think off:
  * use a (x-<x>)^4 function. This will be very sensitive to extremes.
  * calculate the mean and standard deviation both using the standard
    technique and using medians, and then use the DIFFERENCE between the
    result as a measure of the extreme-characteristic.

Just some random ideas I wouldn't yet know how to apply.

Rob
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From mhammond@skippinet.com.au  Thu Oct 10 09:52:04 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Thu, 10 Oct 2002 18:52:04 +1000
Subject: [Spambayes] Demo Outlook Plugin available
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEINBKAB.tim.one@comcast.net>
Message-ID: <LCEPIIGDJPKCOIHOBJEPMELJGMAA.mhammond@skippinet.com.au>

[Tim]

> [Mark Hammond]
> >   I just released new win32all builds that contain support for Microsoft
> > Outlook Extensions.
> >
> > If you install win32all-149, 150 or the most recent CVS snapshot
> > build, you will find a file win32com\demos\outlookAddin.py - please
> > see the  comments in the file for information on how to install and
> > test this plugin.
>
> Those of you who aren't Python natives may be wondering where to
> find that!
> Here you go:
>
>     http://starship.python.net/crew/mhammond/

Thanks Tim!

Except unfortunately I did screw up the packaging.  However, as I only
announced the new package here, I simply re-issued these win32all builds.

If you have already installed win32all-149, win32all-150, or the CVS
snapshot build, you should move the files
"site-packages\win32com\pythoncom.*" to the "site-packages" directory - that
is, move pythoncom.py/pyc/pyo to its parent directory.  All COM related
stuff should then work.  You can ignore the fact that the sample COM objects
weren't registered.

If course, simply re-download the package if you want to make sure!

It looks like Sean True and I are going to be doing a little more work on
this ;)

Mark.


From grobinson@transpose.com  Thu Oct 10 13:45:53 2002
From: grobinson@transpose.com (Gary Robinson)
Date: Thu, 10 Oct 2002 08:45:53 -0400
Subject: [Spambayes] chi-square
Message-ID: <B9CAEEC1.174CE%grobinson@transpose.com>

> 
>> It wouldn't be invoking that optimality theorem, but whatever works...
> 
> I'm not sure the optimality theorem in question is relevant to the task at
> hand, though.  Why should we care abour rejecting a hypothesis that the word
> probabilities are uniformly distributed?  There's virtually no message in
> which they are, and no reason to believe that the *majority* of words in
> spam will have spamprobs over 0.5.  Graham got results as good as he did
> because the spamprob strength of a mere handful of words is usually enough
> to decide it.  In a sense, I am trying to move back toward what worked best
> in his formulation.


Right, I agree and I've noted earlier that because the variables aren't
independent this isn't really an "optimal" use of the optimality theorem. ;)
Nevertheless, I think it is a good idea to come as close as we can to
invoking it, because even approximately invoking such a theorem is often
better than a doing something which has no real mathematics underlying it at
all. 

> There's a dramatic difference in the Paul results, while the Gary results
> move sublty (in comparison).
> 
> If we force 10 additional .99 spamprobs, the differences are night and day:
> 
> Result for random vectors of 50 probs, + 10 forced to 0.99
> 

[Histogram here]

> 
> It's hard to know what to make of this, especially in light of the claim
> that Gary-combining has been proven to be the most sensitive possible test
> for rejecting the hypothesis that a collection of probs is uniformly
> distributed.  At least in this test, Paul-combining seemed far more
> sensitive (even when the data is random <wink>).


If you do the chi-square transformation, it should respond strongly to this
experiment, because it figures out a probability in association with that
kind of distortion.

That is, doing the inverse chi-square thing uncovers the probablistic
information that is now completely buried in the product of the p's, and
that can only emerge when the number of p's is considered, which is done by
means of the inverse chi-square computation. The number of p's is currently
ignored; when it is considered a very different result will emerge.

Look at it this way. You're saying that in your experiment 17% of the p's
are artificially forced to .99. If there are 6 p's to start with, 17% would
only mean 1 p was skewed and that is not very unusual. But if you had
1,000,000 p's, and 17% of them were totally out-of-whack with a uniform
distribution, the odds against it happening by chance alone would be
completely astronomical.

So, you have to figure in the number of p's if you want to get anything like
a real probability.

You can compute that real probability using the inverse chi-square calc.
Otherwise all the probabilistic detail is lost; it just gets buried in the
process of calculating the geometric mean.

If you are playing with different cutoffs, the details that are lost when
you don't do the inverse chi-square calc may really matter. They DON'T
matter if you are only using a .5 cutoff, because the monotonic property
we've discussed means that a binary choice based on a .5 cutoff will be the
same either way. But the details will matter more as you get away from .5
for the cutoff.


--Gary


-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454


From grobinson@transpose.com  Thu Oct 10 13:46:01 2002
From: grobinson@transpose.com (Gary Robinson)
Date: Thu, 10 Oct 2002 08:46:01 -0400
Subject: [Spambayes] spamprob combining
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEIJBKAB.tim.one@comcast.net>
Message-ID: <B9CAEEC9.174CE%grobinson@transpose.com>


>> If you want to try something like that, I would suggest using the
>> ARITHMETIC means in computing S and H and again using S(S+H).  That
>> would remove that bias.
> 
> That doesn't appear promising:
> 
> If
>  S = Smean = (sum p_i)/n
> 
> and
>  H = Hmean = (sum 1-p_i)/n
> 
> then Hmean = n/n - Smean = 1 - Smean, and Smean + Hmean = 1.  So whether you
> meant S*(S+H) or S/(S+H), the result is S.  To within roundoff error, that's
> what happens, too.


Ha ha ha! I should have thought of that! :)


Gary


From bkc@murkworks.com  Thu Oct 10 15:08:41 2002
From: bkc@murkworks.com (Brad Clements)
Date: Thu, 10 Oct 2002 10:08:41 -0400
Subject: [Spambayes] timcombine results (long)
Message-ID: <3DA55161.28574.F0586AC@localhost>

I re-ran tests on my corpus using this morning's checkout.

The only difference between the two runs was

use_tim_combining:  (false first, then true)

Note, I left spam_cutoff at 0.50 for both runs!

[bkc@strader2 spambayes]$ more bayescustomize.ini
[Tokenizer]
mine_received_headers: True

[Classifier]
use_central_limit = False
use_central_limit2 = False
use_central_limit3 = False
zscore_ratio_cutoff: 1.9
use_tim_combining: False

[TestDriver]
spam_cutoff: 0.50
show_false_negatives: True
nbuckets: 100

show_spam_lo: 0.0
show_spam_hi: 0.45

save_trained_pickles: True
save_histogram_pickles: True


Histogram from from use_tim_combining: false


-> <stat> Ham scores for all runs: 13000 items; mean 25.35; sdev 6.95
-> <stat> min 0.771618; median 24.6878; max 78.0095
* = 20 items
  0   12 *
  1   12 *
  2    1 *
  3    7 *
  4   12 *
  5   12 *
  6   29 **
  7   13 *
  8   32 **
  9   32 **
 10   45 ***
 11   35 **
 12   80 ****
 13  122 *******
 14  130 *******
 15  191 **********
 16  230 ************
 17  300 ***************
 18  369 *******************
 19  533 ***************************
 20  701 ************************************
 21  839 ******************************************
 22  946 ************************************************
 23  998 **************************************************
 24 1165 ***********************************************************
 25  975 *************************************************
 26  859 *******************************************
 27  780 ***************************************
 28  659 *********************************
 29  541 ****************************
 30  388 ********************
 31  299 ***************
 32  267 **************
 33  197 **********
 34  168 *********
 35  144 ********
 36  135 *******
 37  102 ******
 38   96 *****
 39   84 *****
 40   51 ***
 41   64 ****
 42   44 ***
 43   42 ***
 44   39 **
 45   34 **
 46   17 *
 47   28 **
 48   24 **
 49   19 *
 50   26 **
 51    7 *
 52   15 *
 53   12 *
 54    6 *
 55    5 *
 56   11 *
 57    3 *
 58    3 *
 59    2 *
 60    0 
 61    2 *
 62    1 *
 63    0 
 64    2 *
 65    1 *
 66    0 
 67    1 *
 68    0 
 69    0 
 70    0 
 71    0 
 72    0 
 73    0 
 74    0 
 75    0 
 76    0 
 77    0 
 78    1 *
 79    0 
 80    0 
 81    0 
 82    0 
 83    0 
 84    0 
 85    0 
 86    0 
 87    0 
 88    0 
 89    0 
 90    0 
 91    0 
 92    0 
 93    0 
 94    0 
 95    0 
 96    0 
 97    0 
 98    0 
 99    0 

-> <stat> Spam scores for all runs: 13000 items; mean 81.18; sdev 7.55
-> <stat> min 34.1005; median 82.5437; max 99.5356
* = 17 items
  0   0 
  1   0 
  2   0 
  3   0 
  4   0 
  5   0 
  6   0 
  7   0 
  8   0 
  9   0 
 10   0 
 11   0 
 12   0 
 13   0 
 14   0 
 15   0 
 16   0 
 17   0 
 18   0 
 19   0 
 20   0 
 21   0 
 22   0 
 23   0 
 24   0 
 25   0 
 26   0 
 27   0 
 28   0 
 29   0 
 30   0 
 31   0 
 32   0 
 33   0 
 34   1 *
 35   0 
 36   1 *
 37   0 
 38   0 
 39   1 *
 40   0 
 41   0 
 42   0 
 43   3 *
 44   1 *
 45   5 *
 46   2 *
 47   1 *
 48   5 *
 49   0 
 50   7 *
 51   9 *
 52   7 *
 53  12 *
 54  18 **
 55  18 **
 56  23 **
 57  22 **
 58  30 **
 59  38 ***
 60  37 ***
 61  49 ***
 62  74 *****
 63  56 ****
 64 102 ******
 65  83 *****
 66 100 ******
 67 102 ******
 68 124 ********
 69 160 **********
 70 191 ************
 71 228 **************
 72 211 *************
 73 261 ****************
 74 318 *******************
 75 312 *******************
 76 411 *************************
 77 413 *************************
 78 497 ******************************
 79 627 *************************************
 80 745 ********************************************
 81 780 **********************************************
 82 861 ***************************************************
 83 991 ***********************************************************
 84 903 ******************************************************
 85 860 ***************************************************
 86 771 **********************************************
 87 622 *************************************
 88 506 ******************************
 89 510 ******************************
 90 230 **************
 91 158 **********
 92 142 *********
 93 112 *******
 94  79 *****
 95  52 ****
 96  38 ***
 97  14 *
 98  18 **
 99  48 ***
-> best cutoff for all runs: 0.53
->     with weighted total 1*50 fp + 43 fn = 93
->     fp rate 0.385%  fn rate 0.331%
->     matched at 0.54 with 38 fp & 55 fn; fp rate 0.292%; fn rate 0.423%
    saving ham histogram pickle to class_hamhist.pik
    saving spam histogram pickle to class_spamhist.pik


And now, histogram from timcombine true


-> <stat> Ham scores for all runs: 13000 items; mean 11.93; sdev 8.33
-> <stat> min 0.584578; median 9.92718; max 87.3273
* = 18 items
  0   61 ****
  1  220 *************
  2  233 *************
  3  398 ***********************
  4  670 **************************************
  5  811 **********************************************
  6  930 ****************************************************
  7 1080 ************************************************************
  8 1081 *************************************************************
  9 1088 *************************************************************
 10  871 *************************************************
 11  788 ********************************************
 12  702 ***************************************
 13  629 ***********************************
 14  558 *******************************
 15  419 ************************
 16  337 *******************
 17  268 ***************
 18  229 *************
 19  195 ***********
 20  146 *********
 21  141 ********
 22  125 *******
 23  105 ******
 24  102 ******
 25   75 *****
 26   50 ***
 27   60 ****
 28   47 ***
 29   58 ****
 30   48 ***
 31   49 ***
 32   31 **
 33   35 **
 34   19 **
 35   30 **
 36   34 **
 37   21 **
 38    9 *
 39   19 **
 40   13 *
 41   25 **
 42   10 *
 43   10 *
 44   12 *
 45   14 *
 46   12 *
 47   11 *
 48   12 *
 49   11 *
 50   13 *
 51   13 *
 52    4 *
 53    7 *
 54    8 *
 55    5 *
 56    8 *
 57    3 *
 58    4 *
 59    3 *
 60    5 *
 61    4 *
 62    3 *
 63    1 *
 64    3 *
 65    3 *
 66    0 
 67    2 *
 68    2 *
 69    0 
 70    0 
 71    0 
 72    0 
 73    2 *
 74    1 *
 75    2 *
 76    0 
 77    0 
 78    0 
 79    0 
 80    0 
 81    0 
 82    0 
 83    1 *
 84    0 
 85    0 
 86    0 
 87    1 *
 88    0 
 89    0 
 90    0 
 91    0 
 92    0 
 93    0 
 94    0 
 95    0 
 96    0 
 97    0 
 98    0 
 99    0 

-> <stat> Spam scores for all runs: 13000 items; mean 90.62; sdev 7.36
-> <stat> min 11.1229; median 92.6441; max 99.5389
* = 21 items
  0    0 
  1    0 
  2    0 
  3    0 
  4    0 
  5    0 
  6    0 
  7    0 
  8    0 
  9    0 
 10    0 
 11    1 *
 12    0 
 13    0 
 14    0 
 15    0 
 16    0 
 17    0 
 18    0 
 19    1 *
 20    0 
 21    1 *
 22    0 
 23    0 
 24    0 
 25    0 
 26    0 
 27    0 
 28    0 
 29    0 
 30    0 
 31    0 
 32    1 *
 33    0 
 34    0 
 35    1 *
 36    0 
 37    1 *
 38    0 
 39    1 *
 40    0 
 41    2 *
 42    5 *
 43    1 *
 44    1 *
 45    0 
 46    3 *
 47    1 *
 48    0 
 49    0 
 50    2 *
 51    7 *
 52    3 *
 53    3 *
 54    6 *
 55    3 *
 56    7 *
 57   13 *
 58    6 *
 59   11 *
 60   13 *
 61   16 *
 62   11 *
 63   18 *
 64   22 **
 65   20 *
 66   28 **
 67   33 **
 68   24 **
 69   36 **
 70   55 ***
 71   39 **
 72   55 ***
 73   77 ****
 74   59 ***
 75   69 ****
 76   93 *****
 77  100 *****
 78  110 ******
 79  152 ********
 80  156 ********
 81  172 *********
 82  210 **********
 83  193 **********
 84  242 ************
 85  278 **************
 86  313 ***************
 87  393 *******************
 88  477 ***********************
 89  608 *****************************
 90  689 *********************************
 91  950 **********************************************
 92 1131 ******************************************************
 93 1278 *************************************************************
 94 1244 ************************************************************
 95  945 *********************************************
 96  902 *******************************************
 97 1056 ***************************************************
 98  604 *****************************
 99   48 ***
-> best cutoff for all runs: 0.57
->     with weighted total 1*40 fp + 51 fn = 91
->     fp rate 0.308%  fn rate 0.392%
    saving ham histogram pickle to class_hamhist.pik
    saving spam histogram pickle to class_spamhist.pik


And rates cmp.py

results/timcombinefalses.txt -> results/timcombinetrues.txt
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
<snip more lines like above>


false positive percentages
    1.077  1.077  tied          
    0.769  0.769  tied          
    0.769  0.769  tied          
    0.923  0.923  tied          
    0.769  0.769  tied          
    0.538  0.538  tied          
    0.538  0.538  tied          
    0.692  0.692  tied          
    0.769  0.769  tied          
    0.692  0.692  tied          

won   0 times
tied 10 times
lost  0 times

total unique fp went from 98 to 98 tied          
mean fp % went from 0.753846153846 to 0.753846153846 tied          

false negative percentages
    0.154  0.154  tied          
    0.154  0.154  tied          
    0.231  0.231  tied          
    0.077  0.077  tied          
    0.000  0.000  tied          
    0.231  0.231  tied          
    0.231  0.231  tied          
    0.077  0.077  tied          
    0.154  0.154  tied          
    0.231  0.231  tied          

won   0 times
tied 10 times
lost  0 times

total unique fn went from 20 to 20 tied          
mean fn % went from 0.153846153846 to 0.153846153846 tied          

ham mean                     ham sdev
  25.47   12.23  -51.98%        7.31    9.02  +23.39%
  25.37   12.04  -52.54%        7.07    8.57  +21.22%
  25.56   12.08  -52.74%        6.96    8.44  +21.26%
  25.57   12.21  -52.25%        7.09    8.65  +22.00%
  25.33   11.98  -52.70%        6.94    8.40  +21.04%
  25.56   12.20  -52.27%        6.77    8.16  +20.53%
  25.29   11.69  -53.78%        6.71    7.80  +16.24%
  25.19   11.61  -53.91%        6.71    7.91  +17.88%
  25.07   11.63  -53.61%        7.02    8.31  +18.38%
  25.14   11.60  -53.86%        6.88    7.94  +15.41%

ham mean and sdev for all runs
  25.35   11.93  -52.94%        6.95    8.33  +19.86%

spam mean                    spam sdev
  80.93   90.31  +11.59%        7.72    7.59   -1.68%
  81.17   90.59  +11.61%        7.73    7.68   -0.65%
  81.36   90.72  +11.50%        7.52    7.40   -1.60%
  81.51   90.91  +11.53%        7.40    7.16   -3.24%
  81.02   90.54  +11.75%        7.19    6.93   -3.62%
  81.26   90.68  +11.59%        7.41    7.23   -2.43%
  81.03   90.49  +11.67%        7.52    7.25   -3.59%
  81.08   90.61  +11.75%        7.48    7.29   -2.54%
  81.47   90.93  +11.61%        7.54    7.21   -4.38%
  80.93   90.40  +11.70%        7.95    7.80   -1.89%

spam mean and sdev for all runs
  81.18   90.62  +11.63%        7.55    7.36   -2.52%

ham/spam mean difference: 55.83 78.69 +22.86


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From bkc@murkworks.com  Thu Oct 10 15:57:46 2002
From: bkc@murkworks.com (Brad Clements)
Date: Thu, 10 Oct 2002 10:57:46 -0400
Subject: [Spambayes] timcombine comparison with varied cutoff
Message-ID: <3DA55CE2.767.F327508@localhost>

I re-ran the comparison between use_tim_combine false --> true

this time, I set the spam_cutoff to the recommended value for the false (0.53) and true 
(0.57) case.

ran rates and cmp.

results/timcombinefalse053s.txt -> results/timcombinetrue057s.txt
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams

false positive percentages
    0.385  0.231  won    -40.00%
    0.462  0.462  tied          
    0.385  0.231  won    -40.00%
    0.615  0.538  won    -12.52%
    0.462  0.231  won    -50.00%
    0.231  0.231  tied          
    0.154  0.154  tied          
    0.154  0.154  tied          
    0.615  0.538  won    -12.52%
    0.385  0.308  won    -20.00%

won   6 times
tied  4 times
lost  0 times

total unique fp went from 50 to 40 won    -20.00%
mean fp % went from 0.384615384615 to 0.307692307692 won    -20.00%

false negative percentages
    0.308  0.385  lost   +25.00%
    0.385  0.385  tied          
    0.385  0.385  tied          
    0.385  0.462  lost   +20.00%
    0.231  0.231  tied          
    0.308  0.385  lost   +25.00%
    0.385  0.385  tied          
    0.308  0.385  lost   +25.00%
    0.308  0.538  lost   +74.68%
    0.308  0.385  lost   +25.00%

won   0 times
tied  4 times
lost  6 times

total unique fn went from 43 to 51 lost   +18.60%
mean fn % went from 0.330769230769 to 0.392307692307 lost   +18.60%

ham mean                     ham sdev
  25.47   12.23  -51.98%        7.31    9.02  +23.39%
  25.37   12.04  -52.54%        7.07    8.57  +21.22%
  25.56   12.08  -52.74%        6.96    8.44  +21.26%
  25.57   12.21  -52.25%        7.09    8.65  +22.00%
  25.33   11.98  -52.70%        6.94    8.40  +21.04%
  25.56   12.20  -52.27%        6.77    8.16  +20.53%
  25.29   11.69  -53.78%        6.71    7.80  +16.24%
  25.19   11.61  -53.91%        6.71    7.91  +17.88%
  25.07   11.63  -53.61%        7.02    8.31  +18.38%
  25.14   11.60  -53.86%        6.88    7.94  +15.41%

ham mean and sdev for all runs
  25.35   11.93  -52.94%        6.95    8.33  +19.86%

spam mean                    spam sdev
  80.93   90.31  +11.59%        7.72    7.59   -1.68%
  81.17   90.59  +11.61%        7.73    7.68   -0.65%
  81.36   90.72  +11.50%        7.52    7.40   -1.60%
  81.51   90.91  +11.53%        7.40    7.16   -3.24%
  81.02   90.54  +11.75%        7.19    6.93   -3.62%
  81.26   90.68  +11.59%        7.41    7.23   -2.43%
  81.03   90.49  +11.67%        7.52    7.25   -3.59%
  81.08   90.61  +11.75%        7.48    7.29   -2.54%
  81.47   90.93  +11.61%        7.54    7.21   -4.38%
  80.93   90.40  +11.70%        7.95    7.80   -1.89%

spam mean and sdev for all runs
  81.18   90.62  +11.63%        7.55    7.36   -2.52%

ham/spam mean difference: 55.83 78.69 +22.86


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From popiel@wolfskeep.com  Thu Oct 10 17:07:57 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Thu, 10 Oct 2002 09:07:57 -0700
Subject: [Spambayes] CLT run results
Message-ID: <20021010160757.6AF69F59E@cashew.wolfskeep.com>

Not much to say about this one.  The magic of the 2:3 ham:spam ratio
is maintained across default, clt1, clt2, and clt3.  This nudges me to
believe that it's something about my corpus or perhaps a universal
constant.  (Brad's posted results seemed to have high k at 2:3, too).
As others have shown, the clt total error rate is lower than that
of the default classifier, but the fp rate is higher.  I have not
yet looked at the certainty stuff that clt gives.

I used my modified timcv.py (posted earlier, and on my website)...
but if you want to reproduce my results, use Tim's version instead
(it makes more sense to have the training style as an ini option).
Just make sure to use the full retraining when doing clt tests.

I also retrieved my 10 set configuration (from before I rebalanced
for 15 sets in my last experiment).  Note that I did _not_ rebalance
to get back to this point; I untarred the archive I'd made.  This
means that the comparison against the 10set data from my ratio2
experiment might actually be valid.


The default (robinson) classifier results (from the ratio2 experiment):

-> <stat> tested 50 hams & 200 spams against 450 hams & 1800 spams
[... edited for brevity ...]
-> <stat> tested 200 hams & 50 spams against 1800 hams & 450 spams

ham-spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
fp tot:          2       3       3       3       4       3       3
fp %:         0.40    0.40    0.30    0.24    0.27    0.17    0.15
fn tot:         32      41      43      43      47      48      51
fn %:         1.60    2.34    2.87    3.44    4.70    6.40   10.20
h mean:      24.25   21.75   20.12   18.87   18.33   17.72   16.71
h sdev:       7.52    7.13    7.04    7.09    7.16    7.31    7.43
s mean:      77.56   76.66   75.93   74.85   74.13   72.80   70.57
s sdev:       8.24    8.62    8.77    9.09    9.68    9.90   10.54
mean diff:   53.31   54.91   55.81   55.98   55.80   55.08   53.86
k:            3.38    3.49    3.53    3.46    3.31    3.20    3.00


clt1 results:

-> <stat> tested 50 hams & 200 spams against 450 hams & 1800 spams
[... edited for brevity ...]
-> <stat> tested 200 hams & 50 spams against 1800 hams & 450 spams

ham-spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
fp tot:          9       4       6       6      10      10      11
fp %:         1.80    0.53    0.60    0.48    0.67    0.57    0.55
fn tot:          6       6       4       6       9      10      13
fn %:         0.30    0.34    0.27    0.48    0.90    1.33    2.60
h mean:       3.17    1.58    1.29    1.22    1.09    0.91    0.77
h sdev:      14.74    9.77    8.77    8.66    8.54    7.83    7.09
s mean:      99.55   99.32   99.18   98.85   98.22   97.88   96.42
s sdev:       5.66    6.68    7.06    7.96   10.00   11.57   14.67
mean diff:   96.38   97.74   97.89   97.63   97.13   96.97   95.65
k:            4.72    5.94    6.18    5.87    5.24    5.00    4.40


clt2 results:

-> <stat> tested 50 hams & 200 spams against 450 hams & 1800 spams
[... edited for brevity ...]
-> <stat> tested 200 hams & 50 spams against 1800 hams & 450 spams

ham-spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
fp tot:         10       5       6       6       9      10       8
fp %:         2.00    0.67    0.60    0.48    0.60    0.57    0.40
fn tot:          6       6       4       6      11      14      16
fn %:         0.30    0.34    0.27    0.48    1.10    1.87    3.20
h mean:       3.37    1.39    0.89    0.68    0.57    0.57    0.47
h sdev:      15.03    9.31    8.28    7.56    7.17    7.15    6.20
s mean:      99.65   99.43   99.37   99.01   98.46   97.94   96.41
s sdev:       5.22    6.49    6.37    7.75    9.45   11.49   15.04
mean diff:   96.28   98.04   98.48   98.33   97.89   97.37   95.94
k:            4.75    6.21    6.72    6.42    5.89    5.22    4.52


clt3 results:

-> <stat> tested 50 hams & 200 spams against 450 hams & 1800 spams
[... edited for brevity ...]
-> <stat> tested 200 hams & 50 spams against 1800 hams & 450 spams

ham-spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
fp tot:          9       4       5       6       8       9       8
fp %:         1.80    0.53    0.50    0.48    0.53    0.51    0.40
fn tot:          7       7       5      11      18      21      21
fn %:         0.35    0.40    0.33    0.88    1.80    2.80    4.20
h mean:       3.27    1.06    0.74    0.48    0.53    0.46    0.38
h sdev:      14.54    8.44    7.51    6.31    6.81    5.85    5.12
s mean:      99.58   99.35   99.18   98.61   97.81   97.07   95.11
s sdev:       5.78    6.80    7.30    8.89   11.15   13.34   17.31
mean diff:   96.31   98.29   98.44   98.13   97.28   96.61   94.73
k:            4.74    6.45    6.65    6.46    5.42    5.03    4.22

The clt variants all are sensitive to the ham:spam ratio in both
fp and fn, and the directions are crossed (which makes sense).
It's impossible to tell from the fp and fn numbers where the
sweet spot really is, but the k values seem to point at 2:3.

All of this is (of course) on my website at:

  http://www.wolfskeep.com/~popiel/spambayes/clt

- Alex

From popiel@wolfskeep.com  Thu Oct 10 19:58:27 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Thu, 10 Oct 2002 11:58:27 -0700
Subject: [Spambayes] A Tim-Combining run
Message-ID: <20021010185827.39DF9F59E@cashew.wolfskeep.com>

All of this is on my website at:

  http://www.wolfskeep.com/~popiel/spambayes/timcomb

Tim combining looks interesting, but I can't really tell if it's
a win or a lose.  The fp/fn rates only changed by a message or two.
On the other hand, Tim combining shoved one of my ham scores into
the 80s, and some of my spam scores into the single digits.  Ouch.

For this one, I haven't done the full ratio analysis... I only did
the 1:1 case with 10 sets of 200 each.  Like Brad, I ran it once to
find out what the cutoffs should be, then ran it again to utilize
those cutoffs.  Default combining is cv1, Tim combining is cv2.
The one thing of note is that I seem to have surprisingly low cutoff
values (0.42 and 0.38).

Default combining, spam_cutoff 0.42:

-> <stat> Ham scores for all runs: 2000 items; mean 18.07; sdev 7.22
-> <stat> min 0.763677; median 17.9834; max 63.3652
* = 3 items
  0   2 *
  1  20 *******
  2  14 *****
  3  26 *********
  4  33 ***********
  5  21 *******
  6  27 *********
  7  29 **********
  8  26 *********
  9  25 *********
 10  41 **************
 11  56 *******************
 12  53 ******************
 13  88 ******************************
 14 115 ***************************************
 15 134 *********************************************
 16 153 ***************************************************
 17 142 ************************************************
 18 130 ********************************************
 19 159 *****************************************************
 20 129 *******************************************
 21 111 *************************************
 22  87 *****************************
 23  74 *************************
 24  59 ********************
 25  55 *******************
 26  40 **************
 27  28 **********
 28  18 ******
 29  17 ******
 30  19 *******
 31   8 ***
 32   2 *
 33  10 ****
 34   8 ***
 35   7 ***
 36  12 ****
 37   3 *
 38   2 *
 39   1 *
 40   0 
 41   1 *
 42   1 *
 43   0 
 44   1 *
 45   0 
 46   1 *
 47   0 
 48   3 *
 49   0 
 50   2 *
 51   0 
 52   1 *
 53   0 
 54   2 *
 55   0 
 56   0 
 57   0 
 58   0 
 59   2 *
 60   0 
 61   1 *
 62   0 
 63   1 *
 64   0 
 65   0 
 66   0 
 67   0 
 68   0 
 69   0 
 70   0 
 71   0 
 72   0 
 73   0 
 74   0 
 75   0 
 76   0 
 77   0 
 78   0 
 79   0 
 80   0 
 81   0 
 82   0 
 83   0 
 84   0 
 85   0 
 86   0 
 87   0 
 88   0 
 89   0 
 90   0 
 91   0 
 92   0 
 93   0 
 94   0 
 95   0 
 96   0 
 97   0 
 98   0 
 99   0 

-> <stat> Spam scores for all runs: 2000 items; mean 75.97; sdev 9.00
-> <stat> min 19.5284; median 77.2341; max 98.0328
* = 2 items
  0   0 
  1   0 
  2   0 
  3   0 
  4   0 
  5   0 
  6   0 
  7   0 
  8   0 
  9   0 
 10   0 
 11   0 
 12   0 
 13   0 
 14   0 
 15   0 
 16   0 
 17   0 
 18   0 
 19   1 *
 20   0 
 21   0 
 22   1 *
 23   0 
 24   1 *
 25   0 
 26   0 
 27   0 
 28   0 
 29   1 *
 30   1 *
 31   0 
 32   0 
 33   0 
 34   1 *
 35   0 
 36   0 
 37   0 
 38   0 
 39   0 
 40   0 
 41   0 
 42   1 *
 43   0 
 44   2 *
 45   4 **
 46   4 **
 47   4 **
 48   2 *
 49   3 **
 50   7 ****
 51   3 **
 52   4 **
 53   1 *
 54   6 ***
 55  12 ******
 56  11 ******
 57  12 ******
 58   7 ****
 59  17 *********
 60  19 **********
 61  21 ***********
 62  17 *********
 63  17 *********
 64  31 ****************
 65  31 ****************
 66  44 **********************
 67  32 ****************
 68  53 ***************************
 69  44 **********************
 70  51 **************************
 71  63 ********************************
 72  76 **************************************
 73  95 ************************************************
 74  88 ********************************************
 75  90 *********************************************
 76  97 *************************************************
 77 107 ******************************************************
 78 106 *****************************************************
 79 110 *******************************************************
 80 110 *******************************************************
 81  90 *********************************************
 82  87 ********************************************
 83  90 *********************************************
 84  73 *************************************
 85  63 ********************************
 86  42 *********************
 87  56 ****************************
 88  26 *************
 89  18 *********
 90  13 *******
 91  11 ******
 92   8 ****
 93   2 *
 94   1 *
 95   1 *
 96   5 ***
 97   3 **
 98   3 **
 99   0 
-> best cutoff for all runs: 0.42
->     with weighted total 1*15 fp + 6 fn = 21
->     fp rate 0.75%  fn rate 0.3%
->     matched at 0.43 with 14 fp & 7 fn; fp rate 0.7%; fn rate 0.35%
->     matched at 0.44 with 14 fp & 7 fn; fp rate 0.7%; fn rate 0.35%


With Tim combining, spam_cutoff 0.38:

-> <stat> Ham scores for all runs: 2000 items; mean 8.43; sdev 6.15
-> <stat> min 0.66161; median 7.56964; max 81.0785
* = 4 items
  0  14 ****
  1  95 ************************
  2 115 *****************************
  3 126 ********************************
  4 169 *******************************************
  5 189 ************************************************
  6 188 ***********************************************
  7 186 ***********************************************
  8 176 ********************************************
  9 178 *********************************************
 10 128 ********************************
 11  90 ***********************
 12  64 ****************
 13  76 *******************
 14  45 ************
 15  45 ************
 16  21 ******
 17  18 *****
 18  14 ****
 19  14 ****
 20   3 *
 21   5 **
 22   7 **
 23   5 **
 24   3 *
 25   1 *
 26   3 *
 27   2 *
 28   1 *
 29   3 *
 30   0 
 31   0 
 32   0 
 33   0 
 34   0 
 35   1 *
 36   0 
 37   1 *
 38   0 
 39   1 *
 40   0 
 41   0 
 42   0 
 43   1 *
 44   0 
 45   0 
 46   1 *
 47   2 *
 48   0 
 49   0 
 50   1 *
 51   0 
 52   1 *
 53   0 
 54   0 
 55   1 *
 56   0 
 57   0 
 58   1 *
 59   0 
 60   0 
 61   1 *
 62   0 
 63   0 
 64   0 
 65   0 
 66   0 
 67   3 *
 68   0 
 69   0 
 70   0 
 71   0 
 72   0 
 73   0 
 74   0 
 75   0 
 76   0 
 77   0 
 78   0 
 79   0 
 80   0 
 81   1 *
 82   0 
 83   0 
 84   0 
 85   0 
 86   0 
 87   0 
 88   0 
 89   0 
 90   0 
 91   0 
 92   0 
 93   0 
 94   0 
 95   0 
 96   0 
 97   0 
 98   0 
 99   0 

-> <stat> Spam scores for all runs: 2000 items; mean 86.91; sdev 10.12
-> <stat> min 8.51412; median 89.773; max 98.1726
* = 3 items
  0   0 
  1   0 
  2   0 
  3   0 
  4   0 
  5   0 
  6   0 
  7   0 
  8   1 *
  9   1 *
 10   0 
 11   0 
 12   0 
 13   0 
 14   1 *
 15   1 *
 16   0 
 17   1 *
 18   0 
 19   0 
 20   0 
 21   1 *
 22   0 
 23   0 
 24   0 
 25   0 
 26   1 *
 27   0 
 28   0 
 29   0 
 30   0 
 31   0 
 32   0 
 33   1 *
 34   0 
 35   0 
 36   0 
 37   0 
 38   3 *
 39   1 *
 40   1 *
 41   0 
 42   1 *
 43   1 *
 44   2 *
 45   4 **
 46   0 
 47   2 *
 48   1 *
 49   2 *
 50   5 **
 51   2 *
 52   2 *
 53   1 *
 54   2 *
 55   2 *
 56   1 *
 57   0 
 58   5 **
 59   8 ***
 60   6 **
 61   2 *
 62   3 *
 63   6 **
 64   5 **
 65  10 ****
 66  10 ****
 67   9 ***
 68  12 ****
 69  14 *****
 70  11 ****
 71   6 **
 72  12 ****
 73  14 *****
 74  23 ********
 75  18 ******
 76  15 *****
 77  20 *******
 78  21 *******
 79  30 **********
 80  35 ************
 81  39 *************
 82  41 **************
 83  66 **********************
 84  72 ************************
 85  75 *************************
 86  98 *********************************
 87  69 ***********************
 88 118 ****************************************
 89 112 **************************************
 90 154 ****************************************************
 91 151 ***************************************************
 92 160 ******************************************************
 93 140 ***********************************************
 94 142 ************************************************
 95  83 ****************************
 96  91 *******************************
 97  49 *****************
 98   4 **
 99   0 
-> best cutoff for all runs: 0.38
->     with weighted total 1*14 fp + 8 fn = 22
->     fp rate 0.7%  fn rate 0.4%


Finally, results.txt:

cv1s -> cv2s
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams

false positive percentages
    0.500  0.500  tied          
    0.500  0.500  tied          
    0.000  0.000  tied          
    1.000  1.000  tied          
    0.500  0.500  tied          
    2.000  1.500  won    -25.00%
    1.000  1.000  tied          
    0.500  0.500  tied          
    1.000  1.000  tied          
    0.500  0.500  tied          

won   1 times
tied  9 times
lost  0 times

total unique fp went from 15 to 14 won     -6.67%
mean fp % went from 0.75 to 0.7 won     -6.67%

false negative percentages
    0.500  0.500  tied          
    0.000  0.000  tied          
    1.000  1.500  lost   +50.00%
    0.500  0.500  tied          
    0.000  0.500  lost  +(was 0)
    0.500  0.500  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.500  0.500  tied          
    0.000  0.000  tied          

won   0 times
tied  8 times
lost  2 times

total unique fn went from 6 to 8 lost   +33.33%
mean fn % went from 0.3 to 0.4 lost   +33.33%

ham mean                     ham sdev
  17.22    7.72  -55.17%        7.39    6.43  -12.99%
  18.69    8.46  -54.74%        7.27    5.77  -20.63%
  18.86    8.94  -52.60%        6.50    4.71  -27.54%
  16.79    7.92  -52.83%        7.75    6.01  -22.45%
  18.66    8.88  -52.41%        7.09    5.98  -15.66%
  18.47    8.99  -51.33%        7.83    8.27   +5.62%
  18.19    8.51  -53.22%        6.99    6.02  -13.88%
  18.38    8.44  -54.08%        6.80    5.45  -19.85%
  17.67    8.38  -52.57%        7.88    7.12   -9.64%
  17.72    8.10  -54.29%        6.18    4.79  -22.49%

ham mean and sdev for all runs
  18.07    8.43  -53.35%        7.22    6.15  -14.82%

spam mean                    spam sdev
  75.58   86.54  +14.50%        9.15   10.45  +14.21%
  76.81   87.80  +14.31%        8.53    8.21   -3.75%
  74.95   85.60  +14.21%        9.44   12.09  +28.07%
  76.18   87.24  +14.52%        8.64    9.83  +13.77%
  76.55   87.63  +14.47%        8.84    9.54   +7.92%
  76.08   86.83  +14.13%        8.69   10.19  +17.26%
  75.61   86.38  +14.24%        9.72   11.14  +14.61%
  76.51   87.65  +14.56%        8.30    8.75   +5.42%
  75.92   86.79  +14.32%        9.62   11.13  +15.70%
  75.52   86.64  +14.72%        8.76    8.95   +2.17%

spam mean and sdev for all runs
  75.97   86.91  +14.40%        9.00   10.12  +12.44%

ham/spam mean difference: 57.90 78.48 +20.58

- Alex

From popiel@wolfskeep.com  Thu Oct 10 20:13:49 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Thu, 10 Oct 2002 12:13:49 -0700
Subject: [Spambayes] spamprob combining 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	of "Wed, 09 Oct 2002 23:29:38 EDT."
	<LNBBLJKPBEHFEDALKOLCOEIKBKAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCOEIKBKAB.tim.one@comcast.net> 
Message-ID: <20021010191349.57718F59E@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCOEIKBKAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>[T. Alexander Popiel]
>> I'll run this one after I get done with my initial clt tests
>> (which are taking about 4.5 hours each :-/ ).
>
>Use less data?

Yes, I could go back to using only 5 sets instead of 10... but then
my results would be a bit less comparable with other runs I've done.

>> I can't really say anything else, yet, but clt seems _much_ slower
>> than the default classifier.
>
>I haven't really noticed that.  If you're using your "--trainstyle full"
>patch with timcv, then, yes, it would be enormously slower -- timcv gets
>enormous *efficiency* benefits (both instruction-count and temporal cache
>locality) out of incremental learning and unlearning.
>
>The "third training pass" unique to the clt methods also doubles the
>training time (each msg in the training data is tokenized once to update the
>wordprobs, and then a second time to compute the clt ham and spam population
>statistics).

Is it worth caching the token streams somehow?  (I'm thinking not,
since this is still in the research-project stage...)

Quite possibly the problem is that I'm running all this on a PII-300
with only 64M RAM, which is also running X (but not Gnome or KDE; I'm
a hardend twm user!)...

- Alex

From tim.one@comcast.net  Thu Oct 10 21:09:14 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 10 Oct 2002 16:09:14 -0400
Subject: [Spambayes] A Tim-Combining run
In-Reply-To: <20021010185827.39DF9F59E@cashew.wolfskeep.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHCEOJDLAA.tim.one@comcast.net>

[T. Alexander Popiel]
> All of this is on my website at:
>
>   http://www.wolfskeep.com/~popiel/spambayes/timcomb
>
> Tim combining looks interesting, but I can't really tell if it's
> a win or a lose.  The fp/fn rates only changed by a message or two.

The error rates aren't really the point here.  As with the clt schemes, the
point is whether you get a more useful middle ground.  There's no way to
tell that just from running the tests, though -- you have to stare at the
mistakes and think.

> On the other hand, Tim combining shoved one of my ham scores into
> the 80s, and some of my spam scores into the single digits.  Ouch.

Why is that painful?  For example, if they were mistakes before too, what
difference does it make to you if their scores change?  In my large test
run, it made the same mistakes before and after, but the worst mistakes were
already so far out of range of a *usable* "middle ground" before that it
made no difference that the scores got more extreme after.  Those particular
false positives and negatives are never going to swing into the other
category, short of never calling anything spam, or never calling anything
ham.  The usefulness of the change was in a different dimension:  the middle
ground in which *marginal* mistakes lived contained significantly fewer
messages after.  The extremes are hopeless under any scheme (e.g., my "ham"
consisting almost entirely of a giant Nigerian spam quote is simply never
going to be *called* ham by any useful scheme -- and this doesn't have much
of anything to do with how spamprobs get combined, the msg simply has an
overwhelming number of overwhelming large-spamprob words).

So one question for you is whether the extreme mistakes in your runs are in
fact hopeless (remember that we're running a computer program here, not a
psychic hotline <wink>).  This requires thought and careful judgment more
than running tests.  At some point the belief is that we actually deploy
this code, and then we need a no-fudging way of saying "ham", "spam", "not
sure".  The clt schemes have appeared to be the only hope of getting a
useful "not sure" category, but this variation on the non-clt scheme *may*
be able to get there without the extra complication and expense of the clt
schemes.


From popiel@wolfskeep.com  Thu Oct 10 21:36:09 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Thu, 10 Oct 2002 13:36:09 -0700
Subject: [Spambayes] A Tim-Combining run 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	of "Thu, 10 Oct 2002 16:09:14 EDT."
	<BIEJKCLHCIOIHAGOKOLHCEOJDLAA.tim.one@comcast.net> 
References: <BIEJKCLHCIOIHAGOKOLHCEOJDLAA.tim.one@comcast.net> 
Message-ID: <20021010203609.73FC1F59E@cashew.wolfskeep.com>

In message:  <BIEJKCLHCIOIHAGOKOLHCEOJDLAA.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>[T. Alexander Popiel]
>> On the other hand, Tim combining shoved one of my ham scores into
>> the 80s, and some of my spam scores into the single digits.  Ouch.
>
>Why is that painful?  For example, if they were mistakes before too, what
>difference does it make to you if their scores change?  In my large test
>run, it made the same mistakes before and after, but the worst mistakes were
>already so far out of range of a *usable* "middle ground" before that it
>made no difference that the scores got more extreme after.

Point.  I wasn't thinking.  *bonk*

>So one question for you is whether the extreme mistakes in your runs are in
>fact hopeless (remember that we're running a computer program here, not a
>psychic hotline <wink>).

Some are... like the FDIC sending me notice that NextBank folded,
and (insert long list of marketroid-named services) would no longer
be available as it went into receivership.  I'll take a closer look.

But darn it, I wanted to have it tell me if I was gonna have a hot
date this weekend after I won the lottery! (I'd even settle for it
telling me that my life is hard because I don't listen to it more
often at $1.75 a minute, first 10 minutes free.) ;-)

- Alex

From tim.one@comcast.net  Sat Oct 12 00:53:45 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 11 Oct 2002 19:53:45 -0400
Subject: [Spambayes] spamprob combining
In-Reply-To: <B9CA530B.173CD%grobinson@transpose.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEAPBLAB.tim.one@comcast.net>

[Gary Robinson]
> ...
> It's not the product of the p's that is a chi-square distribution, it's
> the following, given p1, p2..., pn:
>
> 2*((ln p1) + (ln p2) + ... + (ln pn))
>
> That expression has a chi-square distribution with 2n degrees of
> freedom.

I haven't found a reference to this online, and don't have reasonable access
to a good technical library, so I need your help to get this straight.

The first thing that strikes me is that it can't be quite right <wink>:  a
chi-squared statistic is positive by its very nature, but the expression is
a sum of logs of values in (0., 1), so is necessarily negative.

Here's the chi-squared function I'm using:

"""
import math as _math

def chi2Q(x2, v, exp=_math.exp):
    """Return prob(chisq >= x2, with v degrees of freedom).

    v must be even.
    """

    assert v & 1 == 0
    m = x2 / 2.0
    sum = term = exp(-m)
    for i in range(1, v//2):
        term *= m / i
        sum += term
    return sum
"""

It's an especially simple and numerically stable calculation when v is even,
and v always is even if the formulation is right.  I understand I could save
time with tabulated values, but speeding this is premature.

Example:

>>> chi2Q(129.561, 100)
0.025000686582048785
>>>

Abramowitz & Stegun give 129.561 as the 0.025 point for the chi-squared
distribution with 100 degrees of freedom.  Etc.  I'm confident *that*
function is working correctly.

> So you feed THAT into the inverse-chi square function to get a p-value.

I'm feeding its negation in instead, since the correct result would be 1.0
for any negative x2 input.

> Let invchi(x, f), where s is the random variable and f is the degrees of
> freedom, be the inverse chi square function. Let S be a number near 1 when
> the email looks spammy and H be a number near 1 when the email
> looks hammy.
>
> Then you want
>
> S = 1 - invchi(2*((ln (1-p1)) + (ln (1-p2)) + ... + (ln (1-pn)), 2*n)
>
> and
>
> H = 1 - invchi(2*((ln p1) + (ln p2) + ... + (ln pn)), 2*n)

OK, I believe I'm doing that, but multiplying the first argument by -2
instead of 2:

"""
from Histogram import Hist
from random import random
import sys

h = Hist(20, lo=0.0, hi=1.0)

def judge(ps, ln=_math.log):
    H = S = 0.0
    for p in ps:
        S += ln(1.0 - p)
        H += ln(p)
    n = len(ps)
    S = 1.0 - chi2Q(-2.0 * S, 2*n)
    H = 1.0 - chi2Q(-2.0 * H, 2*n)
    return S/(S+H)

warp = 0
if len(sys.argv) > 1:
    warp = int(sys.argv[1])

for i in range(5000):
    ps = [random() for j in range(50)]
    p = judge(ps + [0.99] * warp)
    h.add(p)

print "Result for random vectors of 50 probs, +", warp, "forced to 0.99"
print
h.display()
"""

Note:  as usual, scaling (S-H)/(S+H) from [-1, 1] into [0, 1] is

    ((S-H)/(S+H) + 1)/2 =
    ((S-H+S+H)/(S+H))/2 =
    (2*S/(S+H))/2 =
    S/(S+H)

The bad(?) news is that, on random inputs, this is all over the map:

Result for random vectors of 50 probs, + 0 forced to 0.99

5000 items; mean 0.50; sdev 0.27
-> <stat> min 0.000219435; median 0.49027; max 0.999817
* = 6 items
0.00 206 ***********************************
0.05 215 ************************************
0.10 209 ***********************************
0.15 240 ****************************************
0.20 282 ***********************************************
0.25 239 ****************************************
0.30 270 *********************************************
0.35 289 *************************************************
0.40 276 **********************************************
0.45 325 *******************************************************
0.50 291 *************************************************
0.55 300 **************************************************
0.60 278 ***********************************************
0.65 267 *********************************************
0.70 234 ***************************************
0.75 234 ***************************************
0.80 211 ************************************
0.85 207 ***********************************
0.90 213 ************************************
0.95 214 ************************************

I don't think it's uniformly distributed (across many runs, the small
peakedness near the midpoint persists), but it's close.

The better news is that it's indeed very sensitive to bias (perhaps that's
*why* all-random data scores all over the map?):

Result for random vectors of 50 probs, + 1 forced to 0.99

5000 items; mean 0.59; sdev 0.24
-> <stat> min 0.00175781; median 0.596673; max 0.999818
* = 7 items
0.00  49 *******
0.05  94 **************
0.10 119 *****************
0.15 109 ****************
0.20 153 **********************
0.25 185 ***************************
0.30 214 *******************************
0.35 253 *************************************
0.40 286 *****************************************
0.45 338 *************************************************
0.50 344 **************************************************
0.55 381 *******************************************************
0.60 369 *****************************************************
0.65 340 *************************************************
0.70 325 ***********************************************
0.75 315 *********************************************
0.80 292 ******************************************
0.85 285 *****************************************
0.90 255 *************************************
0.95 294 ******************************************

Result for random vectors of 50 probs, + 2 forced to 0.99

5000 items; mean 0.66; sdev 0.21
-> <stat> min 0.0214171; median 0.667916; max 0.9999
* = 7 items
0.00   4 *
0.05  17 ***
0.10  45 *******
0.15  50 ********
0.20  58 *********
0.25 103 ***************
0.30 137 ********************
0.35 181 **************************
0.40 259 *************************************
0.45 324 ***********************************************
0.50 372 ******************************************************
0.55 377 ******************************************************
0.60 412 ***********************************************************
0.65 427 *************************************************************
0.70 345 **************************************************
0.75 369 *****************************************************
0.80 379 *******************************************************
0.85 370 *****************************************************
0.90 376 ******************************************************
0.95 395 *********************************************************

Result for random vectors of 50 probs, + 10 forced to 0.99

5000 items; mean 0.88; sdev 0.13
-> <stat> min 0.494068; median 0.922177; max 1
* = 33 items
0.00    0
0.05    0
0.10    0
0.15    0
0.20    0
0.25    0
0.30    0
0.35    0
0.40    0
0.45    1 *
0.50   84 ***
0.55  139 *****
0.60  172 ******
0.65  257 ********
0.70  282 *********
0.75  326 **********
0.80  369 ************
0.85  560 *****************
0.90  800 *************************
0.95 2010 *************************************************************

Result for random vectors of 50 probs, + 20 forced to 0.99

5000 items; mean 0.97; sdev 0.06
-> <stat> min 0.543929; median 0.996147; max 1
* = 69 items
0.00    0
0.05    0
0.10    0
0.15    0
0.20    0
0.25    0
0.30    0
0.35    0
0.40    0
0.45    0
0.50    1 *
0.55    4 *
0.60   12 *
0.65   25 *
0.70   32 *
0.75   48 *
0.80  128 **
0.85  203 ***
0.90  383 ******
0.95 4164 *************************************************************

I won't bother to show it here, but the histograms are essentially mirror
images if I force the bias value to 0.01 instead of to 0.99.

Does the near-uniform spread of "scores" on wholly random inputs strike you
as sane or insane?  I don't understand the theoretical underpinnings of this
test, so can't really guess.  It did surprise me -- I guess I expected
strong clustering at 0.5.


From tim.one@comcast.net  Sat Oct 12 02:34:50 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 11 Oct 2002 21:34:50 -0400
Subject: [Spambayes] spamprob combining
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEAPBLAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEBDBLAB.tim.one@comcast.net>

Regardless of whether the chi-squared code makes sense, I whipped up another
spamprob() variant to use it, and checked it in.  There's a new option:

[Classifier]
use_chi_squared_combining: False

This is yet another alternative to use_tim_combining (by the way, offline
Gary and I agreed that tim_combining isn't biased, but are still butting
heads over whether it's actually just a trivial transformation of
Gary-combining <wink>; scores from each are always on the same *side* of
0.5, but tim-combining scores are always at least as far from 0.5 as
Gary-combining scores, and usually significant farther -- that's why the
spread increases so dramatically).

Small test run, 10-fold CV with 400+400 in each set.  As usual when
switching combining schemes, the "won/lost" things don't make sense for the
"after" run, because the appropriate value for spam_cutoff changes.  The
before run is all-default, the after run just setting the new option true:

-> <stat> tested 400 hams & 400 spams against 3600 hams & 3600 spams
   [ditto 19 times]

false positive percentages
    0.000  0.000  tied
    0.000  0.250  lost  +(was 0)
    0.000  0.250  lost  +(was 0)
    0.000  0.000  tied
    0.250  0.500  lost  +100.00%
    0.000  0.250  lost  +(was 0)
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied  6 times
lost  4 times

total unique fp went from 1 to 5 lost  +400.00%
mean fp % went from 0.025 to 0.125 lost  +400.00%

false negative percentages
    0.000  0.000  tied
    0.250  0.000  won   -100.00%
    0.000  0.000  tied
    0.250  0.250  tied
    0.250  0.000  won   -100.00%
    0.500  0.250  won    -50.00%
    0.000  0.000  tied
    0.250  0.000  won   -100.00%
    0.500  0.250  won    -50.00%
    0.000  0.000  tied

won   5 times
tied  5 times
lost  0 times

total unique fn went from 8 to 3 won    -62.50%
mean fn % went from 0.2 to 0.075 won    -62.50%

ham mean                     ham sdev
  27.29    0.49  -98.20%        5.80    3.68  -36.55%
  27.62    0.62  -97.76%        5.57    4.91  -11.85%
  27.25    0.66  -97.58%        5.52    5.40   -2.17%
  27.75    0.25  -99.10%        5.36    2.39  -55.41%
  27.47    0.84  -96.94%        6.07    6.78  +11.70%
  27.65    0.78  -97.18%        5.84    4.68  -19.86%
  28.00    0.75  -97.32%        5.85    4.41  -24.62%
  27.44    0.29  -98.94%        5.35    2.47  -53.83%
  27.55    0.36  -98.69%        5.31    2.66  -49.91%
  27.95    0.68  -97.57%        5.85    4.37  -25.30%

ham mean and sdev for all runs
  27.60    0.57  -97.93%        5.66    4.39  -22.44%

spam mean                    spam sdev
  82.89   99.96  +20.59%        7.17    0.48  -93.31%
  82.11   99.84  +21.59%        7.04    2.11  -70.03%
  81.34   99.93  +22.85%        7.30    0.79  -89.18%
  81.73   99.84  +22.16%        7.38    2.66  -63.96%
  82.07   99.85  +21.66%        6.78    1.85  -72.71%
  82.02   99.70  +21.56%        7.32    3.28  -55.19%
  82.03   99.91  +21.80%        7.05    1.27  -81.99%
  82.22   99.93  +21.54%        6.75    0.73  -89.19%
  82.14   99.70  +21.38%        7.50    3.27  -56.40%
  82.30   99.92  +21.41%        7.30    0.84  -88.49%

spam mean and sdev for all runs
  82.08   99.86  +21.66%        7.17    2.00  -72.11%

ham/spam mean difference: 54.48 99.29 +44.81

Stare at what happened to the means, and it's easy to see that this is more
Graham-like in its score distribution than anything we've seen since using
Graham-combining:

-> <stat> Ham scores for all runs: 4000 items; mean 0.57; sdev 4.39
-> <stat> min -2.22045e-013; median 8.33096e-009; max 100

Check out the median there:  that's extreme.

Note that one ham scored 1.0!  That's the Nigerian-scam quote, and I don't
care because it's hopeless.  It actually scored 0.999999988294.

* = 63 items
 0.0 3813 *************************************************************
 0.5   32 *
 1.0   18 *
 1.5   13 *
 2.0    6 *
 2.5    5 *
 3.0    3 *
 3.5    4 *
 4.0    7 *
 4.5    7 *
 5.0    8 *
 5.5    2 *
 6.0    2 *
 6.5    3 *
 7.0    2 *
 7.5    3 *
 8.0    4 *
 8.5    0
 9.0    4 *
 9.5    2 *
10.0    2 *
10.5    0
11.0    2 *
11.5    1 *
12.0    1 *
12.5    1 *
13.0    2 *
13.5    1 *
14.0    1 *
14.5    1 *
15.0    1 *
15.5    2 *
16.0    1 *
16.5    3 *
17.0    1 *
17.5    1 *
18.0    3 *
18.5    0
19.0    1 *
19.5    0
20.0    1 *
20.5    1 *
21.0    0
21.5    1 *
22.0    0
22.5    0
23.0    1 *
23.5    0
24.0    0
24.5    0
25.0    0
25.5    2 *
26.0    2 *
26.5    0
27.0    1 *
27.5    0
28.0    0
28.5    1 *
29.0    1 *
29.5    2 *
30.0    0
30.5    0
31.0    1 *
31.5    0
32.0    0
32.5    0
33.0    0
33.5    0
34.0    1 *
34.5    0
35.0    0
35.5    0
36.0    1 *
36.5    3 *
37.0    2 *
37.5    0
38.0    0
38.5    0
39.0    0
39.5    2 *
40.0    0
40.5    1 *
41.0    1 *
41.5    0
42.0    0
42.5    0
43.0    0
43.5    0
44.0    0
44.5    1 *
45.0    0
45.5    1 *
46.0    0
46.5    0
47.0    0
47.5    1 *
48.0    0
48.5    0
49.0    2 *
49.5    0
50.0    0
50.5    0
51.0    0
51.5    1 *
52.0    0
52.5    0
53.0    0
53.5    0
54.0    0
54.5    1 *
55.0    1 *  haven't seen this get a high score since using bigrams;
55.5    0    it's someone putting together a Python user group;
56.0    0    "fully functional", etc -- accidental spam phrases
56.5    0
57.0    0
57.5    0
58.0    0
58.5    0
59.0    0
59.5    0
60.0    0
60.5    0
61.0    0
61.5    0
62.0    0
62.5    0
63.0    1 *  "If you are interested in saving money ...": someone looking
63.5    0    to share a hotel room at a Python conference, but neglecting
64.0    0    to mention it *is* a Python conference
64.5    0
65.0    0
65.5    0
66.0    0
66.5    0
67.0    0
67.5    0
68.0    0
68.5    0
69.0    0
69.5    0
70.0    0
70.5    1 *  this is a disturbing fp -- it's not spammish at all;
71.0    0    someone looking for help writing a webmasterish program;
71.5    0    lots of accidental high-spamprob words
72.0    0
72.5    0
73.0    0
73.5    0
74.0    0
74.5    0
75.0    0
75.5    0
76.0    0
76.5    1 *  "TOOLS Europe 2000" conference announcement
77.0    0
...
99.5    1 *  Nigerian-scam quote


-> <stat> Spam scores for all runs: 4000 items; mean 99.86; sdev 2.00
-> <stat> min 46.9565; median 100; max 100
* = 65 items

Note that the *median* is 100:  that's extreme.

...
46.5    1 *  "Hello, my Name is BlackIntrepid"
47.0    0
47.5    0
48.0    0
48.5    0
49.0    0
49.5    0
50.0    0
50.5    0
51.0    0
51.5    0
52.0    1 *  "Website Programmers Available Now!"; lots of tech terms
52.5    0
53.0    0
53.5    0
54.0    1 *  This one slays me.  It has this meta tag we ignore:
             <meta name="keywords" content"free stuff, get paid for being
              online, make money on the internet, computer jobs, home
              makers, get paid to surf, mlm, work at home,
              yes you can. for time spent surfing the internet,
              everything is free, no obligation, money, FREE, ...
             and on and on.  It also has this tag we ignore:
             <meta name="Classification" content="free money, mlm,
              paid to surf, home base business, home base businesses,
              free money, online">
             It's may be the most obvious spam ever created <wink>.
54.5    0
55.0    0
55.5    0
56.0    0
56.5    0
57.0    0
57.5    0
58.0    0
58.5    0
59.0    1 *
59.5    0
60.0    2 *
60.5    0
61.0    0
61.5    0
62.0    0
62.5    0
63.0    0
63.5    0
64.0    0
64.5    0
65.0    0
65.5    0
66.0    0
66.5    0
67.0    0
67.5    0
68.0    0
68.5    0
69.0    1 *
69.5    0
70.0    0
70.5    0
71.0    0
71.5    0
72.0    0
72.5    0
73.0    0
73.5    0
74.0    0
74.5    0
75.0    1 *
            If spam_cutoff had been here, it would have matched the 8
            FN from the "before" run, and would have left only the
            Nigerian-scam and TOOLS annoucement as f-p.
75.5    0
76.0    0
76.5    0
            And if spam_cutoff had been here, the wretched TOOLS
            announcement would have gotten thru too (sorry, but
            that annoucement is spam in my eyes)
77.0    1 *
77.5    0
78.0    0
78.5    0
79.0    0
79.5    0
80.0    1 *
80.5    0
81.0    0
81.5    0
82.0    0
82.5    0
83.0    0
83.5    0
84.0    0
84.5    0
85.0    0
85.5    1 *
86.0    0
86.5    1 *
87.0    0
87.5    0
88.0    3 *
88.5    1 *
89.0    2 *
89.5    0
90.0    0
90.5    0
91.0    0
91.5    1 *
92.0    2 *
92.5    0
93.0    3 *
93.5    0
94.0    2 *
94.5    0
95.0    0
95.5    1 *
96.0    1 *
96.5    3 *
97.0    2 *
97.5    1 *
98.0    4 *
98.5    6 *
99.0    3 *
99.5 3953 *************************************************************

Looks promising, albeit uncomfortably extreme.  There's a huge and sparsely
populated middle ground where all the mistakes live, except for the hopeless
Nigerian scam quote.

Example:  if we called everything from 50 thru 80 "the middle ground", that
easily contains all but the Nigerian mistake, yet contains only 6 (of 4000
total) ham and only 8 (of 4000 total) spam.  So in a manual-review system,
this combines all the desirable properties:

1. Very little is kicked out for review.

2. There are high error rates among the msgs kicked out for review.

3. There are unmeasurably low error rates among the msgs not kicked
   out for review.

Feel encouraged to try this if you like, but keep in mind that the *point*
here is how useful the middle ground may be -- just pasting in f-p and f-n
rates without analysis (== staring at the mistakes and thinking about them)
won't help (unless they're both disasters).  It may be wise to wait for Gary
to look over my previous questions about the math -- I can't swear the
implementation even makes sense at this point.


From tim.one@comcast.net  Sat Oct 12 07:27:29 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 12 Oct 2002 02:27:29 -0400
Subject: [Spambayes] spamprob combining
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEBDBLAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEBKBLAB.tim.one@comcast.net>

OK!  Gary and I exchanged info offline, and I believe the implementation of
use_chi_squared_combining matches his intent for it.

> ...
> Example:  if we called everything from 50 thru 80 "the middle
> ground", ... in a manual-review system, this combines all the
> desirable properties:
>
> 1. Very little is kicked out for review.
>
> 2. There are high error rates among the msgs kicked out for review.
>
> 3. There are unmeasurably low error rates among the msgs not kicked
>    out for review.

On my full 20,000 ham + 14,000 spam test, and with spam_cutoff 0.70, this
got 3 FP and 11 FN in a 10-fold CV run, compared to 2 FP and 11 FN under the
all-default scheme with the very touchy spam_cutoff.  The middle ground is
the *interesting* thing, and it's like a laser beam here (yippee!).  In the
"50 thru 80" range guessed at above,

1. 12 of 20,000 hams lived there, 1 of the FPs among them (scoring 0.737).
   The other 2 FP scored 0.999999929221 (Nigerian scam quote) and
   0.972986477986 (lady with the short question and long obnoxious
   employer-generated SIG).  I don't believe any usable scheme will
   ever call those ham, though, or put them in a middle ground without
   greatly bloating the middle ground with correctly classified
   messages.

2. 14 of 14,000 spams lived there, including 8 (yowza!) of the 11 FN
   (with 3 scores a bit above 0.5, 1 near 0.56, 1 near 0.58, 1 near
   0.61, 1 near 0.63, and 1 near 0.68).  The 3 remaining spam scored
   below 0.50:

0.35983017036
    "Hello, my Name is BlackIntrepid"
    Except that it contained a URL and an invitation to visit it, this
    could have been a poorly written c.l.py post explaining a bit
    about hackers to newbies (and if you don't think there are
    plenty of those in my ham, you don't read c.l.py <wink>).

0.39570232415
    The embarrassing "HOW TO BECOME A MILLIONAIRE IN WEEKS!!" spam,
    whose body consists of a uuencoded text file we throw away
    unlooked at.  (This is quite curable, but I doubt it's worth
    the bother -- at least until spammers take to putting everything
    in uuencoded text files!)

0.499567195859 (about as close to "middle ground" cutoff as can be)
    A giant (> 20KB) base64-encoded plain text file.  I've never
    bothered to decode this to see what it says; like the others,
    though, it's been a persistent FN under all schemes.  Note that
    we do decode this; I've always assumed it's of the "long, chatty,
    just-folks" flavor of tech spam that's hard to catch; the list of
    clues contains "cookies", "editor", "ms-dos", "backslashes",
    "guis", "commands", "folder", "dumb", "(well,", "cursor",
    and "trick" (a spamprob 0.00183748 word!).


For my original purpose of looking at a scheme for c.l.py traffic, this has
become the clear leader among all schemes:  while it's more extreme than I
might like, it made very few errors, and a miniscule middle ground (less
than 0.08% of all msgs) contains 64+% of all errors.  3 FN would survive,
and 2 FP, but I don't expect that any usable scheme could do better on this
data.  Note that Graham combining was also very extreme, but had *no* usable
middle ground on this data:  all mistakes had scores of almost exactly 0.0
or almost exactly 1.0 (and there were more mistakes).

How does it do for you?  An analysis like the above is what I'm looking for,
although it surely doesn't need to be so detailed.  Here's the .ini file I
used:

"""
[Classifier]
use_chi_squared_combining: True

[TestDriver]
spam_cutoff: 0.70

nbuckets: 200
best_cutoff_fp_weight: 10

show_false_positives: True
show_false_negatives: True
show_best_discriminators: 50
show_spam_lo = 0.40
show_spam_hi = 0.80
show_ham_lo = 0.40
show_ham_hi = 0.80
show_charlimit: 100000
"""

Your best spam_cutoff may be different, but the point to this exercise isn't
to find the best cutoff, it's to think about the middle ground.  Note that I
set

   show_{ham,spam}_{lo,hi}

to values such that I would see every ham and spam that lived in my presumed
middle ground of 0.50-0.80, plus down to 0.40 on the low end.   I also set
show_charlimit to a large value so that I'd see the full text of each such
msg.

Heh:  My favorite:  Data/Ham/Set7/51781.txt got overall score 0.485+, close
to the middle ground cutoff.  It's a msg I posted 2 years ago to the day (12
Sep 2000), and consists almost entirely of a rather long transcript of part
of the infamous Chicago Seven trial:

    http://www.law.umkc.edu/faculty/projects/ftrials/Chicago7/chicago7.html

I learned two things from this <wink>:

1. There are so many unique lexical clues when I post a thing, I can
   get away with posting anything.

2. "tyranny" is a spam clue, but "nazi" a ham clue:

      prob('tyranny') = 0.850877
      prob('nazi')    = 0.282714

leaving-lexical-clues-amid-faux-intimations-of-profundity-ly y'rs  - tim


From grobinson@transpose.com  Sat Oct 12 16:39:24 2002
From: grobinson@transpose.com (Gary Robinson)
Date: Sat, 12 Oct 2002 11:39:24 -0400
Subject: [Spambayes] spamprob combining
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEBKBLAB.tim.one@comcast.net>
Message-ID: <B9CDBA6C.176EC%grobinson@transpose.com>

This sounds like it's working out pretty well!

If we get to the point that it becomes the accepted technique for spambayes,
I'll add it to the my online essay.

NOTE: 

As we've discussed ad nauseum, this multipicative thing is one-sided in its
sensitivity, which is why we end up having to do something like S/(S+H)
where S is based on (1-p) calcs for combining the p's and H is based on p
calcs.

There ARE meta-analytical ways of combining the p-values which are equally
sensitive on both sides... but are a TAD overall less sensitive than the
chi-square thing. And frankly, the S/(S+H)-style trick may take away a lot
of that super-strength super sensitivity anyway -- maybe even all of the
advantage over other methods (I just don't know without directly testing
it).

So a two-sided combining approach may perform equally well for our practical
purposes... there's no way of knowing without trying.

The advantage of such an approach would essentially be algorithmic elegance.
No longer would we need that klugy (P-Q)/(P+Q) or S/(S+H) stuff which
doesn't convert to a real probability.

Instead, the combined P would be all we would need. Combined P near 1 would
be spammy, and combined P near 0 would by hammy. And P would be a REAL
probability (against the null hypothesis of randomness).

I wouldn't expect any performance ADVANTAGE to this other approach, but it
WOULD be more elegant. (Note, all these approaches depend on one or another
statistical function as the current one does the inverse-chi-square).

If you are interested in going that way let me know, and I'll send info on
how to do it. Maybe you'll have another beautifully simple algorithm up your
sleave to implement the necessary statistical function.


--Gary


-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454


> From: Tim Peters <tim.one@comcast.net>
> Date: Sat, 12 Oct 2002 02:27:29 -0400
> To: SpamBayes <spambayes@python.org>
> Cc: Gary Robinson <grobinson@transpose.com>
> Subject: RE: [Spambayes] spamprob combining
> 
> OK!  Gary and I exchanged info offline, and I believe the implementation of
> use_chi_squared_combining matches his intent for it.
> 
>> ...
>> Example:  if we called everything from 50 thru 80 "the middle
>> ground", ... in a manual-review system, this combines all the
>> desirable properties:
>> 
>> 1. Very little is kicked out for review.
>> 
>> 2. There are high error rates among the msgs kicked out for review.
>> 
>> 3. There are unmeasurably low error rates among the msgs not kicked
>>    out for review.
> 
> On my full 20,000 ham + 14,000 spam test, and with spam_cutoff 0.70, this
> got 3 FP and 11 FN in a 10-fold CV run, compared to 2 FP and 11 FN under the
> all-default scheme with the very touchy spam_cutoff.  The middle ground is
> the *interesting* thing, and it's like a laser beam here (yippee!).  In the
> "50 thru 80" range guessed at above,
> 
> 1. 12 of 20,000 hams lived there, 1 of the FPs among them (scoring 0.737).
>  The other 2 FP scored 0.999999929221 (Nigerian scam quote) and
>  0.972986477986 (lady with the short question and long obnoxious
>  employer-generated SIG).  I don't believe any usable scheme will
>  ever call those ham, though, or put them in a middle ground without
>  greatly bloating the middle ground with correctly classified
>  messages.
> 
> 2. 14 of 14,000 spams lived there, including 8 (yowza!) of the 11 FN
>  (with 3 scores a bit above 0.5, 1 near 0.56, 1 near 0.58, 1 near
>  0.61, 1 near 0.63, and 1 near 0.68).  The 3 remaining spam scored
>  below 0.50:
> 
> 0.35983017036
>   "Hello, my Name is BlackIntrepid"
>   Except that it contained a URL and an invitation to visit it, this
>   could have been a poorly written c.l.py post explaining a bit
>   about hackers to newbies (and if you don't think there are
>   plenty of those in my ham, you don't read c.l.py <wink>).
> 
> 0.39570232415
>   The embarrassing "HOW TO BECOME A MILLIONAIRE IN WEEKS!!" spam,
>   whose body consists of a uuencoded text file we throw away
>   unlooked at.  (This is quite curable, but I doubt it's worth
>   the bother -- at least until spammers take to putting everything
>   in uuencoded text files!)
> 
> 0.499567195859 (about as close to "middle ground" cutoff as can be)
>   A giant (> 20KB) base64-encoded plain text file.  I've never
>   bothered to decode this to see what it says; like the others,
>   though, it's been a persistent FN under all schemes.  Note that
>   we do decode this; I've always assumed it's of the "long, chatty,
>   just-folks" flavor of tech spam that's hard to catch; the list of
>   clues contains "cookies", "editor", "ms-dos", "backslashes",
>   "guis", "commands", "folder", "dumb", "(well,", "cursor",
>   and "trick" (a spamprob 0.00183748 word!).
> 
> 
> For my original purpose of looking at a scheme for c.l.py traffic, this has
> become the clear leader among all schemes:  while it's more extreme than I
> might like, it made very few errors, and a miniscule middle ground (less
> than 0.08% of all msgs) contains 64+% of all errors.  3 FN would survive,
> and 2 FP, but I don't expect that any usable scheme could do better on this
> data.  Note that Graham combining was also very extreme, but had *no* usable
> middle ground on this data:  all mistakes had scores of almost exactly 0.0
> or almost exactly 1.0 (and there were more mistakes).
> 
> How does it do for you?  An analysis like the above is what I'm looking for,
> although it surely doesn't need to be so detailed.  Here's the .ini file I
> used:
> 
> """
> [Classifier]
> use_chi_squared_combining: True
> 
> [TestDriver]
> spam_cutoff: 0.70
> 
> nbuckets: 200
> best_cutoff_fp_weight: 10
> 
> show_false_positives: True
> show_false_negatives: True
> show_best_discriminators: 50
> show_spam_lo = 0.40
> show_spam_hi = 0.80
> show_ham_lo = 0.40
> show_ham_hi = 0.80
> show_charlimit: 100000
> """
> 
> Your best spam_cutoff may be different, but the point to this exercise isn't
> to find the best cutoff, it's to think about the middle ground.  Note that I
> set
> 
>  show_{ham,spam}_{lo,hi}
> 
> to values such that I would see every ham and spam that lived in my presumed
> middle ground of 0.50-0.80, plus down to 0.40 on the low end.   I also set
> show_charlimit to a large value so that I'd see the full text of each such
> msg.
> 
> Heh:  My favorite:  Data/Ham/Set7/51781.txt got overall score 0.485+, close
> to the middle ground cutoff.  It's a msg I posted 2 years ago to the day (12
> Sep 2000), and consists almost entirely of a rather long transcript of part
> of the infamous Chicago Seven trial:
> 
>   http://www.law.umkc.edu/faculty/projects/ftrials/Chicago7/chicago7.html
> 
> I learned two things from this <wink>:
> 
> 1. There are so many unique lexical clues when I post a thing, I can
>  get away with posting anything.
> 
> 2. "tyranny" is a spam clue, but "nazi" a ham clue:
> 
>     prob('tyranny') = 0.850877
>     prob('nazi')    = 0.282714
> 
> leaving-lexical-clues-amid-faux-intimations-of-profundity-ly y'rs  - tim
> 


From jm@jmason.org  Wed Oct  9 13:21:11 2002
From: jm@jmason.org (Justin Mason)
Date: Wed, 09 Oct 2002 13:21:11 +0100
Subject: [Spambayes] [SAtalk] fully-public corpus of mail available
Message-ID: <20021009122116.6EB2416F03@jmason.org>

(Please feel free to forward this message to other possibly-interested
parties.)

Hi all,

One of the big problems working with spam classification, is finding good
mail to test with.  There are few public corpora available; Ion
Androutsopoulos' "Ling-spam" corpus is one (hi Ion!), but unfortunately
this does not contain all of the mail message data, so would not be useful
to a SpamAssassin-style system (which relies heavily on header data), for
example.

Another effect of not having a common, shared corpus, is the difficulty
this introduces in comparing accuracy rates between spam filter software;
since everyone tests using different corpora, statistics can be unportable
as a result.

Building public corpora is difficult, as it typically involves saving your
own (classified) mail.  This brings privacy problems, as your mail senders
may not wish to see this made public.

But what the heck, that's what I've done anyway ;)  Here's a public corpus
I've assembled from my own corpora, removing messages which were not
public in the first place.  Please feel free to download it and use
it for spam-filter development.

It's quite small, but should be big enough for use as a reference corpus,
at least, so that hit-rate statistics can be compared across tools.
Hope it helps.

It lives here:

  http://spamassassin.org/publiccorpus/


and here's the README.txt:

Welcome to the SpamAssassin public mail corpus.  This is a selection of mail
messages, suitable for use in testing spam filtering systems.  Pertinent
points:

  - All headers are reproduced in full.  Some address obfuscation has taken
    place; hostnames in some cases have been replaced with "example.com",
    which should have a valid MX record (if I recall correctly).  In most
    cases though, the headers appear as they were received.

  - All of these messages were posted to public fora, were sent to me in the
    knowledge that they may be made public, were sent by me, or originated as
    newsletters from public news web sites.

  - Copyright for the text in the messages remains with the original senders.


OK, now onto the corpus description.  It's split into three parts, as follows:

  - spam: 500 spam messages, all received from non-spam-trap sources.

  - easy_ham: 350 non-spam messages.  These are typically quite easy to
    differentiate from spam, since they frequently do not contain any spammish
    signatures (like HTML etc).

  - hard_ham: 250 non-spam messages which are closer in many respects to
    typical spam: use of HTML, unusual HTML markup, coloured text,
    "spammish-sounding" phrases etc.

The corpora are prefixed with "200210", because that's the date when I
assembled it, so it's as good a version string as anything else ;) . They are
compressed using "bzip2".

This corpus lives at http://spamassassin.org/publiccorpus/ .  Mail
jm - public - corpus AT jmason dot org if you have questions, or to donate
mail.

(Oct  9 2002 jm)


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
Spamassassin-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


From bkc@murkworks.com  Sat Oct 12 20:07:50 2002
From: bkc@murkworks.com (Brad Clements)
Date: Sat, 12 Oct 2002 15:07:50 -0400
Subject: [Spambayes] Chi True results
Message-ID: <3DA83A74.20675.1A642869@localhost>

I ran this twice, first to get the recommended spam cutoff, the 2nd time with the 
recommended cutoff in the .ini

then I compared it against the tim_combine_true test I ran previously.

In this message:  .ini, cmp.py results, histograms from chi true run.

[Tokenizer]
mine_received_headers: True

[Classifier]
use_central_limit = False
use_central_limit2 = False
use_central_limit3 = False
use_tim_combining: False
use_chi_squared_combining: True

[TestDriver]
spam_cutoff: 0.98
show_false_negatives: True
show_false_positives: True
nbuckets: 200
best_cutoff_fp_weight: 10

show_spam_lo: 0.4
show_spam_hi: 0.80
show_ham_lo = 0.40
show_ham_hi = 0.80
show_charlimit: 10000

save_trained_pickles: True
save_histogram_pickles: True


results/timcombinetrues.txt -> results/chitrues.txt
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams

false positive percentages
    1.077  0.154  won    -85.70%
    0.769  0.231  won    -69.96%
    0.769  0.077  won    -89.99%
    0.923  0.154  won    -83.32%
    0.769  0.154  won    -79.97%
    0.538  0.077  won    -85.69%
    0.538  0.077  won    -85.69%
    0.692  0.000  won   -100.00%
    0.769  0.231  won    -69.96%
    0.692  0.000  won   -100.00%

won  10 times
tied  0 times
lost  0 times

total unique fp went from 98 to 15 won    -84.69%
mean fp % went from 0.753846153846 to 0.115384615385 won    -84.69%

false negative percentages
    0.154  0.846  lost  +449.35%
    0.154  1.231  lost  +699.35%
    0.231  1.154  lost  +399.57%
    0.077  0.615  lost  +698.70%
    0.000  0.923  lost  +(was 0)
    0.231  1.308  lost  +466.23%
    0.231  0.692  lost  +199.57%
    0.077  1.077  lost  +1298.70%
    0.154  1.231  lost  +699.35%
    0.231  1.231  lost  +432.90%

won   0 times
tied  0 times
lost 10 times

total unique fn went from 20 to 134 lost  +570.00%
mean fn % went from 0.153846153846 to 1.03076923077 lost  +570.00%

ham mean                     ham sdev
  12.23    1.40  -88.55%        9.02    8.67   -3.88%
  12.04    1.12  -90.70%        8.57    8.09   -5.60%
  12.08    1.12  -90.73%        8.44    8.02   -4.98%
  12.21    1.26  -89.68%        8.65    8.62   -0.35%
  11.98    1.06  -91.15%        8.40    8.03   -4.40%
  12.20    1.01  -91.72%        8.16    6.87  -15.81%
  11.69    0.85  -92.73%        7.80    6.57  -15.77%
  11.61    0.96  -91.73%        7.91    7.06  -10.75%
  11.63    1.15  -90.11%        8.31    8.38   +0.84%
  11.60    1.01  -91.29%        7.94    7.62   -4.03%

ham mean and sdev for all runs
  11.93    1.09  -90.86%        8.33    7.83   -6.00%

spam mean                    spam sdev
  90.31   99.74  +10.44%        7.59    3.59  -52.70%
  90.59   99.67  +10.02%        7.68    4.17  -45.70%
  90.72   99.68   +9.88%        7.40    4.12  -44.32%
  90.91   99.83   +9.81%        7.16    2.68  -62.57%
  90.54   99.84  +10.27%        6.93    2.20  -68.25%
  90.68   99.66   +9.90%        7.23    4.29  -40.66%
  90.49   99.67  +10.14%        7.25    4.68  -35.45%
  90.61   99.79  +10.13%        7.29    2.98  -59.12%
  90.93   99.75   +9.70%        7.21    3.24  -55.06%
  90.40   99.54  +10.11%        7.80    5.07  -35.00%

spam mean and sdev for all runs
  90.62   99.72  +10.04%        7.36    3.80  -48.37%

ham/spam mean difference: 78.69 98.63 +19.94


--

histogram from chi: true

-> <stat> Ham scores for all runs: 13000 items; mean 1.09; sdev 7.83
-> <stat> min -2.66454e-13; median 2.85882e-12; max 100
* = 204 items
 0.0 12433 *************************************************************
 0.5    71 *
 1.0    43 *
 1.5    33 *
 2.0    14 *
 2.5    15 *
 3.0    12 *
 3.5     5 *
 4.0    14 *
 4.5    11 *
 5.0     6 *
 5.5     9 *
 6.0     9 *
 6.5     5 *
 7.0     6 *
 7.5     3 *
 8.0     7 *
 8.5     2 *
 9.0     5 *
 9.5     5 *
10.0     5 *
10.5     5 *
11.0     3 *
11.5     4 *
12.0     7 *
12.5     2 *
13.0     3 *
13.5     2 *
14.0     3 *
14.5     4 *
15.0     3 *
15.5     3 *
16.0     0 
16.5     3 *
17.0     2 *
17.5     1 *
18.0     0 
18.5     5 *
19.0     3 *
19.5     1 *
20.0     1 *
20.5     3 *
21.0     0 
21.5     1 *
22.0     1 *
22.5     2 *
23.0     1 *
23.5     2 *
24.0     2 *
24.5     0 
25.0     0 
25.5     3 *
26.0     2 *
26.5     2 *
27.0     1 *
27.5     1 *
28.0     2 *
28.5     3 *
29.0     2 *
29.5     2 *
30.0     1 *
30.5     3 *
31.0     1 *
31.5     1 *
32.0     4 *
32.5     2 *
33.0     2 *
33.5     3 *
34.0     1 *
34.5     3 *
35.0     1 *
35.5     3 *
36.0     5 *
36.5     4 *
37.0     0 
37.5     3 *
38.0     1 *
38.5     1 *
39.0     0 
39.5     2 *
40.0     2 *
40.5     3 *
41.0     2 *
41.5     1 *
42.0     1 *
42.5     3 *
43.0     2 *
43.5     1 *
44.0     2 *
44.5     3 *
45.0     3 *
45.5     5 *
46.0     1 *
46.5     3 *
47.0     1 *
47.5     5 *
48.0     1 *
48.5     3 *
49.0     9 *
49.5    11 *
50.0     8 *
50.5     1 *
51.0     3 *
51.5     1 *
52.0     7 *
52.5     3 *
53.0     2 *
53.5     1 *
54.0     0 
54.5     1 *
55.0     2 *
55.5     0 
56.0     3 *
56.5     0 
57.0     0 
57.5     1 *
58.0     2 *
58.5     0 
59.0     0 
59.5     1 *
60.0     1 *
60.5     1 *
61.0     0 
61.5     0 
62.0     0 
62.5     0 
63.0     2 *
63.5     0 
64.0     0 
64.5     0 
65.0     0 
65.5     1 *
66.0     0 
66.5     0 
67.0     0 
67.5     0 
68.0     0 
68.5     2 *
69.0     0 
69.5     1 *
70.0     1 *
70.5     0 
71.0     0 
71.5     1 *
72.0     0 
72.5     1 *
73.0     0 
73.5     0 
74.0     1 *
74.5     0 
75.0     0 
75.5     0 
76.0     2 *
76.5     0 
77.0     0 
77.5     0 
78.0     0 
78.5     1 *
79.0     0 
79.5     0 
80.0     1 *
80.5     1 *
81.0     1 *
81.5     0 
82.0     2 *
82.5     0 
83.0     0 
83.5     1 *
84.0     0 
84.5     3 *
85.0     0 
85.5     1 *
86.0     1 *
86.5     0 
87.0     1 *
87.5     1 *
88.0     2 *
88.5     1 *
89.0     0 
89.5     0 
90.0     2 * 
90.5     0 
91.0     0 
91.5     1 *   
92.0     0 
92.5     1 *
93.0     1 *
93.5     0 
94.0     2 *
94.5     1 *
95.0     1 *
95.5     2 *
96.0     1 *
96.5     2 *
97.0     1 *
97.5     2 *
98.0     0 
98.5     0 
99.0     3 *
99.5    12 *  thanks for joining paypal,  ETrade news, HP Symposiom, Registration ack from Cingular, 
      EDN renewal, X10 newsletter (argh!), FAFSA US Dept Education renewal :-(, 
United Connection, Network Computing Renewal, Infotel Distributing

-> <stat> Spam scores for all runs: 13000 items; mean 99.72; sdev 3.80

This histogram seems broken, I have 4 or 5 spams with prob < .0.05

> Survey on Software Reuse Views and Activity

> You are invited to participate in my Dissertation research on the topic of ^M
> Software Reuse.

(naw)

VoIP solutions for providers

HP Enterprise Technical Symposium (oops, this should be ham, guess I got sick of 
getting these)

-> <stat> min 0.000127988; median 100; max 100
* = 210 items
 0.0     1 * ***New SAP Opportunities*** Client interviewing now!!
 0.5     1 * Certified IT professional with over 6 years of Experience on Design
        and Coding.
 1.0     0 
 1.5     1 * Senior Consultant with Experience on JD Edwards, ONE WORLD, XE, CNC,
        AS/400 is available
 2.0     0 
 2.5     1 * Fax / Copier Sales / service call 2078787
 3.0     1 * Development Services on Telecom/Datacom Protocols
 3.5     0 
 4.0     0 
 4.5     0 
 5.0     0 
 5.5     0 
 6.0     0 
 6.5     0 
 7.0     1 * Certified IT professional with over 6 years of Experience on Design
        and Coding.
 7.5     0 
 8.0     0 
 8.5     1 *
 9.0     0 
 9.5     0 
10.0     0 
10.5     0 
11.0     0 
11.5     0 
12.0     0 
12.5     0 
13.0     0 
13.5     0 
14.0     0 
14.5     0 
15.0     0 
15.5     0 
16.0     1 * Use the Session Scheduler to personalize your training (hp, probably mis-classified, guess I did get sick of them)
16.5     1 * VoIP solutions for providers
17.0     0 
17.5     0 
18.0     0 
18.5     0 
19.0     0 
19.5     0 
20.0     0 
20.5     1 *
21.0     0 
21.5     0 
22.0     2 *
22.5     0 
23.0     0 
23.5     0 
24.0     0 
24.5     1 *
25.0     0 
25.5     0 
26.0     0 
26.5     0 
27.0     0 
27.5     0 
28.0     0 
28.5     0 
29.0     0 
29.5     0 
30.0     0 
30.5     0 
31.0     1 *
31.5     0 
32.0     0 
32.5     0 
33.0     0 
33.5     0 
34.0     0 
34.5     0 
35.0     0 
35.5     0 
36.0     0 
36.5     0 
37.0     0 
37.5     0 
38.0     0 
38.5     1 *
39.0     0 
39.5     0 
40.0     0 
40.5     0 
41.0     0 
41.5     0 
42.0     0 
42.5     0 
43.0     0 
43.5     0 
44.0     1 *
44.5     2 *
45.0     0 
45.5     0 
46.0     0 
46.5     0 
47.0     0 
47.5     0 
48.0     0 
48.5     1 *
49.0     0 
49.5     1 *
50.0     9 *
50.5     0 
51.0     2 *
51.5     0 
52.0     1 *
52.5     0 
53.0     1 *
53.5     0 
54.0     0 
54.5     0 
55.0     0 
55.5     2 *
56.0     1 *
56.5     0 
57.0     1 *
57.5     0 
58.0     0 
58.5     0 
59.0     0 
59.5     0 
60.0     0 
60.5     0 
61.0     0 
61.5     0 
62.0     0 
62.5     2 *
63.0     0 
63.5     0 
64.0     0 
64.5     2 *
65.0     0 
65.5     1 *
66.0     1 *
66.5     0 
67.0     0 
67.5     0 
68.0     0 
68.5     1 *
69.0     0 
69.5     0 
70.0     0 
70.5     0 
71.0     0 
71.5     0 
72.0     0 
72.5     1 *
73.0     0 
73.5     1 *
74.0     0 
74.5     0 
75.0     0 
75.5     0 
76.0     5 *
76.5     0 
77.0     1 *
77.5     2 *
78.0     2 *
78.5     1 *
79.0     2 *
79.5     2 *
80.0     1 *
80.5     0 
81.0     1 *
81.5     0 
82.0     1 *
82.5     1 *
83.0     2 *
83.5     1 *
84.0     3 *
84.5     0 
85.0     1 *
85.5     1 *
86.0     2 *
86.5     1 *
87.0     0 
87.5     0 
88.0     2 *
88.5     0 
89.0     1 *
89.5     1 *
90.0     2 *
90.5     5 *
91.0     0 
91.5     4 *
92.0     3 *
92.5     2 *
93.0     1 *
93.5     3 *
94.0     2 *
94.5     5 *
95.0     3 *
95.5     4 *
96.0     5 *
96.5     6 *
97.0     5 *
97.5     4 *
98.0    10 *
98.5    16 *
99.0    33 *
99.5 12807 *************************************************************
-> best cutoff for all runs: 0.98
->     with weighted total 10*15 fp + 134 fn = 284
->     fp rate 0.115%  fn rate 1.03%
    saving ham histogram pickle to class_hamhist.pik
    saving spam histogram pickle to class_spamhist.pik


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From tim.one@comcast.net  Sat Oct 12 21:09:55 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 12 Oct 2002 16:09:55 -0400
Subject: [Spambayes] Chi True results
In-Reply-To: <3DA83A74.20675.1A642869@localhost>
Message-ID: <LNBBLJKPBEHFEDALKOLCGECDBLAB.tim.one@comcast.net>

[Brad Clements]
> I ran this twice, first to get the recommended spam cutoff, the
> 2nd time with the recommended cutoff in the .ini
>
> then I compared it against the tim_combine_true test I ran previously.

Brad, of all the approaches you've tried here (and I really appreciate how
many you've tried!), which have *you* been happiest with?  The numbers can't
tell me that, it's a human judgment.

Note that the "recommended cutoff" isn't really a recommendation, it's a
dry-as-dust number that objectively minimizes

    best_fp_cutoff_weight * total_fp + total_fn

What you get out of that is a function of what you feed into it via
selecting best_fp_cutoff_weight.  The value for that *you* like is also a
matter of personal judgment.

> In this message:  .ini, cmp.py results, histograms from chi true run.
>
> [Tokenizer]
> mine_received_headers: True
>
> [Classifier]
> use_central_limit = False
> use_central_limit2 = False
> use_central_limit3 = False
> use_tim_combining: False
> use_chi_squared_combining: True
>
> [TestDriver]
> spam_cutoff: 0.98

Dang, that's big <wink>.

> show_false_negatives: True
> show_false_positives: True
> nbuckets: 200
> best_cutoff_fp_weight: 10
>
> show_spam_lo: 0.4
> show_spam_hi: 0.80
> show_ham_lo = 0.40
> show_ham_hi = 0.80
> show_charlimit: 10000
>
> save_trained_pickles: True
> save_histogram_pickles: True
>
>
>
> results/timcombinetrues.txt -> results/chitrues.txt
> -> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
>
> false positive percentages
>     1.077  0.154  won    -85.70%
>     0.769  0.231  won    -69.96%
>     0.769  0.077  won    -89.99%
>     0.923  0.154  won    -83.32%
>     0.769  0.154  won    -79.97%
>     0.538  0.077  won    -85.69%
>     0.538  0.077  won    -85.69%
>     0.692  0.000  won   -100.00%
>     0.769  0.231  won    -69.96%
>     0.692  0.000  won   -100.00%
>
> won  10 times
> tied  0 times
> lost  0 times
>
> total unique fp went from 98 to 15 won    -84.69%
> mean fp % went from 0.753846153846 to 0.115384615385 won    -84.69%
>
> false negative percentages
>     0.154  0.846  lost  +449.35%
>     0.154  1.231  lost  +699.35%
>     0.231  1.154  lost  +399.57%
>     0.077  0.615  lost  +698.70%
>     0.000  0.923  lost  +(was 0)
>     0.231  1.308  lost  +466.23%
>     0.231  0.692  lost  +199.57%
>     0.077  1.077  lost  +1298.70%
>     0.154  1.231  lost  +699.35%
>     0.231  1.231  lost  +432.90%
>
> won   0 times
> tied  0 times
> lost 10 times
>
> total unique fn went from 20 to 134 lost  +570.00%
> mean fn % went from 0.153846153846 to 1.03076923077 lost  +570.00%

So is that a tradeoff (massive decrease in fp vs massive increase in fn)
you're happy with?  Would the middle ground here be useful to you?

> ham mean                     ham sdev
>   12.23    1.40  -88.55%        9.02    8.67   -3.88%
>   12.04    1.12  -90.70%        8.57    8.09   -5.60%
>   12.08    1.12  -90.73%        8.44    8.02   -4.98%
>   12.21    1.26  -89.68%        8.65    8.62   -0.35%
>   11.98    1.06  -91.15%        8.40    8.03   -4.40%
>   12.20    1.01  -91.72%        8.16    6.87  -15.81%
>   11.69    0.85  -92.73%        7.80    6.57  -15.77%
>   11.61    0.96  -91.73%        7.91    7.06  -10.75%
>   11.63    1.15  -90.11%        8.31    8.38   +0.84%
>   11.60    1.01  -91.29%        7.94    7.62   -4.03%
>
> ham mean and sdev for all runs
>   11.93    1.09  -90.86%        8.33    7.83   -6.00%
>
> spam mean                    spam sdev
>   90.31   99.74  +10.44%        7.59    3.59  -52.70%
>   90.59   99.67  +10.02%        7.68    4.17  -45.70%
>   90.72   99.68   +9.88%        7.40    4.12  -44.32%
>   90.91   99.83   +9.81%        7.16    2.68  -62.57%
>   90.54   99.84  +10.27%        6.93    2.20  -68.25%
>   90.68   99.66   +9.90%        7.23    4.29  -40.66%
>   90.49   99.67  +10.14%        7.25    4.68  -35.45%
>   90.61   99.79  +10.13%        7.29    2.98  -59.12%
>   90.93   99.75   +9.70%        7.21    3.24  -55.06%
>   90.40   99.54  +10.11%        7.80    5.07  -35.00%
>
> spam mean and sdev for all runs
>   90.62   99.72  +10.04%        7.36    3.80  -48.37%
>
> ham/spam mean difference: 78.69 98.63 +19.94
>
>
> --
>
> histogram from chi: true
>
> -> <stat> Ham scores for all runs: 13000 items; mean 1.09; sdev 7.83
> -> <stat> min -2.66454e-13; median 2.85882e-12; max 100
> * = 204 items
>  0.0 12433 *************************************************************
>  0.5    71 *
>  1.0    43 *
>  1.5    33 *
>  2.0    14 *
>  2.5    15 *
>  3.0    12 *
>  3.5     5 *
>  4.0    14 *
>  4.5    11 *
>  5.0     6 *
>  5.5     9 *
>  6.0     9 *
>  6.5     5 *
>  7.0     6 *
>  7.5     3 *
>  8.0     7 *
>  8.5     2 *
>  9.0     5 *
>  9.5     5 *
> 10.0     5 *
> 10.5     5 *
> 11.0     3 *
> 11.5     4 *
> 12.0     7 *
> 12.5     2 *
> 13.0     3 *
> 13.5     2 *
> 14.0     3 *
> 14.5     4 *
> 15.0     3 *
> 15.5     3 *
> 16.0     0
> 16.5     3 *
> 17.0     2 *
> 17.5     1 *
> 18.0     0
> 18.5     5 *
> 19.0     3 *
> 19.5     1 *
> 20.0     1 *
> 20.5     3 *
> 21.0     0
> 21.5     1 *
> 22.0     1 *
> 22.5     2 *
> 23.0     1 *
> 23.5     2 *
> 24.0     2 *
> 24.5     0
> 25.0     0
> 25.5     3 *
> 26.0     2 *
> 26.5     2 *
> 27.0     1 *
> 27.5     1 *
> 28.0     2 *
> 28.5     3 *
> 29.0     2 *
> 29.5     2 *
> 30.0     1 *
> 30.5     3 *
> 31.0     1 *
> 31.5     1 *
> 32.0     4 *
> 32.5     2 *
> 33.0     2 *
> 33.5     3 *
> 34.0     1 *
> 34.5     3 *
> 35.0     1 *
> 35.5     3 *
> 36.0     5 *
> 36.5     4 *
> 37.0     0
> 37.5     3 *
> 38.0     1 *
> 38.5     1 *
> 39.0     0
> 39.5     2 *
> 40.0     2 *
> 40.5     3 *
> 41.0     2 *
> 41.5     1 *
> 42.0     1 *
> 42.5     3 *
> 43.0     2 *
> 43.5     1 *
> 44.0     2 *
> 44.5     3 *
> 45.0     3 *
> 45.5     5 *
> 46.0     1 *
> 46.5     3 *
> 47.0     1 *
> 47.5     5 *
> 48.0     1 *
> 48.5     3 *
> 49.0     9 *
> 49.5    11 *

Suppose you were to call scores of .50 thru .80 "unsure", and stuffed them
in a different folder(s).  Then the ham from here:

> 50.0     8 *
> 50.5     1 *
> 51.0     3 *
> 51.5     1 *
> 52.0     7 *
> 52.5     3 *
> 53.0     2 *
> 53.5     1 *
> 54.0     0
> 54.5     1 *
> 55.0     2 *
> 55.5     0
> 56.0     3 *
> 56.5     0
> 57.0     0
> 57.5     1 *
> 58.0     2 *
> 58.5     0
> 59.0     0
> 59.5     1 *
> 60.0     1 *
> 60.5     1 *
> 61.0     0
> 61.5     0
> 62.0     0
> 62.5     0
> 63.0     2 *
> 63.5     0
> 64.0     0
> 64.5     0
> 65.0     0
> 65.5     1 *
> 66.0     0
> 66.5     0
> 67.0     0
> 67.5     0
> 68.0     0
> 68.5     2 *
> 69.0     0
> 69.5     1 *
> 70.0     1 *
> 70.5     0
> 71.0     0
> 71.5     1 *
> 72.0     0
> 72.5     1 *
> 73.0     0
> 73.5     0
> 74.0     1 *
> 74.5     0
> 75.0     0
> 75.5     0
> 76.0     2 *
> 76.5     0
> 77.0     0
> 77.5     0
> 78.0     0
> 78.5     1 *
> 79.0     0
> 79.5     0

through here would get "booted out for manual review".  There's not much ham
in this range compared to 13,000 msgs.  The spam in this range would *also*
get "booted out for manual review", and that would catch 41 (see next
histogram) of your false negatives.  Is that attractive to you?  Useless?
In either case, would shifting the "unsure" range change your judgment?  For
example, if you shifted the upper end of the unsure range from .8 to .9,
that would add 16 ham to the "booted out" range, in return for catching
another 19 spam.

> 80.0     1 *
> 80.5     1 *
> 81.0     1 *
> 81.5     0
> 82.0     2 *
> 82.5     0
> 83.0     0
> 83.5     1 *
> 84.0     0
> 84.5     3 *
> 85.0     0
> 85.5     1 *
> 86.0     1 *
> 86.5     0
> 87.0     1 *
> 87.5     1 *
> 88.0     2 *
> 88.5     1 *
> 89.0     0
> 89.5     0
> 90.0     2 *
> 90.5     0
> 91.0     0
> 91.5     1 *
> 92.0     0
> 92.5     1 *
> 93.0     1 *
> 93.5     0
> 94.0     2 *
> 94.5     1 *
> 95.0     1 *
> 95.5     2 *
> 96.0     1 *
> 96.5     2 *
> 97.0     1 *
> 97.5     2 *
> 98.0     0
> 98.5     0
> 99.0     3 *
> 99.5    12 *  thanks for joining paypal,  ETrade news, HP
> Symposiom, Registration ack from Cingular,
>       EDN renewal, X10 newsletter (argh!), FAFSA US Dept
> Education renewal :-(,
> United Connection, Network Computing Renewal, Infotel Distributing
>
> -> <stat> Spam scores for all runs: 13000 items; mean 99.72; sdev 3.80
>
> This histogram seems broken, I have 4 or 5 spams with prob < .0.05

I'm not sure what you mean here.  Note that for hysterical raisins, the
histogram buckets are labelled with 100x the score values.  So a prob of
0.05 is in the bucket with label 5.0.  There are 5 spam in the histogram
below under 5.0 (= score 0.05).

> > Survey on Software Reuse Views and Activity
>
> > You are invited to participate in my Dissertation research on
> the topic of ^M
> > Software Reuse.
>
> (naw)
>
> VoIP solutions for providers
>
> HP Enterprise Technical Symposium (oops, this should be ham,
> guess I got sick of getting these)
>
> -> <stat> min 0.000127988; median 100; max 100
> * = 210 items
>  0.0     1 * ***New SAP Opportunities*** Client interviewing now!!
>  0.5     1 * Certified IT professional with over 6 years of
> Experience on Design
>         and Coding.
>  1.0     0
>  1.5     1 * Senior Consultant with Experience on JD Edwards, ONE
> WORLD, XE, CNC,
>         AS/400 is available
>  2.0     0
>  2.5     1 * Fax / Copier Sales / service call 2078787
>  3.0     1 * Development Services on Telecom/Datacom Protocols
>  3.5     0
>  4.0     0
>  4.5     0
>  5.0     0
>  5.5     0
>  6.0     0
>  6.5     0
>  7.0     1 * Certified IT professional with over 6 years of
> Experience on Design and Coding.
>  7.5     0
>  8.0     0
>  8.5     1 *
>  9.0     0
>  9.5     0
> 10.0     0
> 10.5     0
> 11.0     0
> 11.5     0
> 12.0     0
> 12.5     0
> 13.0     0
> 13.5     0
> 14.0     0
> 14.5     0
> 15.0     0
> 15.5     0
> 16.0     1 * Use the Session Scheduler to personalize your
> training (hp, probably mis-classified, guess I did get sick of them)
> 16.5     1 * VoIP solutions for providers
> 17.0     0
> 17.5     0
> 18.0     0
> 18.5     0
> 19.0     0
> 19.5     0
> 20.0     0
> 20.5     1 *
> 21.0     0
> 21.5     0
> 22.0     2 *
> 22.5     0
> 23.0     0
> 23.5     0
> 24.0     0
> 24.5     1 *
> 25.0     0
> 25.5     0
> 26.0     0
> 26.5     0
> 27.0     0
> 27.5     0
> 28.0     0
> 28.5     0
> 29.0     0
> 29.5     0
> 30.0     0
> 30.5     0
> 31.0     1 *
> 31.5     0
> 32.0     0
> 32.5     0
> 33.0     0
> 33.5     0
> 34.0     0
> 34.5     0
> 35.0     0
> 35.5     0
> 36.0     0
> 36.5     0
> 37.0     0
> 37.5     0
> 38.0     0
> 38.5     1 *
> 39.0     0
> 39.5     0
> 40.0     0
> 40.5     0
> 41.0     0
> 41.5     0
> 42.0     0
> 42.5     0
> 43.0     0
> 43.5     0
> 44.0     1 *
> 44.5     2 *
> 45.0     0
> 45.5     0
> 46.0     0
> 46.5     0
> 47.0     0
> 47.5     0
> 48.0     0
> 48.5     1 *
> 49.0     0
> 49.5     1 *

.50 thru .80 covers the spam starting here:

> 50.0     9 *
> 50.5     0
> 51.0     2 *
> 51.5     0
> 52.0     1 *
> 52.5     0
> 53.0     1 *
> 53.5     0
> 54.0     0
> 54.5     0
> 55.0     0
> 55.5     2 *
> 56.0     1 *
> 56.5     0
> 57.0     1 *
> 57.5     0
> 58.0     0
> 58.5     0
> 59.0     0
> 59.5     0
> 60.0     0
> 60.5     0
> 61.0     0
> 61.5     0
> 62.0     0
> 62.5     2 *
> 63.0     0
> 63.5     0
> 64.0     0
> 64.5     2 *
> 65.0     0
> 65.5     1 *
> 66.0     1 *
> 66.5     0
> 67.0     0
> 67.5     0
> 68.0     0
> 68.5     1 *
> 69.0     0
> 69.5     0
> 70.0     0
> 70.5     0
> 71.0     0
> 71.5     0
> 72.0     0
> 72.5     1 *
> 73.0     0
> 73.5     1 *
> 74.0     0
> 74.5     0
> 75.0     0
> 75.5     0
> 76.0     5 *
> 76.5     0
> 77.0     1 *
> 77.5     2 *
> 78.0     2 *
> 78.5     1 *
> 79.0     2 *
> 79.5     2 *

and ending above.

> 80.0     1 *
> 80.5     0
> 81.0     1 *
> 81.5     0
> 82.0     1 *
> 82.5     1 *
> 83.0     2 *
> 83.5     1 *
> 84.0     3 *
> 84.5     0
> 85.0     1 *
> 85.5     1 *
> 86.0     2 *
> 86.5     1 *
> 87.0     0
> 87.5     0
> 88.0     2 *
> 88.5     0
> 89.0     1 *
> 89.5     1 *

And another 19 spam lived in [0.8, 0.9).

> 90.0     2 *
> 90.5     5 *
> 91.0     0
> 91.5     4 *
> 92.0     3 *
> 92.5     2 *
> 93.0     1 *
> 93.5     3 *
> 94.0     2 *
> 94.5     5 *
> 95.0     3 *
> 95.5     4 *
> 96.0     5 *
> 96.5     6 *
> 97.0     5 *
> 97.5     4 *
> 98.0    10 *
> 98.5    16 *
> 99.0    33 *
> 99.5 12807 *************************************************************
> -> best cutoff for all runs: 0.98
> ->     with weighted total 10*15 fp + 134 fn = 284
> ->     fp rate 0.115%  fn rate 1.03%

One thing I noticed that doesn't require your personal judgment <wink>:
unless, e.g., you join PayPal over and over again, the system is never going
to learn that

    Subject: Get $5 by Referring Your Friends to PayPal!

    Dear Tim Peters,

    Thank you for joining PayPal! You can use your new account
    to make purchases from over 3 million eBay(TM) auctions, shop online
    at over 20,000 online stores that accept PayPal, or just collect
    money from friends and co-workers.

    [blah blah blah blah blah, and a URL containing "refer" <sheesh>]

isn't spam.  Likewise for yearly renewal notices.

I haven't had to deal with this since my ham is composed of newsgroup
traffic.  It does suggest that some form of whitelist is needed for personal
email, else there appears no hope for these kinds of rare, commercial,
pseudo-personalized mass mailings.  They've got all the earmarks of spam;
the only difference is that sometime in the past, you asked for them; but
the tokenizer can't know that.


From bkc@murkworks.com  Sat Oct 12 21:21:33 2002
From: bkc@murkworks.com (Brad Clements)
Date: Sat, 12 Oct 2002 16:21:33 -0400
Subject: [Spambayes] Chi True results
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGECDBLAB.tim.one@comcast.net>
References: <3DA83A74.20675.1A642869@localhost>
Message-ID: <3DA84BBA.2174.1AA7A657@localhost>

On 12 Oct 2002 at 16:09, Tim Peters wrote:

> Brad, of all the approaches you've tried here (and I really appreciate how
> many you've tried!), which have *you* been happiest with?  The numbers can't
> tell me that, it's a human judgment.
> 

Oh, I reached nirvana a few weeks ago.

Any of these schemes seem like a big win for me. though I did like the central limit 
schemes well enough.

That is, the original graham method didn't have "sure, mostly sure" (ham x spam).. 
Which I like to have.

I can appreciate gary's interest in numerical purity, but the absolute difference between 
1% fn and 2%fn is, in my case, only 1 spam message a day.

At this point, I'm working to put the rubber on the road and tackle deployment issues .. 
Like how could you implement this scheme for 300 users on an IMAP server? Not with 
a 20 megabyte pickle per user!

if tim_combining works "nearly as well" as chi, but takes 1/4 the processor time.. I'd 
probably choose the former.

Sorry, guess I haven't answered your question.


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From tim.one@comcast.net  Sat Oct 12 21:59:45 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 12 Oct 2002 16:59:45 -0400
Subject: [Spambayes] Chi True results
In-Reply-To: <3DA84BBA.2174.1AA7A657@localhost>
Message-ID: <LNBBLJKPBEHFEDALKOLCCECGBLAB.tim.one@comcast.net>

[Brad Clements]
> Oh, I reached nirvana a few weeks ago.

Cool -- I hope to join you there soon <wink>.

> Any of these schemes seem like a big win for me. though I did
> like the central limit schemes well enough.

Because?  That is, what about them was attractive to you, in contrast=
 to the
others?

> That is, the original graham method didn't have "sure, mostly
> sure" (ham x spam)..
> Which I like to have.
>
> I can appreciate gary's interest in numerical purity, but the
> absolute difference between 1% fn and 2%fn is, in my case, only
> 1 spam message a day.

All of the remaining schemes beyond the current default (the 3 clt sc=
hemes,
tim combining, and chi combining) haven't been about numerical purity=
, but
about refining "the middle ground":  isolating as many mistakes into =
as
small a group of "unsure" msgs as possible, with as least touchy a se=
t of
cutoff values as possible.  On my test data, chi combining blows the =
others
out of the water by these measures, and python.org:

1. Deals with many more msgs than any individual deals with.

and

2. Has a mail admin notorious for whining about currently reviewing a
   measly 20 msgs per day <wink>.

Cutting an error rate in half means half the work, and probably a qua=
rter of
the whining, in that context.

> At this point, I'm working to put the rubber on the road and
> tackle deployment issues ..
> Like how could you implement this scheme for 300 users on an IMAP
> server?

There you go:  cut an error rate in half there, and your "1 msg per d=
ay"
instantly turns into 300.

> Not with a 20 megabyte pickle per user!

Things to look at:  we shouldn't need an 8-byte timestamp per word; t=
he
killcount may not be useful at all when we stop *comparing* schemes; =
about
half of all words will be found only once in the whole database (this=
 is an
Invariant Truth across all computer indexing applications -- "hapax
legomena"(*) is what it's called in the literature), so half the word=
s in
your database can be expected to be useless because unique; work need=
s to be
done on pruning the database over time; and these are all related.

Note that incremental adjustments to the clt schemes bristle with pro=
blems
the non-clt schemes don't have, due to the third training pass unique=
 to the
clt schemes.

> if tim_combining works "nearly as well" as chi, but takes 1/4 the
> processor time.. I'd probably choose the former.

Processor time won't be a factor here -- tokenization and I/O times d=
ominate
all schemes so far, and the combining method is an expense distinct f=
rom
those (note that all the variations discussed here are purely variati=
ons in
the combining method:  they all see the same token streams and word c=
ounts,
the differences are in how they *use* the evidence).  I barely notice=
d the
time difference as-is, yet chi combining is invoking log about 50x mo=
re
often than necessary now, and computing chi2Q() to about 14 significa=
nt
digits is way more than necessary too.

> Sorry, guess I haven't answered your question.

Indeed not, but you answered other interesting questions I didn't thi=
nk to
ask <wink>.


(*) For our grammarians, the plural is hapaxes, as in

     31.6% of English hapaxes have corresponding Lithuanian hapaxes.

    and

     Among the evangelists, Luke is the most capable of apparently
     writing =93uncharacteristically=94 since he has the largest voca=
bulary,
     the greatest number of hapax legomena, and a disturbing habit of
     varying his synonyms.  Paffenroth does not engage, for example,
     with Michael Goulder=92s claim that Luke introduces more hapaxes
     into Mark than he takes over.

And you thought we were getting academic *here* <wink>.


From rob@hooft.net  Sat Oct 12 22:00:58 2002
From: rob@hooft.net (Rob Hooft)
Date: Sat, 12 Oct 2002 23:00:58 +0200
Subject: [Spambayes] Chi**2 results
Message-ID: <3DA88D8A.3070509@hooft.net>

This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Here is my chi results. I am amazed by the high cutoff it is advising me 
to use! This feels very good. On the FP side bad messages are:
  * a yahoo account created to correct incorrect listings in their
    database
  * A problem with my Linux Journal subscription
  * India student applying for a course
  * Amazon.com membership update
  * Red Cross blood drive announcement

Which is 5 out of 16000; but I have to admit that even missing 4 out of 
these 5 would not have been too costly.

The middle ground is amazingly empty! I'd almost want to set my cutoff 
at 0.99 or 0.995! One thing that does bother me a bit is that some words 
have a very high correlation of co-existing in a message, and there is 
no way of finding this out. E.g. all the "bad jokes" I'm referring to in 
the attachment were sent by a friend of mine that uses a very strange 
way of forwarding by modifying the "From:" line:

   From: callaway@indigo.picower.edu (David Callaway) (by way of Pieter 
Stouten)


Which results in the highly correlated:

prob('from:pieter') = 0.00151566
prob('message-id:@[158.117.170.103]') = 0.00306331
prob('x-mailer:eudora pro 3.1 for macintosh') = 0.00474183
prob('from:stouten)') = 0.0115681
prob('from:way') = 0.012894
prob('from:(by') = 0.0167286

Regards,

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
[TestDriver]
pickle_basename = class
save_trained_pickles = False
show_histograms = True
show_ham_lo = 0.40
show_best_discriminators = 50
nbuckets = 200
show_ham_hi = 0.80
spam_cutoff = 0.70
spam_directories = Data/Spam/Set%d
show_spam_lo = 0.40
show_false_negatives = True
ham_directories = Data/Ham/Set%d
compute_best_cutoffs_from_histograms = True
show_false_positives = True
best_cutoff_fp_weight = 10
show_spam_hi = 0.80
save_histogram_pickles = False
show_charlimit = 100000

[CV Driver]
build_each_classifier_from_scratch = False

[Tokenizer]
mine_received_headers = False
octet_prefix_size = 5
generate_long_skips = True
count_all_header_lines = False
check_octets = False
ignore_redundant_html = False
basic_header_tokenize = False
safe_headers = abuse-reports-to
	date
	errors-to
	from
	importance
	in-reply-to
	message-id
	mime-version
	organization
	received
	reply-to
	return-path
	subject
	to
	user-agent
	x-abuse-info
	x-complaints-to
	x-face
basic_header_skip = received
	date
	x-.*
basic_header_tokenize_only = False
retain_pure_html_tags = False

-> <stat> Ham scores for all runs: 16000 items; mean 0.57; sdev 5.03
-> <stat> min -2.22045e-13; median 9.99201e-14; max 100
* = 253 items
 0.0 15408 *************************************************************
 0.5   115 *
 1.0    59 *
 1.5    27 *
 2.0    25 *
 2.5    19 *
 3.0    27 *
 3.5     9 *
 4.0     6 *
 4.5     8 *
 5.0    12 *
 5.5     7 *
 6.0     4 *
 6.5     8 *
 7.0     6 *
 7.5     4 *
 8.0     6 *
 8.5     5 *
 9.0     4 *
 9.5    12 *
10.0     9 *
10.5     6 *
11.0     3 *
11.5     1 *
12.0     6 *
12.5     4 *
13.0     1 *
13.5     1 *
14.0     2 *
14.5     6 *
15.0     3 *
15.5     2 *
16.0     3 *
16.5     5 *
17.0     4 *
17.5     5 *
18.0     1 *
18.5     2 *
19.0     2 *
19.5     2 *
20.0     7 *
20.5     1 *
21.0     4 *
21.5     2 *
22.0     4 *
22.5     5 *
23.0     2 *
23.5     3 *
24.0     1 *
24.5     3 *
25.0     2 *
25.5     1 *
26.0     2 *
26.5     1 *
27.0     0 
27.5     1 *
28.0     2 *
28.5     0 
29.0     2 *
29.5     2 *
30.0     4 *
30.5     3 *
31.0     1 *
31.5     1 *
32.0     1 *
32.5     2 *
33.0     0 
33.5     1 *
34.0     2 *
34.5     1 *
35.0     1 *
35.5     2 *
36.0     0 
36.5     2 *
37.0     0 
37.5     6 *
38.0     2 *
38.5     1 *
39.0     4 *
39.5     0 
40.0     2 * Someone replying to a spam on a mailinglist (NO Fwd:!);
	     Bad joke
40.5     2 * Official company press release; Bad joke

Bruker AXS Announces Appointment of Laura Francis as New Chief Financial Officer
<http://cbs.marketwatch.com/tools/quotes/newsarticle.asp?guid={B0A78D89-ADE4
-4E54-B3ED-28553C959466}&siteid=mktw&dist=nbs> 
3/25/2002 9:03:00 AM MADISON, Wis., Mar 25, 2002 (BUSINESS WIRE)
Bruker AXS Inc., a leading global provider of advanced X-ray solutions for
life and advanced materials sciences, today announced that it has appointed
Laura Francis as its new Chief Financial Officer, effective April 8, 2002.
Ms. Francis will also be responsible for investor relations.

41.0     2 * Bad joke; ISP helpdesk reply (payment related)
41.5     2 * Internic regret; Unsubscribe confirmation commercial mailing list.

       We regret to inform you that we were unable to accept your
  credit card payment for the domain names listed below, in the amount 
  of $100.00.  To determine the specific reason your credit card
  was not accepted please contact your credit card company as we
  do not receive that information.  For accounting purposes, we can
  not reflect a paid status for this domain name.  Please resubmit 
  payment by calling (703)742-4777, or by sending a check to the
  address listed on your invoice.  If you submit a check, please
  ensure that the domain name and invoice number are listed as
  references.  We apologize for any inconvenience, and hope this   
  matter can be resolved as quickly as possible.
 
  Thank you,

  Jill Dodson 
  InterNIC Registration Services

42.0     0 
42.5     4 * Unsubscribe from commercial newsletter;
             Bad joke; Bad joke; Someone mass-asking for help
43.0     1 * My wife sending me a link to a housing service.
43.5     2 * Happy birthday via WBW; Happy birthday via WBW.
44.0     2 * Press release International Court of Justice (Nigeria....);
             Linux journal autoreply
44.5     3 * Linux journal autoreply; Bad joke; Bad joke.
45.0     2 * Bad joke; Bad Joke.
45.5     2 * Customer license request; Bad joke

------=_NextPart_000_0005_01C00135.2727FB40
Content-Type: text/plain;
	charset="ks_c_5601-1987"
Content-Transfer-Encoding: base64

SGksDQpJIGhhdmUgYSBxdWVzdGlvbi4uDQpEbyBJIG5lZWQgdG8gZ2V0IG5ldyBsaWNlbnNlIGlm
IEkgdXBncmFkZSB0aGUgY29sbGVjdCBzb2Z0d2FyZT8NCkkgaGF2ZSB1cGdyYWRlZCBpdCBqdXN0
IG5vdywgYnV0IGl0IGRvZXNuJ3Qgc2VlbSB0byBiZSBjb25uZWN0ZWQgdG8gQ0NEIGNvbnRyb2xs
ZXIuDQpTbyBJJ20gdXNpbmcgdGhlIG9sZCB2ZXJzaW9uLg0KDQpQbGVhc2UsIHRlbGwgbWUgd2hh
dCBzaG9sZCBJIGRvLg0KDQogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHNp
bmNlcmVseSwNCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgRG9uZyBN
b2sgU2hpbg0KICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBTZW91bCBO
YXRpb25hbCBVbml2ZXJzaXR5DQo=

46.0     3 * Bad joke; Shogi mailing list posting; Bad joke
46.5     1 * My boss asking me for help with a SPAM on his mailing list (Fwd)
47.0     0 
47.5     2 * Bad joke; Happy birthday via WBW
48.0     4 * Bad joke; Company press release; Bad joke;
	     Colleague notifying me of an important new web service.


The following press release went out earlier this morning. =20

Bruker AXS Acquires MAC Science to Further Penetrate Japanese Life =
Science
and Materials Research Markets

48.5     4 * Conference invitation; Happy birthday from WBW; 
             Company sales budget (in German); 
	     Press release International Court of Justice (Congo, Burundi,...)
49.0     2 * Headhunter hunting me via mailing; Bad joke.
49.5     3 * Shogi mailing list posting; 
             My boss asking me how to deal with a spam message (Fwd:);
	     Shogi mailing list posting announcing a tournament
50.0     1 * Customer sending 1.44MB binary file as text/plain attachment :-)
50.5     0 
51.0     2 * Bad joke; Bad joke.
51.5     0 
52.0     0 
52.5     1 * Bad joke.
53.0     0 
53.5     1 * ECA Annual fee reminder

Dear ECA Members

Just to remember those that have not paid the 2001 annual fee can do
that in Krakow.
It is easier and "cheaper".

54.0     0 
54.5     0 
55.0     1 * Python Professional Services Europe (PPSE) announcement.
55.5     0 
56.0     0 
56.5     0 
57.0     1 * Bad joke.
57.5     0 
58.0     0 
58.5     1 * Bad joke.
59.0     1 * Happy birthday via WBW
59.5     0 
60.0     1 * ISP Newsletter in German
60.5     2 * Happy birthday via WBW; request for information

Hi, 
I see you name im the web museum, in M. C. Escher page, OK, I am
building a site about ilusions, can you help me in this? I use 2
pictures of the artist in my page this is ok? Anyway you known other
images that I can use in my work? Please visit my page at:

http://www.geocities.com/SoHo/Studios/4762/

61.0     0 
61.5     0 
62.0     0 
62.5     0 
63.0     1 * Happy birthday via WBW
63.5     1 * Bad joke
64.0     0 
64.5     0 
65.0     2 * Bad joke; Bad joke.
65.5     0 
66.0     0 
66.5     0 
67.0     0 
67.5     2 * Free copy of Caldera linux; Web-site registration code.
Linux Developer:

We greatly appreciate the contribution you have made to the Linux
community and, to demonstrate that appreciation, we would like to send 
you a free copy of our latest Linux-based product, OpenLinux Standard 1.1.

68.0     1 * Happy birthday via WBW
68.5     0 
69.0     0 
69.5     0 
70.0     0 
70.5     0 
71.0     1 * Happy birthday via WBW
71.5     0 
72.0     0 
72.5     2 * Self-reminder of a bug in a program; Auto-reply to a web request

fields can start with </PRE>

73.0     0 
73.5     0 
74.0     0 
74.5     0 
75.0     0 
75.5     0 
76.0     0 
76.5     1 * Happy birthday via WBW
77.0     0 
77.5     0 
78.0     0 
78.5     1 * Four11 directory listing announcement
79.0     0 
79.5     0 
80.0     0 
80.5     1 *
81.0     0 
81.5     0 
82.0     0 
82.5     0 
83.0     0 
83.5     0 
84.0     0 
84.5     0 
85.0     1 *
85.5     0 
86.0     0 
86.5     1 *
87.0     0 
87.5     0 
88.0     0 
88.5     0 
89.0     0 
89.5     0 
90.0     0 
90.5     0 
91.0     1 *
91.5     0 
92.0     0 
92.5     0 
93.0     0 
93.5     0 
94.0     0 
94.5     0 
95.0     0 
95.5     0 
96.0     0 
96.5     1 *
97.0     1 *
97.5     0 
98.0     1 *
98.5     0 
99.0     1 *
99.5     7 *

-> <stat> Spam scores for all runs: 5600 items; mean 99.35; sdev 5.40
-> <stat> min 4.22602e-09; median 100; max 100
* = 89 items
 0.0    3 *
 0.5    0 
 1.0    0 
 1.5    1 *
 2.0    0 
 2.5    0 
 3.0    0 
 3.5    0 
 4.0    0 
 4.5    1 *
 5.0    0 
 5.5    0 
 6.0    1 *
 6.5    0 
 7.0    0 
 7.5    0 
 8.0    0 
 8.5    0 
 9.0    0 
 9.5    0 
10.0    0 
10.5    0 
11.0    0 
11.5    0 
12.0    0 
12.5    0 
13.0    0 
13.5    0 
14.0    0 
14.5    0 
15.0    0 
15.5    0 
16.0    0 
16.5    0 
17.0    0 
17.5    0 
18.0    0 
18.5    0 
19.0    0 
19.5    0 
20.0    0 
20.5    1 *
21.0    0 
21.5    0 
22.0    0 
22.5    0 
23.0    0 
23.5    0 
24.0    0 
24.5    0 
25.0    0 
25.5    0 
26.0    0 
26.5    0 
27.0    0 
27.5    0 
28.0    0 
28.5    0 
29.0    0 
29.5    0 
30.0    0 
30.5    0 
31.0    0 
31.5    1 *
32.0    0 
32.5    0 
33.0    1 *
33.5    0 
34.0    1 *
34.5    0 
35.0    0 
35.5    0 
36.0    0 
36.5    0 
37.0    0 
37.5    0 
38.0    0 
38.5    0 
39.0    0 
39.5    1 *
40.0    1 * "we would like to send you our information". May be misclassified.
40.5    0 
41.0    0 
41.5    0 
42.0    0 
42.5    0 
43.0    0 
43.5    0 
44.0    0 
44.5    1 * ObjectSpace C++ product announcement
45.0    0 
45.5    0 
46.0    0 
46.5    1 * Webcounter

 You tried other counters now try something AMAZING!!!

 ONE CODE FOR ALL YOUR PAGES AND DOMAINS!!!

http://www.freewebcounter.com


1.  View your full raw log files and perform tracerouts from the hosts!
2.  See the every page the person visited in order!
3.  Top 50 full search phrase used to find your site!
4.  All countries
5.  unique visits / page views.
6.  visites by day/week/month/year.
7.  Top 50 browser agents.
8.  Emails.

47.0    0 
47.5    0 
48.0    0 
48.5    1 * Character analysis

This is  a: Commercial Electronic Mail Message. It is TOTALLY   LEGAL  (Washington.' Law; chapter 
19.190 RCW)
and with U.S. Federal requirements for commercial email under bill: S 1618 Title 111 section 301 
paragraph (a) (2) (C) because it includes a removal mechanism.     To be removed:the list: please see below.

49.0    0 
49.5    3 * Translation company based in Beijing; Distance education IT school;
            Anti-aids medicin from Beijing
50.0    7 * HTML-only with image maps; How to juggle women (book); 
            Far east spam; HTML only far east spam; conference announcement;
            Conference announcement; Hunza diet bread; Tim's hometown stories;
            Far east spam
50.5    2 * Web advertising
51.0    1 * Far east spam
51.5    3 * Far east spam; Get rich via python mailinglist; empty message
52.0    0 
52.5    0 
53.0    1 * Hyper porn
            YIKES: prob('subject:porn') = 0.696523 only!

From: HairyKevin <HairyKevin@aol.com>
Return-path: <HairyKevin@aol.com>
To: HairyKevin@aol.com
Subject: hyper porn
Date: Sun, 24 May 1998 15:11:51 EDT
Organization: AOL (http://www.aol.com)
Mime-Version: 1.0
Content-type: text/plain; charset=US-ASCII
Content-transfer-encoding: 7bit

<a href="http://sex4free.dyn.ml.org/index.html">click here</a>

53.5    0 
54.0    0 
54.5    0 
55.0    2 * Web hosting (German); Internet programming offered
55.5    0 
56.0    0 
56.5    0 
57.0    0 
57.5    0 
58.0    2 * Affengeil; Far east spam

AFFENGEIL !!!!
002.45.29.65.83
... Ruf an!

58.5    0 
59.0    1 * Microsoft office training
59.5    0 
60.0    0 
60.5    0 
61.0    0 
61.5    0 
62.0    1 * here is the picture of me that you asked for... 
62.5    1 * "E-bay auction" spam (Congradulations (sic) on your selling)
63.0    1 * Dahanut newsletter
63.5    1 * "My friend is going out with this girl"
64.0    0 
64.5    0 
65.0    0 
65.5    0 
66.0    0 
66.5    1 * Make a million
67.0    0 
67.5    0 
68.0    0 
68.5    0 
69.0    0 
69.5    2 * Both same mailinglist removal confirmation. MISCLASSIFIED.
70.0    0 
70.5    0 
71.0    0 
71.5    0 
72.0    1 * Spanish, HTML only.
72.5    0 
73.0    0 
73.5    0 
74.0    0 
74.5    0 
75.0    1 * Happy birthday via WBW with commercial appendix
75.5    0 
76.0    1 * Christian site in Jerusalem
76.5    1 * "\Below is the result of your feedback form."
77.0    0 
77.5    0 
78.0    2 * Diet science (close to the biology I worked in for a while);
            Medical website on-line announcement
78.5    1 * Medical conference announcement
79.0    0 
79.5    1 * "I have attached my web page with new photos!"
80.0    1 *
80.5    0 
81.0    2 *
81.5    4 *
82.0    2 *
82.5    2 *
83.0    0 
83.5    1 *
84.0    1 *
84.5    1 *
85.0    2 *
85.5    1 *
86.0    1 *
86.5    1 *
87.0    0 
87.5    5 *
88.0    0 
88.5    0 
89.0    1 *
89.5    1 *
90.0    1 *
90.5    2 *
91.0    2 *
91.5    2 *
92.0    2 *
92.5    3 *
93.0   36 *
93.5    1 *
94.0    4 *
94.5    4 *
95.0    5 *
95.5    6 *
96.0    9 *
96.5    8 *
97.0    4 *
97.5    7 *
98.0    7 *
98.5   15 *
99.0   26 *
99.5 5378 *************************************************************
-> best cutoff for all runs: 0.87
->     with weighted total 10*12 fp + 71 fn = 191
->     fp rate 0.075%  fn rate 1.27%
->     matched at 0.875 with 12 fp & 71 fn; fp rate 0.075%; fn rate 1.27%

---------------------- multipart/mixed attachment--


From tim.one@comcast.net  Sat Oct 12 23:55:49 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 12 Oct 2002 18:55:49 -0400
Subject: [Spambayes] chi-squared versus "cancellation disease"
Message-ID: <LNBBLJKPBEHFEDALKOLCEECLBLAB.tim.one@comcast.net>

Turns out there's a good reason to keep 0.5 in your "middle ground" when
using chi-squared combining:  it's not fooled by "cancellation disease", and
refuses to make a choice in either direction when it happens.  I think this
is good.

chi2.py has a new function showscore() you can use to see exactly what
happens on a given vector of probabilities.  Like so:

>>> from chi2 import showscore as s
>>> s([.01, .99] * 30)  # 30 pairs of "cancelling" extremes
P(chisq >=    276.913 | v=120) = 2.00377e-014
P(chisq >=    276.913 | v=120) = 2.00377e-014
spam prob 1.0
 ham prob 1.0
  S/(S+H) 0.5
>>>

The sums are so large that there's virtually no chance the probs are random
under either the ham or spam measures.  For a peculiar reason, this causes
the internals to estimate the probability of both outcomes to be 1.  When
they're combined, though, 0.5 is the best guess it can make.

Adding a bunch more clues in one direction doesn't really change this:

>>> s([.01, .99] * 30 + [.99] * 10)
P(chisq >=    369.017 | v=140) = 1.55622e-022
P(chisq >=    277.114 | v=140) = 4.54929e-011
spam prob 1.0
 ham prob 0.999999999955
  S/(S+H) 0.500000000011
>>>

That only managed to convince it that spam was a *tiny* bit more likely --
the distribution is still wildly unlikely under either measure.  In that
sense it's accomplishing much of what the clt schemes try to do, but with
less mechanism and pain.

Seeing a non-pathological case should make normal behavior clearer:

>>> s([.1, .1, .2, .3, .4, .4, .45, .7, .8])
P(chisq >=    10.4469 | v= 18) =    0.91634
P(chisq >=     21.259 | v= 18) =   0.266549
spam prob 0.0836602351022
 ham prob 0.733450737215
  S/(S+H) 0.102385401661
>>>

That was clearly a hammish probability vector, and the scheme has no trouble
realizing that.

Note that we get some intuitive outcomes via unintuitive means:

>>> s([0.5] * 20)
P(chisq >=    27.7259 | v= 40) =   0.928958
P(chisq >=    27.7259 | v= 40) =   0.928958
spam prob 0.071042357154
 ham prob 0.071042357154
  S/(S+H) 0.5
>>>

That is, a vector of all 0.5 is quite unlikely against the hypothesis that
the probs are uniformly distributed (the sums are too small -- it's *too*
regular), but it's equally unlikely under both measures, so 0.5 is the best
guess it can make.

BTW, when eyeballing this stuff, it's helpful to know that a chi-squared
distribution with v degrees of freedom has mean v and sdev sqrt(2*v).


From tim.one@comcast.net  Sun Oct 13 02:14:24 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 12 Oct 2002 21:14:24 -0400
Subject: [Spambayes] chi-squared versus "prob strength"
Message-ID: <LNBBLJKPBEHFEDALKOLCEEDABLAB.tim.one@comcast.net>

Note the default robinson_minimum_prob_strength is still 0.1, meaning that
we ignore words with spamprobs in 0.4 to 0.6.

Since the chi-squared test is testing the hypothesis that the probs are
uniformly distributed, systematically leaving a chunk of probs "out of the
middle" may bias it.

Rerunning my fat test with this option set to 0.0 (don't ignore any words)
gave nearly identical final results, but I didn't like the fine-grained
differences.  In particular, there's a seeming paradox:

ham mean                     ham sdev
   0.39    0.20  -48.72%        3.47    2.56  -26.22%
   0.33    0.15  -54.55%        3.13    2.30  -26.52%
   0.40    0.28  -30.00%        3.54    3.34   -5.65%
   0.23    0.09  -60.87%        2.24    1.40  -37.50%
   0.47    0.30  -36.17%        4.38    3.69  -15.75%
   0.31    0.18  -41.94%        3.05    2.56  -16.07%
   0.38    0.19  -50.00%        3.23    2.30  -28.79%
   0.29    0.15  -48.28%        2.80    2.12  -24.29%
   0.30    0.17  -43.33%        2.90    2.37  -18.28%
   0.55    0.32  -41.82%        4.45    3.78  -15.06%

ham mean and sdev for all runs
   0.36    0.20  -44.44%        3.38    2.74  -18.93%

spam mean                    spam sdev
  99.93   99.95   +0.02%        1.25    1.18   -5.60%
  99.94   99.96   +0.02%        1.24    1.11  -10.48%
  99.98   99.99   +0.01%        0.34    0.32   -5.88%
  99.92   99.93   +0.01%        1.84    2.24  +21.74%
  99.93   99.94   +0.01%        1.72    1.44  -16.28%
  99.88   99.90   +0.02%        1.95    1.75  -10.26%
  99.86   99.86   +0.00%        2.22    2.60  +17.12%
  99.91   99.96   +0.05%        1.26    0.57  -54.76%
  99.90   99.93   +0.03%        1.75    1.43  -18.29%
  99.96   99.97   +0.01%        0.73    0.61  -16.44%

spam mean and sdev for all runs
  99.92   99.94   +0.02%        1.53    1.50   -1.96%

ham/spam mean difference: 99.56 99.74 +0.18

While the tight ham distribution got significantly tighter (but note that
the effect on the spam distribution was inconsistent and overall virtually
nil), the ham at the wrong end of the scale got worse.  Here are the
ham-score histograms starting at 50, combined into one, "before" in the
middle column (default prob strength), "after" in the right column (don't
ignore any words):

50.0  1 1
50.5  3 0
51.0  0 1
51.5   -
52.0  1 0
52.5   -
53.0   -
53.5  0 1
54.0   -
54.5   -
55.0   -
55.5   -
56.0  2 0
56.5   -
57.0   -
57.5   -
58.0   -
58.5  1 0
59.0  0 1
59.5   -
60.0   -
60.5   -
61.0   -
61.5   -
62.0   -
62.5  0 1
63.0   -
63.5  1 0
64.0   -
64.5   -
65.0   -
65.5  0 1
66.0  1 0
66.5   -
67.0   -
67.5   -
68.0  1 0
68.5   -
69.0   -
69.5   -
70.0  0 1
70.5   -
71.0   -
71.5   -
72.0   -
72.5   -
73.0   -
73.5  1 0
74.0  0 1
74.5   -
75.0   -
75.5   -
76.0   -
76.5   -
77.0   -
77.5   -
78.0   -
78.5  0 1
79.0   -
79.5   -
80.0   -
80.5  0 1
81.0   -
81.5   -
82.0   -
82.5   -
83.0   -
83.5   -
84.0   -
84.5   -
85.0  0 1
85.5   -
86.0   -
86.5   -
87.0  0 1
87.5   -
88.0   -
88.5   -
89.0   -
89.5   -
90.0   -
90.5   -
91.0   -
91.5   -
92.0   -
92.5   -
93.0   -
93.5   -
94.0   -
94.5   -
95.0   -
95.5   -
96.0   -
96.5   -
97.0  1 0
97.5   -
98.0   -
98.5   -
99.0   -
99.5  1 2

The ham scores drift up here rather dramatically.  I haven't found any
particular *sense* to it.  The lady with the brief question and the
obnoxious employer-generated sig saw her score climb from 0.972986477986 to
0.998446743969.  Here's the full list of "after" clues:

prob('python.') = 0.000144374
prob('subject:Python') = 0.00115551
prob('header:Errors-To:1') = 0.0225343
prob('thanks,') = 0.0642414
prob('x-mailer:microsoft outlook express 4.72.3155.0') = 0.0652174
prob('help?') = 0.134215
prob('edinburgh') = 0.155172
prob('there,') = 0.164471
prob('but') = 0.223265
prob('skip:r 20') = 0.245934
prob('standard') = 0.260615
prob('road,') = 0.283848
prob('tel:') = 0.286105
prob('content-type:text/plain') = 0.306072
prob('calls') = 0.307063
prob('return') = 0.323593
prob('addressee') = 0.340522
prob('alteration') = 0.340522
prob('header:Message-ID:1') = 0.372119
prob('scan') = 0.388899

"before" ignored all words from here ...

prob('caused') = 0.42498
prob('fax:') = 0.441534
prob('not') = 0.442943
prob('header:Date:1') = 0.47242
prob('the') = 0.476861
prob('to:2**0') = 0.48041
prob('subject: ') = 0.488475
prob('header:To:1') = 0.489883
prob('header:Subject:1') = 0.495711
prob('header:From:1') = 0.496624
prob('0131') = 0.5
prob('1127') = 0.5
prob('1550') = 0.5
prob('2552') = 0.5
prob('2dh,') = 0.5
prob('eh1') = 0.5
prob('email addr:standardlife.com') = 0.5
prob('email name:vickie_mills') = 0.5
prob('from:email addr:standardlife.com>') = 0.5
prob('from:email name:<vickie_mills') = 0.5
prob('from:mills"') = 0.5
prob('lothian') = 0.5
prob('message-id:@sl079320.internal.standardlife.com') = 0.5
prob('scotland') = 0.5
prob('subject:Courses') = 0.5
prob('sz4)') = 0.5
prob('url:standardlife') = 0.5
prob('vickie') = 0.5
prob('any') = 0.502219
prob('can') = 0.502437
prob('looking') = 0.510638
prob('and') = 0.515922
prob('for') = 0.52215
prob('after') = 0.526567
prob('this') = 0.541213
prob('without') = 0.545373
prob('more') = 0.577158
prob('you') = 0.581657
prob('are') = 0.585348
prob('e-mail:') = 0.592056
prob('skip:r 10') = 0.594963
prob('virus') = 0.597194

... to there

prob('may') = 0.618572
prob('header:MIME-Version:1') = 0.625117
prob('(no') = 0.632309
prob('url:www') = 0.634103
prob('all') = 0.643954
prob('proto:http') = 0.685218
prob('available') = 0.685685
prob('house,') = 0.71652
prob('url:com') = 0.73038
prob('charset:iso-8859-1') = 0.760234
prob('245') = 0.763306
prob('header:Received:3') = 0.783612
prob('225') = 0.788448
prob('damage') = 0.790382
prob('monitored.') = 0.822077
prob('subject:Training') = 0.822077
prob('authority.') = 0.843339
prob('from:"vickie') = 0.844828
prob('information') = 0.854942
prob('personal') = 0.856264
prob('recorded') = 0.857881
prob('error,') = 0.860761
prob('training') = 0.877755
prob('third') = 0.879884
prob('registered') = 0.888215
prob('responsible') = 0.888938
prob('please') = 0.896807
prob('sent.') = 0.901054
prob('confidential') = 0.90175
prob('courses') = 0.902951
prob('assurance') = 0.909218
prob('life') = 0.910715
prob('consent') = 0.9175
prob('website') = 0.918074
prob('life,') = 0.919452
prob('only.') = 0.920248
prob('us.') = 0.935784
prob('e-mail') = 0.943941
prob('our') = 0.945067
prob('party') = 0.94572
prob('received') = 0.955257
prob('visit') = 0.956705
prob('mills') = 0.965116
prob('company,') = 0.968075
prob('analyst') = 0.969214
prob('e-mails') = 0.969516
prob('investment') = 0.984532
prob('regulated') = 0.99554

This large new blob of stuff clustered around 0.5 is evidence against the
probs being uniformly distributed, but letting words like 'this' and
'and'(!) "vote" in favor of spam is still hard to swallow.

Another ham rose from 0.68 to .87:

prob('script') = 0.00985651
prob('header:Organization:1') = 0.0122571
prob('aspects.') = 0.0412844
prob('trough') = 0.0652174
prob('automate') = 0.0686782
prob('gathers') = 0.0918367
prob('thanks.') = 0.107154
prob('mark') = 0.131076
prob('admin') = 0.184548
prob('appreciate') = 0.212924
prob('handle') = 0.233617
prob('should') = 0.237752
prob('subject:skip:s 10') = 0.255697
prob('it,') = 0.255906
prob('example') = 0.259171
prob('logs') = 0.267346
prob('content-type:text/plain') = 0.304946
prob('format') = 0.325865
prob('built') = 0.346389
prob('password') = 0.362865
prob('anyone') = 0.368834
prob('header:Message-ID:1') = 0.371648
prob('would') = 0.39851

"before" ignored the words from here ...

prob('that') = 0.415786
prob('system,') = 0.416801
prob('subject:...') = 0.432911
prob('need') = 0.439916
prob('x-mailer:none') = 0.470681
prob('header:Date:1') = 0.472663
prob('the') = 0.477375
prob('to:2**0') = 0.479881
prob('subject: ') = 0.489239
prob('header:To:1') = 0.489538
prob('program') = 0.493299
prob('header:Subject:1') = 0.495671
prob('have') = 0.496264
prob('header:From:1') = 0.496604
prob('can') = 0.499869
prob('admin,') = 0.5
prob('cheater') = 0.5
prob('email name:inert28') = 0.5
prob('from:(huy4huy)') = 0.5
prob('from:email name:huy4huy') = 0.5
prob('message-id:@ng-cf1.news.cs.com') = 0.5
prob("ssl's,") = 0.5
prob('surfers,') = 0.5
prob('well') = 0.502315
prob('need.') = 0.504138
prob('and') = 0.516631
prob('with') = 0.520765
prob('for') = 0.522261
prob('skip:h 10') = 0.532787
prob('devices') = 0.548446
prob('subject:need') = 0.553522
prob('above,') = 0.561455
prob('skip:- 60') = 0.567795
prob('skip:o 10') = 0.580777
prob('real') = 0.582227
prob('system') = 0.599189

... to there

prob('help') = 0.630462
prob('easy') = 0.633559
prob('all') = 0.64156
prob('time') = 0.65037
prob('will') = 0.682267
prob('email addr:hotmail.com') = 0.698887
prob('activities,') = 0.729277
prob('webmaster') = 0.73569
prob('inform') = 0.747001
prob('area,') = 0.765433
prob('joining') = 0.786091
prob('addresses,') = 0.829239
prob('sign') = 0.844338
prob('information') = 0.852438
prob('traffic') = 0.860206
prob('protection') = 0.869987
prob('monitoring') = 0.87122
prob('notification') = 0.873331
prob('please') = 0.898261
prob('address') = 0.913514
prob('anti') = 0.93207
prob('from:email addr:cs.com') = 0.934783
prob('email') = 0.941779
prob('age') = 0.961517
prob('verification') = 0.971946
prob('davidson') = 0.975352
prob('webmasters') = 0.994938

The large difference in score here seems to have less to do with new
evidence in favor of spam, and more to do with having more evidence that the
probs merely aren't uniformly distributed (and so exaggerate the
spammishness judgment obtained from the non-bland words).

If the primary effect of including bland words is indeed just to make a
judgment more extreme, I recommend to continue ignoring bland words:

>>> from chi2 import showscore as s

>>> s([.2, .8, .9])
P(chisq >=    8.27033 | v=  6) =   0.218959
P(chisq >=    3.87588 | v=  6) =   0.693468
spam prob 0.781040515476
 ham prob 0.306531778646
  S/(S+H) 0.71815043441

>>> s([.2, .8, .9] + [0.5] * 10)
P(chisq >=    22.1333 | v= 26) =   0.681383
P(chisq >=    17.7388 | v= 26) =   0.885068
spam prob 0.318617174026
 ham prob 0.114932197304
  S/(S+H) 0.734904015772

>>>

I can't love that adding a pile of 100% neutral probs intensifies the spam
judgment, and under the covers the effects on S and H are seen to be
dramatic.  Yes, "it's even more not uniformly distributed" after adding in
10 0.5s, but that's really got nothing to do with whether the msg is ham or
spam!


From tim.one@comcast.net  Sun Oct 13 06:57:23 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 13 Oct 2002 01:57:23 -0400
Subject: [Spambayes] Chi**2 results
In-Reply-To: <3DA88D8A.3070509@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEDFBLAB.tim.one@comcast.net>

[Rob Hooft]
> Here is my chi results.

Thanks for trying this, Rob!

> I am amazed by the high cutoff it is advising me to use!

Well, you told it you hate fp 10x more than you hate fn
(best_cutoff_fp_weight = 10), and that pushes the best cutoff up.  Note that
the cutoff is an after-the-fact thing, and moving it improves one error rate
at the unavoidable expense of injuring the other -- it doesn't change any
scores.  It looks like this scheme has an extremely usable middle ground for
you, so provided your deployment can *do* something with a middle ground,
you've got a very large range for absolute cutoffs that would leave you
staring at very few "unsure" msgs.

> This feels very good.

Looks good too <wink>.  One part is *too* good:

-> <stat> Ham scores for all runs: 16000 items; mean 0.57; sdev 5.03
-> <stat> min -2.22045e-13; median 9.99201e-14; max 100
          ^^^^^^^^^^^^^^^^

It's not logically possible for a score to go negative -- we can thank
rounding errors for that.

 On the FP side bad messages are:
>   * a yahoo account created to correct incorrect listings in their
>     database
>   * A problem with my Linux Journal subscription
>   * India student applying for a course
>   * Amazon.com membership update
>   * Red Cross blood drive announcement
>
> Which is 5 out of 16000; but I have to admit that even missing 4 out of
> these 5 would not have been too costly.

I don't think any scheme can afford to throw msgs away entirely.  What I
hope instead is that a middle ground can shuffle unclear msgs into a "please
help me" folder (or two, if it's still valuable to record the "ham or spam?"
guess for these) where most mistakes live, and that any scheme tossing a msg
entirely try to notify the sender.  I personally would never use a scheme
that tosses msgs entirely, but that's just me.

Unless you create a lot of Yahoo accts, and have a lot of problems with your
Linux Journal subscriptions, and etc, seems likely that the system just
won't get enough training examples to learn that they're OK for you.  A
whitelist might help, except it's hard to populate one without first
recognizing an FP from an unfortunate sender.

> The middle ground is amazingly empty! I'd almost want to set my cutoff
> at 0.99 or 0.995!

It's OK by me if you do <wink>.

> One thing that does bother me a bit is that some words have a very high
> correlation of co-existing in a message, and there is no way of finding
> this out. E.g. all the "bad jokes" I'm referring to in the attachment
> were sent by a friend of mine that uses a very strange way of
> forwarding by modifying the "From:" line:
>
>    From: callaway@indigo.picower.edu (David Callaway) (by way of Pieter
> Stouten)
>
>
> Which results in the highly correlated:
>
> prob('from:pieter') = 0.00151566
> prob('message-id:@[158.117.170.103]') = 0.00306331
> prob('x-mailer:eudora pro 3.1 for macintosh') = 0.00474183
> prob('from:stouten)') = 0.0115681
> prob('from:way') = 0.012894
> prob('from:(by') = 0.0167286

I don't know whether to call that a bug or a feature.  In this specific
example, I think I have to call it a feature:  the "bad joke" msgs appear to
confuse the system routinely, and this bundle of very low-spamprob words may
be all that's saving them from getting scores near 1.0.  There are a
significant number of my ham that are redeemed by this kind of thing too --
a well-known poster posting from a well-known address, but going on about
something that has nothing to do with the newsgroup.  Sucking out 8 distinct
clues about who they are and where they posted from helps them a *lot* in
these cases, even if all 8 come from the "From" line.

If you turn on mine_received_headers, you'll also find that Neil goes out of
his way to present IP addr and machine-name info in multiple ways,
triggering the same kind of effect for "bad machines" and "bad networks".

So, overall, "this kind of thing" has appeared valuable to me.  OTOH, we've
been reduced to stripping all HTML tags else we get a mountain of
high-spamprob decorations (in legit HTML mail) that are nearly 100%
correlated but each counts as if a killer-good clue all by itself.

So it's at best a mixed bag.  I don't know of a computationally cheap way to
take correlations into account, else I would have tried that before
resorting to stripping HTML tags (I hate throwing info away).


From rob@hooft.net  Sun Oct 13 08:00:24 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 13 Oct 2002 09:00:24 +0200
Subject: [Spambayes] Chi**2 results
References: <LNBBLJKPBEHFEDALKOLCOEDFBLAB.tim.one@comcast.net>
Message-ID: <3DA91A08.70605@hooft.net>

Tim Peters wrote:

> Looks good too <wink>.  One part is *too* good:
> 
> -> <stat> Ham scores for all runs: 16000 items; mean 0.57; sdev 5.03
> -> <stat> min -2.22045e-13; median 9.99201e-14; max 100
>           ^^^^^^^^^^^^^^^^

I noticed that... But indeed, one cannot blame the program if it is 
calculating chi2Q with 14 digit accuracy and then subtract it from 1.0.....

> I don't think any scheme can afford to throw msgs away entirely. 

I have to admit that I do have a "spam" folder from SA at this moment, 
and that I am only "scanning" the index page of this for 3 seconds per 
week.... That is almost as good as throwing them out completely.

A good feature of spamassassin is that it turns every suspect message 
into text/plain. This would be a good feature for the middle-ground 
messages (but it should be easy to undo somehow for 
middle-ground-negatives).

> So it's at best a mixed bag.  I don't know of a computationally cheap way to
> take correlations into account, else I would have tried that before
> resorting to stripping HTML tags (I hate throwing info away).

We'd just have to make a 100k*100k correlation matrix. Programmatically 
very cheap ;-)

I'm currently looking at the H and S values of middle ground messages. I 
have seen a few H+S>1.9 so far. Advantage of the current schema is that 
if H+S>1.25, the message is always at least in the middle ground. H+S<<1 
are quite rare with this schema, but I've seen some with H=0.05 S=0.02 
and will investigate whether something can be gained (sure fp/fn) in 
that area.

Rob
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob@hooft.net  Sun Oct 13 08:27:11 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 13 Oct 2002 09:27:11 +0200
Subject: [Spambayes] chi-squared versus "prob strength"
References: <LNBBLJKPBEHFEDALKOLCEEDABLAB.tim.one@comcast.net>
Message-ID: <3DA9204F.5000509@hooft.net>

Tim Peters wrote:
> Note the default robinson_minimum_prob_strength is still 0.1, meaning that
> we ignore words with spamprobs in 0.4 to 0.6.
> 
> Since the chi-squared test is testing the hypothesis that the probs are
> uniformly distributed, systematically leaving a chunk of probs "out of the
> middle" may bias it.
> 
> Rerunning my fat test with this option set to 0.0 (don't ignore any words)
> gave nearly identical final results, but I didn't like the fine-grained
> differences.  

Here is my cmp run for this. First is with 0.1, second with 0.0. 
Distributions are tighter. Is this due to the fact that we have more 
clues now, so the Chi2 distribution is more decisive?

cv2s -> cv3s
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
[...]
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams

false positive percentages
     0.062  0.188  lost  +203.23%
     0.312  0.438  lost   +40.38%
     0.062  0.125  lost  +101.61%
     0.062  0.125  lost  +101.61%
     0.062  0.125  lost  +101.61%
     0.062  0.062  tied
     0.250  0.250  tied
     0.125  0.188  lost   +50.40%
     0.250  0.312  lost   +24.80%
     0.000  0.000  tied

won   0 times
tied  3 times
lost  7 times

total unique fp went from 20 to 29 lost   +45.00%
mean fp % went from 0.125 to 0.18125 lost   +45.00%

false negative percentages
     1.034  1.034  tied
     0.345  0.345  tied
     0.517  0.345  won    -33.27%
     0.517  0.517  tied
     1.207  1.207  tied
     0.862  0.690  won    -19.95%
     0.862  0.690  won    -19.95%
     0.345  0.345  tied
     0.517  0.517  tied
     1.034  0.862  won    -16.63%

won   4 times
tied  6 times
lost  0 times

total unique fn went from 42 to 38 won     -9.52%
mean fn % went from 0.724137931034 to 0.655172413793 won     -9.52%

ham mean                     ham sdev
    0.52    0.39  -25.00%        4.49    4.46   -0.67%
    0.72    0.60  -16.67%        6.62    6.59   -0.45%
    0.63    0.45  -28.57%        4.83    4.42   -8.49%
    0.60    0.41  -31.67%        4.83    4.51   -6.63%
    0.52    0.36  -30.77%        4.26    4.06   -4.69%
    0.43    0.31  -27.91%        4.21    3.82   -9.26%
    0.64    0.52  -18.75%        5.75    5.72   -0.52%
    0.68    0.51  -25.00%        5.63    5.39   -4.26%
    0.70    0.62  -11.43%        5.71    6.13   +7.36%
    0.41    0.31  -24.39%        3.65    3.24  -11.23%

ham mean and sdev for all runs
    0.59    0.45  -23.73%        5.07    4.94   -2.56%

spam mean                    spam sdev
   99.20   99.32   +0.12%        6.10    5.77   -5.41%
   99.70   99.71   +0.01%        3.45    3.80  +10.14%
   99.55   99.68   +0.13%        3.63    3.23  -11.02%
   99.38   99.44   +0.06%        6.34    6.27   -1.10%
   99.14   99.19   +0.05%        7.05    7.05   +0.00%
   99.40   99.47   +0.07%        4.72    5.24  +11.02%
   99.42   99.50   +0.08%        5.09    5.10   +0.20%
   99.41   99.51   +0.10%        4.55    4.99   +9.67%
   99.48   99.62   +0.14%        3.81    3.20  -16.01%
   99.31   99.39   +0.08%        6.09    5.97   -1.97%

spam mean and sdev for all runs
   99.40   99.48   +0.08%        5.22    5.21   -0.19%

ham/spam mean difference: 98.81 99.03 +0.22


-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tim.one@comcast.net  Sun Oct 13 09:13:47 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 13 Oct 2002 04:13:47 -0400
Subject: [Spambayes] chi-squared versus "prob strength"
In-Reply-To: <3DA9204F.5000509@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEDJBLAB.tim.one@comcast.net>

[Tim]
> Note the default robinson_minimum_prob_strength is still 0.1 ...
> ...
> Rerunning my fat test with this option set to 0.0 (don't ignore
> any words) gave nearly identical final results, but I didn't like
> the fine-grained differences.

[Rob Hooft]
> Here is my cmp run for this. First is with 0.1, second with 0.0.
> Distributions are tighter. Is this due to the fact that we have more
> clues now, so the Chi2 distribution is more decisive?

It's been my belief that bland words are at best worthless as clues, and at
worst actively hurt (experiment:  fiddle your favorite scheme to look *only*
at the bland words; do they have predictive power?).  I think this is one of
the schemes where they hurt, for the reason illustrated by tiny example at
the end of my original post:

"""
>>> from chi2 import showscore as s

>>> s([.2, .8, .9])
P(chisq >=    8.27033 | v=  6) =   0.218959
P(chisq >=    3.87588 | v=  6) =   0.693468
spam prob 0.781040515476
 ham prob 0.306531778646
  S/(S+H) 0.71815043441

>>> s([.2, .8, .9] + [0.5] * 10)
P(chisq >=    22.1333 | v= 26) =   0.681383
P(chisq >=    17.7388 | v= 26) =   0.885068
spam prob 0.318617174026
 ham prob 0.114932197304
  S/(S+H) 0.734904015772

>>>

I can't love that adding a pile of 100% neutral probs intensifies the spam
judgment, and under the covers the effects on S and H are seen to be
dramatic.  Yes, "it's even more not uniformly distributed" after adding in
10 0.5s, but that's really got nothing to do with whether the msg is ham or
spam!
"""

The hypothesis that the spamprobs are uniformly distributed seems irrelevant
to whether a msg is ham or spam, and dumping bland words in acts to reject
the hypothesis for a reason that also has nothing to do with the distinction
we're *trying* to make.  The bland words seem most of all to intensify the
decision the scheme would have made anyway if they weren't included.  That
makes things more extreme, but (IMO) not for a *reasonable* reason.  I think
it's akin to taking scores below 0.1 and dividing them by 2, and taking
scores above 0.9 and adding half their distance to 1:  it makes things more
extreme, but not usefully.  Extremity for extremity's sake is no virtue
<wink>.

> -> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
> [...]
> -> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
>
> false positive percentages
>      0.062  0.188  lost  +203.23%
>      0.312  0.438  lost   +40.38%
>      0.062  0.125  lost  +101.61%
>      0.062  0.125  lost  +101.61%
>      0.062  0.125  lost  +101.61%
>      0.062  0.062  tied
>      0.250  0.250  tied
>      0.125  0.188  lost   +50.40%
>      0.250  0.312  lost   +24.80%
>      0.000  0.000  tied
>
> won   0 times
> tied  3 times
> lost  7 times
>
> total unique fp went from 20 to 29 lost   +45.00%
> mean fp % went from 0.125 to 0.18125 lost   +45.00%
>
> false negative percentages
>      1.034  1.034  tied
>      0.345  0.345  tied
>      0.517  0.345  won    -33.27%
>      0.517  0.517  tied
>      1.207  1.207  tied
>      0.862  0.690  won    -19.95%
>      0.862  0.690  won    -19.95%
>      0.345  0.345  tied
>      0.517  0.517  tied
>      1.034  0.862  won    -16.63%
>
> won   4 times
> tied  6 times
> lost  0 times
>
> total unique fn went from 42 to 38 won     -9.52%
> mean fn % went from 0.724137931034 to 0.655172413793 won     -9.52%
>
> ham mean                     ham sdev
>     0.52    0.39  -25.00%        4.49    4.46   -0.67%
>     0.72    0.60  -16.67%        6.62    6.59   -0.45%
>     0.63    0.45  -28.57%        4.83    4.42   -8.49%
>     0.60    0.41  -31.67%        4.83    4.51   -6.63%
>     0.52    0.36  -30.77%        4.26    4.06   -4.69%
>     0.43    0.31  -27.91%        4.21    3.82   -9.26%
>     0.64    0.52  -18.75%        5.75    5.72   -0.52%
>     0.68    0.51  -25.00%        5.63    5.39   -4.26%
>     0.70    0.62  -11.43%        5.71    6.13   +7.36%
>     0.41    0.31  -24.39%        3.65    3.24  -11.23%
>
> ham mean and sdev for all runs
>     0.59    0.45  -23.73%        5.07    4.94   -2.56%

Because the ham distrubtion got tighter and closer to 0, you need a larger
spam_cutoff now.  A spam_cutoff too low probably explains both the increase
in FP rate and the decrease in FN rate.

> spam mean                    spam sdev
>    99.20   99.32   +0.12%        6.10    5.77   -5.41%
>    99.70   99.71   +0.01%        3.45    3.80  +10.14%
>    99.55   99.68   +0.13%        3.63    3.23  -11.02%
>    99.38   99.44   +0.06%        6.34    6.27   -1.10%
>    99.14   99.19   +0.05%        7.05    7.05   +0.00%
>    99.40   99.47   +0.07%        4.72    5.24  +11.02%
>    99.42   99.50   +0.08%        5.09    5.10   +0.20%
>    99.41   99.51   +0.10%        4.55    4.99   +9.67%
>    99.48   99.62   +0.14%        3.81    3.20  -16.01%
>    99.31   99.39   +0.08%        6.09    5.97   -1.97%
>
> spam mean and sdev for all runs
>    99.40   99.48   +0.08%        5.22    5.21   -0.19%
>
> ham/spam mean difference: 98.81 99.03 +0.22

I saw the same thing (qualitatively), and it's at least curious:  ham mean
and sdev consistently decrease; spam mean consistently increases but less
so; and effects on spam sdev a mixed bag with almost no net effect when
averaged out.  BTW, with max_discriminators=150, you *may* have many ham
that didn't have 150 unique extreme words, and in that case no longer
ignoring the bland words may have a large effect similar to the one in the
example above.


From rob@hooft.net  Sun Oct 13 12:37:40 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 13 Oct 2002 13:37:40 +0200
Subject: [Spambayes] chi-squared versus "prob strength"
References: <LNBBLJKPBEHFEDALKOLCIEDJBLAB.tim.one@comcast.net>
Message-ID: <3DA95B04.2040806@hooft.net>

I'm playing currently with a variant on the S/(S+H) formula. I replaced 
it with (S-H+1)/2

Some examples where this doesn't make much difference:

    H        S    S/(H+S)     (S-H+1)/2
   0.01     0.99   0.99         0.99     Typical spam.
   0.99     0.01   0.01         0.01     Typical ham.
   0.50     0.50   0.50         0.50     Typical half-way.
   0.90     0.90   0.50         0.50     Looks both like ham and spam
   0.10     0.10   0.50         0.50     Doesn't look like either
   0.80     0.95   0.54         0.57     Both, but a bit more spam

But where it makes a difference is:

    H        S    S/(H+S)     (S-H+1)/2
   0.05     0.20   0.80         0.57
   0.02     0.05   0.71         0.51

Here, the low S value tells you "I don't have any proof that it looks 
like spam." Just because the H value is even lower, we suddenly put this 
in or close to the realm of certainty using S/(H+S). How come? Well 
we're dividing by H+S, which tells the system we're sure it is either 
ham or spam. If we're fair, however, these messages with H+S<<1 are not
Ham nor Spam. So, maybe we should not divide by H+S at all? Remember, 
the original formula was (S-H)/(S+H). Replace this by (S-H)/1.0 and you 
arrive at my (S-H+1)/2 which puts message that are neither ham nor spam 
close to 0.50

Tim Peters wrote:

> It's been my belief that bland words are at best worthless as clues, and at
> worst actively hurt (experiment:  fiddle your favorite scheme to look *only*
> at the bland words; do they have predictive power?).  I think this is one of
> the schemes where they hurt, for the reason illustrated by tiny example at
> the end of my original post:
> 
> """
> 
>>>>from chi2 import showscore as s
>>>
> 
>>>>s([.2, .8, .9])
>>>
> P(chisq >=    8.27033 | v=  6) =   0.218959
> P(chisq >=    3.87588 | v=  6) =   0.693468
> spam prob 0.781040515476
>  ham prob 0.306531778646
>   S/(S+H) 0.71815043441

  (S-H+1)/2 = 0.737

> 
>>>>s([.2, .8, .9] + [0.5] * 10)
>>>
> P(chisq >=    22.1333 | v= 26) =   0.681383
> P(chisq >=    17.7388 | v= 26) =   0.885068
> spam prob 0.318617174026
>  ham prob 0.114932197304
>   S/(S+H) 0.734904015772

  (S-H+1)/2 = 0.602

Better, isn't it?

Elsewhere you write:

 > Lady with the obnoxious sig:

 > Ignoring bland words:

 > P(chisq >=    222.333 | v=136) = 4.23496e-006
 > P(chisq >=     106.24 | v=136) =   0.972237
 > spam prob 0.999995765045
 >  ham prob 0.0277633711662
 >   S/(S+H) 0.972986500253

  (S-H+1)/2 = 0.986

 > Including bland words:
 >
 > P(chisq >=    282.465 | v=220) = 0.00283528
 > P(chisq >=    163.095 | v=220) =   0.998449
 > spam prob 0.997164718534
 >  ham prob 0.00155126034776
 >   S/(S+H) 0.99844674524

  (S-H+1)/2 = 0.997

The difference is smaller. This small addition of certainty could be due 
to the bland words actually contributing.

 > The ham whose score rose from 0.68 to 0.87:

 > Ignoring bland words:

 > P(chisq >=    123.422 | v=100) =  0.0560948
 > P(chisq >=    97.2217 | v=100) =   0.560026
 > spam prob 0.943905161882
 >  ham prob 0.439974054337
 >   S/(S+H) 0.682071925656

  (S-H+1)/2 = 0.752

 > Including bland words:

 > P(chisq >=    174.229 | v=172) =   0.438174
 > P(chisq >=    146.746 | v=172) =   0.918976
 > spam prob 0.561826411084
 >  ham prob 0.0810237511331
 >   S/(S+H) 0.873961685171

  (S-H+1)/2 = 0.740

Convinced? With this rule, it does no longer harm to add the bland 
words. For my set, with bland words, I end up with
    3 spams < 0.01; 15499 hams  < 0.01
    4 spams < 0.10; 15766 hams  < 0.01
    9 hams  > 0.90;  5658 spams < 0.10
    3 hams  > 0.99;  5392 spams > 0.99


S/S+H left and (S-H+1)/2 right:

cv3s -> cv5s
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams

false positive percentages
     0.188  0.062  won    -67.02%
     0.438  0.125  won    -71.46%
     0.125  0.062  won    -50.40%
     0.125  0.062  won    -50.40%
     0.125  0.062  won    -50.40%
     0.062  0.062  tied
     0.250  0.188  won    -24.80%
     0.188  0.250  lost   +32.98%
     0.312  0.188  won    -39.74%
     0.000  0.000  tied

won   7 times
tied  2 times
lost  1 times

total unique fp went from 29 to 17 won    -41.38%
mean fp % went from 0.18125 to 0.10625 won    -41.38%

false negative percentages
     1.034  1.207  lost   +16.73%
     0.345  0.517  lost   +49.86%
     0.345  0.862  lost  +149.86%
     0.517  0.862  lost   +66.73%
     1.207  1.207  tied
     0.690  1.379  lost   +99.86%
     0.690  1.034  lost   +49.86%
     0.345  1.034  lost  +199.71%
     0.517  1.034  lost  +100.00%
     0.862  1.552  lost   +80.05%

won   0 times
tied  1 times
lost  9 times

total unique fn went from 38 to 62 lost   +63.16%
mean fn % went from 0.655172413793 to 1.06896551724 lost   +63.16%

ham mean                     ham sdev
    0.39    0.58  +48.72%        4.46    4.94  +10.76%
    0.60    0.60   +0.00%        6.59    5.74  -12.90%
    0.45    0.60  +33.33%        4.42    4.57   +3.39%
    0.41    0.57  +39.02%        4.51    4.46   -1.11%
    0.36    0.61  +69.44%        4.06    4.63  +14.04%
    0.31    0.41  +32.26%        3.82    4.08   +6.81%
    0.52    0.66  +26.92%        5.72    5.48   -4.20%
    0.51    0.69  +35.29%        5.39    5.74   +6.49%
    0.62    0.70  +12.90%        6.13    5.71   -6.85%
    0.31    0.44  +41.94%        3.24    3.76  +16.05%

ham mean and sdev for all runs
    0.45    0.59  +31.11%        4.94    4.96   +0.40%

spam mean                    spam sdev
   99.32   98.98   -0.34%        5.77    6.32   +9.53%
   99.71   99.25   -0.46%        3.80    4.28  +12.63%
   99.68   99.15   -0.53%        3.23    4.55  +40.87%
   99.44   98.90   -0.54%        6.27    7.00  +11.64%
   99.19   98.96   -0.23%        7.05    6.67   -5.39%
   99.47   98.96   -0.51%        5.24    5.93  +13.17%
   99.50   98.94   -0.56%        5.10    6.17  +20.98%
   99.51   98.95   -0.56%        4.99    5.91  +18.44%
   99.62   99.18   -0.44%        3.20    4.70  +46.88%
   99.39   98.93   -0.46%        5.97    6.40   +7.20%

spam mean and sdev for all runs
   99.48   99.02   -0.46%        5.21    5.86  +12.48%

ham/spam mean difference: 99.03 98.43 -0.60


-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob@hooft.net  Sun Oct 13 17:06:55 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 13 Oct 2002 18:06:55 +0200
Subject: [Spambayes] Bland word only score..
Message-ID: <3DA99A1F.9040506@hooft.net>

[Tim: the previous copy of this message I sent to you was too quick.]

Tim Peters wrote:

 > It's been my belief that bland words are at best worthless as clues,
 > and at worst actively hurt (experiment:  fiddle your favorite scheme
 > to look *only* at the bland words; do they have predictive power?).

Just for kicks: Yes, with the latest schema and (S-H+1)/2, it does give 
a third of a standard deviation of separation on my sets. And the best 
is: it doesn't have any false positives :-P

[Classifier]
use_chi_squared_combining: True
robinson_minimum_prob_strength = 0.1

[TestDriver]
spam_cutoff: 0.70

nbuckets: 200
best_cutoff_fp_weight: 10

Obviously, the robinson_minimum_prob_strength test is inverted in the code.

-> <stat> Ham scores for all runs: 16000 items; mean 49.59; sdev 1.41
-> <stat> min 40.7953; median 49.9561; max 57.7839

40.0    0
40.5    1 *
41.0    0
41.5    1 *
42.0    9 *
42.5    8 *
43.0   17 *
43.5   31 *
44.0   35 *
44.5   61 *
45.0   95 **
45.5  136 **
46.0  186 ***
46.5  317 *****
47.0  383 ******
47.5  572 ********
48.0  832 ***********
48.5 1101 ***************
49.0 1455 ********************
49.5 3829 ***************************************************
50.0 4625 *************************************************************
50.5 1024 **************
51.0  520 *******
51.5  275 ****
52.0  176 ***
52.5  108 **
53.0   71 *
53.5   66 *
54.0   30 *
54.5   16 *
55.0   10 *
55.5    4 *
56.0    3 *
56.5    2 *
57.0    0
57.5    1 *
58.0    0
58.5    0
59.0    0
59.5    0
60.0    0

-> <stat> Spam scores for all runs: 5800 items; mean 50.39; sdev 1.25
-> <stat> min 43.2803; median 50.2241; max 59.1799

40.0    0
40.5    0
41.0    0
41.5    0
42.0    0
42.5    0
43.0    1 *
43.5    2 *
44.0    1 *
44.5    4 *
45.0    8 *
45.5   12 *
46.0   30 *
46.5   38 *
47.0   53 **
47.5   65 **
48.0   94 ***
48.5  118 ***
49.0  206 *****
49.5  497 ************
50.0 2580 ************************************************************
50.5  925 **********************
51.0  493 ************
51.5  234 ******
52.0  135 ****
52.5   88 ***
53.0   95 ***
53.5   43 *
54.0   22 *
54.5   22 *
55.0   17 *
55.5    7 *
56.0    3 *
56.5    2 *
57.0    1 *
57.5    1 *
58.0    0
58.5    0
59.0    3 *
59.5    0
60.0    0

-> best cutoff for all runs: 0.58
->     with weighted total 10*0 fp + 5797 fn = 5797
->     fp rate 0%  fn rate 99.9%


-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob@hooft.net  Sun Oct 13 18:58:46 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 13 Oct 2002 19:58:46 +0200
Subject: [Spambayes] Total cost analysis
Message-ID: <3DA9B456.2010908@hooft.net>

I checked in a new program 'cvcost.py' that analyses the total human 
cost to you of human spam filtering based on a result of timcv.py

The program is called cvcost.py. The default cost for an unknown message 
is set to $0.20, for a fn to $1 and for a fp to $10; these numbers can 
be changed using command line options.

amigo[142]spambayes%% /usr/local/bin/python cvcost.py cv[2345].txt
.........................................................................................
Optimal cost is 127.2 with grey zone between 49.0 and 99.0
.........................................................................................
Optimal cost is 143.4 with grey zone between 49.0 and 98.0
.........................................................................................
Optimal cost is 149.4 with grey zone between 49.0 and 98.0
.........................................................................................
Optimal cost is 103.2 with grey zone between 49.0 and 96.0
/usr/local/bin/python cost.py cv[2345].txt  26.88s user 0.14s system 98% 
cpu 27.346 total

The four runs that this represents are:
    cv2.txt : Tims suggested run  (min_prob_str=0.1)
    cv3.txt : Same run but min_prob_str=0
                   (has more fp)
    cv4.txt : Failed try: if H+S<0.3 prob = 0.65
                   (force middle ground for strange)
    cv5.txt : New decision criterion: prob = (S-H+1)/2

The latter is "objectively" the best....

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob@hooft.net  Sun Oct 13 20:24:50 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 13 Oct 2002 21:24:50 +0200
Subject: [Spambayes] Total cost analysis
References: <3DA9B456.2010908@hooft.net>
Message-ID: <3DA9C882.1030903@hooft.net>

I had cv5.txt : New decision criterion: prob = (S-H+1)/2
                 robinson_minimum_prob_strength = 0.0

Adding cv6.txt : Same as cv5 but with
                  robinson_minimum_prob_strength = 0.1


amigo[165]spambayes%% /usr/local/bin/python cvcost.py cv[56].txt
.........................................................................................
cv5.txt: Optimal cost is $103.2 with grey zone between 49.0 and 96.0
.........................................................................................
cv6.txt: Optimal cost is $109.0 with grey zone between 49.0 and 97.0

So for me, robinson_minimum_prob_strength = 0.0 gives the best result yet.


-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tim.one@comcast.net  Sun Oct 13 21:42:32 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 13 Oct 2002 16:42:32 -0400
Subject: [Spambayes] Total cost analysis
In-Reply-To: <3DA9B456.2010908@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEFABLAB.tim.one@comcast.net>

[Rob Hooft]
> I checked in a new program 'cvcost.py' that analyses the total human
> cost to you of human spam filtering based on a result of timcv.py

Very cool!  Thank you.  Everyone, note that this looks at the 'all runs' ham
and spam histograms at the end of the file, so the granularity of the
analysis is limited by your nbuckets setting.  I usually run with nbuckets
200; maybe I should boost the default to that (it's currently 40).

> The program is called cvcost.py. The default cost for an unknown message
> is set to $0.20, for a fn to $1 and for a fp to $10; these numbers can
> be changed using command line options.

I find I can make almost any scheme "the winner" by fiddling these to
extreme enough values <wink>.  In particular, by boosting the fp cost toward
infinity, the all-default scheme Rulz -- even at nbuckets 200, the extreme
schemes don't have fine enough granularity in the histograms to weed out the
one or two (depending on scheme) extremely high-scoring false positives in
my data.  But I don't actually care if the Nigerian scam quote  gets
rejected, so like all automated analyses this has to be tempered with
judgment.  It's a wonderfully useful tool then!


PS:  I'm rerunning my fat test now with your alternative S-and-H combination
scheme; I sure agree I like the effects it had in the examples you
presented; we'll see whether my data agrees too ...


From tim.one@comcast.net  Mon Oct 14 02:05:28 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 13 Oct 2002 21:05:28 -0400
Subject: [Spambayes] chi-squared versus "prob strength"
In-Reply-To: <3DA95B04.2040806@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEFHBLAB.tim.one@comcast.net>

[Rob Hooft]
> I'm playing currently with a variant on the S/(S+H) formula. I replaced
> it with (S-H+1)/2

[and then shows specific examples where this gives intuitively more-
 sensible endcase results than the current rule]

> ...
> Better, isn't it?
> ...
> Convinced?

I was, but more importantly my test data agreed, so I'm going to switch to
this (the evidence is so consistent and solid on both our datasets that
making it an option would supply a pointless choice -- losers are killed).
Good show!

S/(S+H) before, (S-H+1)/2 after (all defaults except
use_chi_squared_combining in both):

ham mean                     ham sdev
   0.39    0.29  -25.64%        3.47    2.98  -14.12%
   0.33    0.24  -27.27%        3.13    2.66  -15.02%
   0.40    0.31  -22.50%        3.54    3.23   -8.76%
   0.23    0.16  -30.43%        2.24    1.78  -20.54%
   0.47    0.39  -17.02%        4.38    4.06   -7.31%
   0.31    0.24  -22.58%        3.05    2.73  -10.49%
   0.38    0.28  -26.32%        3.23    2.71  -16.10%
   0.29    0.21  -27.59%        2.80    2.35  -16.07%
   0.30    0.23  -23.33%        2.90    2.51  -13.45%
   0.55    0.43  -21.82%        4.45    4.08   -8.31%

ham mean and sdev for all runs
   0.36    0.28  -22.22%        3.38    2.99  -11.54%

spam mean                    spam sdev
  99.93   99.95   +0.02%        1.25    1.01  -19.20%
  99.94   99.96   +0.02%        1.24    1.11  -10.48%
  99.98   99.99   +0.01%        0.34    0.19  -44.12%
  99.92   99.93   +0.01%        1.84    1.93   +4.89%
  99.93   99.94   +0.01%        1.72    1.59   -7.56%
  99.88   99.90   +0.02%        1.95    1.72  -11.79%
  99.86   99.88   +0.02%        2.22    2.27   +2.25%
  99.91   99.94   +0.03%        1.26    0.83  -34.13%
  99.90   99.92   +0.02%        1.75    1.55  -11.43%
  99.96   99.97   +0.01%        0.73    0.43  -41.10%

spam mean and sdev for all runs
  99.92   99.94   +0.02%        1.53    1.41   -7.84%

ham/spam mean difference: 99.56 99.66 +0.10

So it's even more extreme this way, but not in a way that hurts:  the weird
msgs in "the middle ground" are even more reliably *in* the middle ground
now.  For example, in my data, conference announcements, and the very
difficult but rare long & chatty spam, almost always end up scoring near 0.5
now.  But the regions of "extreme certainty" contain more msgs at the same
time:

HAM BEFORE

-> <stat> Ham scores for all runs: 20000 items; mean 0.36; sdev 3.38
-> <stat> min -1.9984e-013; median 1.18333e-010; max 100
* = 319 items
 0.0 19401 *************************************************************
 0.5    97 *

HAM AFTER
-> <stat> Ham scores for all runs: 20000 items; mean 0.28; sdev 2.99
-> <stat> min -9.99201e-014; median 6.28553e-011; max 100
* = 320 items
 0.0 19492 *************************************************************
 0.5   104 *

Median, mean and sdev all decreased, and about 100 more hams scored below
0.05.

SPAM BEFORE

-> <stat> Spam scores for all runs: 14000 items; mean 99.92; sdev 1.53
-> <stat> min 35.983; median 100; max 100
* = 228 items
99.0    15 *
99.5 13906 *************************************************************

SPAM AFTER

-> <stat> Spam scores for all runs: 14000 items; mean 99.94; sdev 1.41
-> <stat> min 29.6176; median 100; max 100
* = 229 items
99.0    13 *
99.5 13918 *************************************************************

The effects are milder here, but still in the right direction.

The "BlackIntrepid" spam is the min-scoring spam in both cases:

prob('*H*') = 0.930885
prob('*S*') = 0.523237

Chop that up any way you want, it's always going to look more like ham than
spam, and it does look a lot like legit c.l.py traffic.

cvcost doesn't find much bottom-line difference:

chisq.txt: Optimal cost is $27.2 with grey zone between 50.0 and 74.0
chisq_altsh.txt: Optimal cost is $27.0 with grey zone between 50.0 and 78.0

Given that I have two false positives that are never going to go away, and
they're charged $10 each, the cost of both methods for 34,000 msgs is
trivial.


From tim.one@comcast.net  Mon Oct 14 06:42:58 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 14 Oct 2002 01:42:58 -0400
Subject: Bland words, and z-combining (was RE: [Spambayes] Bland word only
 score..)
In-Reply-To: <3DA99A1F.9040506@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEFPBLAB.tim.one@comcast.net>

[Rob Hooft]
> [Tim: the previous copy of this message I sent to you was too quick.]

Ah, replied to that privately.  Bottom line:

[tail end of histograms after running looking *only* at bland words]
> -> best cutoff for all runs: 0.58
> ->     with weighted total 10*0 fp + 5797 fn = 5797
> ->     fp rate 0%  fn rate 99.9%

The overlap is so bad that even with 200 buckets, the best the histogram
analysis could do is suggest a cutoff with a nearly 100% FN rate.

> -> <stat> Ham scores for all runs: 16000 items; mean 49.59; sdev 1.41
> -> <stat> min 40.7953; median 49.9561; max 57.7839

> -> <stat> Spam scores for all runs: 5800 items; mean 50.39; sdev 1.25
> -> <stat> min 43.2803; median 50.2241; max 59.1799

So whether ham or spam, nearly half the bland words point in the wrong
direction.  It's too much like adding in coin flips for my tastes.

> I had cv5.txt : New decision criterion: prob = (S-H+1)/2
>                  robinson_minimum_prob_strength = 0.0
>
> Adding cv6.txt : Same as cv5 but with
>                   robinson_minimum_prob_strength = 0.1
>
>
> amigo[165]spambayes%% /usr/local/bin/python cvcost.py cv[56].txt
> cv5.txt: Optimal cost is $103.2 with grey zone between 49.0 and 96.0
> cv6.txt: Optimal cost is $109.0 with grey zone between 49.0 and 97.0
>
> So for me, robinson_minimum_prob_strength = 0.0 gives the best result
> yet.

It didn't help on my data:

chisq.txt: Optimal cost is $27.0 with grey zone between 50.0 and 78.0
bland.txt: Optimal cost is $28.2 with grey zone between 50.0 and 85.0

The difference is so small I can't swear it hurt, either.  I think the
difference in your case is too small to be confident too.


There's *one* scheme where including the bland words helps me:  there's
another option use_z_combining I haven't talked about here, which implements
another speculative idea from Gary.  That one is, well, extremely extreme.
Only 16 of 20,000 ham scored over 0.50 using it, and only 3 of 14,000 spam
scored under 0.50.  The 16 FP include my 2 that will never go away, and they
score 1.00000000000 and 0.999693086732 even with the bland words.  BTW, in
*some* sense the z-combining score is an actual probability.

With the all-default costs, cvcost sez z-combining worked even better for me
(including all bland words):

zcomb.txt: Optimal cost is $26.8 with grey zone between 75.0 and 90.0

The difference between that and chisq.txt's $27.00 is one "not sure" msg out
of 34,000, so I'm not highly motivated to pursue it.  But I encourage others
to try it -- it may work better on harder data than mine!  I'll note that it
suffers its own form of "cancellation disease" (one of my very long spam
scored 0.0000000000041), which the chi-squared scheme is refreshingly free
of (that same spam scored 0.5 under chi combining).

If you want to try it, I suggest

"""
[Classifier]
use_z_combining: True
robinson_minimum_prob_strength: 0.0

[TestDriver]
nbuckets: 200
"""

I'd rather that people who haven't been playing along lately try
chi-combining, though, because as far as I'm concerned, the results so far
say it's the best scheme we've got -- and as someone else recently
suggested, it's high time to start killing off the losers again.

"""
[Classifier]
use_chi_squared_combining: True

[TestDriver]
nbuckets: 200
"""

I sped that up, BTW (it invokes log() up to 150x less often now).

Note that chi and z combining do NOT require "the third" training pass, so
cross-validation tests can be run in the default "high speed" mode
(incremental training and untraining work fine with these).


From rob@hooft.net  Mon Oct 14 07:18:49 2002
From: rob@hooft.net (Rob W.W. Hooft)
Date: Mon, 14 Oct 2002 08:18:49 +0200
Subject: Bland words, and z-combining (was RE: [Spambayes] Bland word
 only score..)
References: <LNBBLJKPBEHFEDALKOLCMEFPBLAB.tim.one@comcast.net>
Message-ID: <3DAA61C9.4090108@hooft.net>

Tim Peters wrote:

 > There's *one* scheme where including the bland words helps me:  there's
> another option use_z_combining I haven't talked about here, which implements
> another speculative idea from Gary.  That one is, well, extremely extreme.
> Only 16 of 20,000 ham scored over 0.50 using it, and only 3 of 14,000 spam
> scored under 0.50.  The 16 FP include my 2 that will never go away, and they
> score 1.00000000000 and 0.999693086732 even with the bland words.  BTW, in
> *some* sense the z-combining score is an actual probability.

I tried z-combining before going to bed last night (saw it in the CVS), 
but it cost me $6 more for my 21800 messages than chi2-combining (i.e. a 
whopping 0.03 cents per message) and I didn't have the time this morning 
before going to work to check why. The problem starts to be to find a 
set of corpuses that are difficult enough to score; I am quite happy 
with the separation I have now....

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From anthony@interlink.com.au  Mon Oct 14 07:31:31 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Mon, 14 Oct 2002 16:31:31 +1000
Subject: [Spambayes] chi-squared versus "prob strength" 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEFHBLAB.tim.one@comcast.net> 
Message-ID: <200210140631.g9E6VWn31907@localhost.localdomain>


>>> Tim Peters wrote
> I was, but more importantly my test data agreed, so I'm going to switch to
> this (the evidence is so consistent and solid on both our datasets that
> making it an option would supply a pointless choice -- losers are killed).
> Good show!

Here's what my mungo-test set shows for this (before is pre-Rob Hooft's
change, after is current CVS)

chi2s.txt -> chi2as.txt
-> <stat> tested 3490 hams & 1687 spams against 31410 hams & 15161 spams
-> <stat> tested 3490 hams & 1682 spams against 31410 hams & 15166 spams
-> <stat> tested 3490 hams & 1688 spams against 31410 hams & 15160 spams
-> <stat> tested 3490 hams & 1679 spams against 31410 hams & 15169 spams
-> <stat> tested 3490 hams & 1686 spams against 31410 hams & 15162 spams
-> <stat> tested 3490 hams & 1688 spams against 31410 hams & 15160 spams
-> <stat> tested 3490 hams & 1678 spams against 31410 hams & 15170 spams
-> <stat> tested 3490 hams & 1688 spams against 31410 hams & 15160 spams
-> <stat> tested 3490 hams & 1683 spams against 31410 hams & 15165 spams
-> <stat> tested 3490 hams & 1689 spams against 31410 hams & 15159 spams
-> <stat> tested 3490 hams & 1687 spams against 31410 hams & 15161 spams
-> <stat> tested 3490 hams & 1682 spams against 31410 hams & 15166 spams
-> <stat> tested 3490 hams & 1688 spams against 31410 hams & 15160 spams
-> <stat> tested 3490 hams & 1679 spams against 31410 hams & 15169 spams
-> <stat> tested 3490 hams & 1686 spams against 31410 hams & 15162 spams
-> <stat> tested 3490 hams & 1688 spams against 31410 hams & 15160 spams
-> <stat> tested 3490 hams & 1678 spams against 31410 hams & 15170 spams
-> <stat> tested 3490 hams & 1688 spams against 31410 hams & 15160 spams
-> <stat> tested 3490 hams & 1683 spams against 31410 hams & 15165 spams
-> <stat> tested 3490 hams & 1689 spams against 31410 hams & 15159 spams

false positive percentages
    0.946  0.974  lost    +2.96%
    0.917  0.917  tied          
    0.802  0.831  lost    +3.62%
    0.659  0.860  lost   +30.50%
    0.573  0.659  lost   +15.01%
    0.802  0.831  lost    +3.62%
    0.716  0.745  lost    +4.05%
    0.516  0.544  lost    +5.43%
    0.630  0.688  lost    +9.21%
    0.917  1.003  lost    +9.38%

won   0 times
tied  1 times
lost  9 times

total unique fp went from 261 to 281 lost    +7.66%
mean fp % went from 0.747851002865 to 0.805157593123 lost    +7.66%

false negative percentages
    0.356  0.296  won    -16.85%
    0.119  0.059  won    -50.42%
    0.237  0.237  tied          
    0.476  0.476  tied          
    0.297  0.237  won    -20.20%
    0.415  0.415  tied          
    0.596  0.477  won    -19.97%
    0.296  0.237  won    -19.93%
    0.416  0.416  tied          
    0.355  0.296  won    -16.62%

won   6 times
tied  4 times
lost  0 times

total unique fn went from 60 to 53 won    -11.67%
mean fn % went from 0.356257958499 to 0.314689990048 won    -11.67%

ham mean                     ham sdev
   3.46    3.24   -6.36%       12.12   11.96   -1.32%
   3.01    2.85   -5.32%       11.48   11.39   -0.78%
   3.28    3.01   -8.23%       11.45   11.22   -2.01%
   3.23    3.02   -6.50%       11.43   11.27   -1.40%
   3.15    2.88   -8.57%       10.65   10.37   -2.63%
   3.17    2.95   -6.94%       11.30   11.07   -2.04%
   3.27    3.02   -7.65%       11.29   10.94   -3.10%
   3.06    2.82   -7.84%       10.51   10.20   -2.95%
   3.32    3.13   -5.72%       11.37   11.18   -1.67%
   3.45    3.21   -6.96%       11.75   11.59   -1.36%

ham mean and sdev for all runs
   3.24    3.01   -7.10%       11.34   11.13   -1.85%

spam mean                    spam sdev
  99.75   99.76   +0.01%        3.91    3.85   -1.53%
  99.90   99.91   +0.01%        1.62    1.38  -14.81%
  99.81   99.82   +0.01%        3.09    3.05   -1.29%
  99.60   99.62   +0.02%        4.92    4.80   -2.44%
  99.78   99.78   +0.00%        3.24    3.36   +3.70%
  99.78   99.78   +0.00%        3.04    3.14   +3.29%
  99.62   99.62   +0.00%        4.73    4.78   +1.06%
  99.79   99.81   +0.02%        2.75    2.66   -3.27%
  99.66   99.66   +0.00%        4.47    4.62   +3.36%
  99.70   99.70   +0.00%        4.37    4.32   -1.14%

spam mean and sdev for all runs
  99.74   99.75   +0.01%        3.75    3.75   +0.00%

ham/spam mean difference: 96.50 96.74 +0.24

Here's the histograms from the 'after' case:

-> <stat> Ham scores for all runs: 34900 items; mean 3.01; sdev 11.13
-> <stat> min -9.99201e-14; median 0.000498415; max 100
* = 448 items
 0.0 27319 *************************************************************
 0.5  1129 ***
 1.0   695 **
 1.5   507 **
 2.0   412 *
 2.5   320 *
 3.0   269 *
 3.5   241 *
 4.0   194 *
 4.5   178 *
 5.0   151 *
 5.5   114 *
 6.0   131 *
 6.5   129 *
 7.0   106 *
 7.5   104 *
 8.0   103 *
 8.5    84 *
 9.0    76 *
 9.5    85 *
10.0    65 *
10.5    60 *
11.0    73 *
11.5    54 *
12.0    63 *
12.5    50 *
13.0    59 *
13.5    51 *
14.0    65 *
14.5    43 *
15.0    31 *
15.5    50 *
16.0    40 *
16.5    38 *
17.0    39 *
17.5    37 *
18.0    27 *
18.5    31 *
19.0    40 *
19.5    31 *
20.0    41 *
20.5    27 *
21.0    27 *
21.5    29 *
22.0    26 *
22.5    34 *
23.0    23 *
23.5    26 *
24.0    31 *
24.5    23 *
25.0    12 *
25.5    15 *
26.0    16 *
26.5    27 *
27.0    27 *
27.5    27 *
28.0    18 *
28.5    25 *
29.0    16 *
29.5    19 *
30.0    19 *
30.5    17 *
31.0    14 *
31.5    18 *
32.0    16 *
32.5    12 *
33.0    29 *
33.5    19 *
34.0     6 *
34.5    15 *
35.0    14 *
35.5    15 *
36.0    19 *
36.5    11 *
37.0     9 *
37.5    12 *
38.0    13 *
38.5    10 *
39.0    12 *
39.5    15 *
40.0    13 *
40.5    12 *
41.0     9 *
41.5    14 *
42.0    14 *
42.5    13 *
43.0    21 *
43.5    16 *
44.0    11 *
44.5     7 *
45.0    10 *
45.5     8 *
46.0     9 *
46.5    10 *
47.0     9 *
47.5     9 *
48.0     9 *
48.5    10 *
49.0    12 *
49.5    20 *
50.0    31 *
50.5     8 *
51.0    12 *
51.5     6 *
52.0    10 *
52.5     8 *
53.0    10 *
53.5     3 *
54.0     9 *
54.5     5 *
55.0    16 *
55.5    14 *
56.0     6 *
56.5     7 *
57.0    10 *
57.5     8 *
58.0     6 *
58.5     7 *
59.0    11 *
59.5     3 *
60.0     5 *
60.5     9 *
61.0     3 *
61.5     5 *
62.0     5 *
62.5     5 *
63.0     5 *
63.5     9 *
64.0    10 *
64.5     8 *
65.0     5 *
65.5     7 *
66.0     7 *
66.5     3 *
67.0     3 *
67.5     5 *
68.0     7 *
68.5     3 *
69.0     5 *
69.5     6 *
70.0     6 *
70.5     3 *
71.0     2 *
71.5     5 *
72.0     5 *
72.5     1 *
73.0     1 *
73.5     6 *
74.0     2 *
74.5     8 *
75.0     5 *
75.5     5 *
76.0     5 *
76.5     7 *
77.0     5 *
77.5     3 *
78.0     4 *
78.5     4 *
79.0     2 *
79.5     2 *
80.0     2 *
80.5     4 *
81.0     7 *
81.5     4 *
82.0     6 *
82.5     5 *
83.0     1 *
83.5     5 *
84.0     4 *
84.5     2 *
85.0     4 *
85.5     4 *
86.0     2 *
86.5     1 *
87.0     8 *
87.5     6 *
88.0     3 *
88.5     5 *
89.0     2 *
89.5     3 *
90.0     0 
90.5     0 
91.0     1 *
91.5     3 *
92.0     1 *
92.5     3 *
93.0     5 *
93.5     5 *
94.0     5 *
94.5     3 *
95.0     8 *
95.5     4 *
96.0     1 *
96.5     3 *
97.0     5 *
97.5     4 *
98.0     5 *
98.5     8 *
99.0     8 *
99.5    50 *

-> <stat> Spam scores for all runs: 16848 items; mean 99.75; sdev 3.75
-> <stat> min 0.00333927; median 100; max 100
* = 273 items
 0.0     1 *
 0.5     1 *
 1.0     1 *
 1.5     0 
 2.0     0 
 2.5     1 *
 3.0     1 *
 3.5     0 
 4.0     0 
 4.5     0 
 5.0     1 *
 5.5     0 
 6.0     0 
 6.5     0 
 7.0     0 
 7.5     0 
 8.0     1 *
 8.5     0 
 9.0     1 *
 9.5     0 
10.0     0 
10.5     0 
11.0     0 
11.5     0 
12.0     1 *
12.5     0 
13.0     0 
13.5     0 
14.0     2 *
14.5     0 
15.0     0 
15.5     0 
16.0     0 
16.5     0 
17.0     1 *
17.5     2 *
18.0     0 
18.5     0 
19.0     0 
19.5     0 
20.0     0 
20.5     0 
21.0     1 *
21.5     1 *
22.0     0 
22.5     0 
23.0     0 
23.5     0 
24.0     0 
24.5     2 *
25.0     0 
25.5     1 *
26.0     1 *
26.5     0 
27.0     0 
27.5     0 
28.0     0 
28.5     0 
29.0     0 
29.5     0 
30.0     0 
30.5     0 
31.0     0 
31.5     0 
32.0     0 
32.5     0 
33.0     0 
33.5     0 
34.0     0 
34.5     0 
35.0     0 
35.5     0 
36.0     0 
36.5     0 
37.0     0 
37.5     1 *
38.0     1 *
38.5     0 
39.0     0 
39.5     0 
40.0     0 
40.5     0 
41.0     0 
41.5     1 *
42.0     0 
42.5     1 *
43.0     0 
43.5     1 *
44.0     0 
44.5     0 
45.0     0 
45.5     0 
46.0     0 
46.5     1 *
47.0     0 
47.5     0 
48.0     0 
48.5     1 *
49.0     0 
49.5     2 *
50.0     3 *
50.5     2 *
51.0     0 
51.5     2 *
52.0     0 
52.5     1 *
53.0     0 
53.5     0 
54.0     1 *
54.5     0 
55.0     0 
55.5     0 
56.0     0 
56.5     2 *
57.0     1 *
57.5     1 *
58.0     0 
58.5     1 *
59.0     0 
59.5     0 
60.0     1 *
60.5     0 
61.0     0 
61.5     1 *
62.0     0 
62.5     0 
63.0     0 
63.5     1 *
64.0     0 
64.5     1 *
65.0     0 
65.5     2 *
66.0     1 *
66.5     0 
67.0     0 
67.5     0 
68.0     1 *
68.5     0 
69.0     1 *
69.5     1 *
70.0     0 
70.5     1 *
71.0     0 
71.5     1 *
72.0     0 
72.5     3 *
73.0     0 
73.5     0 
74.0     1 *
74.5     0 
75.0     1 *
75.5     1 *
76.0     0 
76.5     1 *
77.0     1 *
77.5     0 
78.0     6 *
78.5     0 
79.0     1 *
79.5     0 
80.0     1 *
80.5     0 
81.0     1 *
81.5     1 *
82.0     1 *
82.5     1 *
83.0     0 
83.5     0 
84.0     0 
84.5     0 
85.0     3 *
85.5     1 *
86.0     2 *
86.5     2 *
87.0     2 *
87.5     0 
88.0     2 *
88.5     0 
89.0     0 
89.5     2 *
90.0     0 
90.5     1 *
91.0     0 
91.5     0 
92.0     4 *
92.5     5 *
93.0     2 *
93.5     1 *
94.0     2 *
94.5     4 *
95.0     2 *
95.5     8 *
96.0     3 *
96.5     5 *
97.0     9 *
97.5    10 *
98.0     9 *
98.5    22 *
99.0    44 *
99.5 16628 *************************************************************
-> best cutoff for all runs: 0.995
->     with weighted total 10*50 fp + 220 fn = 720
->     fp rate 0.143%  fn rate 1.31%


From tim.one@comcast.net  Mon Oct 14 18:34:39 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 14 Oct 2002 13:34:39 -0400
Subject: [Spambayes] Total cost analysis
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEFABLAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEIBBLAB.tim.one@comcast.net>

In order to ease "middle ground" testing, I redid the automatic histogram
analysis to do total-cost minimization similar to that done by Rob's
cvcost.py.

Here's highly atypical sample output.  It's from a tiny run so that you can
see by eyeball what it means:

-> <stat> Ham scores for this pair: 10 items; mean 1.04; sdev 1.21
-> <stat> min 0.000428085; median 0.45401; max 3.12227
* = 1 items
 0.0 5 *****
 0.5 2 **
 1.0 0
 1.5 0
 2.0 1 *
 2.5 0
 3.0 2 **
 3.5 0
...

-> <stat> Spam scores for this pair: 10 items; mean 100.00; sdev 0.00
-> <stat> min 100; median 100; max 100
* = 1 items
...
99.0  0
99.5 10 **********
-> best cost $0.00
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 18721 cutoff pairs
-> smallest ham & spam cutoffs 0.035 & 0.035
->     fp 0; fn 0; unsure ham 0; unsure spam 0
->     fp rate 0%; fn rate 0%
-> largest ham & spam cutoffs 0.995 & 0.995
->     fp 0; fn 0; unsure ham 0; unsure spam 0
->     fp rate 0%; fn rate 0%

This is trivial because no "middle ground" is needed here:  calling
everything >= 0.035 spam works exactly as well as calling everything >=
0.995 spam, and there are no mistakes or unsures in either case.

Less trivial, because the ham scores slobber all over the range:

-> <stat> Ham scores for all runs: 100 items; mean 8.09; sdev 17.25
-> <stat> min 3.24153e-007; median 0.846144; max 97.6463
* = 1 items
 0.0 44 ********************************************
 0.5  8 ********
 1.0  5 *****
 1.5  5 *****
 2.0  4 ****
 2.5  0
 3.0  5 *****
 3.5  2 **
 4.0  1 *
 4.5  0
 5.0  1 *
 5.5  1 *
 6.0  0
 6.5  2 **
 7.0  2 **
 7.5  1 *
 8.0  0
 8.5  0
 9.0  0
 9.5  0
10.0  1 *
10.5  0
11.0  0
11.5  0
12.0  0
12.5  0
13.0  0
13.5  0
14.0  1 *
14.5  0
15.0  0
15.5  1 *
16.0  1 *
16.5  0
17.0  0
17.5  0
18.0  0
18.5  0
19.0  0
19.5  0
20.0  0
20.5  1 *
21.0  2 **
21.5  0
22.0  1 *
22.5  0
23.0  0
23.5  0
24.0  0
24.5  0
25.0  0
25.5  0
26.0  0
26.5  0
27.0  0
27.5  0
28.0  1 *
28.5  0
29.0  0
29.5  0
30.0  1 *
30.5  1 *
31.0  0
31.5  0
32.0  1 *
32.5  0
33.0  0
33.5  0
34.0  0
34.5  0
35.0  0
35.5  0
36.0  0
36.5  0
37.0  0
37.5  0
38.0  0
38.5  0
39.0  0
39.5  0
40.0  0
40.5  0
41.0  0
41.5  0
42.0  0
42.5  0
43.0  0
43.5  0
44.0  1 *
44.5  0
45.0  0
45.5  0
46.0  1 *
46.5  0
47.0  0
47.5  0
48.0  0
48.5  0
49.0  0
49.5  0
50.0  0
50.5  0
51.0  0
51.5  0
52.0  0
52.5  0
53.0  0
53.5  0
54.0  0
54.5  0
55.0  0
55.5  0
56.0  0
56.5  0
57.0  0
57.5  0
58.0  0
58.5  0
59.0  1 *
59.5  1 *
60.0  0
60.5  0
61.0  0
61.5  1 *
62.0  0
62.5  0
63.0  0
63.5  0
64.0  0
64.5  0
65.0  0
65.5  0
66.0  0
66.5  0
67.0  0
67.5  0
68.0  0
68.5  0
69.0  0
69.5  0
70.0  0
70.5  0
71.0  1 *
71.5  0
72.0  0
72.5  0
73.0  0
73.5  0
74.0  0
74.5  0
75.0  0
75.5  0
76.0  0
76.5  0
77.0  0
77.5  0
78.0  0
78.5  0
79.0  0
79.5  0
80.0  0
80.5  0
81.0  0
81.5  0
82.0  0
82.5  0
83.0  0
83.5  0
84.0  0
84.5  0
85.0  0
85.5  0
86.0  0
86.5  0
87.0  0
87.5  0
88.0  0
88.5  0
89.0  0
89.5  0
90.0  0
90.5  0
91.0  0
91.5  0
92.0  0
92.5  0
93.0  0
93.5  0
94.0  0
94.5  0
95.0  0
95.5  0
96.0  0
96.5  0
97.0  0
97.5  1 *
98.0  0
98.5  0
99.0  0
99.5  0

-> <stat> Spam scores for all runs: 100 items; mean 99.87; sdev 0.71
-> <stat> min 94.9387; median 100; max 100
* = 2 items
...
94.0  0
94.5  1 *
95.0  0
95.5  0
96.0  1 *
96.5  1 *
97.0  0
97.5  0
98.0  0
98.5  0
99.0  1 *
99.5 96 ************************************************
-> best cost $0.80
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 141 cutoff pairs
-> smallest ham & spam cutoffs 0.715 & 0.98
->     fp 0; fn 0; unsure ham 1; unsure spam 3
->     fp rate 0%; fn rate 0%
-> largest ham & spam cutoffs 0.945 & 0.99
->     fp 0; fn 0; unsure ham 1; unsure spam 3
->     fp rate 0%; fn rate 0%

There is a middle ground here:  saying something is "unsure" if

    0.715 <= score < 0.98

works exactly as well as

    0.945 <= score < 0.99

and there are 141-2 = 139 other cutoff pairs from the histogram boundaries
that also achieve cost $0.80 (== 4 msgs in the middle ground, and no errors
outside the middle ground).

The default nbuckets has been boosted to 200, although
TestDriver.printhist() (which does this display and computation) can be
passed any number of buckets "after the fact", provided you saved the
histogram objects as pickles.

There are two new options to support this:

"""
# After the display of a ham+spam histogram pair, you can get a listing of
# all the cutoff values (coinciding with histogram bucket boundaries) that
# minimize
#
#      best_cutoff_fp_weight * (# false positives) +
#      best_cutoff_fn_weight * (# false negatives) +
#      best_cutoff_unsure_weight * (# unsure msgs)
#
# This displays two cutoffs:  hamc and spamc, where
#
#     0.0 <= hamc <= spamc <= 1.0
#
# The idea is that if something scores < hamc, it's called ham; if
# something scores >= spamc, it's called spam; and everything else is
# called "I'm not sure" -- the middle ground.
#
# Note that cvcost.py does a similar analysis.
#
# Note:  You may wish to increase nbuckets, to give this scheme more
# cutoff values to analyze.
compute_best_cutoffs_from_histograms: True
best_cutoff_fp_weight:     10.00
best_cutoff_fn_weight:      1.00
best_cutoff_unsure_weight:  0.20
"""

Note that the default values match cvcost.py's defaults.


From tim.one@comcast.net  Mon Oct 14 18:57:03 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 14 Oct 2002 13:57:03 -0400
Subject: [Spambayes] Total cost analysis
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEIBBLAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEIDBLAB.tim.one@comcast.net>

This is a multi-part message in MIME format.

---------------------- multipart/mixed attachment
CAUTION:  For the attached histogram pair, cvcost sez:

    tcap.txt: Optimal cost is $10.0 with grey zone between 89.0 and 97.0

but the new histogram analysis says:

-> best cost $0.80
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 24 cutoff pairs
-> smallest ham & spam cutoffs 0.855 & 0.995
->     fp 0; fn 0; unsure ham 1; unsure spam 3
->     fp rate 0%; fn rate 0%; unsure rate 2%
-> largest ham & spam cutoffs 0.97 & 0.995
->     fp 0; fn 0; unsure ham 1; unsure spam 3
->     fp rate 0%; fn rate 0%; unsure rate 2%

and eyeballing the histograms shows that the latter is correct.  I don't
know why cvcost.py thinks $10.00 is the best that can be done; I suspect
it's because it's skipping some cutoff pairs in order to save time.

---------------------- multipart/mixed attachment

-> <stat> Ham scores for all runs: 100 items; mean 7.21; sdev 18.87
-> <stat> min 3.34881e-009; median 0.18187; max 99.2347
* = 2 items
 0.0 63 ********************************
 0.5  6 ***
 1.0  4 **
 1.5  5 ***
 2.0  1 *
 2.5  0 
 3.0  3 **
 3.5  0 
 4.0  0 
 4.5  0 
 5.0  0 
 5.5  0 
 6.0  0 
 6.5  0 
 7.0  0 
 7.5  0 
 8.0  0 
 8.5  0 
 9.0  1 *
 9.5  1 *
10.0  0 
10.5  0 
11.0  0 
11.5  0 
12.0  1 *
12.5  0 
13.0  0 
13.5  0 
14.0  0 
14.5  1 *
15.0  0 
15.5  1 *
16.0  1 *
16.5  0 
17.0  0 
17.5  0 
18.0  0 
18.5  1 *
19.0  1 *
19.5  0 
20.0  0 
20.5  0 
21.0  0 
21.5  0 
22.0  0 
22.5  0 
23.0  0 
23.5  1 *
24.0  0 
24.5  0 
25.0  0 
25.5  0 
26.0  0 
26.5  0 
27.0  0 
27.5  1 *
28.0  0 
28.5  0 
29.0  0 
29.5  0 
30.0  0 
30.5  1 *
31.0  0 
31.5  0 
32.0  0 
32.5  0 
33.0  0 
33.5  0 
34.0  0 
34.5  0 
35.0  0 
35.5  0 
36.0  0 
36.5  0 
37.0  0 
37.5  0 
38.0  0 
38.5  0 
39.0  0 
39.5  0 
40.0  0 
40.5  0 
41.0  0 
41.5  0 
42.0  0 
42.5  0 
43.0  0 
43.5  0 
44.0  1 *
44.5  1 *
45.0  0 
45.5  0 
46.0  0 
46.5  0 
47.0  0 
47.5  0 
48.0  0 
48.5  0 
49.0  0 
49.5  0 
50.0  0 
50.5  0 
51.0  0 
51.5  0 
52.0  0 
52.5  0 
53.0  0 
53.5  0 
54.0  0 
54.5  0 
55.0  0 
55.5  0 
56.0  0 
56.5  0 
57.0  0 
57.5  0 
58.0  0 
58.5  0 
59.0  0 
59.5  0 
60.0  0 
60.5  0 
61.0  0 
61.5  0 
62.0  0 
62.5  0 
63.0  0 
63.5  1 *
64.0  0 
64.5  0 
65.0  0 
65.5  0 
66.0  0 
66.5  0 
67.0  0 
67.5  0 
68.0  0 
68.5  0 
69.0  0 
69.5  0 
70.0  0 
70.5  0 
71.0  0 
71.5  0 
72.0  0 
72.5  0 
73.0  0 
73.5  0 
74.0  0 
74.5  1 *
75.0  0 
75.5  0 
76.0  0 
76.5  0 
77.0  1 *
77.5  0 
78.0  0 
78.5  0 
79.0  0 
79.5  0 
80.0  0 
80.5  0 
81.0  0 
81.5  0 
82.0  0 
82.5  0 
83.0  0 
83.5  0 
84.0  0 
84.5  0 
85.0  1 *
85.5  0 
86.0  0 
86.5  0 
87.0  0 
87.5  0 
88.0  0 
88.5  0 
89.0  0 
89.5  0 
90.0  0 
90.5  0 
91.0  0 
91.5  0 
92.0  0 
92.5  0 
93.0  0 
93.5  0 
94.0  0 
94.5  0 
95.0  0 
95.5  0 
96.0  0 
96.5  0 
97.0  0 
97.5  0 
98.0  0 
98.5  0 
99.0  1 *
99.5  0 

-> <stat> Spam scores for all runs: 100 items; mean 99.94; sdev 0.34
-> <stat> min 97.0896; median 100; max 100
* = 2 items
 0.0  0 
 0.5  0 
 1.0  0 
 1.5  0 
 2.0  0 
 2.5  0 
 3.0  0 
 3.5  0 
 4.0  0 
 4.5  0 
 5.0  0 
 5.5  0 
 6.0  0 
 6.5  0 
 7.0  0 
 7.5  0 
 8.0  0 
 8.5  0 
 9.0  0 
 9.5  0 
10.0  0 
10.5  0 
11.0  0 
11.5  0 
12.0  0 
12.5  0 
13.0  0 
13.5  0 
14.0  0 
14.5  0 
15.0  0 
15.5  0 
16.0  0 
16.5  0 
17.0  0 
17.5  0 
18.0  0 
18.5  0 
19.0  0 
19.5  0 
20.0  0 
20.5  0 
21.0  0 
21.5  0 
22.0  0 
22.5  0 
23.0  0 
23.5  0 
24.0  0 
24.5  0 
25.0  0 
25.5  0 
26.0  0 
26.5  0 
27.0  0 
27.5  0 
28.0  0 
28.5  0 
29.0  0 
29.5  0 
30.0  0 
30.5  0 
31.0  0 
31.5  0 
32.0  0 
32.5  0 
33.0  0 
33.5  0 
34.0  0 
34.5  0 
35.0  0 
35.5  0 
36.0  0 
36.5  0 
37.0  0 
37.5  0 
38.0  0 
38.5  0 
39.0  0 
39.5  0 
40.0  0 
40.5  0 
41.0  0 
41.5  0 
42.0  0 
42.5  0 
43.0  0 
43.5  0 
44.0  0 
44.5  0 
45.0  0 
45.5  0 
46.0  0 
46.5  0 
47.0  0 
47.5  0 
48.0  0 
48.5  0 
49.0  0 
49.5  0 
50.0  0 
50.5  0 
51.0  0 
51.5  0 
52.0  0 
52.5  0 
53.0  0 
53.5  0 
54.0  0 
54.5  0 
55.0  0 
55.5  0 
56.0  0 
56.5  0 
57.0  0 
57.5  0 
58.0  0 
58.5  0 
59.0  0 
59.5  0 
60.0  0 
60.5  0 
61.0  0 
61.5  0 
62.0  0 
62.5  0 
63.0  0 
63.5  0 
64.0  0 
64.5  0 
65.0  0 
65.5  0 
66.0  0 
66.5  0 
67.0  0 
67.5  0 
68.0  0 
68.5  0 
69.0  0 
69.5  0 
70.0  0 
70.5  0 
71.0  0 
71.5  0 
72.0  0 
72.5  0 
73.0  0 
73.5  0 
74.0  0 
74.5  0 
75.0  0 
75.5  0 
76.0  0 
76.5  0 
77.0  0 
77.5  0 
78.0  0 
78.5  0 
79.0  0 
79.5  0 
80.0  0 
80.5  0 
81.0  0 
81.5  0 
82.0  0 
82.5  0 
83.0  0 
83.5  0 
84.0  0 
84.5  0 
85.0  0 
85.5  0 
86.0  0 
86.5  0 
87.0  0 
87.5  0 
88.0  0 
88.5  0 
89.0  0 
89.5  0 
90.0  0 
90.5  0 
91.0  0 
91.5  0 
92.0  0 
92.5  0 
93.0  0 
93.5  0 
94.0  0 
94.5  0 
95.0  0 
95.5  0 
96.0  0 
96.5  0 
97.0  1 *
97.5  0 
98.0  0 
98.5  2 *
99.0  0 
99.5 97 *************************************************
-> best cost $0.80
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 24 cutoff pairs
-> smallest ham & spam cutoffs 0.855 & 0.995
->     fp 0; fn 0; unsure ham 1; unsure spam 3
->     fp rate 0%; fn rate 0%; unsure rate 2%
-> largest ham & spam cutoffs 0.97 & 0.995
->     fp 0; fn 0; unsure ham 1; unsure spam 3
->     fp rate 0%; fn rate 0%; unsure rate 2%

C:\Code\spambayes>tcap/u

---------------------- multipart/mixed attachment--

From bkc@murkworks.com  Mon Oct 14 19:08:15 2002
From: bkc@murkworks.com (Brad Clements)
Date: Mon, 14 Oct 2002 14:08:15 -0400
Subject: [Spambayes] Comparing chi to zcombine
Message-ID: <3DAACFCE.25807.14D268A@localhost>

First, cmp.py 

results/chitrues.txt -> results/zcombines.txt
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams

false positive percentages
    0.154  0.154  tied          
    0.231  0.077  won    -66.67%
    0.077  0.154  lost  +100.00%
    0.154  0.462  lost  +200.00%
    0.154  0.154  tied          
    0.077  0.154  lost  +100.00%
    0.077  0.077  tied          
    0.000  0.000  tied          
    0.231  0.538  lost  +132.90%
    0.000  0.154  lost  +(was 0)

won   1 times
tied  4 times
lost  5 times

total unique fp went from 15 to 25 lost   +66.67%
mean fp % went from 0.115384615385 to 0.192307692308 lost   +66.67%

false negative percentages
    0.846  1.308  lost   +54.61%
    1.231  1.538  lost   +24.94%
    1.154  1.308  lost   +13.34%
    0.615  0.846  lost   +37.56%
    0.923  1.000  lost    +8.34%
    1.308  1.154  won    -11.77%
    0.692  1.077  lost   +55.64%
    1.077  1.231  lost   +14.30%
    1.231  1.154  won     -6.26%
    1.231  1.077  won    -12.51%

won   3 times
tied  0 times
lost  7 times

total unique fn went from 134 to 152 lost   +13.43%
mean fn % went from 1.03076923077 to 1.16923076923 lost   +13.43%

ham mean                     ham sdev
   1.40    1.37   -2.14%        8.67    9.24   +6.57%
   1.12    1.03   -8.04%        8.09    8.37   +3.46%
   1.12    1.02   -8.93%        8.02    7.86   -2.00%
   1.26    1.13  -10.32%        8.62    9.07   +5.22%
   1.06    1.04   -1.89%        8.03    8.27   +2.99%
   1.01    0.86  -14.85%        6.87    7.08   +3.06%
   0.85    0.71  -16.47%        6.57    6.55   -0.30%
   0.96    0.90   -6.25%        7.06    7.56   +7.08%
   1.15    1.00  -13.04%        8.38    8.67   +3.46%
   1.01    0.77  -23.76%        7.62    7.11   -6.69%

ham mean and sdev for all runs
   1.09    0.98  -10.09%        7.83    8.03   +2.55%

spam mean                    spam sdev
  99.74   99.75   +0.01%        3.59    3.85   +7.24%
  99.67   99.71   +0.04%        4.17    3.95   -5.28%
  99.68   99.70   +0.02%        4.12    4.43   +7.52%
  99.83   99.81   -0.02%        2.68    3.16  +17.91%
  99.84   99.91   +0.07%        2.20    0.96  -56.36%
  99.66   99.73   +0.07%        4.29    3.92   -8.62%
  99.67   99.74   +0.07%        4.68    4.28   -8.55%
  99.79   99.81   +0.02%        2.98    2.52  -15.44%
  99.75   99.78   +0.03%        3.24    2.85  -12.04%
  99.54   99.68   +0.14%        5.07    4.96   -2.17%

spam mean and sdev for all runs
  99.72   99.76   +0.04%        3.80    3.66   -3.68%

ham/spam mean difference: 98.63 98.78 +0.15

And now, the zcombine histogram


-> <stat> Ham scores for all runs: 13000 items; mean 0.98; sdev 8.03
-> <stat> min -6.66134e-14; median 0; max 100
* = 205 items
 0.0 12487 *************************************************************
 0.5    84 *
 1.0    44 *
 1.5    29 *
 2.0    28 *
 2.5    17 *
 3.0     7 *
 3.5     7 *
 4.0    11 *
 4.5    14 *
 5.0     1 *
 5.5     5 *
 6.0     5 *
 6.5     6 *
 7.0     3 *
 7.5     6 *
 8.0     8 *
 8.5     2 *
 9.0     1 *
 9.5     3 *
10.0     1 *
10.5     3 *
11.0     6 *
11.5     3 *
12.0     6 *
12.5     6 *
13.0     8 *
13.5     1 *
14.0     1 *
14.5     0 
15.0     1 *
15.5     2 *
16.0     3 *
16.5     1 *
17.0     2 *
17.5     3 *
18.0     2 *
18.5     1 *
19.0     2 *
19.5     2 *
20.0     0 
20.5     3 *
21.0     1 *
21.5     0 
22.0     2 *
22.5     0 
23.0     3 *
23.5     1 *
24.0     0 
24.5     2 *
25.0     1 *
25.5     3 *
26.0     0 
26.5     2 *
27.0     0 
27.5     1 *
28.0     0 
28.5     2 *
29.0     2 *
29.5     1 *
30.0     1 *
30.5     2 *
31.0     0 
31.5     3 *
32.0     2 *
32.5     0 
33.0     0 
33.5     1 *
34.0     2 *
34.5     0 
35.0     2 *
35.5     0 
36.0     2 *
36.5     2 *
37.0     1 *
37.5     2 *
38.0     1 *
38.5     0 
39.0     2 *
39.5     1 *
40.0     1 *
40.5     4 *
41.0     1 *
41.5     1 *
42.0     0 
42.5     0 
43.0     1 *
43.5     2 *
44.0     0 
44.5     0 
45.0     1 *
45.5     1 *
46.0     1 *
46.5     1 *
47.0     1 *
47.5     1 *
48.0     1 *
48.5     2 *
49.0     0 
49.5     1 *
50.0     0 
50.5     0 
51.0     1 *
51.5     0 
52.0     1 *
52.5     1 *
53.0     2 *
53.5     0 
54.0     0 
54.5     1 *
55.0     1 *
55.5     1 *
56.0     1 *
56.5     1 *
57.0     1 *
57.5     0 
58.0     2 *
58.5     2 *
59.0     2 *
59.5     3 *
60.0     1 *
60.5     2 *
61.0     1 *
61.5     1 *
62.0     2 *
62.5     1 *
63.0     2 *
63.5     0 
64.0     0 
64.5     1 *
65.0     3 *
65.5     0 
66.0     0 
66.5     2 *
67.0     0 
67.5     1 *
68.0     0 
68.5     1 *
69.0     0 
69.5     0 
70.0     0 
70.5     0 
71.0     3 *
71.5     1 *
72.0     0 
72.5     1 *
73.0     0 
73.5     0 
74.0     0 
74.5     0 
75.0     0 
75.5     0 
76.0     0 
76.5     0 
77.0     0 
77.5     1 *
78.0     0 
78.5     0 
79.0     2 *
79.5     0 
80.0     3 *
80.5     0 
81.0     0 
81.5     0 
82.0     0 
82.5     1 *
83.0     1 *
83.5     1 *
84.0     0 
84.5     1 *
85.0     0 
85.5     1 *
86.0     1 *
86.5     0 
87.0     1 *
87.5     2 *
88.0     3 *
88.5     4 *
89.0     1 *
89.5     1 *
90.0     2 *
90.5     2 *
91.0     0 
91.5     0 
92.0     0 
92.5     0 
93.0     0 
93.5     1 *
94.0     0 
94.5     0 
95.0     3 *
95.5     0 
96.0     2 *
96.5     2 *
97.0     1 *
97.5     1 *
98.0     4 *
98.5     0 
99.0     2 *
99.5    23 *

-> <stat> Spam scores for all runs: 13000 items; mean 99.76; sdev 3.66
-> <stat> min 0; median 100; max 100
* = 210 items
 0.0     5 *
 0.5     1 *
 1.0     1 *
 1.5     2 *
 2.0     1 *
 2.5     0 
 3.0     1 *
 3.5     0 
 4.0     0 
 4.5     0 
 5.0     0 
 5.5     0 
 6.0     0 
 6.5     0 
 7.0     0 
 7.5     0 
 8.0     0 
 8.5     0 
 9.0     0 
 9.5     0 
10.0     0 
10.5     0 
11.0     0 
11.5     0 
12.0     0 
12.5     0 
13.0     0 
13.5     0 
14.0     0 
14.5     0 
15.0     0 
15.5     0 
16.0     0 
16.5     0 
17.0     0 
17.5     1 *
18.0     0 
18.5     0 
19.0     0 
19.5     0 
20.0     0 
20.5     1 *
21.0     0 
21.5     0 
22.0     0 
22.5     0 
23.0     0 
23.5     0 
24.0     0 
24.5     0 
25.0     0 
25.5     0 
26.0     0 
26.5     0 
27.0     0 
27.5     0 
28.0     0 
28.5     0 
29.0     0 
29.5     0 
30.0     0 
30.5     0 
31.0     0 
31.5     0 
32.0     0 
32.5     1 *
33.0     0 
33.5     0 
34.0     0 
34.5     2 *
35.0     0 
35.5     0 
36.0     1 *
36.5     0 
37.0     0 
37.5     0 
38.0     0 
38.5     0 
39.0     0 
39.5     0 
40.0     0 
40.5     0 
41.0     0 
41.5     0 
42.0     1 *
42.5     0 
43.0     0 
43.5     0 
44.0     0 
44.5     0 
45.0     0 
45.5     0 
46.0     0 
46.5     0 
47.0     1 *
47.5     0 
48.0     0 
48.5     0 
49.0     1 *
49.5     0 
50.0     0 
50.5     0 
51.0     0 
51.5     0 
52.0     0 
52.5     0 
53.0     0 
53.5     0 
54.0     0 
54.5     0 
55.0     0 
55.5     1 *
56.0     0 
56.5     0 
57.0     0 
57.5     0 
58.0     0 
58.5     0 
59.0     0 
59.5     0 
60.0     2 *
60.5     0 
61.0     2 *
61.5     0 
62.0     0 
62.5     0 
63.0     0 
63.5     1 *
64.0     1 *
64.5     0 
65.0     0 
65.5     1 *
66.0     1 *
66.5     1 *
67.0     0 
67.5     0 
68.0     0 
68.5     1 *
69.0     0 
69.5     0 
70.0     0 
70.5     1 *
71.0     0 
71.5     0 
72.0     2 *
72.5     0 
73.0     0 
73.5     0 
74.0     1 *
74.5     0 
75.0     0 
75.5     0 
76.0     0 
76.5     0 
77.0     0 
77.5     3 *
78.0     2 *
78.5     1 *
79.0     3 *
79.5     0 
80.0     0 
80.5     1 *
81.0     0 
81.5     0 
82.0     0 
82.5     0 
83.0     1 *
83.5     0 
84.0     3 *
84.5     1 *
85.0     1 *
85.5     3 *
86.0     1 *
86.5     1 *
87.0     1 *
87.5     2 *
88.0     0 
88.5     1 *
89.0     1 *
89.5     2 *
90.0     2 *
90.5     1 *
91.0     2 *
91.5     2 *
92.0     2 *
92.5     3 *
93.0     6 *
93.5     1 *
94.0     2 *
94.5     2 *
95.0     5 *
95.5     6 *
96.0    10 *
96.5     4 *
97.0     6 *
97.5    14 *
98.0    21 *
98.5    17 *
99.0    37 *
99.5 12794 *************************************************************
-> best cutoff for all runs: 0.985
->     with weighted total 10*25 fp + 152 fn = 402
->     fp rate 0.192%  fn rate 1.17%
    saving ham histogram pickle to class_hamhist.pik
    saving spam histogram pickle to class_spamhist.pik


.ini  for zcombine run

[Tokenizer]
mine_received_headers: True

[Classifier]
use_central_limit = False
use_central_limit2 = False
use_central_limit3 = False
use_tim_combining: False
use_chi_squared_combining: False
use_z_combining: True
robinson_minimum_prob_strength: 0.0

[TestDriver]
spam_cutoff: 0.985
show_false_negatives: True
show_false_positives: True
nbuckets: 200
best_cutoff_fp_weight: 10

show_spam_lo: 0.4
show_spam_hi: 0.80
show_ham_lo = 0.40
show_ham_hi = 0.80
show_charlimit: 10000

save_trained_pickles: True
save_histogram_pickles: True


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From popiel@wolfskeep.com  Mon Oct 14 19:57:13 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Mon, 14 Oct 2002 11:57:13 -0700
Subject: [Spambayes] defaults vs. chi-square
Message-ID: <20021014185713.4604CF4D4@cashew.wolfskeep.com>

I'm being lazy today, so I haven't put this one up on my
website in all its gory detail.

I did a cvs up, catching the changes to the histograms
and the cost determinations.  I did not catch Tim's last
modification for tagging the cost computations with set/all
discriminators.

cv1 is all defaults.  cv2 is chi-square, but otherwise default.

"""
cv1s -> cv2s
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
[yadda yadda yadda]
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams

false positive percentages
    0.500  0.500  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.500  lost  +(was 0)
    1.000  1.000  tied          
    0.000  0.500  lost  +(was 0)
    0.000  0.000  tied          
    0.500  1.000  lost  +100.00%
    0.000  0.000  tied          

won   0 times
tied  7 times
lost  3 times

total unique fp went from 4 to 7 lost   +75.00%
mean fp % went from 0.2 to 0.35 lost   +75.00%

false negative percentages
    2.000  1.500  won    -25.00%
    1.500  0.500  won    -66.67%
    4.000  2.000  won    -50.00%
    2.000  1.000  won    -50.00%
    2.000  1.500  won    -25.00%
    3.000  2.000  won    -33.33%
    5.000  3.500  won    -30.00%
    3.000  1.500  won    -50.00%
    5.000  2.500  won    -50.00%
    2.000  0.500  won    -75.00%

won  10 times
tied  0 times
lost  0 times

total unique fn went from 59 to 33 won    -44.07%
mean fn % went from 2.95 to 1.65 won    -44.07%

ham mean                     ham sdev
  17.22    0.50  -97.10%        7.39    7.04   -4.74%
  18.69    0.27  -98.56%        7.27    3.71  -48.97%
  18.86    0.04  -99.79%        6.50    0.41  -93.69%
  16.79    0.41  -97.56%        7.75    4.13  -46.71%
  18.66    0.36  -98.07%        7.09    4.84  -31.73%
  18.47    1.01  -94.53%        7.83    9.42  +20.31%
  18.19    0.51  -97.20%        6.99    5.47  -21.75%
  18.38    0.16  -99.13%        6.80    1.94  -71.47%
  17.67    0.95  -94.62%        7.88    9.40  +19.29%
  17.72    0.14  -99.21%        6.18    1.88  -69.58%

ham mean and sdev for all runs
  18.07    0.44  -97.57%        7.22    5.65  -21.75%

spam mean                    spam sdev
  75.58   98.42  +30.22%        9.15   10.85  +18.58%
  76.81   99.26  +29.23%        8.53    5.56  -34.82%
  74.95   97.82  +30.51%        9.44   12.18  +29.03%
  76.18   98.85  +29.76%        8.64    8.90   +3.01%
  76.55   98.55  +28.74%        8.84    9.65   +9.16%
  76.08   98.31  +29.22%        8.69   11.21  +29.00%
  75.61   97.25  +28.62%        9.72   13.12  +34.98%
  76.51   98.98  +29.37%        8.30    6.15  -25.90%
  75.92   98.26  +29.43%        9.62   10.37   +7.80%
  75.52   99.01  +31.10%        8.76    5.46  -37.67%

spam mean and sdev for all runs
  75.97   98.47  +29.62%        9.00    9.72   +8.00%

ham/spam mean difference: 57.90 98.03 +40.13
"""

Nothing too surprising, though I wonder if it would be good
to mangle cmp.py to output a table for unsure like it does
for fp and fn.  It also looks like it's using the raw untuned
numbers for fp and fn, instead of the computed best values.

The best info for cv1 (defaults):

"""
-> best cost $41.20
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.425 & 0.635
->     fp 0; fn 6; unsure ham 14; unsure spam 162
->     fp rate 0%; fn rate 0.3%; unsure rate 4.4%
"""

The best info for cv2 (chi-square):

"""
-> best cost $48.00
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 3 cutoff pairs
-> smallest ham & spam cutoffs 0.03 & 0.89
->     fp 3; fn 6; unsure ham 12; unsure spam 48
->     fp rate 0.15%; fn rate 0.3%; unsure rate 1.5%
-> largest ham & spam cutoffs 0.03 & 0.9
->     fp 3; fn 6; unsure ham 12; unsure spam 48
->     fp rate 0.15%; fn rate 0.3%; unsure rate 1.5%
"""

The histograms for chi-square look pretty much like all the other
histograms reported here (big spikes at the ends for the ham and
spam, several spread lightly (and fairly evenly) over the middle
ground.

I must say that I like chi-square best out of all the ones I've
tested, since it has fairly obvious points for the cutoffs (I suspect
that .05 and .90 are not too far from optimal for just about everyone),
and it does have a useful middle ground.

(The false positives I get from it are fairly hopeless cases:
FDIC informing customers that NextBank died, a contractor's bid
containing only an encoded .pdf, info requests wrt getting a new
mortgage.  The false negatives are a bunch of particularly chatty
spams, and one or two with empty bodies.  Again, fairly hopeless.)

I'll be testing the zcombining shortly.

- Alex

From tim.one@comcast.net  Mon Oct 14 20:35:33 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 14 Oct 2002 15:35:33 -0400
Subject: [Spambayes] defaults vs. chi-square
In-Reply-To: <20021014185713.4604CF4D4@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEIKBLAB.tim.one@comcast.net>

[T. Alexander Popiel]
> I'm being lazy today, so I haven't put this one up on my
> website in all its gory detail.

I confess I haven't been able to make enough time to follow all the msgs on
this list carefully, let alone cruise the web mining more details.  If
stupid beats smart here, let's hope lazy beats ambitious too <wink>.

> I did a cvs up, catching the changes to the histograms
> and the cost determinations.

Good!

> I did not catch Tim's last modification for tagging the cost
> computations with set/all discriminators.

That's fine -- purely cosmetic, no difference in results.

> cv1 is all defaults.  cv2 is chi-square, but otherwise default.
>
> """
> cv1s -> cv2s
> -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
> [yadda yadda yadda]
> -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
>
> false positive percentages
>     0.500  0.500  tied
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.000  0.500  lost  +(was 0)
>     1.000  1.000  tied
>     0.000  0.500  lost  +(was 0)
>     0.000  0.000  tied
>     0.500  1.000  lost  +100.00%
>     0.000  0.000  tied
>
> won   0 times
> tied  7 times
> lost  3 times
>
> total unique fp went from 4 to 7 lost   +75.00%
> mean fp % went from 0.2 to 0.35 lost   +75.00%
>
> false negative percentages
>     2.000  1.500  won    -25.00%
>     1.500  0.500  won    -66.67%
>     4.000  2.000  won    -50.00%
>     2.000  1.000  won    -50.00%
>     2.000  1.500  won    -25.00%
>     3.000  2.000  won    -33.33%
>     5.000  3.500  won    -30.00%
>     3.000  1.500  won    -50.00%
>     5.000  2.500  won    -50.00%
>     2.000  0.500  won    -75.00%
>
> won  10 times
> tied  0 times
> lost  0 times
>
> total unique fn went from 59 to 33 won    -44.07%
> mean fn % went from 2.95 to 1.65 won    -44.07%
>
> ham mean                     ham sdev
>   17.22    0.50  -97.10%        7.39    7.04   -4.74%
>   18.69    0.27  -98.56%        7.27    3.71  -48.97%
>   18.86    0.04  -99.79%        6.50    0.41  -93.69%
>   16.79    0.41  -97.56%        7.75    4.13  -46.71%
>   18.66    0.36  -98.07%        7.09    4.84  -31.73%
>   18.47    1.01  -94.53%        7.83    9.42  +20.31%
>   18.19    0.51  -97.20%        6.99    5.47  -21.75%
>   18.38    0.16  -99.13%        6.80    1.94  -71.47%
>   17.67    0.95  -94.62%        7.88    9.40  +19.29%
>   17.72    0.14  -99.21%        6.18    1.88  -69.58%
>
> ham mean and sdev for all runs
>   18.07    0.44  -97.57%        7.22    5.65  -21.75%
>
> spam mean                    spam sdev
>   75.58   98.42  +30.22%        9.15   10.85  +18.58%
>   76.81   99.26  +29.23%        8.53    5.56  -34.82%
>   74.95   97.82  +30.51%        9.44   12.18  +29.03%
>   76.18   98.85  +29.76%        8.64    8.90   +3.01%
>   76.55   98.55  +28.74%        8.84    9.65   +9.16%
>   76.08   98.31  +29.22%        8.69   11.21  +29.00%
>   75.61   97.25  +28.62%        9.72   13.12  +34.98%
>   76.51   98.98  +29.37%        8.30    6.15  -25.90%
>   75.92   98.26  +29.43%        9.62   10.37   +7.80%
>   75.52   99.01  +31.10%        8.76    5.46  -37.67%
>
> spam mean and sdev for all runs
>   75.97   98.47  +29.62%        9.00    9.72   +8.00%
>
> ham/spam mean difference: 57.90 98.03 +40.13
> """
>
> Nothing too surprising, though I wonder if it would be good
> to mangle cmp.py to output a table for unsure like it does
> for fp and fn.  It also looks like it's using the raw untuned
> numbers for fp and fn, instead of the computed best values.

Yes, cmp.py doesn't look at the histograms at all, it's mining the
individual

> ...
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.000  0.500  lost  +(was 0)
>     1.000  1.000  tied
>     0.000  0.500  lost  +(was 0)
>     0.000  0.000  tied
> ...

output lines.  Those are still based on a single value for spam_cutoff, and
a single cutoff value doesn't really make sense for the "middle ground"
schemes.  The mean and sdev stats remain interesting for these schemes, but
cmp.py's fn and fp accounts are at best misleading for the middle-ground
schemes.  For now, the histogram analysis is the best analytic ouput we get
for such schemes.

> The best info for cv1 (defaults):
>
> """
> -> best cost $41.20
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at ham & spam cutoffs 0.425 & 0.635
> ->     fp 0; fn 6; unsure ham 14; unsure spam 162
> ->     fp rate 0%; fn rate 0.3%; unsure rate 4.4%
> """

The all-default scheme does do very well; the practical difficulty has been
that "the best" cutoff values seem extremely corpus-dependent, and even so
require 3 digits of precision to express, and change depending on how much
data you train on.  Cutoffs that can only be determined after the fact, and
only when knowing exactly what the classifications *should* have been, are
impractical on several counts.  Still, if you had a time machine (so could
pick "the best" cutoffs later and apply them retroactively), nothing else
really does better.

> The best info for cv2 (chi-square):
>
> """
> -> best cost $48.00
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at 3 cutoff pairs
> -> smallest ham & spam cutoffs 0.03 & 0.89
> ->     fp 3; fn 6; unsure ham 12; unsure spam 48
> ->     fp rate 0.15%; fn rate 0.3%; unsure rate 1.5%
> -> largest ham & spam cutoffs 0.03 & 0.9
> ->     fp 3; fn 6; unsure ham 12; unsure spam 48
> ->     fp rate 0.15%; fn rate 0.3%; unsure rate 1.5%
> """

And this seems a lot easier to live with in a world without time machines:
the middle ground spans a huge range of scores, yet contains a lot fewer
msgs than under highly-corpus-tuned cv1.

> The histograms for chi-square look pretty much like all the other
> histograms reported here (big spikes at the ends for the ham and
> spam, several spread lightly (and fairly evenly) over the middle
> ground.
>
> I must say that I like chi-square best out of all the ones I've
> tested, since it has fairly obvious points for the cutoffs (I suspect
> that .05 and .90 are not too far from optimal for just about everyone),
> and it does have a useful middle ground.

I agree on all counts.

> (The false positives I get from it are fairly hopeless cases:
> FDIC informing customers that NextBank died, a contractor's bid
> containing only an encoded .pdf,

That one surprises me:  assuming we threw the body away unlooked-at (we
ignore MIME sections that aren't of text/* type), it's hard to get enough
other clues to force a spam score so high.  If possible, I'd like to see the
list of clues (the "prob('word') = 0.432' thingies in the main output file,
assuing you have show_false_positives enabled).

> info requests wrt getting a new mortgage.  The false negatives are a
> bunch of particularly chatty spams, and one or two with empty bodies.
> Again, fairly hopeless.)

Long chatty spam has been pretty reliably scoring near 0.5 for me, which has
been a real advantage of chi combining.  So again I'd really like to see the
list of clues.

> I'll be testing the zcombining shortly.

I look forward to it.  Note that, as above, that's another middle-ground
scheme, so only the histogram analysis will be truly interesting.


From bkc@murkworks.com  Mon Oct 14 21:12:48 2002
From: bkc@murkworks.com (Brad Clements)
Date: Mon, 14 Oct 2002 16:12:48 -0400
Subject: [Spambayes] Tokenizer output text range, high bits
Message-ID: <3DAAECFE.18531.1BF2CA3@localhost>

I thought I'd read in the list that the tokenizer doesn't return chars with the "high bit" set, 
just creates a new token indicating that.

So, when going through the classifier wordlist keys, I don't expect to see any keys with 
chars where ord(c) & 0x80 != 0

however, I am finding some.

Also, finding chars whose ord() < 32. 

I'm not so worried about the later (as long as there aren't any nuls), but somewhat 
concerned about the high-bit. Unicode? I don't want to deal with that just now.. :-(


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From rob@hooft.net  Mon Oct 14 21:12:38 2002
From: rob@hooft.net (Rob Hooft)
Date: Mon, 14 Oct 2002 22:12:38 +0200
Subject: [Spambayes] Total cost analysis
References: <LNBBLJKPBEHFEDALKOLCOEIDBLAB.tim.one@comcast.net>
Message-ID: <3DAB2536.8000601@hooft.net>

Tim Peters wrote:
> CAUTION:  For the attached histogram pair, cvcost sez:
> 
>     tcap.txt: Optimal cost is $10.0 with grey zone between 89.0 and 97.0
> 
> but the new histogram analysis says:
> 
> -> best cost $0.80
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at 24 cutoff pairs
> -> smallest ham & spam cutoffs 0.855 & 0.995
> ->     fp 0; fn 0; unsure ham 1; unsure spam 3
> ->     fp rate 0%; fn rate 0%; unsure rate 2%
> -> largest ham & spam cutoffs 0.97 & 0.995
> ->     fp 0; fn 0; unsure ham 1; unsure spam 3
> ->     fp rate 0%; fn rate 0%; unsure rate 2%
> 
> and eyeballing the histograms shows that the latter is correct.  I don't
> know why cvcost.py thinks $10.00 is the best that can be done; I suspect
> it's because it's skipping some cutoff pairs in order to save time.

Yep, it only does full percentage points. It is a quick hack that should 
be done away with now that it is implemented in the histogram analysis.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From popiel@wolfskeep.com  Mon Oct 14 21:29:59 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Mon, 14 Oct 2002 13:29:59 -0700
Subject: [Spambayes] defaults vs. chi-square 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	of "Mon, 14 Oct 2002 15:35:33 EDT."
	<LNBBLJKPBEHFEDALKOLCCEIKBLAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCCEIKBLAB.tim.one@comcast.net> 
Message-ID: <20021014202959.76B5BF4D4@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCCEIKBLAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>> The best info for cv2 (chi-square):
>>
>> """
>> -> best cost $48.00
>> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
>> -> achieved at 3 cutoff pairs
>> -> smallest ham & spam cutoffs 0.03 & 0.89
>> ->     fp 3; fn 6; unsure ham 12; unsure spam 48
>> ->     fp rate 0.15%; fn rate 0.3%; unsure rate 1.5%
>> -> largest ham & spam cutoffs 0.03 & 0.9
>> ->     fp 3; fn 6; unsure ham 12; unsure spam 48
>> ->     fp rate 0.15%; fn rate 0.3%; unsure rate 1.5%
>> """
>
>And this seems a lot easier to live with in a world without time machines:
>the middle ground spans a huge range of scores, yet contains a lot fewer
>msgs than under highly-corpus-tuned cv1.
>
>> The histograms for chi-square look pretty much like all the other
>> histograms reported here (big spikes at the ends for the ham and
>> spam, several spread lightly (and fairly evenly) over the middle
>> ground.
>>
>> I must say that I like chi-square best out of all the ones I've
>> tested, since it has fairly obvious points for the cutoffs (I suspect
>> that .05 and .90 are not too far from optimal for just about everyone),
>> and it does have a useful middle ground.
>
>I agree on all counts.
>
>> (The false positives I get from it are fairly hopeless cases:
>> FDIC informing customers that NextBank died, a contractor's bid
>> containing only an encoded .pdf,
>
>That one surprises me:  assuming we threw the body away unlooked-at (we
>ignore MIME sections that aren't of text/* type), it's hard to get enough
>other clues to force a spam score so high.  If possible, I'd like to see the
>list of clues (the "prob('word') = 0.432' thingies in the main output file,
>assuing you have show_false_positives enabled).

Data/Ham/Set5/2745
prob = 0.685540245196
prob('*H*') = 0.535842
prob('*S*') = 0.906922
prob('content-type:application/pdf') = 0.0918367
prob('filename:fname piece:pdf') = 0.0918367
prob('subject:Electrical') = 0.155172
prob('content-type:text/plain') = 0.389566
prob('header:Received:5') = 0.389918
prob('content-type:multipart/mixed') = 0.737422
prob('content-type:multipart/alternative') = 0.948917
prob('&nbsp;') = 0.959269
prob('content-type:text/html') = 0.986282

That's the whole list of probabilities.  I did fib slightly: in
addition to the bid.pdf, there's a one-space-character message
body represented in both plain text and HTML.  Effectively null,
but the classifier doesn't see it that way.  It's that dual-body
that's killing it.

>> info requests wrt getting a new mortgage.  The false negatives are a
>> bunch of particularly chatty spams, and one or two with empty bodies.
>> Again, fairly hopeless.)
>
>Long chatty spam has been pretty reliably scoring near 0.5 for me, which has
>been a real advantage of chi combining.  So again I'd really like to see the
>list of clues.

My error... I was looking at the fn output without paying attention
to the listed probs.  Since the fn output is based on the single
cutoff (set at 0.56), it was getting some of the chatty stuff.  The
real fns are pretty short, and generally in odd languages or binary.

This one looks like a worm:

Data/Spam/Set3/32
prob = 0.000317545970781
prob('*H*') = 0.999926
prob('*S*') = 0.000560844
prob('skip:b 70') = 0.0412844
prob('skip:a 70') = 0.0505618
prob('skip:d 70') = 0.0505618
prob('skip:e 70') = 0.0505618
prob('email name:debian-java-request') = 0.0547407
prob('email addr:lists.debian.org') = 0.0594895
prob('email name:listmaster') = 0.0599834
prob("control: couldn't decode") = 0.0652174
prob('from:email addr:t-online.de>') = 0.0652174
prob('skip:c 70') = 0.0652174
prob('skip:i 70') = 0.0652174
prob('skip:y 70') = 0.0652174
prob('skip:z 70') = 0.0652174
prob('trouble?') = 0.0753369
prob('skip:" 10') = 0.277389
prob('skip:a 20') = 0.295202
prob('content-type:text/plain') = 0.388944
prob('header:Message-Id:1') = 0.6167
prob('email') = 0.787497
prob('x-mailer:microsoft outlook express 5.50.4133.2400') = 0.791262
prob('message-id:@lists.debian.org') = 0.844828
prob('skip:5 70') = 0.844828

And again:

Data/Spam/Set3/2472
prob = 0.0029549796705
prob('*H*') = 0.999949
prob('*S*') = 0.00585924
prob('header:In-Reply-To:1') = 0.000449595
prob('skip:s 70') = 0.0412844
prob('skip:d 70') = 0.0505618
prob('skip:o 70') = 0.0505618
prob('skip:t 70') = 0.0505618
prob("control: couldn't decode") = 0.0652174
prob('skip:c 70') = 0.0652174
prob('skip:i 70') = 0.0652174
prob('skip:l 70') = 0.0652174
prob('skip:z 70') = 0.0652174
prob('from:email addr:mail.com>') = 0.23545
prob('charset:us-ascii') = 0.317057
prob('skip:n 30') = 0.355072
prob('content-type:text/plain') = 0.388944
prob('header:Message-Id:1') = 0.6167
prob('content-disposition:inline') = 0.661659
prob('content-type:multipart/mixed') = 0.696645
prob('x-mailer:microsoft outlook, build 10.0.2616') = 0.97619


This one actually wasn't too long and chatty, but it seemed
to hit a bunch of good words, and was half in french:

Data/Spam/Set6/2011
prob = 0.00173950022128
prob('*H*') = 0.99774
prob('*S*') = 0.00121919
prob('forum') = 0.0121951
prob('url:be') = 0.0302013
prob('email name:debian-java-request') = 0.0341451
prob('email addr:lists.debian.org') = 0.0441114
prob('email name:listmaster') = 0.044487
prob('trouble?') = 0.0604856
prob('des') = 0.0652174
prob('cross') = 0.117486
prob('avec') = 0.155172
prob('est') = 0.155172
prob('firmwares') = 0.155172
prob('progress,') = 0.155172
prob('toute') = 0.155172
prob('...') = 0.180314
prob('occasionally') = 0.184814
prob('still') = 0.237895
prob('but') = 0.249098
prob('skip:" 10') = 0.278104
prob('site') = 0.295343
prob('already') = 0.301798
prob('charset:us-ascii') = 0.308681
prob('after') = 0.341657
prob('x-mailer:microsoft outlook express 6.00.2600.0000') = 0.347036
prob('content-type:text/plain') = 0.390599
prob('header:Reply-To:1') = 0.60073
prob('from') = 0.604083
prob('subject:.') = 0.605015
prob('available') = 0.637633
prob('header:Mime-Version:1') = 0.646706
prob('email') = 0.785132
prob('please') = 0.83219
prob('subject:skip:W 10') = 0.908163
prob('url:') = 0.936848

I don't know what happened to the other fn < 0.03.  Close, but not
quite, is a nigerian spam (!!!):

Data/Spam/Set7/352
prob = 0.0344593026264
prob('*H*') = 0.999908
prob('*S*') = 0.0688269
prob('indeed') = 0.00556242
prob('aim') = 0.012894
prob('(my') = 0.0145631
prob('manner') = 0.0180723
prob('wrote') = 0.0211545
prob('reminder') = 0.0238095
prob('nigerian') = 0.0266272
prob('december') = 0.0266272
prob('so.') = 0.0281933
prob('okay') = 0.0302013
prob('although') = 0.0350768
prob('numbered') = 0.0412844
prob('ratio') = 0.0446266
prob('opposed') = 0.0481336
prob('apparently,') = 0.0505618
prob('revert') = 0.0505618
prob('officer') = 0.0505618
prob('subsequently') = 0.0505618
prob('patience') = 0.0505618
prob('however') = 0.0524146
prob('overcome') = 0.0599022
prob('fixed') = 0.0617239
prob('infer') = 0.0652174
prob('presumed') = 0.0652174
prob('filename:fname piece:txt') = 0.0652174
prob('therefore') = 0.0838752
prob('attempts') = 0.0874263
prob('expert,') = 0.0918367
prob('calendar') = 0.0918367
prob('travelling') = 0.0918367
prob('nigeria.') = 0.0918367
prob('apparently') = 0.0929593
prob('forwarding') = 0.106987
prob('saw') = 0.107116
prob('thus') = 0.110275
prob('did') = 0.112618
prob('concern') = 0.114396
prob('especially') = 0.125537
prob('finally,') = 0.126719
prob('shall') = 0.135258
prob('worked') = 0.138554
prob('point') = 0.154593
prob('totaling') = 0.155172
prob('proposition') = 0.155172
prob('6th') = 0.155172
prob('actively') = 0.165428
prob('since') = 0.166612
prob('knows') = 0.169148
prob('which') = 0.172635
prob('necessary') = 0.182854
prob('source') = 0.183395
prob('routine') = 0.189922
prob('driven') = 0.205305
prob('got') = 0.206143
prob('reality') = 0.206601
prob('light') = 0.207284
prob('skip:h 20') = 0.211375
prob('some') = 0.214937
prob('there') = 0.219934
prob('same') = 0.227242
prob('still') = 0.238027
prob('but') = 0.254404
prob('according') = 0.254563
prob('very') = 0.256327
prob('skip:m 10') = 0.258633
prob('stand') = 0.260226
prob('died') = 0.263314
prob('branch') = 0.263314
prob('zero') = 0.26593
prob('number') = 0.267526
prob('them') = 0.274205
prob('large') = 0.27431
prob('his') = 0.276565
prob('transaction') = 0.281659
prob('consultant') = 0.283198
prob('reason') = 0.288324
prob('dead') = 0.288434
prob('trace') = 0.29021
prob('mr.') = 0.292388
prob('part') = 0.294772
prob('when') = 0.297739
prob('ask') = 0.299886
prob('already') = 0.299963
prob('listing') = 0.310964
prob('given') = 0.311411
prob('down') = 0.311983
prob('charset:us-ascii') = 0.312457
prob('being') = 0.312739
prob('federal') = 0.695627
prob('president') = 0.697044
prob('safely') = 0.700267
prob('notification') = 0.700364
prob('information') = 0.703131
prob('skip:r 10') = 0.706302
prob('inform') = 0.707612
prob('brought') = 0.70783
prob('your') = 0.710937
prob('complete') = 0.711206
prob('content-type:application/octet-stream') = 0.718341
prob('country.') = 0.718341
prob('immediately') = 0.727163
prob('further') = 0.728674
prob('obtained') = 0.732221
prob('risk') = 0.747156
prob('content-type:multipart/mixed') = 0.751609
prob('contract') = 0.754669
prob('informed') = 0.75788
prob('business') = 0.761283
prob('internet') = 0.768097
prob('phone') = 0.774467
prob('questions') = 0.795045
prob('money,') = 0.796192
prob('bank') = 0.801151
prob('succeed') = 0.805677
prob('settled') = 0.810078
prob('month') = 0.811997
prob('claim') = 0.812913
prob('confidential') = 0.815186
prob('money.') = 0.8156
prob('our') = 0.820323
prob('please') = 0.828641
prob('months,') = 0.829218
prob('fund') = 0.83557
prob('national') = 0.835796
prob('sent') = 0.837147
prob('blood') = 0.843797
prob('asked,') = 0.844828
prob('treasury') = 0.844828
prob('address') = 0.860353
prob('reply') = 0.864689
prob('achieving') = 0.87037
prob('money') = 0.878353
prob('70%') = 0.880818
prob('million') = 0.885051
prob('corporation') = 0.891198
prob('free') = 0.90477
prob('approval') = 0.904949
prob('x-mailer:microsoft outlook express 5.00.2919.6900 dm') = 0.908163
prob('modalities') = 0.908163
prob('employment') = 0.912574
prob('claim.') = 0.915225
prob('skip:y 10') = 0.922406
prob('deposit') = 0.929253
prob('wish') = 0.930416
prob('credit') = 0.941699
prob('valued') = 0.950726
prob('guaranteed') = 0.956906
prob('honored') = 0.958716
prob('message-id:@ucsu.colorado.edu') = 0.965116
prob('conservative') = 0.983271

All you folks _talking_ about the nigerian spams has turned them
into ham for me! ;-)

- Alex

From tim.one@comcast.net  Mon Oct 14 21:33:10 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 14 Oct 2002 16:33:10 -0400
Subject: [Spambayes] Comparing chi to zcombine
In-Reply-To: <3DAACFCE.25807.14D268A@localhost>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEIOBLAB.tim.one@comcast.net>

[Brad Clements]
> ...
> -> best cutoff for all runs: 0.985
> ->     with weighted total 10*25 fp + 152 fn = 402
> ->     fp rate 0.192%  fn rate 1.17%
>     saving ham histogram pickle to class_hamhist.pik
>     saving spam histogram pickle to class_spamhist.pik

Note that a single cutoff value doesn't make sense for the "middle ground"
methods.  Since you ran this, I checked in changes to histogram analysis
that compute "best" ham *and* spam cutoff points, where best minimizes a
function with three distinct costs (cost of an FP, cost of an FN, cost of an
"unsure" msg).  You set those costs to what makes sense for your application
(e.g., as I've said many times, *I'd* rather get an fp than an fn for my own
use, as I'm going to review every rejection anyway, and I just want to
shuffle spam out of my main inbox so it doesn't interfere with normal
workflow; I may be unique in that, though).

I was able to run that analysis over the z-combining histogram you included
here, but it's impossible to guess what it would have said for your
chi-combining run:

-> best cost for Brad z-combining run: $301.40
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.6 & 0.995
->     fp 23; fn 21; unsure ham 67; unsure spam 185
->     fp rate 0.177%; fn rate 0.162%; unsure rate 0.969%

.995 is the highest bucket there was, so it couldn't draw any finer
distinction among the 23 ham in in the .995 bucket.  Boosting nbuckets would
allow a more exact analysis.  OTOH, those fp are scoring so high they may be
hopeless.  On the third hand,

-> <stat> Spam scores for all runs: 13000 items; mean 99.76; sdev 3.66
-> <stat> min 0; median 100; max 100

*at least* half your spam scored 100 under z-combining (because the median
spam score was 100), so there may well a useful distinction remaining to be
drawn within the .995 bucket.


From tim.one@comcast.net  Mon Oct 14 21:45:18 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 14 Oct 2002 16:45:18 -0400
Subject: [Spambayes] Tokenizer output text range, high bits
In-Reply-To: <3DAAECFE.18531.1BF2CA3@localhost>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEIPBLAB.tim.one@comcast.net>

[Brad Clements]
> I thought I'd read in the list that the tokenizer doesn't return
> chars with the "high bit" set, just creates a new token indicating that.

"Words" are currently obtained via simple-minded split-on-whitespace after
converting to lowercase.  If a word has fewer than 3 chars, it's ignored.
If it has between 3 and 12 chars inclusive, it's taken as-is
(unconditionally -- it doesn't matter if it's all \xff or all NUL bytes or
anything in-between).  If it has more than 12 chars, then a "skip" metatoken
is generated, *and* if it has any high-bit chars, an "8bit%" metatoken is
also generated.

> So, when going through the classifier wordlist keys, I don't
> expect to see any keys with chars where ord(c) & 0x80 != 0
>
> however, I am finding some.
>
> Also, finding chars whose ord() < 32.

See above; all that is expected.

> I'm not so worried about the later (as long as there aren't any
> nuls),

Why would you care about \x00 bytes?  Python doesn't.

> but somewhat  concerned about the high-bit. Unicode? I don't want to
> deal with that just now.. :-(

Any number of non-Unicode encoding schemes use high-bit characters.  For
example, *typical* French, German and Spanish use high-bit characters
sparingly.  The current scheme should work fine for French and Spanish
users.  German seems to contain a lot of very long words, though, so less
sanguine about that.

Some Asian language don't use whitespace at all, so the s-o-w scheme ends up
generating lots of "skip" and "8bit%" tokens for those.  I expect the
current tokenizer would be pretty much useless for Asian users as a result.


From tim.one@comcast.net  Mon Oct 14 22:22:45 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 14 Oct 2002 17:22:45 -0400
Subject: [Spambayes] defaults vs. chi-square
In-Reply-To: <20021014202959.76B5BF4D4@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEJCBLAB.tim.one@comcast.net>

[T. Alexander Popiel]
>>> (The false positives I get from it are fairly hopeless cases:
>>> FDIC informing customers that NextBank died, a contractor's bid
>>> containing only an encoded .pdf,

[Tim]
>> That one surprises me:  assuming we threw the body away unlooked-at (we
>> ignore MIME sections that aren't of text/* type), it's hard to get
>> enough other clues to force a spam score so high.  If possible, I'd
>> like to see the list of clues (the "prob('word') = 0.432' thingies in
>> the main output file, assuing you have show_false_positives enabled).

[Alex]
> Data/Ham/Set5/2745
> prob = 0.685540245196

How did this end up getting counted as an FP?  A score of 0.69 was very
solidly in your middle ground.

> prob('*H*') = 0.535842
> prob('*S*') = 0.906922
> prob('content-type:application/pdf') = 0.0918367
> prob('filename:fname piece:pdf') = 0.0918367
> prob('subject:Electrical') = 0.155172
> prob('content-type:text/plain') = 0.389566
> prob('header:Received:5') = 0.389918
> prob('content-type:multipart/mixed') = 0.737422
> prob('content-type:multipart/alternative') = 0.948917
> prob('&nbsp;') = 0.959269
> prob('content-type:text/html') = 0.986282
>
> That's the whole list of probabilities.

Right, that's what I expected:  if we skipped the .pdf attachment, there's
very little left, and it's hard for very little to get a killer-strong spam
score.

> I did fib slightly: in addition to the bid.pdf, there's a
> one-space-character message body represented in both plain text
> and HTML.  Effectively null, but the classifier doesn't see it that
> way.  It's that dual-body that's killing it.

As above, this just doesn't *have* a high spam score.  I think you must have
confused this with some other other FP.  The tokenizer should probably get
rid of "&nbsp;" anyway, but that's a different experiment.

>>> The false negatives are a bunch of particularly chatty spams, and
>>> one or two with empty bodies.  Again, fairly hopeless.)

>> Long chatty spam has been pretty reliably scoring near 0.5 for
>> me, which has been a real advantage of chi combining.  So again I'd
>> really like to see the list of clues.

> My error... I was looking at the fn output without paying attention
> to the listed probs.  Since the fn output is based on the single
> cutoff (set at 0.56),

Ah, that would also explain why the 0.69 msg above was mistaken for an FP
rather than a middle-ground msg.

> it was getting some of the chatty stuff.  The real fns are pretty
> short, and generally in odd languages or binary.
>
> This one looks like a worm:
>
> Data/Spam/Set3/32
> prob = 0.000317545970781
> prob('*H*') = 0.999926
> prob('*S*') = 0.000560844
> prob('skip:b 70') = 0.0412844
> prob('skip:a 70') = 0.0505618
> prob('skip:d 70') = 0.0505618
> prob('skip:e 70') = 0.0505618
> prob('email name:debian-java-request') = 0.0547407
> prob('email addr:lists.debian.org') = 0.0594895
> prob('email name:listmaster') = 0.0599834
> prob("control: couldn't decode") = 0.0652174
> prob('from:email addr:t-online.de>') = 0.0652174
> prob('skip:c 70') = 0.0652174
> prob('skip:i 70') = 0.0652174
> prob('skip:y 70') = 0.0652174
> prob('skip:z 70') = 0.0652174

An odd thing is that you must have a lot of 'skip:z 70' (etc) tokens in your
ham too, else these spamprobs wouldn't be so small.  Any idea where they
come from?  It suggests the tokenizer is giving up on something it should
really be picking apart -- but I don't have many of these in my ham, so I'm
at a loss to guess where they come from.

> prob('trouble?') = 0.0753369
> prob('skip:" 10') = 0.277389
> prob('skip:a 20') = 0.295202
> prob('content-type:text/plain') = 0.388944
> prob('header:Message-Id:1') = 0.6167
> prob('email') = 0.787497
> prob('x-mailer:microsoft outlook express 5.50.4133.2400') = 0.791262
> prob('message-id:@lists.debian.org') = 0.844828
> prob('skip:5 70') = 0.844828
>
> And again:
>
> Data/Spam/Set3/2472
> prob = 0.0029549796705
> prob('*H*') = 0.999949
> prob('*S*') = 0.00585924
> prob('header:In-Reply-To:1') = 0.000449595
> prob('skip:s 70') = 0.0412844
> prob('skip:d 70') = 0.0505618
> prob('skip:o 70') = 0.0505618
> prob('skip:t 70') = 0.0505618
> prob("control: couldn't decode") = 0.0652174
> prob('skip:c 70') = 0.0652174
> prob('skip:i 70') = 0.0652174
> prob('skip:l 70') = 0.0652174
> prob('skip:z 70') = 0.0652174

As above, you must have an awful lot of low-spamprob skip tokens in your
ham.

> prob('from:email addr:mail.com>') = 0.23545
> prob('charset:us-ascii') = 0.317057
> prob('skip:n 30') = 0.355072
> prob('content-type:text/plain') = 0.388944
> prob('header:Message-Id:1') = 0.6167
> prob('content-disposition:inline') = 0.661659
> prob('content-type:multipart/mixed') = 0.696645
> prob('x-mailer:microsoft outlook, build 10.0.2616') = 0.97619
>
>
> This one actually wasn't too long and chatty, but it seemed
> to hit a bunch of good words, and was half in french:

You must have more French in your ham, then (else the French words wouldn't
have low spamprobs).

> Data/Spam/Set6/2011
> prob = 0.00173950022128
> prob('*H*') = 0.99774
> prob('*S*') = 0.00121919
> prob('forum') = 0.0121951
> prob('url:be') = 0.0302013
> prob('email name:debian-java-request') = 0.0341451
> prob('email addr:lists.debian.org') = 0.0441114
> prob('email name:listmaster') = 0.044487
> prob('trouble?') = 0.0604856
> prob('des') = 0.0652174
> prob('cross') = 0.117486
> prob('avec') = 0.155172
> prob('est') = 0.155172
> prob('firmwares') = 0.155172
> prob('progress,') = 0.155172
> prob('toute') = 0.155172
> prob('...') = 0.180314
> prob('occasionally') = 0.184814
> prob('still') = 0.237895
> prob('but') = 0.249098
> prob('skip:" 10') = 0.278104
> prob('site') = 0.295343
> prob('already') = 0.301798
> prob('charset:us-ascii') = 0.308681
> prob('after') = 0.341657
> prob('x-mailer:microsoft outlook express 6.00.2600.0000') = 0.347036
> prob('content-type:text/plain') = 0.390599
> prob('header:Reply-To:1') = 0.60073
> prob('from') = 0.604083
> prob('subject:.') = 0.605015
> prob('available') = 0.637633
> prob('header:Mime-Version:1') = 0.646706
> prob('email') = 0.785132
> prob('please') = 0.83219
> prob('subject:skip:W 10') = 0.908163
> prob('url:') = 0.936848
>
> I don't know what happened to the other fn < 0.03.  Close, but not
> quite, is a nigerian spam (!!!):
>
> Data/Spam/Set7/352
> prob = 0.0344593026264
> prob('*H*') = 0.999908
> prob('*S*') = 0.0688269
> prob('indeed') = 0.00556242
> prob('aim') = 0.012894
> prob('(my') = 0.0145631
> prob('manner') = 0.0180723
> prob('wrote') = 0.0211545
> prob('reminder') = 0.0238095
> prob('nigerian') = 0.0266272

You have lot of ham containing "Nigerian"?  If so, that may be my fault for
talking about my Nigerian-scam FP every chance I get <wink>.

> prob('december') = 0.0266272
> prob('so.') = 0.0281933
> prob('okay') = 0.0302013
> prob('although') = 0.0350768
> prob('numbered') = 0.0412844
> prob('ratio') = 0.0446266
> prob('opposed') = 0.0481336
> prob('apparently,') = 0.0505618
> prob('revert') = 0.0505618
> prob('officer') = 0.0505618
> prob('subsequently') = 0.0505618
> prob('patience') = 0.0505618
> prob('however') = 0.0524146
> prob('overcome') = 0.0599022
> prob('fixed') = 0.0617239
> prob('infer') = 0.0652174
> prob('presumed') = 0.0652174
> prob('filename:fname piece:txt') = 0.0652174
> prob('therefore') = 0.0838752
> prob('attempts') = 0.0874263
> prob('expert,') = 0.0918367
> prob('calendar') = 0.0918367
> prob('travelling') = 0.0918367
> prob('nigeria.') = 0.0918367
> prob('apparently') = 0.0929593
> prob('forwarding') = 0.106987
> prob('saw') = 0.107116
> prob('thus') = 0.110275
> prob('did') = 0.112618
> prob('concern') = 0.114396
> prob('especially') = 0.125537
> prob('finally,') = 0.126719
> prob('shall') = 0.135258
> prob('worked') = 0.138554
> prob('point') = 0.154593
> prob('totaling') = 0.155172
> prob('proposition') = 0.155172
> prob('6th') = 0.155172
> prob('actively') = 0.165428
> prob('since') = 0.166612
> prob('knows') = 0.169148
> prob('which') = 0.172635
> prob('necessary') = 0.182854
> prob('source') = 0.183395
> prob('routine') = 0.189922
> prob('driven') = 0.205305
> prob('got') = 0.206143
> prob('reality') = 0.206601
> prob('light') = 0.207284
> prob('skip:h 20') = 0.211375
> prob('some') = 0.214937
> prob('there') = 0.219934
> prob('same') = 0.227242
> prob('still') = 0.238027
> prob('but') = 0.254404
> prob('according') = 0.254563
> prob('very') = 0.256327
> prob('skip:m 10') = 0.258633
> prob('stand') = 0.260226
> prob('died') = 0.263314
> prob('branch') = 0.263314
> prob('zero') = 0.26593
> prob('number') = 0.267526
> prob('them') = 0.274205
> prob('large') = 0.27431
> prob('his') = 0.276565
> prob('transaction') = 0.281659
> prob('consultant') = 0.283198
> prob('reason') = 0.288324
> prob('dead') = 0.288434
> prob('trace') = 0.29021
> prob('mr.') = 0.292388
> prob('part') = 0.294772
> prob('when') = 0.297739
> prob('ask') = 0.299886
> prob('already') = 0.299963
> prob('listing') = 0.310964
> prob('given') = 0.311411
> prob('down') = 0.311983
> prob('charset:us-ascii') = 0.312457
> prob('being') = 0.312739
> prob('federal') = 0.695627
> prob('president') = 0.697044
> prob('safely') = 0.700267
> prob('notification') = 0.700364
> prob('information') = 0.703131
> prob('skip:r 10') = 0.706302
> prob('inform') = 0.707612
> prob('brought') = 0.70783
> prob('your') = 0.710937
> prob('complete') = 0.711206
> prob('content-type:application/octet-stream') = 0.718341

Eh?  A Nigerian scam with an octet-stream attachment?!  That's unique!

> prob('country.') = 0.718341
> prob('immediately') = 0.727163
> prob('further') = 0.728674
> prob('obtained') = 0.732221
> prob('risk') = 0.747156
> prob('content-type:multipart/mixed') = 0.751609
> prob('contract') = 0.754669
> prob('informed') = 0.75788
> prob('business') = 0.761283
> prob('internet') = 0.768097
> prob('phone') = 0.774467
> prob('questions') = 0.795045
> prob('money,') = 0.796192
> prob('bank') = 0.801151
> prob('succeed') = 0.805677
> prob('settled') = 0.810078
> prob('month') = 0.811997
> prob('claim') = 0.812913
> prob('confidential') = 0.815186
> prob('money.') = 0.8156
> prob('our') = 0.820323
> prob('please') = 0.828641
> prob('months,') = 0.829218
> prob('fund') = 0.83557
> prob('national') = 0.835796
> prob('sent') = 0.837147
> prob('blood') = 0.843797
> prob('asked,') = 0.844828
> prob('treasury') = 0.844828
> prob('address') = 0.860353
> prob('reply') = 0.864689
> prob('achieving') = 0.87037
> prob('money') = 0.878353
> prob('70%') = 0.880818
> prob('million') = 0.885051
> prob('corporation') = 0.891198
> prob('free') = 0.90477
> prob('approval') = 0.904949
> prob('x-mailer:microsoft outlook express 5.00.2919.6900 dm') = 0.908163
> prob('modalities') = 0.908163
> prob('employment') = 0.912574
> prob('claim.') = 0.915225
> prob('skip:y 10') = 0.922406
> prob('deposit') = 0.929253
> prob('wish') = 0.930416
> prob('credit') = 0.941699
> prob('valued') = 0.950726
> prob('guaranteed') = 0.956906
> prob('honored') = 0.958716
> prob('message-id:@ucsu.colorado.edu') = 0.965116
> prob('conservative') = 0.983271
>
> All you folks _talking_ about the nigerian spams has turned them
> into ham for me! ;-)

That could be.  I hardly ever mention modalities here <wink>.


From popiel@wolfskeep.com  Mon Oct 14 22:36:15 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Mon, 14 Oct 2002 14:36:15 -0700
Subject: [Spambayes] defaults vs. chi-square 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	of "Mon, 14 Oct 2002 17:22:45 EDT."
	<LNBBLJKPBEHFEDALKOLCGEJCBLAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCGEJCBLAB.tim.one@comcast.net> 
Message-ID: <20021014213616.10BE4F4D4@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCGEJCBLAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>
>[Alex]
>> Data/Ham/Set5/2745
>> prob = 0.685540245196
>
>How did this end up getting counted as an FP?  A score of 0.69 was very
>solidly in your middle ground.

You're right, I'm a twit who can't read.

Okay, where did those false positives really go?

>An odd thing is that you must have a lot of 'skip:z 70' (etc) tokens in your
>ham too, else these spamprobs wouldn't be so small.  Any idea where they
>come from?  It suggests the tokenizer is giving up on something it should
>really be picking apart -- but I don't have many of these in my ham, so I'm
>at a loss to guess where they come from.

I'm not sure offhand, either.  I'd have to work to track it down,
though... and as mentioned earlier, today is a lazy day.  My best
guess is a few base64 bits that didn't get decoded properly.

>You must have more French in your ham, then (else the French words wouldn't
>have low spamprobs).

Yes, I do, from you folks talking about French messages... this
mailing list is doing a fine job of polluting my corpora with
difficult messages. ;-)

- Alex

From popiel@wolfskeep.com  Mon Oct 14 22:53:06 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Mon, 14 Oct 2002 14:53:06 -0700
Subject: [Spambayes] z-combining
Message-ID: <20021014215307.0D632F4D4@cashew.wolfskeep.com>

Well, I did a z-combining run.  @whee.  It replaces my
all-defaults run as cv1.  chi-square remains as cv2.

>From results.txt:

"""
ham mean                     ham sdev
   0.50    0.50   +0.00%        7.05    7.04   -0.14%
   0.26    0.27   +3.85%        3.65    3.71   +1.64%
   0.02    0.04 +100.00%        0.29    0.41  +41.38%
   0.49    0.41  -16.33%        5.44    4.13  -24.08%
   0.38    0.36   -5.26%        5.27    4.84   -8.16%
   1.03    1.01   -1.94%        9.88    9.42   -4.66%
   0.51    0.51   +0.00%        5.56    5.47   -1.62%
   0.09    0.16  +77.78%        1.26    1.94  +53.97%
   0.97    0.95   -2.06%        9.66    9.40   -2.69%
   0.12    0.14  +16.67%        1.73    1.88   +8.67%

ham mean and sdev for all runs
   0.44    0.44   +0.00%        5.90    5.65   -4.24%

spam mean                    spam sdev
  98.68   98.42   -0.26%       10.66   10.85   +1.78%
  99.31   99.26   -0.05%        5.62    5.56   -1.07%
  97.68   97.82   +0.14%       13.94   12.18  -12.63%
  98.84   98.85   +0.01%        9.00    8.90   -1.11%
  98.54   98.55   +0.01%       11.71    9.65  -17.59%
  97.99   98.31   +0.33%       13.48   11.21  -16.84%
  96.88   97.25   +0.38%       15.83   13.12  -17.12%
  99.34   98.98   -0.36%        4.95    6.15  +24.24%
  98.07   98.26   +0.19%       11.74   10.37  -11.67%
  99.65   99.01   -0.64%        3.04    5.46  +79.61%

spam mean and sdev for all runs
  98.50   98.47   -0.03%       10.81    9.72  -10.08%

ham/spam mean difference: 98.06 98.03 -0.03
"""

z-combining loses vs. chi-square there, with looser sdevs.

Next, we have the best computations for z-combining:

"""
-> best cost $54.20
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 6 cutoff pairs
-> smallest ham & spam cutoffs 0.01 & 0.985
->     fp 3; fn 13; unsure ham 12; unsure spam 44
->     fp rate 0.15%; fn rate 0.65%; unsure rate 1.4%
-> largest ham & spam cutoffs 0.035 & 0.985
->     fp 3; fn 13; unsure ham 12; unsure spam 44
->     fp rate 0.15%; fn rate 0.65%; unsure rate 1.4%
"""

Compare with the one from chi-square:

"""
-> best cost $48.00
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 3 cutoff pairs
-> smallest ham & spam cutoffs 0.03 & 0.89
->     fp 3; fn 6; unsure ham 12; unsure spam 48
->     fp rate 0.15%; fn rate 0.3%; unsure rate 1.5%
-> largest ham & spam cutoffs 0.03 & 0.9
->     fp 3; fn 6; unsure ham 12; unsure spam 48
->     fp rate 0.15%; fn rate 0.3%; unsure rate 1.5%
"""

Looks like z-combining has real granularity problems near
the top end.  Trash it.

- Alex

From tim.one@comcast.net  Mon Oct 14 22:59:14 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 14 Oct 2002 17:59:14 -0400
Subject: [Spambayes] defaults vs. chi-square
In-Reply-To: <20021014213616.10BE4F4D4@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEJEBLAB.tim.one@comcast.net>

[Tim]
>> An odd thing is that you must have a lot of 'skip:z 70' (etc)
>> tokens in your ham too, else these spamprobs wouldn't be so small.
>>  Any idea where they come from?

[T. Alexander Popiel]
> I'm not sure offhand, either.  I'd have to work to track it down,
> though... and as mentioned earlier, today is a lazy day.  My best
> guess is a few base64 bits that didn't get decoded properly.

I cater to lazy:  you had a bunch of them in the very spams you were talking
about.  What does the source for those look like?  I *used* to get a bunch
of these before we started stripping uuencoded sections, but that shouldn't
be happening anymore -- unless the uuencode-finding regexp is missing a
pattern that's common in your data but not in mine.  Or unless the message
headers are damaged to such an extent that the email package barfs on them
(in which case we fall back to the raw body text).

Whatever the cause, if it's a systematic problem in your data, it will be
for others too.  It may be unique to Perl programmers, though <wink>.


From popiel@wolfskeep.com  Mon Oct 14 23:09:02 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Mon, 14 Oct 2002 15:09:02 -0700
Subject: [Spambayes] defaults vs. chi-square 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	of "Mon, 14 Oct 2002 17:59:14 EDT."
	<LNBBLJKPBEHFEDALKOLCKEJEBLAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCKEJEBLAB.tim.one@comcast.net> 
Message-ID: <20021014220902.71797F4D4@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCKEJEBLAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>[Tim]
>>> An odd thing is that you must have a lot of 'skip:z 70' (etc)
>>> tokens in your ham too, else these spamprobs wouldn't be so small.
>>>  Any idea where they come from?
>
>[T. Alexander Popiel]
>> I'm not sure offhand, either.  I'd have to work to track it down,
>> though... and as mentioned earlier, today is a lazy day.  My best
>> guess is a few base64 bits that didn't get decoded properly.
>
>I cater to lazy:  you had a bunch of them in the very spams you were talking
>about.  What does the source for those look like?  I *used* to get a bunch
>of these before we started stripping uuencoded sections, but that shouldn't
>be happening anymore -- unless the uuencode-finding regexp is missing a
>pattern that's common in your data but not in mine.  Or unless the message
>headers are damaged to such an extent that the email package barfs on them
>(in which case we fall back to the raw body text).

It appears to be a systematic error when a mailing list manager
appends plain text to what should be a base64 encoded segment.
Bad MLM, no biscuit.  This confuses the MIME decoder. Bad MIME
decoder, too!

As a sample:

"""
Return-Path: bounce-debian-java=popiel=wolfskeep.com@lists.debian.org
Delivery-Date: Fri, 23 Aug 2002 02:56:21 -0700
Return-Path: <bounce-debian-java=popiel=wolfskeep.com@lists.debian.org>
Delivered-To: popiel@wolfskeep.com
Received: from murphy.debian.org (murphy.debian.org [65.125.64.134])
	by cashew.wolfskeep.com (Postfix) with SMTP id 0EAFBF58E
	for <popiel@wolfskeep.com>; Fri, 23 Aug 2002 02:56:21 -0700 (PDT)
Received: (qmail 29739 invoked by uid 38); 23 Aug 2002 09:37:09 -0000
X-Envelope-Sender: ybqiwbt@t-online.de
Received: (qmail 29162 invoked from network); 23 Aug 2002 09:36:55 -0000
Received: from adsl-065-081-092-098.sip.gsp.bellsouth.net (HELO xpfoncv) (65.81.92.98)
  by murphy.debian.org with SMTP; 23 Aug 2002 09:36:55 -0000
From: Cagdas Burhansan31 <ybqiwbt@t-online.de>
To: <debian-java@lists.debian.org>
Subject: Ar�iv haz�r
Date: Fri, 23 Aug 2002 10:33:48 -0400
X-Mailer: Microsoft Outlook Express 5.50.4133.2400
Content-Type: text/plain
Content-Transfer-Encoding: base64
Message-Id: <ydxtiqsklccg@lists.debian.org>
X-Spam-Status: No, hits=0.0 required=4.7 tests= version=2.01
Resent-Message-ID: <X9GjoB.A.nPH.DJgZ9@murphy>
Resent-From: debian-java@lists.debian.org
X-Mailing-List: <debian-java@lists.debian.org> archive/latest/2709
X-Loop: debian-java@lists.debian.org
List-Post: <mailto:debian-java@lists.debian.org>
List-Help: <mailto:debian-java-request@lists.debian.org?subject=help>
List-Subscribe: <mailto:debian-java-request@lists.debian.org?subject=subscribe>
List-Unsubscribe: <mailto:debian-java-request@lists.debian.org?subject=unsubscribe>
Precedence: list
Resent-Sender: debian-java-request@lists.debian.org
Resent-Date: Fri, 23 Aug 2002 02:56:21 -0700 (PDT)

DQpUck1lbG9kaSwgS/1y/WsgbGlua2xpIOdhbP3+bWF5YW4gdmUgYmlydGVrIG1wMyD8IGlu
ZGlyaXJrZW4gYmlsZSBpbnNhbmxhcv0ga2FocmVkZW4gc/Z6ZGUgbXAzIHNpdGVsZXJpbmUg
YWx0ZXJuYXRpZiANCm9sYXJhayBzaXpsZXIgaedpbiD2emVubGUgaGF6/XJsYW5t/f50/XIu
IEhlciB5Yf50YW4gaGVyIGtlc2ltZGVuIG38emlrc2V2ZXJlIGhpdGFwIGVkZWJpbG1layBp
52luIHRhc2FybGFubf3+IDEzIEdCIA0KbP1rIGRldiBNcDMgbGlzdGVzaXlsZSBz/W79Zv1u
ZGEgcmFraXBzaXogb2xhY2FrIP5la2lsZGUgZG9uYXT9bG39/iB2ZSBzaXogbfx6aWtzZXZl
cmxlcmluIGhpem1ldGluZSBzdW51bG11/nR1ci4gDQpodHRwOi8vd3d3LnRybWVsb2RpLmNv
bSBhZHJlc2luZGVraSBkZXYgYXL+aXZpbWl6ZGUgc2l6aSBiZWtsZXllbiBlbiBzZXZkafBp
bml6IHNhbmF05/1sYXL9biBlbiBzZXZkafBpbml6IA0K/mFya/1sYXL9bv0gYmlya2HnIGRh
a2lrYSBp52luZGUgYmlsZ2lzYXlhcv1u/XphIGluZGlyaW4gdmUga2V5aWZsZSBkaW5sZW1l
eWUgYmH+bGF5/W4uIA0KDQrdeWkgRfBsZW5jZWxlci4uIA0KaHR0cDovL3d3dy50cm1lbG9k
aS5jb20NCg0KDQoNCg0K


-- 
To UNSUBSCRIBE, email to debian-java-request@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
"""

>Whatever the cause, if it's a systematic problem in your data, it will be
>for others too.  It may be unique to Perl programmers, though <wink>.

Nope.  In this case, it's Java programmers. ;-)

- Alex

From guido@python.org  Mon Oct 14 23:13:46 2002
From: guido@python.org (Guido van Rossum)
Date: Mon, 14 Oct 2002 18:13:46 -0400
Subject: [Spambayes] defaults vs. chi-square
In-Reply-To: Your message of "Mon, 14 Oct 2002 15:09:02 PDT."
             <20021014220902.71797F4D4@cashew.wolfskeep.com> 
References: <LNBBLJKPBEHFEDALKOLCKEJEBLAB.tim.one@comcast.net>  
            <20021014220902.71797F4D4@cashew.wolfskeep.com> 
Message-ID: <200210142213.g9EMDsP00965@pcp02138704pcs.reston01.va.comcast.net>

> It appears to be a systematic error when a mailing list manager
> appends plain text to what should be a base64 encoded segment.
> Bad MLM, no biscuit.  This confuses the MIME decoder. Bad MIME
> decoder, too!
> 
> As a sample:
> 
> """
[...]
> Content-Type: text/plain
> Content-Transfer-Encoding: base64
[...]
> 
> DQpUck1lbG9kaSwgS/1y/WsgbGlua2xpIOdhbP3+bWF5YW4gdmUgYmlydGVrIG1wMyD8IGlu
> ZGlyaXJrZW4gYmlsZSBpbnNhbmxhcv0ga2FocmVkZW4gc/Z6ZGUgbXAzIHNpdGVsZXJpbmUg
> YWx0ZXJuYXRpZiANCm9sYXJhayBzaXpsZXIgaedpbiD2emVubGUgaGF6/XJsYW5t/f50/XIu
> IEhlciB5Yf50YW4gaGVyIGtlc2ltZGVuIG38emlrc2V2ZXJlIGhpdGFwIGVkZWJpbG1layBp
> 52luIHRhc2FybGFubf3+IDEzIEdCIA0KbP1rIGRldiBNcDMgbGlzdGVzaXlsZSBz/W79Zv1u
> ZGEgcmFraXBzaXogb2xhY2FrIP5la2lsZGUgZG9uYXT9bG39/iB2ZSBzaXogbfx6aWtzZXZl
> cmxlcmluIGhpem1ldGluZSBzdW51bG11/nR1ci4gDQpodHRwOi8vd3d3LnRybWVsb2RpLmNv
> bSBhZHJlc2luZGVraSBkZXYgYXL+aXZpbWl6ZGUgc2l6aSBiZWtsZXllbiBlbiBzZXZkafBp
> bml6IHNhbmF05/1sYXL9biBlbiBzZXZkafBpbml6IA0K/mFya/1sYXL9bv0gYmlya2HnIGRh
> a2lrYSBp52luZGUgYmlsZ2lzYXlhcv1u/XphIGluZGlyaW4gdmUga2V5aWZsZSBkaW5sZW1l
> eWUgYmH+bGF5/W4uIA0KDQrdeWkgRfBsZW5jZWxlci4uIA0KaHR0cDovL3d3dy50cm1lbG9k
> aS5jb20NCg0KDQoNCg0K
> 
> 
> -- 
> To UNSUBSCRIBE, email to debian-java-request@lists.debian.org
> with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
> """

Mailman used to do this too; I believe it's finally fixed in Mailman
2.1.

--Guido van Rossum (home page: http://www.python.org/~guido/)

From tim.one@comcast.net  Tue Oct 15 03:52:47 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 14 Oct 2002 22:52:47 -0400
Subject: [Spambayes] z-combining
In-Reply-To: <20021014215307.0D632F4D4@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEKHBLAB.tim.one@comcast.net>

[T. Alexander Popiel]
> Well, I did a z-combining run.  @whee.  It replaces my
> all-defaults run as cv1.  chi-square remains as cv2.
>
> From results.txt:

[inconsitent effects on means across runs, small and large effects on
 sdevs, but overall decreases]

> ...
> z-combining loses vs. chi-square there, with looser sdevs.

The sdevs actually got smaller overall:

> ham mean and sdev for all runs
>    0.44    0.44   +0.00%        5.90    5.65   -4.24%
>
> spam mean and sdev for all runs
>   98.50   98.47   -0.03%       10.81    9.72  -10.08%

The means are so far apart compared to the sdevs, and the extreme
concentration at the endpoints, though, that random overlap isn't an issue
with either scheme -- the mistakes these guys make are more fundamental than
random.

> Next, we have the best computations for z-combining:
>
> """
> -> best cost $54.20
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at 6 cutoff pairs
> -> smallest ham & spam cutoffs 0.01 & 0.985
> ->     fp 3; fn 13; unsure ham 12; unsure spam 44
> ->     fp rate 0.15%; fn rate 0.65%; unsure rate 1.4%
> -> largest ham & spam cutoffs 0.035 & 0.985
> ->     fp 3; fn 13; unsure ham 12; unsure spam 44
> ->     fp rate 0.15%; fn rate 0.65%; unsure rate 1.4%
> """
>
> Compare with the one from chi-square:
>
> """
> -> best cost $48.00
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at 3 cutoff pairs
> -> smallest ham & spam cutoffs 0.03 & 0.89
> ->     fp 3; fn 6; unsure ham 12; unsure spam 48
> ->     fp rate 0.15%; fn rate 0.3%; unsure rate 1.5%
> -> largest ham & spam cutoffs 0.03 & 0.9
> ->     fp 3; fn 6; unsure ham 12; unsure spam 48
> ->     fp rate 0.15%; fn rate 0.3%; unsure rate 1.5%
> """
>
> Looks like z-combining has real granularity problems near
> the top end.  Trash it.

It's indeed not working better for anyone so far, and it does suffer
cancellation disease.  OTOH, it was a quick hack to get a quick feel for how
this *kind* of approach might work, and it didn't go all the way.  Gary
would like to "rank" the spamprobs first, but that requires another version
of "the third training pass" that I just don't know how to make practical
over time.

If Rob is feeling particularly adventurous, it would be interesting (in
conncection with z-combining) to transform the database spamprobs into
unit-normalized zscores via his RMS black magic, as an extra step at the end
of update_probabilities().  This wouldn't require another pass over the
training data, would speed z-combining scoring a lot, and I *think* would
make the inputs to this scheme much closer to what Gary would really like
them to be (z-combining *pretends* the "extreme-word" spamprobs are normally
distributed now; I don't have any idea how close that is to the truth).  The
attraction of this scheme is that it gives a single "spam probability"
directly; combining distinct ham and spam indicators is still a bit of a
puzzle (although a happy puzzle from my POV when both indicators suck, as
happens in chi combining with large numbers of strong clues on both ends).


From tim.one@comcast.net  Tue Oct 15 04:03:08 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 14 Oct 2002 23:03:08 -0400
Subject: [Spambayes] defaults vs. chi-square
In-Reply-To: <20021014220902.71797F4D4@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEKIBLAB.tim.one@comcast.net>

[T. Alexander Popiel, tracks down a source of his many "skip" tokens]
> ...
> It appears to be a systematic error when a mailing list manager
> appends plain text to what should be a base64 encoded segment.
> Bad MLM, no biscuit.  This confuses the MIME decoder. Bad MIME
> decoder, too!
>
> As a sample:
>
> """
> [headers]
> ...
> Content-Type: text/plain
> Content-Transfer-Encoding: base64
> ...
>
> DQpUck1lbG9kaSwgS/1y/WsgbGlua2xpIOdhbP3+bWF5YW4gdmUgYmlydGVrIG1wMyD8IGlu
> ZGlyaXJrZW4gYmlsZSBpbnNhbmxhcv0ga2FocmVkZW4gc/Z6ZGUgbXAzIHNpdGVsZXJpbmUg
> YWx0ZXJuYXRpZiANCm9sYXJhayBzaXpsZXIgaedpbiD2emVubGUgaGF6/XJsYW5t/f50/XIu
> IEhlciB5Yf50YW4gaGVyIGtlc2ltZGVuIG38emlrc2V2ZXJlIGhpdGFwIGVkZWJpbG1layBp
> 52luIHRhc2FybGFubf3+IDEzIEdCIA0KbP1rIGRldiBNcDMgbGlzdGVzaXlsZSBz/W79Zv1u
> ZGEgcmFraXBzaXogb2xhY2FrIP5la2lsZGUgZG9uYXT9bG39/iB2ZSBzaXogbfx6aWtzZXZl
> cmxlcmluIGhpem1ldGluZSBzdW51bG11/nR1ci4gDQpodHRwOi8vd3d3LnRybWVsb2RpLmNv
> bSBhZHJlc2luZGVraSBkZXYgYXL+aXZpbWl6ZGUgc2l6aSBiZWtsZXllbiBlbiBzZXZkafBp
> bml6IHNhbmF05/1sYXL9biBlbiBzZXZkafBpbml6IA0K/mFya/1sYXL9bv0gYmlya2HnIGRh
> a2lrYSBp52luZGUgYmlsZ2lzYXlhcv1u/XphIGluZGlyaW4gdmUga2V5aWZsZSBkaW5sZW1l
> eWUgYmH+bGF5/W4uIA0KDQrdeWkgRfBsZW5jZWxlci4uIA0KaHR0cDovL3d3dy50cm1lbG9k
> aS5jb20NCg0KDQoNCg0K
>
>
> --
> To UNSUBSCRIBE, email to debian-java-request@lists.debian.org
> with a subject of "unsubscribe". Trouble? Contact
> listmaster@lists.debian.org

Ouch.  That would do it, all right, here in tokenizer.py:

        for part in textparts(msg):
            # Decode, or take it as-is if decoding fails.
            try:
                text = part.get_payload(decode=True)
            except:
                yield "control: couldn't decode"
                text = part.get_payload(decode=False)

The base64 decoder will barf on that kind of msg, but you've got so many of
these in your ham that even the "couldn't decode" metatoken is taken as a
strong ham clue:

    prob("control: couldn't decode") = 0.0652174

I overlooked that in your msg before.

So, Barry, what can we do about this?  Filling the database with "skip"
tokens from raw base64 is a Bad Idea, and I assume the email pkg doesn't
know how to, e.g., "decode base64 up until it can't anymore, and then grab
the rest as plain text".  Heh -- just writing that made me want to puke.  We
have to do something better with this, though.


From tim.one@comcast.net  Tue Oct 15 04:32:57 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 14 Oct 2002 23:32:57 -0400
Subject: [Spambayes] chi-squared versus "prob strength"
In-Reply-To: <200210140631.g9E6VWn31907@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEKJBLAB.tim.one@comcast.net>

[Tim, to Rob, on switching from S/(S+H) to (S-H+1)/2]
> I was, but more importantly my test data agreed, so I'm going
> to switch to this (the evidence is so consistent and solid on both
> our datasets that making it an option would supply a pointless
> choice -- losers are killed).  Good show!

[Anthony Baxter]
> Here's what my mungo-test set shows for this (before is pre-Rob Hooft's
> change, after is current CVS)

This would have been a useful result, but, unfortunately, you ran it before
the histogram analysis was beefed up to tell us the useful bits.  If you
still have the final ("all runs") ham and spam histograms from *both* runs
in output files, you could post much more useful info by running them thru
cvcost.py.  With some pain I can run the new histo analysis for you on your
"after" run, because you included the full final histograms for that:

-> best cost for Anthoy's CVS run: $626.00
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.85 & 0.995
->     fp 50; fn 75; unsure ham 110; unsure spam 145
->     fp rate 0.143%; fn rate 0.445%; unsure rate 0.493%

It's a peculiar pair of cutoffs, reflecting that you have few low-scoring
spam but (relative to others) many high-scoring ham.

The analysis is limited by nbuckets=200, as 50 of your ham scored in the
highest bucket:

99.5    50 *

and so there's no way to get rid of more FP at this granularity short of
calling everything ham.  However, your *median* spam score was 100:

-> <stat> Spam scores for all runs: 16848 items; mean 99.75; sdev 3.75
-> <stat> min 0.00333927; median 100; max 100

meaning that at least half your spam scored 100, so there may well be useful
distinctions still to be drawn if only we could peer inside the 0.995
bucket.  Your data is so nasty I think 200 buckets is too small for you; try
1000 next time?

In any case, the idea that these lines were telling useful truths:

> total unique fp went from 261 to 281 lost    +7.66%
> total unique fn went from 60 to 53 won    -11.67%

is right out.  In part, those say two things:

1. spam_cutoff was too low for the "after" run.

2. A single spam_cutoff doesn't make sense for the middle-ground methods:
   we're trying to *get* you a useful middle ground here, a small
   number of nasty msgs where we have strong reason to believe many
   mistakes will live.


From tim.one@comcast.net  Tue Oct 15 04:54:51 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 14 Oct 2002 23:54:51 -0400
Subject: Bland words, and z-combining (was RE: [Spambayes] Bland word only
 score..)
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEFPBLAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEKKBLAB.tim.one@comcast.net>

FYI, I doubled the number of accurate digits in z-combining's probability ->
zscore calculations.  This made it even more extreme for me -- the median
ham score fell to 0 on the nose.  The good news is that my lowest-scoring
spam's score rose, from

    4.09672e-012

to

    4.10227e-012

Take *that* to the bank <wink>.

-> best cost for all runs: $26.80
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 114 cutoff pairs
-> smallest ham & spam cutoffs 0.63 & 0.944
->     fp 2; fn 3; unsure ham 10; unsure spam 9
->     fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0559%
-> largest ham & spam cutoffs 0.868 & 0.946
->     fp 2; fn 5; unsure ham 2; unsure spam 7
->     fp rate 0.01%; fn rate 0.0357%; unsure rate 0.0265%

That's the first run of any kind I've seen where the minimum cost could be
achieved in more than one way.  I don't mean that there were 114 cutoff
pairs that achieved it (that's normal enough), but that the two specific
endpoints shown there make different tradeoffs between FN and unsures.

What this doesn't show is that picking cutoffs of 0.05 and 0.95 would have
been almost as cheap -- getting *close* to the minimum isn't touchy at all,
but getting the absolute minimum is.


From barry@wooz.org  Tue Oct 15 05:10:06 2002
From: barry@wooz.org (Barry A. Warsaw)
Date: Tue, 15 Oct 2002 00:10:06 -0400
Subject: [Spambayes] defaults vs. chi-square
References: <20021014220902.71797F4D4@cashew.wolfskeep.com>
	<LNBBLJKPBEHFEDALKOLCCEKIBLAB.tim.one@comcast.net>
Message-ID: <15787.38174.711987.679063@gargle.gargle.HOWL>


    >> ...  It appears to be a systematic error when a mailing list
    >> manager appends plain text to what should be a base64 encoded
    >> segment.  Bad MLM, no biscuit.  This confuses the MIME
    >> decoder. Bad MIME decoder, too!

Known problem with MM2.0.  MM2.1 does a better job of adding headers and
footers.  If it can't do it in a MIME-safe way, it won't do it.

    TP> So, Barry, what can we do about this?  Filling the database
    TP> with "skip" tokens from raw base64 is a Bad Idea, and I assume
    TP> the email pkg doesn't know how to, e.g., "decode base64 up
    TP> until it can't anymore, and then grab the rest as plain text".
    TP> Heh -- just writing that made me want to puke.  We have to do
    TP> something better with this, though.

Upgrade to MM2.1 :)

Seriously, when the email package has to decode a base64 payload, it
just hands the whole string off to base64.decodestring().  Given that
that function isn't very forgiving, I'm not sure what to do.  Sucks.

-Barry

From guido@python.org  Tue Oct 15 05:17:52 2002
From: guido@python.org (Guido van Rossum)
Date: Tue, 15 Oct 2002 00:17:52 -0400
Subject: [Spambayes] defaults vs. chi-square
In-Reply-To: Your message of "Tue, 15 Oct 2002 00:10:06 EDT."
             <15787.38174.711987.679063@gargle.gargle.HOWL> 
References: <20021014220902.71797F4D4@cashew.wolfskeep.com>
	<LNBBLJKPBEHFEDALKOLCCEKIBLAB.tim.one@comcast.net>  
	<15787.38174.711987.679063@gargle.gargle.HOWL> 
Message-ID: <200210150417.g9F4HqZ17493@pcp02138704pcs.reston01.va.comcast.net>

>     TP> So, Barry, what can we do about this?  Filling the database
>     TP> with "skip" tokens from raw base64 is a Bad Idea, and I assume
>     TP> the email pkg doesn't know how to, e.g., "decode base64 up
>     TP> until it can't anymore, and then grab the rest as plain text".
>     TP> Heh -- just writing that made me want to puke.  We have to do
>     TP> something better with this, though.
> 
> Upgrade to MM2.1 :)
> 
> Seriously, when the email package has to decode a base64 payload, it
> just hands the whole string off to base64.decodestring().  Given that
> that function isn't very forgiving, I'm not sure what to do.  Sucks.

Split it up in lines first, and collect lines that match a simple
regexp to recognize base64.  Then feed the collected stuff to
base64.decodestring().  If there's non-white excess, deal with that
separately.

--Guido van Rossum (home page: http://www.python.org/~guido/)

From tim.one@comcast.net  Tue Oct 15 05:58:28 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 15 Oct 2002 00:58:28 -0400
Subject: [Spambayes] defaults vs. chi-square
In-Reply-To: <200210150417.g9F4HqZ17493@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEKNBLAB.tim.one@comcast.net>

[Guido]
> Split it up in lines first, and collect lines that match a simple
> regexp to recognize base64.  Then feed the collected stuff to
> base64.decodestring().  If there's non-white excess, deal with that
> separately.

I'm trying to shame Barry into doing this, since he sucked me into this
project and then vanished <wink>.  More importantly, if he can be provoked
into giving it some real thought, he could do a better job faster than I
could.  For example, I've just got a Message object at this point, and I
don't know beans about whether it's plain, base64-encoded, qp-encoded, or
whatever.  The email pkg knows, though, and Barry knows how to get it to
tell him without even thinking about it.  Since most base64 stuff isn't
damaged, we need smarter recovery code in the "except:" clause of the
snippet I posted.  For a start, if it failed to decode base64 stuff, it
would likely be better to ignore that part entirely than to run off
tokenizing it.  It would be much better still to decode it anyway.


From popiel@wolfskeep.com  Tue Oct 15 06:04:42 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Mon, 14 Oct 2002 22:04:42 -0700
Subject: [Spambayes] z-combining 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	of "Mon, 14 Oct 2002 22:52:47 EDT."
	<LNBBLJKPBEHFEDALKOLCEEKHBLAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCEEKHBLAB.tim.one@comcast.net> 
Message-ID: <20021015050442.23597F4D4@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCEEKHBLAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>[T. Alexander Popiel]
>> Well, I did a z-combining run.  @whee.  It replaces my
>> all-defaults run as cv1.  chi-square remains as cv2.

>> z-combining loses vs. chi-square there, with looser sdevs.
>
>The sdevs actually got smaller overall:

Remember, I had z-combining on the left, and chi-square on the right.
Just to confuse you. ;-)

>The means are so far apart compared to the sdevs, and the extreme
>concentration at the endpoints, though, that random overlap isn't an issue
>with either scheme -- the mistakes these guys make are more fundamental than
>random.

Yup.

- Alex

From barry@wooz.org  Tue Oct 15 06:13:38 2002
From: barry@wooz.org (Barry A. Warsaw)
Date: Tue, 15 Oct 2002 01:13:38 -0400
Subject: [Spambayes] defaults vs. chi-square
References: <200210150417.g9F4HqZ17493@pcp02138704pcs.reston01.va.comcast.net>
	<LNBBLJKPBEHFEDALKOLCMEKNBLAB.tim.one@comcast.net>
Message-ID: <15787.41986.422263.708035@gargle.gargle.HOWL>


>>>>> "TP" == Tim Peters <tim.one@comcast.net> writes:

    TP> I'm trying to shame Barry into doing this, since he sucked me
    TP> into this project and then vanished <wink>.

I think everyone here will agree that was the most productive email,
character for character that I ever wrote.  It would peg the ham
meter.

    TP> More importantly, if he can be provoked into giving it some
    TP> real thought, he could do a better job faster than I could.

Will do, but tomorrow.  I have one more Mailman flame to extinguish
tonight and then it's time to drool into my pillow for 6 hours.

-Barry

From rob@hooft.net  Tue Oct 15 14:19:57 2002
From: rob@hooft.net (Rob Hooft)
Date: Tue, 15 Oct 2002 15:19:57 +0200
Subject: [Spambayes] z-combining
References: <LNBBLJKPBEHFEDALKOLCEEKHBLAB.tim.one@comcast.net>
Message-ID: <3DAC15FD.3020200@hooft.net>

This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Tim Peters wrote:

> If Rob is feeling particularly adventurous, it would be interesting (in
> conncection with z-combining) to transform the database spamprobs into
> unit-normalized zscores via his RMS black magic, as an extra step at the end
> of update_probabilities().  This wouldn't require another pass over the
> training data, would speed z-combining scoring a lot, and I *think* would
> make the inputs to this scheme much closer to what Gary would really like
> them to be (z-combining *pretends* the "extreme-word" spamprobs are normally
> distributed now; I don't have any idea how close that is to the truth). 

I'm not exactly sure what you want me to renormalize using my black 
magic, but I did make an interesting histogram of 250000 single-token
spam probabilities... I'm hoping you're not assuming that this is 
normally distributed, although it looks like that is what you are trying 
to do when recalculating this into Z-scores. Out of the 250k tokens I 
put in my histogram, 93k occurred exactly once in the ham corpus of 4500 
messages only, and ~75k exactly once in the spam corpus of 4500 messages 
only..... The noise you see at the baseline is messages that occur 
multiple times in both ham and spam; amplified in the second image where 
all words that occur only once or twice are removed from the histogram. 
A histogram of words that occur more than 30 times in total is a bit 
more flat, but still has many >30+0 / 0+>30 extremes.

My strongest ham clue is "wrote:" (763+0) second "het" (533+0) [Dutch 
for "the" for words without gender and for "it"], at the spam side it is 
"8bit%:100" (0+937) and "charset:ks_c_5601-1987" (0+838)

> The
> attraction of this scheme is that it gives a single "spam probability"
> directly; combining distinct ham and spam indicators is still a bit of a
> puzzle (although a happy puzzle from my POV when both indicators suck, as
> happens in chi combining with large numbers of strong clues on both ends).

I don't see why this schema could not produce a "H" value as well, and 
then mix it with the "S" score we're using now. This schema looks a lot 
like the "S" half of earlier ones like chi2 combining. Think about what 
goes wrong if we would only use the S half of chi2 combining: messages 
that look like both ham and spam come out as perfect spam, and messages 
that look neither like ham nor spam come out as perfect ham.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: probfreq.png
Type: image/png
Size: 9903 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021015/87672913/probfreq.png

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: prob2.png
Type: image/png
Size: 7985 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021015/87672913/prob2.png

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: balk29.png
Type: image/png
Size: 5900 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021015/87672913/balk29.png

---------------------- multipart/mixed attachment--


From grobinson@transpose.com  Tue Oct 15 14:38:21 2002
From: grobinson@transpose.com (Gary Robinson)
Date: Tue, 15 Oct 2002 09:38:21 -0400
Subject: [Spambayes] z
Message-ID: <B9D1928D.173F1%grobinson@transpose.com>

> It's indeed not working better for anyone so far, and it does suffer
> cancellation disease.  OTOH, it was a quick hack to get a quick feel for how
> this *kind* of approach might work, and it didn't go all the way.  Gary
> would like to "rank" the spamprobs first, but that requires another version
> of "the third training pass" that I just don't know how to make practical
> over time.

Actually I think it would be complicated or even impossible to do the way it
really *should* be done because it would have to be structured so that
spammy words always had a rank over .5 and hammy words had a rank under .5,
while the probability of hitting a spam or a ham under a reasonable null
hypothesis is the same.

It would get complicated, so I recommend not bothering trying to do it
right. I know I don't have time to try to work out a good way to do it now.


> If Rob is feeling particularly adventurous, it would be interesting (in
> conncection with z-combining) to transform the database spamprobs into
> unit-normalized zscores via his RMS black magic, as an extra step at the end
> of update_probabilities().  This wouldn't require another pass over the
> training data, would speed z-combining scoring a lot, and I *think* would
> make the inputs to this scheme much closer to what Gary would really like
> them to be (z-combining *pretends* the "extreme-word" spamprobs are normally
> distributed now; I don't have any idea how close that is to the truth).

I didn't realize that this wasn't already being done. Yes I would recommend
that somebody do this because I don't think we're really testing the z
approach completely fairly until it is.

I'm not saying I believe that the z approach will turn out to be better -- I
just don't know -- but it seems worth trying.

Gary


--Gary


-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454


From rob@hooft.net  Tue Oct 15 14:58:29 2002
From: rob@hooft.net (Rob Hooft)
Date: Tue, 15 Oct 2002 15:58:29 +0200
Subject: [Spambayes] Tokenizing numbers and money
Message-ID: <3DAC1F05.7030809@hooft.net>

I just scanned through my 250k token list and found that a surprising 
number of these are numeric or almost numeric. Here is a random part of 
the list:

prob   nham nspam token
0.1552    1    0 3601.2
0.1552    1    0 3601.5
0.1552    1    0 3603.6
0.1552    1    0 3604.2
0.8448    0    1 3605
0.0918    2    0 3607
0.1552    1    0 3607.2
0.1552    1    0 3613
0.1552    1    0 3617
0.0918    2    0 3618
0.1552    1    0 3620.
0.8448    0    1 3621
0.1552    1    0 3624.2
0.1552    1    0 3626.5
0.1552    1    0 3627.7
0.1552    1    0 3629
0.1552    1    0 3631
[...]
0.9698    0    7 $65.00
0.8448    0    1 $369.00.
0.9698    0    7 $149.00,
0.9698    0    7 $800,000
0.8448    0    1 $30.00)
0.9587    0    5 $205.00
0.8448    0    1 $.19
0.8448    0    1 $24.00
0.9734    0    8 $800
0.9494    0    4 $37).
0.9587    0    5 $1.70
0.8448    0    1 $50,00
0.8448    0    1 $450.00.
0.9082    0    2 $1,000.00!
0.9494    0    4 $663.90
0.8448    0    1 $30...get
0.8448    0    1 $350,000
0.8448    0    1 $.275,
0.9651    0    6 $185.00
0.1552    1    0 $500,-
0.9651    0    6 $349.95.
0.8448    0    1 $2,000-
[...but also...]
0.9803    0   11 $30.00
0.9938    0   36 $319,210.00
0.9921    0   28 $25.00
0.9884    0   19 $100,000.00
0.9979    0  108 $5,000
0.9002   13  119 $500
0.9755    3  128 $50
0.9843    2  139 $25
[...and...]
0.9921    0   28 $25.00
0.8448    0    1 x5=$25.00.
0.9082    0    2 us$25.00
0.9878    0   18 5=$25.00.
0.9082    0    2 $25.00!
0.9941    0   38 $25.00.
0.9348    0    3 $25.00,

Does anyone believe that "3605" is a real spam clue, and "3607" a real 
ham clue? I think collapsing numbers into a few classes might 
significantly reduce the size of the database, and actually help the 
classification. Even though for someone doing fragrances "4711" may be a 
strong ham clue, I think that over the whole this is just adding noise.

How about something like tokens for

    num:float     (e.g. 3624.2)
    num:int       (e.g. 3629)
    num:intpair   (e.g. 439,443)
    num:$1        (for amounts between $0.00 and $9.99)
    num:$10       (for amounts between $10 and $99.99)
    num:$100      (for amounts between $100 and $999.99)
    num:$1000     (for amounts between $1k and $10k)
    num:$huge     (for amounts >$10k)

Each of these might have "logarithm suffixes"? Is this unrealistic? 
Currently roughly one in six tokens in my list contains at least 3 
digits in a row!

amigo[197]spambayes%% egrep -c ' .*[0-9][0-9][0-9]' balk.dat
44757
amigo[198]spambayes%% wc -l balk.dat
  255907 balk.dat


Rob
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tim.one@comcast.net  Tue Oct 15 21:05:33 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 15 Oct 2002 16:05:33 -0400
Subject: [Spambayes] z
In-Reply-To: <B9D1928D.173F1%grobinson@transpose.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHEEAHDMAA.tim.one@comcast.net>

[Tim]
>> If Rob is feeling particularly adventurous, it would be interesting (in
>> conncection with z-combining) to transform the database spamprobs into
>> unit-normalized zscores via his RMS black magic, as an extra
>> step at the endof update_probabilities().  This wouldn't require another

[Gary Robinson]
> I didn't realize that this wasn't already being done.

It's unclear to me what "this" means.  RMS transformations?  No, we're not
doing those here.

> Yes I would recommend that somebody do this because I don't think we're
> really testing the z approach completely fairly until it is.

You tell me whether this is this <wink>; this is the code people have been
using:

    def z_spamprob(self, wordstream, evidence=False):
        from math import sqrt

        clues = self._getclues(wordstream)
        zsum = 0.0
        for prob, word, record in clues:
            if record is not None:  # else wordinfo doesn't know about it
                record.killcount += 1
            zsum += normIP(prob)

        n = len(clues)
        if n:
            # We've added n zscores from a unit normal distribution.  By the
            # central limit theorem, their mean is normally distributed with
            # mean 0 and sdev 1/sqrt(n).  So the zscore of zsum/n is
            # (zsum/n - 0)/(1/sqrt(n)) = zsum/n/(1/sqrt(n)) = zsum/sqrt(n).
            prob = normP(zsum / sqrt(n))
        else:
            prob = 0.5

normIP() maps a probability p to the real z such that the area under the
unit Gaussian from -inf to z is p.  normP() is the inverse, mapping real z
to the area under the unit Gaussian from -inf to z.  Example:

>>> normIP(.9)
1.2815502653713151
>>> normP(_)
0.8999997718215671
>>> normIP(.1)
-1.2815502653713149
>>> normP(_)
0.10000022817843296
>>>

normP() is accurate to about 14 decimal digits; normIP() is accurate to
about 6 decimal digits.

The word "prob" values here are your f(w).

> I'm not saying I believe that the z approach will turn out to be
> better -- I just don't know -- but it seems worth trying.

Happy to try, but really don't know how to proceed.  There's seems no reason
to believe that the f(w) values lead to normIP() values that are *in fact*
unit-normal distributed on a random collection of words, and I don't
actually see a reason to believe that this would get closer to being true if
the f(w) were ranked first.

If we can define precisely what we mean by "a random collection of words",
the idea that the resulting normIP() values are or aren't unit-normal
distributed seems easily testable, though.


From grobinson@transpose.com  Tue Oct 15 21:50:43 2002
From: grobinson@transpose.com (Gary Robinson)
Date: Tue, 15 Oct 2002 16:50:43 -0400
Subject: [Spambayes] z
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHEEAHDMAA.tim.one@comcast.net>
Message-ID: <B9D1F7E3.17464%grobinson@transpose.com>

Urgh. Sorry. I am so totally swamped with work that I am only quickly
looking in sometimes and I think I got a wrong impression before.

Based on what you say in the message quoted below, I think you're already
doing what I was hoping for, with the exception of the ranking part! I guess
I was confused by the earlier message...

And I also agree that it doesn't make sense to try ranking now because there
are aspects to this data that mean it won't come out  to a uniform
distribution under a reasonable null hypothesis without more tweaking than I
(or, I guess, any of us) can suggest a way to do at this point.

--Gary


-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454


> From: Tim Peters <tim.one@comcast.net>
> Date: Tue, 15 Oct 2002 16:05:33 -0400
> To: Gary Robinson <grobinson@transpose.com>
> Cc: SpamBayes <spambayes@python.org>
> Subject: RE: [Spambayes] z
> 
> [Tim]
>>> If Rob is feeling particularly adventurous, it would be interesting (in
>>> conncection with z-combining) to transform the database spamprobs into
>>> unit-normalized zscores via his RMS black magic, as an extra
>>> step at the endof update_probabilities().  This wouldn't require another
> 
> [Gary Robinson]
>> I didn't realize that this wasn't already being done.
> 
> It's unclear to me what "this" means.  RMS transformations?  No, we're not
> doing those here.
> 
>> Yes I would recommend that somebody do this because I don't think we're
>> really testing the z approach completely fairly until it is.
> 
> You tell me whether this is this <wink>; this is the code people have been
> using:
> 
>   def z_spamprob(self, wordstream, evidence=False):
>       from math import sqrt
> 
>       clues = self._getclues(wordstream)
>       zsum = 0.0
>       for prob, word, record in clues:
>           if record is not None:  # else wordinfo doesn't know about it
>               record.killcount += 1
>           zsum += normIP(prob)
> 
>       n = len(clues)
>       if n:
>           # We've added n zscores from a unit normal distribution.  By the
>           # central limit theorem, their mean is normally distributed with
>           # mean 0 and sdev 1/sqrt(n).  So the zscore of zsum/n is
>           # (zsum/n - 0)/(1/sqrt(n)) = zsum/n/(1/sqrt(n)) = zsum/sqrt(n).
>           prob = normP(zsum / sqrt(n))
>       else:
>           prob = 0.5
> 
> normIP() maps a probability p to the real z such that the area under the
> unit Gaussian from -inf to z is p.  normP() is the inverse, mapping real z
> to the area under the unit Gaussian from -inf to z.  Example:
> 
>>>> normIP(.9)
> 1.2815502653713151
>>>> normP(_)
> 0.8999997718215671
>>>> normIP(.1)
> -1.2815502653713149
>>>> normP(_)
> 0.10000022817843296
>>>> 
> 
> normP() is accurate to about 14 decimal digits; normIP() is accurate to
> about 6 decimal digits.
> 
> The word "prob" values here are your f(w).
> 
>> I'm not saying I believe that the z approach will turn out to be
>> better -- I just don't know -- but it seems worth trying.
> 
> Happy to try, but really don't know how to proceed.  There's seems no reason
> to believe that the f(w) values lead to normIP() values that are *in fact*
> unit-normal distributed on a random collection of words, and I don't
> actually see a reason to believe that this would get closer to being true if
> the f(w) were ranked first.
> 
> If we can define precisely what we mean by "a random collection of words",
> the idea that the resulting normIP() values are or aren't unit-normal
> distributed seems easily testable, though.
> 


From tim.one@comcast.net  Tue Oct 15 22:40:46 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 15 Oct 2002 17:40:46 -0400
Subject: [Spambayes] Tokenizing numbers and money
In-Reply-To: <3DAC1F05.7030809@hooft.net>
Message-ID: <BIEJKCLHCIOIHAGOKOLHMEALDMAA.tim.one@comcast.net>

[Rob Hooft]
> I just scanned through my 250k token list and found that a surprising
> number of these are numeric or almost numeric. Here is a random part of
> the list:
>
> prob   nham nspam token
> 0.1552    1    0 3601.2
> 0.1552    1    0 3601.5
> 0.1552    1    0 3603.6
> ...
> 0.9698    0    7 $65.00
> 0.8448    0    1 $369.00.
> 0.9698    0    7 $149.00,
> ...
> [...but also...]
> 0.9803    0   11 $30.00
> 0.9938    0   36 $319,210.00
> 0.9921    0   28 $25.00
> 0.9884    0   19 $100,000.00
> 0.9979    0  108 $5,000
> 0.9002   13  119 $500
> 0.9755    3  128 $50
> 0.9843    2  139 $25
> [...and...]
> 0.9921    0   28 $25.00
> 0.8448    0    1 x5=$25.00.
> 0.9082    0    2 us$25.00
> 0.9878    0   18 5=$25.00.
> 0.9082    0    2 $25.00!
> 0.9941    0   38 $25.00.
> 0.9348    0    3 $25.00,
>
> Does anyone believe that "3605" is a real spam clue, and "3607" a real
> ham clue?

The question may be more whether hapaxes (unique occurences) in general are
useful clues.  About half of all words are unique across all kinds of
computer indices, and I expect this app will have more than most (since
email has lots of artificial decorations).

> I think collapsing numbers into a few classes might significantly reduce
> the size of the database,

Nuking hapaxes in general would probably cut it by more than half.  What's
special about numbers in this?

> and actually help the classification.

That I don't know, but there's reason to question it.  We do know that each
time it's been tried, fiddling the value of robinson_probability_s has had a
real effect on results, and that reducing it from 1 has always helped.  The
effect of reducing it is to give more extreme spamprobs to rare words, so we
already know that the treatment of rare words is important (or was
important, in the schemes under which that experiment was tried).  I don't
know how numbers specifically fit into that.

> Even though for someone doing fragrances "4711" may be a  strong ham
> clue, I think that over the whole this is just adding noise.

You can try it, although it fights the "stupid beats smart" meta-rule.  It's
easy to think of examples in the other direction too.  For example, I get an
electronic order receipt with an order number, and a few days later get a
shipping confirmation referencing the same number.  If I trained on the
order receipt between times, that "senseless number" is certainly going to
help the shipping confirmation score low.

> How about something like tokens for
>
>     num:float     (e.g. 3624.2)
>     num:int       (e.g. 3629)
>     num:intpair   (e.g. 439,443)
>     num:$1        (for amounts between $0.00 and $9.99)
>     num:$10       (for amounts between $10 and $99.99)
>     num:$100      (for amounts between $100 and $999.99)
>     num:$1000     (for amounts between $1k and $10k)
>     num:$huge     (for amounts >$10k)
>
> Each of these might have "logarithm suffixes"? Is this unrealistic?

It's realistic to try it, but more expensive than the tokenization we do now
(we do nothing at all for "words" of under 13 chars now except determine
their length; the split-on-whitespace business goes at C speed).

> Currently roughly one in six tokens in my list contains at least 3
> digits in a row!
>
> amigo[197]spambayes%% egrep -c ' .*[0-9][0-9][0-9]' balk.dat
> 44757
> amigo[198]spambayes%% wc -l balk.dat
>   255907 balk.dat

I believe that, but it doesn't suggest anything to me other than that a
sixth of your tokens contain at least 3 digits in a row -- how many contain
at least 3 letters in a row <wink>?


From rbodkin@statalabs.com  Tue Oct 15 23:01:31 2002
From: rbodkin@statalabs.com (Ron Bodkin)
Date: Tue, 15 Oct 2002 15:01:31 -0700
Subject: [Spambayes] 
 Wanted: contractor to work on spam control for innovative email client
Message-ID: <200210152206.g9FM6KO27967@host12.webserver1010.com>


This is a multipart message


---------------------- multipart/mixed attachment
Hi all,

I'm a consultant with Stata Labs, which is a Silicon Valley-based R&D firm. We are developing an innovative new email client, NewMonix. Like any email product, it needs to deal with spam effectively. We've been following the spambayes project with interest, and are impressed with the quality of discussion and the development going on.

We're looking for a contractor to integrate spam filtering into our email client. I contacted Tim and he suggested that I post to the list. Our product has a Java back-end with a qt front-end, so we'd prefer to use those technologies rather than adding Python to the mix. 

I've attached a description of what we're looking for in a contract. For those interested in applying, please respond to Teresa Stancato (tstancato@statalabs.com). If you'd like to comment or have questions for the list, please cc my email address since I only read the archives occasionally.

Thank you,
Ron

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: Software  Engineer-Spam.doc
Type: application/doc
Size: 34705 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021015/47110785/SoftwareEngineer-Spam.bin

---------------------- multipart/mixed attachment--

From popiel@wolfskeep.com  Wed Oct 16 00:27:35 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Tue, 15 Oct 2002 16:27:35 -0700
Subject: [Spambayes] Wanted: contractor to work on spam control for
	innovative email client 
In-Reply-To: Message from "Ron Bodkin" <rbodkin@statalabs.com> 
	of "Tue, 15 Oct 2002 15:01:31 PDT."
	<200210152206.g9FM6KO27967@host12.webserver1010.com> 
References: <200210152206.g9FM6KO27967@host12.webserver1010.com> 
Message-ID: <20021015232735.2B5ADF590@cashew.wolfskeep.com>

In message:  <200210152206.g9FM6KO27967@host12.webserver1010.com>
             "Ron Bodkin" <rbodkin@statalabs.com> writes:
>
>I'm a consultant with Stata Labs, which is a Silicon Valley-based R&D firm.
[...]
>We're looking for a contractor to integrate spam filtering

The answer to this is probably of general interest, so I'll ask it
publicly: Are you willing to have remote contractors, or do you only
want people in the Bay Area?

- Alex (several hundred miles to the north, and not moving)

From tim.one@comcast.net  Wed Oct 16 01:11:01 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 15 Oct 2002 20:11:01 -0400
Subject: [Spambayes] Slice o' life
Message-ID: <LNBBLJKPBEHFEDALKOLCAENIBLAB.tim.one@comcast.net>

This is a multi-part message in MIME format.

---------------------- multipart/mixed attachment
In the background, I've been a guinea pig for Sean True's and Mark Hammond's
experiments hooking our code up to Outlook 2000.  Toward that end, over the
last week I've just been shuffling most of the spam I normally get, and the
truly *hard* ham, into special folders.  By "truly hard ham" I mean assorted
HTML newsletters, PayPal announcements, company newsletters in odd formats,
order/shipping confirmations, and conference announcements.

In all that's 696 spam and 86 truly hard ham so far.  Then I added in about
100 "typical" msgs from assorted work sources and friends.  This has been my
first chance to play with mining the headers for real:

"""
[Tokenizer]
mine_received_headers: True
basic_header_tokenize: True

[Classifier]
use_chi_squared_combining: True
"""

The performance on my real-life email is nothing short of amazing!  The code
adds a "Hammie" field to Outlook msgs, and I fiddled my Outlook views to
show the new field, and to color msgs with a hammie score > 0.05 bold green.
I'll attach a jpeg with a view of the tail end of today's email so far.
That view is in chronological order, and the mix of 0.0 and 1.0 is typical.
There are 523 pending msgs in my inbox right now that haven't been trained
on, and the highest-scoring non-spam is 0.03 (a personal email from someone
I didn't train on yet)  There's also one with a score of 0.01.  All the rest
of the non-spam score 0.00 or -0.00 in the display (yes, I should fix that
<wink>).  All the spam score 1.00.  I suppose it helps that one of my email
accounts automagically puts "Spam:" at the front of suspected-spam msg
Subject lines, but I suspect it wouldn't matter a bit if they didn't.

I didn't realize it before, but this stuff is cool <wink>!

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: hammie.jpg
Type: image/jpeg
Size: 82792 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021015/6cf680fd/hammie.jpeg

---------------------- multipart/mixed attachment--

From tim.one@comcast.net  Wed Oct 16 04:41:48 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 15 Oct 2002 23:41:48 -0400
Subject: [Spambayes] z-combining
In-Reply-To: <3DAC15FD.3020200@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEOEBLAB.tim.one@comcast.net>

[Rob Hooft]
> I'm not exactly sure what you want me to renormalize using my black
> magic,

Me neither -- it's just something to think about if you're feeling
particularly adventurous <wink>.  I can't make a good case for it (more on
that below).

> but I did make an interesting histogram of 250000 single-token
> spam probabilities... I'm hoping you're not assuming that this is
> normally distributed,

No, and my language was unclear before.  I'll try to repair that next.

> although it looks like that is what you are trying to do when
> recalculating this into Z-scores.

Not really.  Here's the scoop:  the chi- and z-combining schemes are both
trying to reject the same hypothesis:

    the extreme-word probabilities in a msg are random, and
    uniformly distributed across [0, 1]

Now if you run chi2.py as a main program, it computes histograms showing
quite clearly (especially if you boost the # of data points) that the
internal chi-squared H and S statistics are uniformly distributed if you
feed in vectors of random probabilities chosen uniformly from [0, 1].  S and
H are as likely then to take any value as any other, and for each x in [0,
1], S <= x with probability x, and H <= x with probability x (another way of
saying S and H are uniformly distributed).  If you warp the input
probabilities even a little, the histograms clearly react by shifting
strongly to one side.

You can do the same thing with the statistic computed by z-combining.
Replace chi2.judge() like so:

    def judge(ps, sqrt=_math.sqrt, normIP=normIP):
        zsum = 0.0
        for p in ps:
            zsum += normIP(p)

        n = len(ps)
        prob = normP(zsum / sqrt(n))
        return prob, 0, 0

The last two return values are dummies so you don't have to bother changing
the code that calls this.  Again the output probabilities are uniformly
distributed across [0, 1], if the input probabilities are random chosen
uniformly from [0, 1] too, and again biasing the input probabilities very
clearly moves the output histogram away from a uniform distribution.

So both are excellent tests for rejecting the hypothesis in question.  A
mystery is to what extent our computed spamprobs "act enough" like uniformly
distributed random values so that rejecting the hypothesis is a valid and
useful and predictably relevant thing to try.

I don't know.  But it *should* be a debating point for both schemes, not
just for the z scheme:  if our computed spamprobs don't meet the
preconditions for the z-scheme to make sense, they probably fail likewise
for the chi-squared scheme.  In practice I can't say I see any evidence of
that, though:  both approaches routinely make extreme judgments with very
low error rates, and the specific cases where the z- scheme does worse that
I've looked at are adequately explained by z's vulnerability to
"cancellation disease".

Still, it's possible that one or both schemes would do even better if we
found some way to precondition the computed spamprobs to fit the schemes'
assumptions better.  Ranking is one idea Gary has in mind for that (sorting
the spamprobs and reassigning to values uniformly spaced).

> Out of the 250k tokens I put in my histogram, 93k occurred exactly once
> in the ham corpus of 4500 messages only, and ~75k exactly once in the
> spam corpus of 4500 messages only.....

Ya, those are the infamous hapaxes, and they consume more than half the
database.  A worthwhile experiment I haven't gotten to is to see what would
happen if update_probabilities() purged them from the database.  That really
can't be done right with incremental learning/unlearning, so it would
require one of the slower or harder test driver modes.

> The noise you see at the baseline is messages that occur multiple times
> in both ham and spam; amplified in the second image where
> all words that occur only once or twice are removed from the histogram.
> A histogram of words that occur more than 30 times in total is a bit
> more flat, but still has many >30+0 / 0+>30 extremes.

FYI, the modes are approximately at 1/6 and 5/6 because of the specific
values we're using for robinson_probability_s and robinson_probability_x.
They act to adjust a 1-message probability guess of 0.0 (word appeared in 1
ham, no spam) up to about 1/6, and a 1-message probability guess of 1.0
(word appeared in one spam, no ham) down to about 5/6.  Your histogram would
be squashed closer together by raising s, or spread out more by decreasing
s.  The value of x essentially determines where the median lies (assuming
equal #s of ham and spam).

> My strongest ham clue is "wrote:" (763+0) second "het" (533+0) [Dutch
> for "the" for words without gender and for "it"], at the spam side it is
> "8bit%:100" (0+937) and "charset:ks_c_5601-1987" (0+838)

How's it doing on Asian spam for you?  Those are about the only useful Asian
clues we get, but they seem to suffice on my spam.

>> The attraction of this [z] scheme is that it gives a single "spam
>> probability" directly; combining distinct ham and spam indicators is
>> still a bit of a puzzle (although a happy puzzle from my POV when
>> both indicators suck, as happens in chi combining with large numbers
>> of strong clues on both ends).

> I don't see why this schema could not produce a "H" value as well,

Exactly how?  Subtracting the current z prob from 1 would do it, I guess.

> and then mix it with the "S" score we're using now.

Why would we want to?  For example, is there some weakness in chi's current
H you've identified?

> This schema looks a lot like the "S" half of earlier ones like chi2
> combining.

If you play with the chi2.py histogram suggestions above, you'll *see* that
chi's S is especially sensitive to high-spamprob words, chi's H is
especially sensitive to low-spamprob words, while z's output is equally
sensitive to both.  Those were all Gary's intended results, and they all
work as he expected in these respects.

> Think about what goes wrong if we would only use the S half of chi2
> combining: messages that look like both ham and spam come out as
> perfect spam, and messages that look neither like ham nor spam come
> out as perfect ham.

I've actually run that experiment (using only the S part of chi-combining),
but not reported on it, except to Gary offline.  It did very well overall on
my data, but had a systematic weakness akin to one you suggest:  a higher
false positive rate, due to msgs where a few very strong spam words manage
to overpower a larger number of strong ham words, and due to S's greater
sensitivity to high-spamprob words.

The z-scheme isn't systematically weak in that way:  it doesn't favor one
kind of clue over the other (low-spamprob words generate negative z,
high-spamprob words generate positive z, and the absolute value of the
distance of a spamprob from 0.5 determines z's magnitude -- it's wholly
symmetric).  Its weakness appears to be cancellation disease, where a msg
with lots of strong ham and lots of strong spam clues gets an extreme score
in the direction of the flavor of clue that appears more often.
chi-combining tends to get S and H both near 1 then, and returns 0.5.


From seant@iname.com  Wed Oct 16 04:52:21 2002
From: seant@iname.com (Sean True)
Date: Tue, 15 Oct 2002 23:52:21 -0400
Subject: [Spambayes] Slice o' life
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAENIBLAB.tim.one@comcast.net>
Message-ID: <MJEHLHJKGINLONDMMKNEOEDHHAAA.seant@iname.com>

I think I'm in agreement with Tim. This stuff is wicked cool. And a simple
regexp filter written in Python was easy to
write, and easier to maintain than all those rules in Microsoft's pseudo NL
syntax.

I train my classifier with the out of the box parameters, and I run Outlook
with it turned on all the time.
Outlook may not be your mailer of choice, but it has a fine UI for sorting
mail.
Makes weeding the remnant spam from the mailbox of 4500+ genuine ham much
faster.

Hat's off to Mark for doing the heavy lifting of wiring up a Python addin
for Outlook.
Before that I was working with a really crappy VBA macro package that almost
worked.
Mark has been making daily improvements in the UI and the integration. It's
COOL stuff.

IMHO, and my daily practice, this stuff is ready for deployment, and
deployment inside the MUA
makes some sense. The user is the one who knows what spam really is. It's
the stuff in the Spam
folder! Even if we can provide an efficient server side version for general
spam (all that mail from Nigeria),
I'm not sure that it's practical (or even wise) to do it all on the server.

I've also trained filters to recognize some other mail classifications, and
they work quite
nicely.

Thanks, folks.

-- Sean


From tim.one@comcast.net  Wed Oct 16 05:06:21 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 16 Oct 2002 00:06:21 -0400
Subject: [Spambayes] z
In-Reply-To: <B9D1F7E3.17464%grobinson@transpose.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEOFBLAB.tim.one@comcast.net>

[Gary Robinson]
> ...
> Based on what you say in the message quoted below, I think you're
> already doing what I was hoping for, with the exception of the ranking
> part!

Me too <wink>.  If I didn't mention it before, that code snippet *does*
produce uniformly distributed outputs in [0, 1] when fed artificially
constructed vectors of uniformly-distributed random probs, so there's
nothing wrong with the theory or this implementation of it -- so far as it
goes.

> I guess I was confused by the earlier message...
>
> And I also agree that it doesn't make sense to try ranking now
> because there are aspects to this data that mean it won't come out
> to a uniform distribution under a reasonable null hypothesis
> without more tweaking than I (or, I guess, any of us) can suggest
> a way to do at this point.

More, I wouldn't see much point to it even if it were dead easy:  the chi-
and z- schemes are having no problems at all making correct extreme
judgments about ham and spam 99+% of the time.  The cases where they're
prone to mistakes mostly fall in "a middle ground", and staring at many
examples strongly suggests they're just freaking hard to classify.  It's
hard to imagine in what sense ranking (or any other probability
preconditioning) could really help here -- the mistakes aren't failures to
separate the spaces when a clear separation exists.

However, I think it may well be worth pursuing with your *original* scheme,
because that one had trouble establishing a clear boundary between ham and
spam scores, and creating "a middle ground" for it via two cutoffs ended up
capturing many more correctly classified messages than the middle grounds in
the chi- and z- schemes (although the z-scheme is so extreme that sometimes
the best spam cutoff is over 0.995!  that's in part, though, due to the
combination of wanting to avoid false positives, and that cancellation
disease sometimes gives ham very high z spam scores).


From tim.one@comcast.net  Wed Oct 16 05:33:01 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 16 Oct 2002 00:33:01 -0400
Subject: [Spambayes] Slice o' life
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAENIBLAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEOHBLAB.tim.one@comcast.net>

[Tim]
> ...
> This has been my first chance to play with mining the headers for real:
>
> """
> [Tokenizer]
> mine_received_headers: True
> basic_header_tokenize: True
>
> [Classifier]
> use_chi_squared_combining: True
> """

And now I note the first systematic weakness:  I scored my own "spam"
folder, and discovered 5 spam with scores of 0.0.  They all have one thing
in common:  they're spam that SpamAssassin didn't catch, and came to me via
a python.org mailing list.

It turns out that python.org, Mailman, and SpamAssassin, put sooooooooo many
unique "Hey, I had my fingers this!" clues in the headers that virtually any
message coming thru python.org has a relatively huge collection of
killer-strong ham clues (just listing headers containing such clues):

Received: from mail.python.org (mail.python.org [12.155.117.29]) ...
Received: from localhost.localdomain ([127.0.0.1] helo=mail.python.org)
	by mail.python.org with esmtp (Exim 4.05)	...
Received: from [168.103.194.76] (helo=wvwrbn)	by mail.python.org ...
Subject: [Python-Help] Mp3sa  hwnf
Sender: python-help-admin@python.org
To: help@python.org
Errors-to: python-help-admin@python.org
Precedence: bulk
X-BeenThere: python-help@python.org
X-warning: 168.103.194.76 in blacklist at list.dsbl.org
 (http://dsbl.org/listing.php?168.103.194.76)
X-Spam-Status: No, hits=3.8 required=5.0
tests=BASE64_ENC_TEXT,CTYPE_JUST_HTML
X-Spam-Level: ***
X-Mailman-Version: 2.0.13 (101270)
List-Post: <mailto:python-help@python.org>
List-Subscribe: <http://mail.python.org/mailman/listinfo/python-help>,
	<mailto:python-help-request@python.org?subject=subscribe>
List-Unsubscribe: <http://mail.python.org/mailman/listinfo/python-help>,
	<mailto:python-help-request@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/mailman/private/python-help/>
List-Help: <mailto:python-help-request@python.org?subject=help>
List-Id: Expert volunteers answer Python-related questions
 <python-help.python.org>

This was an HTML msg that appeared to be pushing a Turkish MP3 site.  It's
not a dead-easy msg to score, but I also got a copy from another email
account, and it scored 0.64 there (instead of 0 via python.org).  I guess I
go back to ignoring various header lines again ...


From anthony@interlink.com.au  Wed Oct 16 05:36:38 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Wed, 16 Oct 2002 14:36:38 +1000
Subject: [Spambayes] Slice o' life 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEOHBLAB.tim.one@comcast.net> 
Message-ID: <200210160436.g9G4acc09914@localhost.localdomain>


>>> Tim Peters wrote
> And now I note the first systematic weakness:  I scored my own "spam"
> folder, and discovered 5 spam with scores of 0.0.  They all have one thing
> in common:  they're spam that SpamAssassin didn't catch, and came to me via
> a python.org mailing list.

This is precisely the same problem that I had with my personal mail, and
I had to take the same approach - disable the header frobbing. 

It's really frustrating, because there _is_ a bunch of great clues in 
there, but there's too much ham-pointing clues as well. 

I'm thinking about trying something which only looks at, say, the two
oldest received lines or some such - but not today...


-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From rob@hooft.net  Wed Oct 16 06:03:48 2002
From: rob@hooft.net (Rob Hooft)
Date: Wed, 16 Oct 2002 07:03:48 +0200
Subject: [Spambayes] Tokenizing numbers and money
References: <BIEJKCLHCIOIHAGOKOLHMEALDMAA.tim.one@comcast.net>
Message-ID: <3DACF334.3040701@hooft.net>

Tim Peters wrote:
> I believe that, but it doesn't suggest anything to me other than that a
> sixth of your tokens contain at least 3 digits in a row -- how many contain
> at least 3 letters in a row <wink>?

Roughly two thirds.

I may try to tokenize the numbers. Many numbers are not hapaxes, but 
I've seen ham significantly harmed by numbers that happened to be spam 
clues. I have customers that send me their log files full of numbers!

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From popiel@wolfskeep.com  Wed Oct 16 07:08:47 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Tue, 15 Oct 2002 23:08:47 -0700
Subject: [Spambayes] Making Tester and TestDriver unsure
Message-ID: <20021016060847.D7753F590@cashew.wolfskeep.com>

I thought it would be interesting to bring the middle ground
into the Tester and TestDriver, in preparation for new comparators
(cmp.py and table.py) which grok the middle ground.  Only so much
I can do in one night, though.

Have patch.

- Alex

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.48
diff -c -r1.48 Options.py
*** Options.py	14 Oct 2002 17:13:47 -0000	1.48
--- Options.py	16 Oct 2002 06:08:15 -0000
***************
*** 107,112 ****
--- 107,119 ----
  # to work best on some data.
  spam_cutoff: 0.560
  
+ # A message is considered ham iff it scores less than or equal to
+ # ham_cutoff.  For a binary classifier, make ham_cutoff == spam_cutoff.
+ # If ham_cutoff < spam_cutoff, you get a classifier with a middle
+ # ground of unsurety.  If ham_cutoff > spam_cutoff, results will
+ # be strange in ways that have not been fully thought out.
+ ham_cutoff: 0.560
+ 
  # Number of buckets in histograms.
  nbuckets: 200
  show_histograms: True
***************
*** 146,151 ****
--- 153,159 ----
  
  show_false_positives: True
  show_false_negatives: False
+ show_unsure: False
  
  # Near the end of Driver.test(), you can get a listing of the 'best
  # discriminators' in the words from the training sets.  These are the
***************
*** 311,322 ****
--- 319,332 ----
                     'show_spam_hi': float_cracker,
                     'show_false_positives': boolean_cracker,
                     'show_false_negatives': boolean_cracker,
+                    'show_unsure': boolean_cracker,
                     'show_histograms': boolean_cracker,
                     'show_best_discriminators': int_cracker,
                     'save_trained_pickles': boolean_cracker,
                     'save_histogram_pickles': boolean_cracker,
                     'pickle_basename': string_cracker,
                     'show_charlimit': int_cracker,
+                    'ham_cutoff': float_cracker,
                     'spam_cutoff': float_cracker,
                     'spam_directories': string_cracker,
                     'ham_directories': string_cracker,
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.23
diff -c -r1.23 TestDriver.py
*** TestDriver.py	14 Oct 2002 18:04:56 -0000	1.23
--- TestDriver.py	16 Oct 2002 06:08:16 -0000
***************
*** 128,133 ****
--- 128,134 ----
      def __init__(self):
          self.falsepos = Set()
          self.falseneg = Set()
+         self.unsure = Set()
          self.global_ham_hist = Hist()
          self.global_spam_hist = Hist()
          self.ntimes_finishtest_called = 0
***************
*** 186,191 ****
--- 187,197 ----
      def alldone(self):
          if options.show_histograms:
              printhist("all runs:", self.global_ham_hist, self.global_spam_hist)
+         
+         print "-> <stat> cost for all runs: $%.2f" % (
+                len(self.falsepos) * options.best_cutoff_fp_weight +
+                len(self.falseneg) * options.best_cutoff_fn_weight +
+                len(self.unsure) * options.best_cutoff_unsure_weight)
  
          if options.save_histogram_pickles:
              for f, h in (('ham', self.global_ham_hist),
***************
*** 229,234 ****
--- 235,246 ----
  
          print "-> <stat> false positive %:", t.false_positive_rate()
          print "-> <stat> false negative %:", t.false_negative_rate()
+         print "-> <stat> unsure %:", t.unsure_rate()
+         print "-> <stat> cost: $%.2f" % (
+                t.nham_wrong * options.best_cutoff_fp_weight +
+                t.nspam_wrong * options.best_cutoff_fn_weight +
+                (t.nham_unsure + t.nspam_unsure) *
+                options.best_cutoff_unsure_weight)
  
          newfpos = Set(t.false_positives()) - self.falsepos
          self.falsepos |= newfpos
***************
*** 250,255 ****
--- 262,279 ----
          if not options.show_false_negatives:
              newfneg = ()
          for e in newfneg:
+             print '*' * 78
+             prob, clues = c.spamprob(e, True)
+             printmsg(e, prob, clues)
+ 
+         newunsure = Set(t.unsures()) - self.unsure
+         self.unsure |= newunsure
+         print "-> <stat> %d new unsure" % len(newunsure)
+         if newunsure:
+             print "    new unsure:", [e.tag for e in newunsure]
+         if not options.show_unsure:
+             newunsure = ()
+         for e in newunsure:
              print '*' * 78
              prob, clues = c.spamprob(e, True)
              printmsg(e, prob, clues)
Index: Tester.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Tester.py,v
retrieving revision 1.5
diff -c -r1.5 Tester.py
*** Tester.py	27 Sep 2002 21:18:18 -0000	1.5
--- Tester.py	16 Oct 2002 06:08:16 -0000
***************
*** 35,46 ****
--- 35,49 ----
          # The number of test instances correctly and incorrectly classified.
          self.nham_right = 0
          self.nham_wrong = 0
+         self.nham_unsure = 0;
          self.nspam_right = 0
          self.nspam_wrong = 0
+         self.nspam_unsure = 0;
  
          # Lists of bad predictions.
          self.ham_wrong_examples = []    # False positives:  ham called spam.
          self.spam_wrong_examples = []   # False negatives:  spam called ham.
+         self.unsure_examples = []
  
      # Train the classifier on streams of ham and spam.  Updates probabilities
      # before returning, and resets test results.
***************
*** 85,108 ****
              if callback:
                  callback(example, prob)
              is_spam_guessed = prob > options.spam_cutoff
!             correct = is_spam_guessed == is_spam
              if is_spam:
                  self.nspam_tested += 1
!                 if correct:
                      self.nspam_right += 1
!                 else:
                      self.nspam_wrong += 1
                      self.spam_wrong_examples.append(example)
              else:
                  self.nham_tested += 1
!                 if correct:
                      self.nham_right += 1
!                 else:
                      self.nham_wrong += 1
                      self.ham_wrong_examples.append(example)
  
!         assert self.nham_right + self.nham_wrong == self.nham_tested
!         assert self.nspam_right + self.nspam_wrong == self.nspam_tested
  
      def false_positive_rate(self):
          """Percentage of ham mistakenly identified as spam, in 0.0..100.0."""
--- 88,119 ----
              if callback:
                  callback(example, prob)
              is_spam_guessed = prob > options.spam_cutoff
!             is_ham_guessed = prob <= options.ham_cutoff
              if is_spam:
                  self.nspam_tested += 1
!                 if is_spam_guessed:
                      self.nspam_right += 1
!                 elif is_ham_guessed:
                      self.nspam_wrong += 1
                      self.spam_wrong_examples.append(example)
+                 else:
+                     self.nspam_unsure += 1
+                     self.unsure_examples.append(example)
              else:
                  self.nham_tested += 1
!                 if is_ham_guessed:
                      self.nham_right += 1
!                 elif is_spam_guessed:
                      self.nham_wrong += 1
                      self.ham_wrong_examples.append(example)
+                 else:
+                     self.nham_unsure += 1
+                     self.unsure_examples.append(example)
  
!         assert self.nham_right + self.nham_wrong + self.nham_unsure \
!                == self.nham_tested
!         assert self.nspam_right + self.nspam_wrong + self.nspam_unsure \
!                == self.nspam_tested
  
      def false_positive_rate(self):
          """Percentage of ham mistakenly identified as spam, in 0.0..100.0."""
***************
*** 112,123 ****
--- 123,140 ----
          """Percentage of spam mistakenly identified as ham, in 0.0..100.0."""
          return self.nspam_wrong * 1e2 / self.nspam_tested
  
+     def unsure_rate(self):
+         return (self.nham_unsure + self.nspam_unsure) * 1e2 \
+                / (self.nham_tested + self.nspam_tested)
+ 
      def false_positives(self):
          return self.ham_wrong_examples
  
      def false_negatives(self):
          return self.spam_wrong_examples
  
+     def unsures(self):
+         return self.unsure_examples
  
  class _Example:
      def __init__(self, name, words):

From rob@hooft.net  Wed Oct 16 11:51:55 2002
From: rob@hooft.net (Rob W.W. Hooft)
Date: Wed, 16 Oct 2002 12:51:55 +0200
Subject: [Spambayes] Slice o' life
References: <LNBBLJKPBEHFEDALKOLCAEOHBLAB.tim.one@comcast.net>
Message-ID: <3DAD44CB.103@hooft.net>

Tim Peters wrote:
> It turns out that python.org, Mailman, and SpamAssassin, put sooooooooo many
> unique "Hey, I had my fingers this!" clues in the headers that virtually any
> message coming thru python.org has a relatively huge collection of
> killer-strong ham clues (just listing headers containing such clues):

Correlations, correlations, correlations. It all boils down to 
correlations. Not the fact that there are correlations, but that they 
are very, very different from one clue to the next. All these mailman 
clues are correlated. And by not downweighting them, we're blinding the 
procedure to the other clues that do not come by the dozens...

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From chk@pobox.com  Wed Oct 16 16:13:03 2002
From: chk@pobox.com (Harald Koch)
Date: Wed, 16 Oct 2002 11:13:03 -0400
Subject: [Spambayes] Re: Slice o' life 
In-Reply-To: Your message of "Wed, 16 Oct 2002 00:33:01 -0400".
	 <LNBBLJKPBEHFEDALKOLCAEOHBLAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCAEOHBLAB.tim.one@comcast.net>
Message-ID: <9329.1034781183@elisabeth.cfrq.net>

> And now I note the first systematic weakness:  I scored my own "spam"
> folder, and discovered 5 spam with scores of 0.0.  They all have one thing
> in common:  they're spam that SpamAssassin didn't catch, and came to me via
> a python.org mailing list.

This is why I don't usually bother spam-filtering my email lists. I
don't get much spam that way to begin with; most of my email lists have
their own spam filters in place already. In the olden days, filtering
lists resulted in too many fps; now it confuses the classifier.

-- 
Harald Koch     <chk@pobox.com>

"It takes a child to raze a village."
		-Michael T. Fry

From tim.one@comcast.net  Wed Oct 16 19:49:03 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 16 Oct 2002 14:49:03 -0400
Subject: [Spambayes] Slice o' life
In-Reply-To: <3DAD44CB.103@hooft.net>
Message-ID: <BIEJKCLHCIOIHAGOKOLHOECIDMAA.tim.one@comcast.net>

[Rob W.W. Hooft]
> Correlations, correlations, correlations. It all boils down to
> correlations. Not the fact that there are correlations, but that they
> are very, very different from one clue to the next. All these mailman
> clues are correlated. And by not downweighting them, we're blinding the
> procedure to the other clues that do not come by the dozens...

It's not even that they're Mailman clues, though, it's more that python.org
specifically already has strong anti-spam and anti-virus measures in place.
That's how these "Mailman clues" earned their very low spamprobs to begin
with -- it's not that Mailman is stopping spam, it's that virtually all the
Mailman lists I'm on go through python.org.  So when python.org screws up,
there's little anything can do on the user's end, short of ignoring
python.org clues as evidence.  I don't know how to automate that in a
no-brainer cross-user way (and, no, I still don't think 200K x 200K matrix
analysis is tractable for this <wink>).

So far as python.org goes, I expect it will eventually use the code
developed here, and its false negative rate should go down then (I haven't
yet seen a spam approved by python.org that *this* code scores low when the
python.org header clues are ignored).


From tim.one@comcast.net  Wed Oct 16 19:55:53 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 16 Oct 2002 14:55:53 -0400
Subject: [Spambayes] Re: Slice o' life
In-Reply-To: <9329.1034781183@elisabeth.cfrq.net>
Message-ID: <BIEJKCLHCIOIHAGOKOLHKECJDMAA.tim.one@comcast.net>

[Tim]
> I scored my own "spam" folder, and discovered 5 spam with scores of 0.0.
They
> all have one thing in common:  they're spam that SpamAssassin didn't
catch, and
> came to me via a python.org mailing list.

[Harald Koch]
> This is why I don't usually bother spam-filtering my email lists. I
> don't get much spam that way to begin with; most of my email lists have
> their own spam filters in place already. In the olden days, filtering
> lists resulted in too many fps; now it confuses the classifier.

Unclear.  I retrained my home-mail classifier to go back to the "ignore most
header lines" defaults, and these low-scoring spam scored high again.
Regular list traffic continued to score low, presumably because it had
genuine hammish content.  What suffered some was personal email, which is
sometimes very brief, and where sucking up header clues about who sent it is
a real help.  Some solicited commercial email also suffered.  The system was
still highly accurate, although this is not a controlled experiment, and the
database has only one week of non-random email (so I won't draw any
conclusions based on this).


From rob@hooft.net  Wed Oct 16 20:40:45 2002
From: rob@hooft.net (Rob Hooft)
Date: Wed, 16 Oct 2002 21:40:45 +0200
Subject: [Spambayes] Tokenizing numbers and money 
References: <BIEJKCLHCIOIHAGOKOLHMEALDMAA.tim.one@comcast.net>
Message-ID: <3DADC0BD.7050806@hooft.net>

Tim Peters wrote:

> You can try it, although it fights the "stupid beats smart" meta-rule.  It's
> easy to think of examples in the other direction too.  For example, I get an
> electronic order receipt with an order number, and a few days later get a
> shipping confirmation referencing the same number.  If I trained on the
> order receipt between times, that "senseless number" is certainly going to
> help the shipping confirmation score low.
> 
> 
>>How about something like tokens for
>>
>>    num:float     (e.g. 3624.2)
>>    num:int       (e.g. 3629)
>>    num:intpair   (e.g. 439,443)
>>    num:$1        (for amounts between $0.00 and $9.99)
>>    num:$10       (for amounts between $10 and $99.99)
>>    num:$100      (for amounts between $100 and $999.99)
>>    num:$1000     (for amounts between $1k and $10k)
>>    num:$huge     (for amounts >$10k)
>>
>>Each of these might have "logarithm suffixes"? Is this unrealistic?
> 
> 
> It's realistic to try it, but more expensive than the tokenization we do now
> (we do nothing at all for "words" of under 13 chars now except determine
> their length; the split-on-whitespace business goes at C speed).

More expensive, but I didn't notice it yet. First results: It doesn't 
make a difference.

cv5: original code
cv8: with "num:XXX" tokens for simple numerics

amigo[109]spambayes%% grep -A1 'all runs' cv5.txt
-> <stat> Ham scores for all runs: 16000 items; mean 0.59; sdev 4.96
-> <stat> min -1.22125e-13; median 1.3603e-11; max 100
--
-> <stat> Spam scores for all runs: 5800 items; mean 99.02; sdev 5.86
-> <stat> min 6.85483e-09; median 100; max 100
amigo[110]spambayes%% grep -A1 'all runs' cv8.txt
-> <stat> Ham scores for all runs: 16000 items; mean 0.60; sdev 5.00
-> <stat> min -1.44329e-13; median 2.66842e-11; max 100
--
-> <stat> Spam scores for all runs: 5800 items; mean 99.04; sdev 5.74
-> <stat> min 7.69111e-09; median 100; max 100


cv8 now has the following tokens:

prob   nham nspam token
0.0082   27    0 num:float8
0.0088   25    0 num:signfloat6
0.0122   18    0 num:signfloat5
0.0137   16    0 num:signfloat4
0.0138  657    9 num:signint3
0.0167   13    0 num:signfloat7
0.0197   11    0 num:int12
0.0266    8    0 num:float10
0.0266    8    0 num:signfloat8
0.0302    7    0 num:signfloat9
0.0302    7    0 num:signint6
0.0413    5    0 num:signfloat10
0.0506    4    0 num:signfloat11
0.0868  265   25 num:signint5
0.0911   12    1 num:float9
0.1539  111   20 num:int10
0.1552    1    0 num:expfloat10
0.1552    1    0 num:float12
0.1552    1    0 num:signint9
0.1566   71   13 num:float7
0.1654   11    2 num:signint4
0.2085  255   67 num:int7
0.2248    4    1 num:float11
0.2656   64   23 num:int9
0.2935  164   68 num:float5
0.3138  431  197 num:float3
0.3196  194   91 num:float4
0.3550  151   83 num:int6
0.3596 1900 1067 num:int4
0.4041   65   44 num:int8
0.4255 1218  902 num:int3
0.4369  687  533 num:int5
0.4399   65   51 num:float6
0.4471    5    4 num:signint11
0.7432    4   12 num:int11
0.7752    1    4 num:signint8
0.8133  127  554 num:intpair
0.8356    1    6 num:signint10
0.9082    0    2 num:money12
0.9180    9  103 num:money5
0.9383   24  368 num:money4
0.9587    0    5 num:exclmoney12
0.9587    0    5 num:money9
0.9599    2   53 num:fracmoney9
0.9700   13  428 num:money3
0.9730    1   44 num:money10
0.9734    0    8 num:fracmoney4
0.9762    0    9 num:exclmoney4
0.9785    0   10 num:exclmoney11
0.9785    0   10 num:exclmoney9
0.9788    4  195 num:money8
0.9794    1   58 num:exclmoney8
0.9796    4  203 num:money6
0.9833    0   13 num:money11
0.9863    0   16 num:exclmoney5
0.9900    4  417 num:fracmoney6
0.9904    0   23 num:exclmoney10
0.9912    2  249 num:fracmoney7
0.9920    2  274 num:fracmoney5
0.9933    0   33 num:fracmoney8
0.9937    0   35 num:exclmoney6
0.9954    1  262 num:money7
0.9956    0   51 num:exclmoney7
0.9974    0   86 num:fracmoney11
0.9975    0   89 num:fracmoney10

Dead end? Or is the reduction in number of tokens significant?

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From gward@python.net  Wed Oct 16 21:10:09 2002
From: gward@python.net (Greg Ward)
Date: Wed, 16 Oct 2002 16:10:09 -0400
Subject: [Spambayes] Re: Slice o' life
In-Reply-To: <9329.1034781183@elisabeth.cfrq.net>
References: <LNBBLJKPBEHFEDALKOLCAEOHBLAB.tim.one@comcast.net>
	<9329.1034781183@elisabeth.cfrq.net>
Message-ID: <20021016201009.GA6778@cthulhu.gerg.ca>

On 16 October 2002, Harald Koch said:
> This is why I don't usually bother spam-filtering my email lists. I
> don't get much spam that way to begin with; most of my email lists have
> their own spam filters in place already. In the olden days, filtering
> lists resulted in too many fps; now it confuses the classifier.

Depends on the server -- I was surprised to learn that a list I follow
fairly closely (optik-users@lists.sourceforge.net) got nothing but spam
for most of the summer.  I never knew until I looked at the archive,
because SA on python.net kept all that spam out of my inbox, even if
it's spam from a mailing list.

(And yes, I am thinking of moving optik-users to either python.net or
python.org...)

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
Vote Cthulhu -- why settle for a lesser evil?

From popiel@wolfskeep.com  Wed Oct 16 22:17:51 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Wed, 16 Oct 2002 14:17:51 -0700
Subject: [Spambayes] More modifications to TestDriver
Message-ID: <20021016211751.4B11CF49B@cashew.wolfskeep.com>

I mangled TestDriver some more to report the fp, fn, and
unsure totals at the end, along with the percentages and
cost.  This way I can have table.py eat the TestDriver
output directly, instead of mediating it through the
summary rates.py script...

Another patch, from the same base as before.

- Alex

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.48
diff -c -r1.48 Options.py
*** Options.py	14 Oct 2002 17:13:47 -0000	1.48
--- Options.py	16 Oct 2002 21:17:54 -0000
***************
*** 107,112 ****
--- 107,119 ----
  # to work best on some data.
  spam_cutoff: 0.560
  
+ # A message is considered ham iff it scores less than or equal to
+ # ham_cutoff.  For a binary classifier, make ham_cutoff == spam_cutoff.
+ # If ham_cutoff < spam_cutoff, you get a classifier with a middle
+ # ground of unsurety.  If ham_cutoff > spam_cutoff, results will
+ # be strange in ways that have not been fully thought out.
+ ham_cutoff: 0.560
+ 
  # Number of buckets in histograms.
  nbuckets: 200
  show_histograms: True
***************
*** 146,151 ****
--- 153,159 ----
  
  show_false_positives: True
  show_false_negatives: False
+ show_unsure: False
  
  # Near the end of Driver.test(), you can get a listing of the 'best
  # discriminators' in the words from the training sets.  These are the
***************
*** 311,322 ****
--- 319,332 ----
                     'show_spam_hi': float_cracker,
                     'show_false_positives': boolean_cracker,
                     'show_false_negatives': boolean_cracker,
+                    'show_unsure': boolean_cracker,
                     'show_histograms': boolean_cracker,
                     'show_best_discriminators': int_cracker,
                     'save_trained_pickles': boolean_cracker,
                     'save_histogram_pickles': boolean_cracker,
                     'pickle_basename': string_cracker,
                     'show_charlimit': int_cracker,
+                    'ham_cutoff': float_cracker,
                     'spam_cutoff': float_cracker,
                     'spam_directories': string_cracker,
                     'ham_directories': string_cracker,
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.23
diff -c -r1.23 TestDriver.py
*** TestDriver.py	14 Oct 2002 18:04:56 -0000	1.23
--- TestDriver.py	16 Oct 2002 21:17:54 -0000
***************
*** 128,133 ****
--- 128,134 ----
      def __init__(self):
          self.falsepos = Set()
          self.falseneg = Set()
+         self.unsure = Set()
          self.global_ham_hist = Hist()
          self.global_spam_hist = Hist()
          self.ntimes_finishtest_called = 0
***************
*** 187,192 ****
--- 188,209 ----
          if options.show_histograms:
              printhist("all runs:", self.global_ham_hist, self.global_spam_hist)
  
+         nham = self.global_ham_hist.n
+         nspam = self.global_spam_hist.n
+         nfp = len(self.falsepos)
+         nfn = len(self.falseneg)
+         nun = len(self.unsure)
+         print "-> <stat> all runs false positives:", nfp
+         print "-> <stat> all runs false negatives:", nfn
+         print "-> <stat> all runs unsure:", nun
+         print "-> <stat> all runs false positive %:", (nfp * 1e2 / nham)
+         print "-> <stat> all runs false negative %:", (nfn * 1e2 / nspam)
+         print "-> <stat> all runs unsure %:", (nun * 1e2 / (nham + nspam))
+         print "-> <stat> all runs cost: $%.2f" % (
+                nfp * options.best_cutoff_fp_weight +
+                nfn * options.best_cutoff_fn_weight +
+                nun * options.best_cutoff_unsure_weight)
+ 
          if options.save_histogram_pickles:
              for f, h in (('ham', self.global_ham_hist),
                           ('spam', self.global_spam_hist)):
***************
*** 229,234 ****
--- 246,257 ----
  
          print "-> <stat> false positive %:", t.false_positive_rate()
          print "-> <stat> false negative %:", t.false_negative_rate()
+         print "-> <stat> unsure %:", t.unsure_rate()
+         print "-> <stat> cost: $%.2f" % (
+                t.nham_wrong * options.best_cutoff_fp_weight +
+                t.nspam_wrong * options.best_cutoff_fn_weight +
+                (t.nham_unsure + t.nspam_unsure) *
+                options.best_cutoff_unsure_weight)
  
          newfpos = Set(t.false_positives()) - self.falsepos
          self.falsepos |= newfpos
***************
*** 250,255 ****
--- 273,290 ----
          if not options.show_false_negatives:
              newfneg = ()
          for e in newfneg:
+             print '*' * 78
+             prob, clues = c.spamprob(e, True)
+             printmsg(e, prob, clues)
+ 
+         newunsure = Set(t.unsures()) - self.unsure
+         self.unsure |= newunsure
+         print "-> <stat> %d new unsure" % len(newunsure)
+         if newunsure:
+             print "    new unsure:", [e.tag for e in newunsure]
+         if not options.show_unsure:
+             newunsure = ()
+         for e in newunsure:
              print '*' * 78
              prob, clues = c.spamprob(e, True)
              printmsg(e, prob, clues)
Index: Tester.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Tester.py,v
retrieving revision 1.5
diff -c -r1.5 Tester.py
*** Tester.py	27 Sep 2002 21:18:18 -0000	1.5
--- Tester.py	16 Oct 2002 21:17:55 -0000
***************
*** 35,46 ****
--- 35,49 ----
          # The number of test instances correctly and incorrectly classified.
          self.nham_right = 0
          self.nham_wrong = 0
+         self.nham_unsure = 0;
          self.nspam_right = 0
          self.nspam_wrong = 0
+         self.nspam_unsure = 0;
  
          # Lists of bad predictions.
          self.ham_wrong_examples = []    # False positives:  ham called spam.
          self.spam_wrong_examples = []   # False negatives:  spam called ham.
+         self.unsure_examples = []
  
      # Train the classifier on streams of ham and spam.  Updates probabilities
      # before returning, and resets test results.
***************
*** 85,108 ****
              if callback:
                  callback(example, prob)
              is_spam_guessed = prob > options.spam_cutoff
!             correct = is_spam_guessed == is_spam
              if is_spam:
                  self.nspam_tested += 1
!                 if correct:
                      self.nspam_right += 1
!                 else:
                      self.nspam_wrong += 1
                      self.spam_wrong_examples.append(example)
              else:
                  self.nham_tested += 1
!                 if correct:
                      self.nham_right += 1
!                 else:
                      self.nham_wrong += 1
                      self.ham_wrong_examples.append(example)
  
!         assert self.nham_right + self.nham_wrong == self.nham_tested
!         assert self.nspam_right + self.nspam_wrong == self.nspam_tested
  
      def false_positive_rate(self):
          """Percentage of ham mistakenly identified as spam, in 0.0..100.0."""
--- 88,119 ----
              if callback:
                  callback(example, prob)
              is_spam_guessed = prob > options.spam_cutoff
!             is_ham_guessed = prob <= options.ham_cutoff
              if is_spam:
                  self.nspam_tested += 1
!                 if is_spam_guessed:
                      self.nspam_right += 1
!                 elif is_ham_guessed:
                      self.nspam_wrong += 1
                      self.spam_wrong_examples.append(example)
+                 else:
+                     self.nspam_unsure += 1
+                     self.unsure_examples.append(example)
              else:
                  self.nham_tested += 1
!                 if is_ham_guessed:
                      self.nham_right += 1
!                 elif is_spam_guessed:
                      self.nham_wrong += 1
                      self.ham_wrong_examples.append(example)
+                 else:
+                     self.nham_unsure += 1
+                     self.unsure_examples.append(example)
  
!         assert self.nham_right + self.nham_wrong + self.nham_unsure \
!                == self.nham_tested
!         assert self.nspam_right + self.nspam_wrong + self.nspam_unsure \
!                == self.nspam_tested
  
      def false_positive_rate(self):
          """Percentage of ham mistakenly identified as spam, in 0.0..100.0."""
***************
*** 112,123 ****
--- 123,140 ----
          """Percentage of spam mistakenly identified as ham, in 0.0..100.0."""
          return self.nspam_wrong * 1e2 / self.nspam_tested
  
+     def unsure_rate(self):
+         return (self.nham_unsure + self.nspam_unsure) * 1e2 \
+                / (self.nham_tested + self.nspam_tested)
+ 
      def false_positives(self):
          return self.ham_wrong_examples
  
      def false_negatives(self):
          return self.spam_wrong_examples
  
+     def unsures(self):
+         return self.unsure_examples
  
  class _Example:
      def __init__(self, name, words):

From popiel@wolfskeep.com  Wed Oct 16 22:39:22 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Wed, 16 Oct 2002 14:39:22 -0700
Subject: [Spambayes] Ratios and chi-squared
Message-ID: <20021016213923.0D317F49B@cashew.wolfskeep.com>

I decided to see how chi-squared coped with differing ham:spam
ratios in the training data.  I'll also be checking the effect
of training set size.  In any case, here's a preview of the
5 sets (1000 ham & 1000 spam total) run, with ham_cutoff 0.05
and spam_cutoff 0.9...

"""
-> <stat> tested 50 hams & 200 spams against 200 hams & 800 spams
[ yadda yadda yadda ]
-> <stat> tested 200 hams & 50 spams against 800 hams & 200 spams

ham:spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
fp total:        1       1       2       2       3       3       2
fp %:         0.40    0.27    0.40    0.32    0.40    0.34    0.20
fn total:        2       2       3       2       3       4       6
fn %:         0.20    0.23    0.40    0.32    0.60    1.07    2.40
unsure t:       26      24      25      33      29      26      37
unsure %:     2.08    1.92    2.00    2.64    2.32    2.08    2.96
real cost:  $17.20  $16.80  $28.00  $28.60  $38.80  $39.20  $33.40
best cost:  $15.60  $15.00  $19.80  $19.20  $27.80  $14.80  $14.60
h mean:       2.59    1.18    0.73    0.44    0.51    0.46    0.35
h sdev:      11.57    7.82    7.00    5.68    6.46    6.01    5.02
s mean:      99.31   98.95   98.32   97.41   96.84   96.10   93.12
s sdev:       7.03    8.50   10.20   12.75   14.43   15.70   19.33
mean diff:   96.72   97.77   97.59   96.97   96.33   95.64   92.77
k:            5.20    5.99    5.67    5.26    4.61    4.41    3.81
"""

The chi-squared combining seems much less sensitive to training
set ratios than the default method.  (Of course, it could just be
the broad and obvious middle ground that's saving it.)  I'll see
what the rest of the data shows, and then do a real writeup...

- Alex

From tim.one@comcast.net  Wed Oct 16 22:29:54 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 16 Oct 2002 17:29:54 -0400
Subject: [Spambayes] chi-z combining:  a worthless scheme
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHEEAHDMAA.tim.one@comcast.net>
Message-ID: <BIEJKCLHCIOIHAGOKOLHAEDCDMAA.tim.one@comcast.net>

If one other person thinks this is funny too, it was worth it <wink>.

Since the sum of squares of n unit-normal distributed vars follows a
chi-squared distribution with n degrees of freedom, here's Yet Another test
for rejecting the hypothesis that a vector of probs is uniformly
distributed:

        S = 0.0
        for p in ps:
            z = normIP(p)
            S += z*z
        S = chi2Q(S, len(ps))

This works as it should:  S is uniformly distributed when the input ps are
uniformly distributed.  But it combines the advantage of being equally
sensitive to high-spamprob and low-spamprob words, with a remarkable
disadvantage no other scheme to date has managed to achieve:  it gives very
low scores to ham *and* to spam, and very high scores to exceedingly bland
msgs.  Take that, BlandAssassin.


From rbodkin@statalabs.com  Wed Oct 16 23:59:47 2002
From: rbodkin@statalabs.com (Ron Bodkin)
Date: Wed, 16 Oct 2002 15:59:47 -0700
Subject: [Spambayes] Wanted: contractor to work on spam control for
	innovative email client
Message-ID: <200210162259.g9GMxOj08543@host12.webserver1010.com>

Hi Alex,

While we prefer local contractors, we are open to applications from outstanding candidates who are remote. The office is in Burlingame, California.

In answer to other questions we received: the contract is for forty hour work weeks. The software is being developed on Windows using cygwin for build and test scripts.

Thanks!
Ron

------------Original Message-------------
From: "T. Alexander Popiel" <popiel@wolfskeep.com>
To: Ron Bodkin <rbodkin@statalabs.com>
Date: Tue, 15 Oct 2002 16:27:35 -0700
Subject: Re: [Spambayes] Wanted: contractor to work on spam control for innovative email client
 
In message:  <200210152206.g9FM6KO27967@host12.webserver1010.com>
             "Ron Bodkin" <rbodkin@statalabs.com> writes:
>
>I'm a consultant with Stata Labs, which is a Silicon Valley-based R&D firm.
[...]
>We're looking for a contractor to integrate spam filtering

The answer to this is probably of general interest, so I'll ask it
publicly: Are you willing to have remote contractors, or do you only
want people in the Bay Area?

- Alex (several hundred miles to the north, and not moving)


From tim.one@comcast.net  Thu Oct 17 04:35:16 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 16 Oct 2002 23:35:16 -0400
Subject: [Spambayes] Proposing to remove 4 combining schemes
Message-ID: <LNBBLJKPBEHFEDALKOLCAECDBMAB.tim.one@comcast.net>

I propose to remove these options and their supporting code:

    use_central_limit
    use_central_limit2
    use_central_limit3

The point of the 3 central limit schemes was (or, rather, turned out to be)
to create a usable middle ground.  chi-combining appears to do a better job
of that, or at worst at least as good.  As a (highly) practical matter, the
central limit schemes are unique in requiring "a third training pass", and
it's never become clear how to *do* that in an incremental way, short of
saving every msg ever trained on and retraining on all whenever a new msg is
added to training.  So even if they did better, I don't know how to deploy
them in real life.  Luckily(?), they're not doing better, so that hard
choice decision is easy to sidestep.

    use_z_combining

It hasn't done better than chi-combining for anyone, and has done worse for
some; it's known to be systematically vulnerable to cancellation disease.

This would leave 3 combining schemes, none of which I'm willing to kill off
yet:

    Gary's original scheme
    use_tim_combining
    use_chi_combining

Note that these three are 100% compatible at the database level:  they don't
affect *training* at all.  The only difference among them is the
implementation of Bayes.spamprob() (the scoring function).  A trained
classifier can use any of these three freely.  Indeed, it's possible (no
experiments have been done on this) that a "hard" msg for one scheme could
benefit via getting scored again by one or both of the others.

Now that I'm playing with a UI (Sean & Mark's code) as a user, I'm growing
fonder of the non-chi schemes again.  Rational or not, I find that the more
uniform range of outcomes in [0.0, 1.0] is psychologically reassuring when
using a UI that throws the scores in your face.

If there are no killer objections, I'll remove the 4 schemes in question.


From rob@hooft.net  Thu Oct 17 05:22:02 2002
From: rob@hooft.net (Rob Hooft)
Date: Thu, 17 Oct 2002 06:22:02 +0200
Subject: [Spambayes] Tokenizing numbers and money
References: <BIEJKCLHCIOIHAGOKOLHMEALDMAA.tim.one@comcast.net>
Message-ID: <3DAE3AEA.4020108@hooft.net>

Tim Peters wrote:
 > That I don't know, but there's reason to question it.  We do know 
that each
 > time it's been tried, fiddling the value of robinson_probability_s 
has had a
 > real effect on results, and that reducing it from 1 has always 
helped.  The
 > effect of reducing it is to give more extreme spamprobs to rare 
words, so we
 > already know that the treatment of rare words is important (or was
 > important, in the schemes under which that experiment was tried).  I 
don't
 > know how numbers specifically fit into that.

The problem is that the final scoring has been adapted so thoroughly
since those tests, that all of that should be done again. And then it
becomes very difficult, because the procedure is so good now that we're
all looking with a microscope at all our fp/fn's and anyway, I "agree"
(that it "looks" wrong) with the filter in most of my fp/fn cases.

I did try something:

s=0.25:
-> <stat> Ham scores for all runs: 16000 items; mean 0.51; sdev 4.70
-> <stat> min -1.33227e-13; median 1.19543e-11; max 100
--
-> <stat> Spam scores for all runs: 5800 items; mean 99.10; sdev 5.81
-> <stat> min 2.89463e-09; median 100; max 100

s=0.45:
-> <stat> Ham scores for all runs: 16000 items; mean 0.60; sdev 5.00
-> <stat> min -1.44329e-13; median 2.66842e-11; max 100
--
-> <stat> Spam scores for all runs: 5800 items; mean 99.04; sdev 5.74
-> <stat> min 7.69111e-09; median 100; max 100

s=0.75:
-> <stat> Ham scores for all runs: 16000 items; mean 0.73; sdev 5.43
-> <stat> min -1.11022e-13; median 9.83325e-11; max 100
--
-> <stat> Spam scores for all runs: 5800 items; mean 98.95; sdev 5.68
-> <stat> min 3.83111e-05; median 100; max 100

And:

s=0.25:
-> best cost for all runs: $109.60
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 2 cutoff pairs
-> smallest ham & spam cutoffs 0.48 & 0.93
->     fp 6; fn 13; unsure ham 43; unsure spam 140
->     fp rate 0.0375%; fn rate 0.224%; unsure rate 0.839%
-> largest ham & spam cutoffs 0.49 & 0.93
->     fp 6; fn 14; unsure ham 39; unsure spam 139
->     fp rate 0.0375%; fn rate 0.241%; unsure rate 0.817%

s=0.45:
-> best cost for all runs: $112.40
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 2 cutoff pairs
-> smallest ham & spam cutoffs 0.495 & 0.975
->     fp 3; fn 15; unsure ham 42; unsure spam 295
->     fp rate 0.0187%; fn rate 0.259%; unsure rate 1.55%
-> largest ham & spam cutoffs 0.5 & 0.975
->     fp 3; fn 16; unsure ham 38; unsure spam 294
->     fp rate 0.0187%; fn rate 0.276%; unsure rate 1.52%

s=0.75:
-> best cost for all runs: $108.20
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.505 & 0.95
->     fp 4; fn 13; unsure ham 46; unsure spam 230
->     fp rate 0.025%; fn rate 0.224%; unsure rate 1.27%

Don't know what to think about this. Total cost looks fairly insensitive 
here, but the distribution over the types of cost is different.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob@hooft.net  Thu Oct 17 05:42:52 2002
From: rob@hooft.net (Rob Hooft)
Date: Thu, 17 Oct 2002 06:42:52 +0200
Subject: [Spambayes] Proposing to remove 4 combining schemes
References: <LNBBLJKPBEHFEDALKOLCAECDBMAB.tim.one@comcast.net>
Message-ID: <3DAE3FCC.4060201@hooft.net>

Tim Peters wrote:
> I propose to remove these options and their supporting code:
> 
>     use_central_limit
>     use_central_limit2
>     use_central_limit3

Go ahead.

>     use_z_combining

I guess that means that no RMS magic can help here. Go ahead.

> Note that these three are 100% compatible at the database level:  they don't
> affect *training* at all.  The only difference among them is the
> implementation of Bayes.spamprob() (the scoring function).  A trained
> classifier can use any of these three freely.  Indeed, it's possible (no
> experiments have been done on this) that a "hard" msg for one scheme could
> benefit via getting scored again by one or both of the others.

I don't expect a lot from that. You and I at least have repeatedly seen 
the same fp and fn's across methods.

> Now that I'm playing with a UI (Sean & Mark's code) as a user, I'm growing
> fonder of the non-chi schemes again.  Rational or not, I find that the more
> uniform range of outcomes in [0.0, 1.0] is psychologically reassuring when
> using a UI that throws the scores in your face.

But it is unrealistic. Think about the original problem again: "why 
can't software that classifies ham/spam be very easy? Almost all spam's 
scream in your face that they are". With chi_squared combining we found 
a method that agrees with this. Most messages scream either "Ham" or 
"Spam", and there is very little left to doubt.

You can downscale things a bit by reducing the final S,H-score in 
chi_squared combining before calling chi2Q. Maybe take the sqrt or 
something similar. That is actually realistic because of  correlations. 
It may shift a few messages along the middle ground, but not have a lot 
of effect on separating ham and spam except broadening the distribution 
a bit.

Maybe the better answer is that the final UI shouldn't throw the scores 
in your face.

> If there are no killer objections, I'll remove the 4 schemes in question.

Did you ever try tim combining with (S-H+1)/2?

Rob
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob@hooft.net  Thu Oct 17 05:55:51 2002
From: rob@hooft.net (Rob Hooft)
Date: Thu, 17 Oct 2002 06:55:51 +0200
Subject: [Spambayes] Tokenizing numbers and money
References: <BIEJKCLHCIOIHAGOKOLHMEALDMAA.tim.one@comcast.net>
Message-ID: <3DAE42D7.6040505@hooft.net>

Tim Peters wrote:
>>Even though for someone doing fragrances "4711" may be a  strong ham
>>clue, I think that over the whole this is just adding noise.
> 
> 
> You can try it, although it fights the "stupid beats smart" meta-rule. 

Here are some more results:

original tokenizer:
-> <stat> Ham scores for all runs: 16000 items; mean 0.59; sdev 4.96
-> <stat> min -1.22125e-13; median 1.3603e-11; max 100
--
-> <stat> Spam scores for all runs: 5800 items; mean 99.02; sdev 5.86
-> <stat> min 6.85483e-09; median 100; max 100

with my "num:" tokens (~10000 different tokens less):
-> <stat> Ham scores for all runs: 16000 items; mean 0.60; sdev 5.00
-> <stat> min -1.44329e-13; median 2.66842e-11; max 100
--
-> <stat> Spam scores for all runs: 5800 items; mean 99.04; sdev 5.74
-> <stat> min 7.69111e-09; median 100; max 100

all words with at least two digits (r'^.*\d.*\d') removed
(~45000 different tokens less):
-> <stat> Ham scores for all runs: 16000 items; mean 0.61; sdev 5.08
-> <stat> min -1.11022e-13; median 1.24117e-10; max 100
--
-> <stat> Spam scores for all runs: 5800 items; mean 99.05; sdev 5.66
-> <stat> min 9.13394e-06; median 100; max 100

conclusion: The more numbers I throw out, the tighter the spam, and the 
wider the ham (both means go up).

BTW: I just realized that the "sdev" in these lines is only determined 
by the few middle ground messages and the fp/fn's. I think this is not a 
good measure for the tightness of the distributions at all. At the very 
least we should throw out all points further away than 4 sigma in the 
calculation of sigma. Better still would be to give numbers like 1% of 
all ham scores are larger than XXX and 1% of all spam scores are smaller 
than YYY".

Rob
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tim.one@comcast.net  Thu Oct 17 06:34:38 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 17 Oct 2002 01:34:38 -0400
Subject: [Spambayes] Tokenizing numbers and money
In-Reply-To: <3DAE3AEA.4020108@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAECHBMAB.tim.one@comcast.net>

[Tim]
> We do know that each time it's been tried, fiddling the value of
> robinson_probability_s has had a real effect on results, and that
> reducing it from 1 has always helped.  The effect of reducing it
> is to give more extreme spamprobs to rare words, so we already
> know that the treatment of rare words is important (or was important,
> in the schemes under which that experiment was tried).  I don't
> know how numbers specifically fit into that.

[Rob Hooft]
> The problem is that the final scoring has been adapted so thoroughly
> since those tests, that all of that should be done again.

Along with everything else <0.1 wink> -- everything is always open to
question here.  But I have to point out that the training and scoring code
has remained absolutely regular under all schemes since abandoning Graham's
original collection of deliberate biases:  no special cases, no warts, no
tweaks (the *tokenizer* code is a different story).

The words with extreme spamprobs have the strongest effects under all
schemes, and s controls how quickly or slowly a spamprob can *get* extreme
relative to the # of msgs a word has been seen in.  In that sense, there's
some reason to believe "the best" value for s is more a function of the data
than of the combining scheme.  Make s too small and too much credence is
given to accidents; make s too large and the amount of training data needed
to get crisp decisions zooms.

> And then it becomes very difficult, because the procedure is so
> good now that we're all looking with a microscope at all our
> fp/fn's and anyway, I "agree" (that it "looks" wrong) with the
> filter in most of my fp/fn cases.

There's something else to vary too:  nobody has looked at fiddling
max_discriminators under the newer schemes, and from what I see here I think
we all leave it at the default 150, which was chosen based on the
death-match results pitting Gary's original scheme against Paul's scheme.
It could be that max_discriminators should change.

> I did try something:
>
> s=0.25:
> -> <stat> Ham scores for all runs: 16000 items; mean 0.51; sdev 4.70
> -> <stat> min -1.33227e-13; median 1.19543e-11; max 100
> --
> -> <stat> Spam scores for all runs: 5800 items; mean 99.10; sdev 5.81
> -> <stat> min 2.89463e-09; median 100; max 100
>
> s=0.45:
> -> <stat> Ham scores for all runs: 16000 items; mean 0.60; sdev 5.00
> -> <stat> min -1.44329e-13; median 2.66842e-11; max 100
> --
> -> <stat> Spam scores for all runs: 5800 items; mean 99.04; sdev 5.74
> -> <stat> min 7.69111e-09; median 100; max 100
>
> s=0.75:
> -> <stat> Ham scores for all runs: 16000 items; mean 0.73; sdev 5.43
> -> <stat> min -1.11022e-13; median 9.83325e-11; max 100
> --
> -> <stat> Spam scores for all runs: 5800 items; mean 98.95; sdev 5.68
> -> <stat> min 3.83111e-05; median 100; max 100

That all makes sense, right?  The lower s, the more extreme spamprobs get,
ahd the higher s the less extreme.  So from top to bottom, ham means and
medians increase, spam means and medians decrease (well, that last is
invisible for spam at this level of precision:  at least half your spam
scores above 100, to 6 significant digits, under all variations), and sdevs
for all increase.

> And:
>
> s=0.25:
> -> best cost for all runs: $109.60
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at 2 cutoff pairs
> -> smallest ham & spam cutoffs 0.48 & 0.93
> ->     fp 6; fn 13; unsure ham 43; unsure spam 140
> ->     fp rate 0.0375%; fn rate 0.224%; unsure rate 0.839%
> -> largest ham & spam cutoffs 0.49 & 0.93
> ->     fp 6; fn 14; unsure ham 39; unsure spam 139
> ->     fp rate 0.0375%; fn rate 0.241%; unsure rate 0.817%
>
> s=0.45:
> -> best cost for all runs: $112.40
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at 2 cutoff pairs
> -> smallest ham & spam cutoffs 0.495 & 0.975
> ->     fp 3; fn 15; unsure ham 42; unsure spam 295
> ->     fp rate 0.0187%; fn rate 0.259%; unsure rate 1.55%
> -> largest ham & spam cutoffs 0.5 & 0.975
> ->     fp 3; fn 16; unsure ham 38; unsure spam 294
> ->     fp rate 0.0187%; fn rate 0.276%; unsure rate 1.52%
>
> s=0.75:
> -> best cost for all runs: $108.20
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at ham & spam cutoffs 0.505 & 0.95
> ->     fp 4; fn 13; unsure ham 46; unsure spam 230
> ->     fp rate 0.025%; fn rate 0.224%; unsure rate 1.27%
>
> Don't know what to think about this. Total cost looks fairly insensitive
> here, but the distribution over the types of cost is different.

The most interesting thing there may be a coincidence <0.3 wink>:  the
default s=0.45 was obtained from staring at all the reports that came in
during the Graham-vs-Robinson death match (tuning s for your data was part
of the task there, although it was called "a" at the time), then picking a
default value that appeared to get close to minimizing the fp rate across
testers.  And s=.45 miminized the fp rate in your results above.

With an absolute # of fp so low, though, I'm afraid that just one specific
oddball ham can easily warp the conclusions to fit it best.  If I try to
mentally discount that, I think the data above suggests most that a
higher-than-default value for s is better for some combination of

    your test data
    this combining scheme (which did you use?  chi-combining?)
    this value of robinson_minimum_prob_strength (ditto)
    this value of max_discriminators (ditto)

It's not obvious how much training data you used here either, but do note
that s=0.45 was picked from 10-fold cv runs with 200 ham and 200 spam in
each set (that exact setup was a requirment for participating in the death
match).  You appear to be using about 10x more ham and something like 2.5x
more spam than that, and I think it stands to reason that low s is
potentially more helpful the less training data you have (no matter what the
value of s, spamprobs *eventually* approach the raw estimates obtained from
counting -- if you have a lot of data, the really strong clues remain really
strong clues throughout this range of s values).


BTW, I was reading a paper on boosting, and one observation struck home:
boosting combines many rules in a weighted-average way, where the weights
are adjusted iteratively, between passes boosting the "importance" of the
examples the previous iteration misclassified.  What the author found was
that boosting worked better overall if he fiddled it to eventually *stop*
paying attention to examples that were persistently and badly misclassifed.
In effect, trying ever harder to fit the outliers warped the whole scheme in
their direction in ever more extreme ways, but almost by definition the
outliers didn't fit the scheme at all.

Similarly, I believe that some of our persistent fp and fn under this scheme
are simply never going to go away, and endless fiddling of parameters to try
to make them go away will hurt overall performance in a doomed attempt to
redeem them.  The combining schemes we've got now are excellent by any
measure, and I suspect it's time to leave them alone.


From tim.one@comcast.net  Thu Oct 17 07:29:13 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 17 Oct 2002 02:29:13 -0400
Subject: [Spambayes] Making Tester and TestDriver unsure
In-Reply-To: <20021016060847.D7753F590@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEECJBMAB.tim.one@comcast.net>

[T. Alexander Popiel]
> I thought it would be interesting to bring the middle ground
> into the Tester and TestDriver,

Indeed, long overdue.  Thank you!  I checked in a minor variation of this
patch.

Everyone, note that there's a new option ham_cutoff, and the meaning of
spam_cutoff has changed slightly.  Also new bool option show_unsure.  From
the new Options.py:

"""
[TestDriver]
...
# spam_cutoff and ham_cutoff are used in Python slice sense:
#    A msg is considered    ham if its score is in 0:ham_cutoff
#    A msg is considered unsure if its score is in ham_cutoff:spam_cutoff
#    A msg is considered   spam if its score is in spam_cutoff:
#
# So it's unsure iff  ham_cutoff <= score < spam_cutoff.
# For a binary classifier, make ham_cutoff == spam_cutoff.
# ham_cutoff > spam_cutoff doesn't make sense.
#
# The defaults are for the all-default Robinson scheme, which makes a
# binary decision with no middle ground.  The precise value that works
# best is corpus-dependent, and values into the .600's have been known
# to work best on some data.
ham_cutoff:  0.560
spam_cutoff: 0.560

...

show_unsure: False
"""

I should probably add that 0.05 and 0.95 probably aren't optimal, but may
well be close to optimal, if using chi-combining.


> in preparation for new comparators (cmp.py and table.py) which grok
> the middle ground.  Only so much I can do in one night, though.

Same here, I'm afraid -- I won't get to your later patch tonight.


From mal@lemburg.com  Thu Oct 17 11:22:01 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 17 Oct 2002 12:22:01 +0200
Subject: [Spambayes] Using mxBeeBase as hammie DB
Message-ID: <3DAE8F49.5080305@lemburg.com>

Is anyone interested in trying out mxBeeBase as hammie DB ?

It is pretty fast, portable and seems to work out nicely.
Oh yes, and the generated DB files are much smaller than
for e.g. hammie with dbm backend.

You do need the latest egenix-mx-base-2.1.0b5 installed
though.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/


From mal@lemburg.com  Thu Oct 17 12:21:20 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 17 Oct 2002 13:21:20 +0200
Subject: [Spambayes] Using mxBeeBase as hammie DB
References: <3DAE8F49.5080305@lemburg.com>
Message-ID: <3DAE9D30.4050801@lemburg.com>

M.-A. Lemburg wrote:
> Is anyone interested in trying out mxBeeBase as hammie DB ?
> 
> It is pretty fast, portable and seems to work out nicely.
> Oh yes, and the generated DB files are much smaller than
> for e.g. hammie with dbm backend.
> 
> You do need the latest egenix-mx-base-2.1.0b5 installed
> though.

Just to put some numbers by the fishes:

Teaching hammie 13000 messages from comp.lang.python
gives a database size of 23MB (that's data + index).

Checking a single message takes 200ms on my Athlon 1200
(this includes Python startup time).

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/


From guido@python.org  Thu Oct 17 12:53:38 2002
From: guido@python.org (Guido van Rossum)
Date: Thu, 17 Oct 2002 07:53:38 -0400
Subject: [Spambayes] Proposing to remove 4 combining schemes
In-Reply-To: Your message of "Thu, 17 Oct 2002 06:42:52 +0200."
             <3DAE3FCC.4060201@hooft.net> 
References: <LNBBLJKPBEHFEDALKOLCAECDBMAB.tim.one@comcast.net>  
            <3DAE3FCC.4060201@hooft.net> 
Message-ID: <200210171153.g9HBrcN10612@pcp02138704pcs.reston01.va.comcast.net>

[Tim]
> > Now that I'm playing with a UI (Sean & Mark's code) as a user, I'm
> > growing fonder of the non-chi schemes again.  Rational or not, I
> > find that the more uniform range of outcomes in [0.0, 1.0] is
> > psychologically reassuring when using a UI that throws the scores
> > in your face.

[Rob]
> But it is unrealistic. Think about the original problem again: "why
> can't software that classifies ham/spam be very easy? Almost all
> spam's scream in your face that they are". With chi_squared
> combining we found a method that agrees with this. Most messages
> scream either "Ham" or "Spam", and there is very little left to
> doubt.

But in real life there are also plenty of messages that mislead or
defy the human screener (if only for a second), and if these still
have a significant chance of becoming a f.p. or f.n., it would be
appropriate if the score reflected that uncertainty.  It may be clear
by now that I haven't been following recent discussions much -- but
the "all outcomes are extreme" characteristic was what led us to look
for an alternative to Graham's scheme, and I've come to appreciate
having a gray area.

> Maybe the better answer is that the final UI shouldn't throw the
> scores in your face.

While you're still deciding on how much value you place on
f.p. vs. f.n., the score can be very helpful (as long as it has a
middle ground).

--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido@python.org  Thu Oct 17 13:13:25 2002
From: guido@python.org (Guido van Rossum)
Date: Thu, 17 Oct 2002 08:13:25 -0400
Subject: [Spambayes] Using mxBeeBase as hammie DB
In-Reply-To: Your message of "Thu, 17 Oct 2002 13:21:20 +0200."
             <3DAE9D30.4050801@lemburg.com> 
References: <3DAE8F49.5080305@lemburg.com>  
            <3DAE9D30.4050801@lemburg.com> 
Message-ID: <200210171213.g9HCDPl11730@pcp02138704pcs.reston01.va.comcast.net>

> M.-A. Lemburg wrote:
> > Is anyone interested in trying out mxBeeBase as hammie DB ?
> > 
> > It is pretty fast, portable and seems to work out nicely.
> > Oh yes, and the generated DB files are much smaller than
> > for e.g. hammie with dbm backend.
> > 
> > You do need the latest egenix-mx-base-2.1.0b5 installed
> > though.
> 
> Just to put some numbers by the fishes:
> 
> Teaching hammie 13000 messages from comp.lang.python
> gives a database size of 23MB (that's data + index).
> 
> Checking a single message takes 200ms on my Athlon 1200
> (this includes Python startup time).

Can you post or (better!) check in a variant of hammie with this
enabled?  I'd like to see this!

--Guido van Rossum (home page: http://www.python.org/~guido/)

From mal@lemburg.com  Thu Oct 17 13:39:24 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 17 Oct 2002 14:39:24 +0200
Subject: [Spambayes] Using mxBeeBase as hammie DB
References: <3DAE8F49.5080305@lemburg.com>
	<3DAE9D30.4050801@lemburg.com>
	<200210171213.g9HCDPl11730@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <3DAEAF7C.2060800@lemburg.com>

This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Guido van Rossum wrote:
>>M.-A. Lemburg wrote:
>>
>>>Is anyone interested in trying out mxBeeBase as hammie DB ?
>>>
>>>It is pretty fast, portable and seems to work out nicely.
>>>Oh yes, and the generated DB files are much smaller than
>>>for e.g. hammie with dbm backend.
>>>
>>>You do need the latest egenix-mx-base-2.1.0b5 installed
>>>though.
>>
>>Just to put some numbers by the fishes:
>>
>>Teaching hammie 13000 messages from comp.lang.python
>>gives a database size of 23MB (that's data + index).
>>
>>Checking a single message takes 200ms on my Athlon 1200
>>(this includes Python startup time).
> 
> 
> Can you post or (better!) check in a variant of hammie with this
> enabled?  I'd like to see this!

I'd need checkin rights for that.

Here's the drop-in file (I've renamed hammie.py to spambayes.py).

The latest beta of egenix-mx-base is here:

http://www.egenix.com/files/python/egenix-mx-base-2.1.0b5.tar.gz

To install: run "python2.2 setup.py install".

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: spambayes.py
Type: text/x-python
Size: 13029 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021017/1f907181/spambayes.py

---------------------- multipart/mixed attachment--


From rob@hooft.net  Thu Oct 17 13:59:35 2002
From: rob@hooft.net (Rob W. W. Hooft)
Date: Thu, 17 Oct 2002 14:59:35 +0200
Subject: [Fwd: Re: [Spambayes] Proposing to remove 4 combining schemes]
Message-ID: <3DAEB437.6050301@hooft.net>

This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
sorry, I forgot to CC the list on this one.
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
An embedded message was scrubbed...
From: "Rob W. W. Hooft" <rob@hooft.net>
Subject: Re: [Spambayes] Proposing to remove 4 combining schemes
Date: Thu, 17 Oct 2002 14:42:40 +0200
Size: 2123
Url: http://mail.python.org/pipermail-21/spambayes/attachments/20021017/629a7302/SpambayesProposingtoremove4combiningschemes.txt

---------------------- multipart/mixed attachment--


From rob@hooft.net  Thu Oct 17 14:18:31 2002
From: rob@hooft.net (Rob W. W. Hooft)
Date: Thu, 17 Oct 2002 15:18:31 +0200
Subject: [Spambayes] 5% points in statistics
Message-ID: <3DAEB8A7.6010807@hooft.net>

This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
I added 5% and 95% points to the statistics in Histogram.py. The 
calculation is similar to a "median": a median is the 50% point. This 
has as effect:

-> <stat> Ham scores for all runs: 16000 items; mean 0.59; sdev 4.96
-> <stat> min 0; median 1.36141e-11; max 100
-> <stat> fivepctlo 0; fivepcthi 0.144228
-> <stat> Spam scores for all runs: 5800 items; mean 99.02; sdev 5.86
-> <stat> min 6.85475e-09; median 100; max 100
-> <stat> fivepctlo 96.8278; fivepcthi 100

So indeed this reveals new information about the distributions: where 
"sdev" for ham and spam are very similar, the fivepct{lo,hi} values show 
that the distributions are NOT the same width. 95% of ham is 20 times 
tighter than 95% of spam.

Rob
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
Index: Histogram.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Histogram.py,v
retrieving revision 1.5
diff -u -r1.5 Histogram.py
--- Histogram.py	8 Oct 2002 18:13:49 -0000	1.5
+++ Histogram.py	17 Oct 2002 13:13:41 -0000
@@ -28,6 +28,8 @@
     #     min       smallest value in collection
     #     max       largest value in collection
     #     median    midpoint
+    #     fivepctlo five percent of data is lower than this
+    #     fivepcthi five percent of data is higher than this
     #     mean
     #     var       variance
     #     sdev      population standard deviation (sqrt(variance))
@@ -47,6 +49,14 @@
             self.median = data[n // 2]
         else:
             self.median = (data[n // 2] + data[(n-1) // 2]) / 2.0
+	xfivepct = 0.05 * (n-1)
+	frac = xfivepct % 1.0
+	self.fivepctlo = (data[int(xfivepct)] * (1 - frac) + 
+                         data[int(xfivepct)+1] * frac)
+	xfivepct = 0.95 * (n-1)
+	frac=xfivepct % 1.0
+	self.fivepcthi = (data[int(xfivepct)] * (1 - frac) + 
+                         data[int(xfivepct) + 1] * frac)
         # Compute mean.
         # Add in increasing order of magnitude, to minimize roundoff error.
         if data[0] < 0.0:
@@ -124,6 +134,8 @@
         print "-> <stat> min %g; median %g; max %g" % (self.min,
                                                        self.median,
                                                        self.max)
+        print "-> <stat> fivepctlo %g; fivepcthi %g" % (self.fivepctlo,
+						      self.fivepcthi)
         lo, hi = self.get_lo_hi()
         if lo > hi:
             return

---------------------- multipart/mixed attachment--


From bkc@murkworks.com  Thu Oct 17 14:47:53 2002
From: bkc@murkworks.com (Brad Clements)
Date: Thu, 17 Oct 2002 09:47:53 -0400
Subject: [Spambayes] Using mxBeeBase as hammie DB
In-Reply-To: <3DAE9D30.4050801@lemburg.com>
Message-ID: <3DAE86DF.22732.FD1E5F7@localhost>

On 17 Oct 2002 at 13:21, M.-A. Lemburg wrote:

> Just to put some numbers by the fishes:
> 
> Teaching hammie 13000 messages from comp.lang.python
> gives a database size of 23MB (that's data + index).
> 
> Checking a single message takes 200ms on my Athlon 1200
> (this includes Python startup time).

What operating system, and how much RAM do you have?


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From rob@hooft.net  Thu Oct 17 15:12:06 2002
From: rob@hooft.net (Rob W. W. Hooft)
Date: Thu, 17 Oct 2002 16:12:06 +0200
Subject: [Spambayes] Proposing to remove 4 combining schemes
References: <MJEHLHJKGINLONDMMKNEAEABHBAA.seant@webreply.com>
Message-ID: <3DAEC536.10803@hooft.net>

Sean True wrote:
> I hate to try to speak for Joe User (like speaking for the "common man",
> always a red flag), but I _am_ just a user of these scoring schemes. I have
> several hundred messages (commercial email) tucked away in a folder that
> score in the non-chi scheme in the range .4 to .6. That score appears to
> reflect my own real uncertainty about the value of Motley Fool newsletters.
> No snickering, please. A system like chi- looks like a very good choice for
> black and white, upstream discards offers to increase body part size.
> 
> But I don't want these messages automatically discarded upstream, I want
> them labelled so that I can deal with them more efficiently.
> 
> When I sort this particular folder by spam score, I get MIT club and
> Infoworld newsletters at the the beginning (the good end), and the Motley
> Fool and Edgar Online at the other end, with a range of spam score from .2
> to .6 Just right. If I could color them continuously, it would be easy to
> spot the ones I want to read, now. And over time, as I change my definition
> of spam, their position in the list looks like it will vary smoothly -- and
> appropriately.
> 
> This may not fit your original mission statement, but mission statements
> often don't survive contact with the enemy, err, customer.

But I agree 100%! Sorting on the spamminess/hamminess is very useful. 
Coloring on the spamminess/hamminess is very useful. But only in the 
middle ground folder. And the numeric values as such are useless, that 
is MHO. Part of my work is to make "clean" user interfaces, and I am 
allergic to showing things that the user can't do anything with.

I understood the original idea of Tim as that he wanted to see the 
spamminess of clearcut spam and the hamminess of clearcut ham. I don't 
see the point of that, but there would be an easy way to do it: Remap 
the probabilities such that 0->0; hamcutoff->0.33; spamcutoff->0.66; 
1->1 using any monotonic increasing function (e.g. three linear segments).

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From mal@lemburg.com  Thu Oct 17 15:19:27 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 17 Oct 2002 16:19:27 +0200
Subject: [Spambayes] Using mxBeeBase as hammie DB
References: <3DAE86DF.22732.FD1E5F7@localhost>
Message-ID: <3DAEC6EF.6080304@lemburg.com>

Brad Clements wrote:
> On 17 Oct 2002 at 13:21, M.-A. Lemburg wrote:
> 
> 
>>Just to put some numbers by the fishes:
>>
>>Teaching hammie 13000 messages from comp.lang.python
>>gives a database size of 23MB (that's data + index).
>>
>>Checking a single message takes 200ms on my Athlon 1200
>>(this includes Python startup time).
> 
> 
> What operating system, and how much RAM do you have?

SuSE Linux 8 on 1GB RAM. But why would that matter ? The process
size is only 4.8MB.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/


From bkc@murkworks.com  Thu Oct 17 15:27:32 2002
From: bkc@murkworks.com (Brad Clements)
Date: Thu, 17 Oct 2002 10:27:32 -0400
Subject: [Spambayes] Using mxBeeBase as hammie DB
In-Reply-To: <3DAEC6EF.6080304@lemburg.com>
Message-ID: <3DAE9029.2035.FF630F1@localhost>

On 17 Oct 2002 at 16:19, M.-A. Lemburg wrote:

> > What operating system, and how much RAM do you have?
> 
> SuSE Linux 8 on 1GB RAM. But why would that matter ? The process
> size is only 4.8MB.


Two thoughts:

1. you ran the test at least once before timing it, so Python and other stuff was probably 
"still in ram"   Not exactly sure how Linux pages things, but on Windows this statement 
would most likely be true.

2. with less ram, you're more likely to need to throw out something to load Python and 
stuff (especially on Windows OS).

I just found the "load time" to be extremely low for a typical office worker box. You don't 
appear to have a typical box.

If your box is typical, is your company hiring?  ;-)

Note I'm not slighting Python, since the load time is a given no matter what. Just 
wanted to know how you achieved the low load time.

Regarding the 23 megabytes . well, to run this on an IMAP server supporting 100 
users. That's a lot of disk space. I realize the context switching from one "user" to the 
next wouldn't be so bad using a database. If you were using a pickle, argh!


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From rob@hooft.net  Thu Oct 17 15:49:15 2002
From: rob@hooft.net (Rob W. W. Hooft)
Date: Thu, 17 Oct 2002 16:49:15 +0200
Subject: [Spambayes] Proposing to remove 4 combining schemes
References: <MJEHLHJKGINLONDMMKNEIEAFHBAA.seant@webreply.com>
Message-ID: <3DAECDEB.8030206@hooft.net>

Sean True wrote:
> The mail folder in question is not a middle ground folder, though. It's a
> collection
> of mail which I decided to keep at some point, for one reason or another.
> And sorting by
> the spam score appears to be _very_ useful for managing the contents.

Ah, that is an interesting application. But do you really use the 
numbers, or is the ordering sufficient?

> These filters learn by example -- and if the examples are ambiguous and
> conflicting, the scores
> should reflect that, right?

Sure. But even the extreme scoring schemes do that!

> I'm mostly lobbying for two things: the importance of MUA based scoring and
> filtering, and the
> retention of the non-extreme scoring schemes. Whether they are _default_ or
> not should be a deployment
> decision based on fitness to the task.

If the scoring schemes are mutually compatible like the ones Tim 
proposed to keep, there is no harm in keeping them. But I think that the 
older schemes are a lot worse in their scoring than the newer ones, so I 
find it questionable whether they will be useful in any application. If 
you want a more linear judgement array, then rescaling the numbers 
produced by the chi2 method to something you can always read in 2 
decimal digits might be more useful than a procedure that generates a 
sub-optimal ordering.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tim.one@comcast.net  Thu Oct 17 15:58:31 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 17 Oct 2002 10:58:31 -0400
Subject: [Spambayes] Proposing to remove 4 combining schemes
In-Reply-To: <3DAE3FCC.4060201@hooft.net>
Message-ID: <BIEJKCLHCIOIHAGOKOLHEEEEDMAA.tim.one@comcast.net>

[Tim, suggests to remove use_z_combining]
>>

[Rob Hooft]
> I guess that means that no RMS magic can help here. Go ahead.

I really don't know, but I don't *see* a way.  It's the normIP() results
that are assumed to be unit-normal, and that happens iff the input probs are
uniformly distributed.  But the deviation of the latter from uniformity
doesn't have any bad consequence I can detect -- to the contrary, if
anything, it seems to make the ham-vs-spam decision easier.

[on the 3 remaining schemes]
>> Indeed, it's possible (no experiments have been done on this) that
>> a "hard" msg for one scheme could benefit via getting scored again
>> by one or both of the others.

> I don't expect a lot from that. You and I at least have repeatedly seen
> the same fp and fn's across methods.

The same final decision, yes, but in at least my cases the *relative* scores
across schemes are quite different.  For example, even my worst FP, which
scores nearly 1.0000000000000 under chi-combining, doesn't have a
particularly high score under Gary-combining *when compared against* the
universe of genuine-spam scores under Gary-combining.  The few clues that
this FP were posted by a real person count a lot under the latter.  Not
enough to drag it into ham territory (and nothing ever will do that), and
not even enough to drag into what could be reasonably called a middle ground
for Gary-combining, but still below the mean for Gary-combining spam scores.
The same is true of my other deadly-bad FP under chi-combining, but even
more so.

I expect the same is true of Alex's data, because his first reaction when
trying the more-extreme tim-combining (but far less extreme than chi-) was
despair over how much *more* extreme his FP got.  I assume they score 1.0
under chi-combining.

So the idea to try here (which remains untested) would be to broaden chi's
middle ground via thinking twice when Gary-combining is much less sure of a
msg.  This needs precise fleshing out before it can be tested, though.

Note that the 3 remaining schemes all compute products of prods and of
1-prods, and the loopy bit doing that is the expensive part of scoring.
Getting the 3 final measures out of that is really cheap.

[on extreme vs non-extreme]
> But it is unrealistic. Think about the original problem again: "why
> can't software that classifies ham/spam be very easy? Almost all spam's
> scream in your face that they are". With chi_squared combining we found
> a method that agrees with this. Most messages scream either "Ham" or
> "Spam", and there is very little left to doubt.

It could be that the UI would be better off with a "ham", "spam", "unsure"
string tag than with decimal digits of precision.

> You can downscale things a bit by reducing the final S,H-score in
> chi_squared combining before calling chi2Q. Maybe take the sqrt or
> something similar.

Not really attractive; sqrt would be far too gross a distortion, btw (e.g.,
it would change a score of 0.5 to 0.0 -- the mean is 2*n and the sdev
2*sqrt(n)).

> ...
> Maybe the better answer is that the final UI shouldn't throw the scores
> in your face.

Possibly.  For now it's helpful to me, since I'm a developer and really need
a window on the internals.

> ...
> Did you ever try tim combining with (S-H+1)/2?

No, but it would be an excellent idea to try it with the current default
combining!  tim-combining is unique in that its S is especially sensitive to
*low*-spamprob words, and its H to high-spamprob words; when something
really is spam, tim-combining isn't relying so much on having a high S value
as on having a low H value, so that the ratio S/(S+H) approaches 1.
Gary-combining is much more like chi-combining in these respects, and
chi-combining is where the (S-H+1)/2 reformulation helped.


From seant@webreply.com  Thu Oct 17 14:25:54 2002
From: seant@webreply.com (Sean True)
Date: Thu, 17 Oct 2002 09:25:54 -0400
Subject: [Spambayes] Proposing to remove 4 combining schemes
In-Reply-To: <3DAEB040.3070302@hooft.net>
Message-ID: <MJEHLHJKGINLONDMMKNEAEABHBAA.seant@webreply.com>


> > [Tim]
> >
> >>>Now that I'm playing with a UI (Sean & Mark's code) as a user, I'm
> >>>growing fonder of the non-chi schemes again.  Rational or not, I
> >>>find that the more uniform range of outcomes in [0.0, 1.0] is
> >>>psychologically reassuring when using a UI that throws the scores
> >>>in your face.
> >>
> >
> > [Rob]
> >
> >>But it is unrealistic. Think about the original problem again: "why
> >>can't software that classifies ham/spam be very easy? Almost all
> >>spam's scream in your face that they are". With chi_squared
> >>combining we found a method that agrees with this. Most messages
> >>scream either "Ham" or "Spam", and there is very little left to
> >>doubt.
> >
> >
> > But in real life there are also plenty of messages that mislead or
> > defy the human screener (if only for a second), and if these still
> > have a significant chance of becoming a f.p. or f.n., it would be
> > appropriate if the score reflected that uncertainty.
>
> But it does: between one and two percent of all messages deviates
> significantly from 0.0 and 100.0; those are the ones we as humans take
> more than split second to judge.
>
> > While you're still deciding on how much value you place on
> > f.p. vs. f.n., the score can be very helpful (as long as it has a
> > middle ground).
>
> Sure, but for Joe User, this "should" be uninteresting.
>
> Rob
>

I hate to try to speak for Joe User (like speaking for the "common man",
always a red flag), but I _am_ just a user of these scoring schemes. I have
several hundred messages (commercial email) tucked away in a folder that
score in the non-chi scheme in the range .4 to .6. That score appears to
reflect my own real uncertainty about the value of Motley Fool newsletters.
No snickering, please. A system like chi- looks like a very good choice for
black and white, upstream discards offers to increase body part size.

But I don't want these messages automatically discarded upstream, I want
them labelled so that I can deal with them more efficiently.

When I sort this particular folder by spam score, I get MIT club and
Infoworld newsletters at the the beginning (the good end), and the Motley
Fool and Edgar Online at the other end, with a range of spam score from .2
to .6 Just right. If I could color them continuously, it would be easy to
spot the ones I want to read, now. And over time, as I change my definition
of spam, their position in the list looks like it will vary smoothly -- and
appropriately.

This may not fit your original mission statement, but mission statements
often don't survive contact with the enemy, err, customer.

-- Sean


From rob@hooft.net  Thu Oct 17 16:25:54 2002
From: rob@hooft.net (Rob W. W. Hooft)
Date: Thu, 17 Oct 2002 17:25:54 +0200
Subject: [Spambayes] Proposing to remove 4 combining schemes
References: <BIEJKCLHCIOIHAGOKOLHEEEEDMAA.tim.one@comcast.net>
Message-ID: <3DAED682.3090905@hooft.net>

I wrote about the huge certainties in chi2 combining:

>>You can downscale things a bit by reducing the final S,H-score in
>>chi_squared combining before calling chi2Q. Maybe take the sqrt or
>>something similar.
> 
Tim wrote:
> 
> Not really attractive; sqrt would be far too gross a distortion, btw (e.g.,
> it would change a score of 0.5 to 0.0 -- the mean is 2*n and the sdev
> 2*sqrt(n)).

I tried it anyway. Here are some results:

Normal:
-> <stat> Ham scores for all runs: 16000 items; mean 0.59; sdev 4.96
-> <stat> min 0; median 1.36141e-11; max 100
-> <stat> fivepctlo 0; fivepcthi 0.144228
* = 253 items
  0.0 15415 *************************************************************
  0.5    84 *
  1.0    54 *
  1.5    30 *
  2.0    30 *
  2.5    17 *
  3.0    19 *
  3.5    19 *
  4.0    12 *
-> <stat> Spam scores for all runs: 5800 items; mean 99.02; sdev 5.86
-> <stat> min 6.85475e-09; median 100; max 100
-> <stat> fivepctlo 96.8278; fivepcthi 100
* = 87 items
95.5   46 *
96.0   17 *
96.5   14 *
97.0   16 *
97.5   21 *
98.0   38 *
98.5   35 *
99.0   92 **
99.5 5300 *************************************************************
-> best cost for all runs: $102.60
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.495 & 0.96
->     fp 3; fn 14; unsure ham 40; unsure spam 253
->     fp rate 0.0187%; fn rate 0.241%; unsure rate 1.34%

==================
Dividing the log-products and n by 2:
-> <stat> Ham scores for all runs: 16000 items; mean 0.76; sdev 5.07
-> <stat> min 0; median 1.19013e-05; max 99.9998
-> <stat> fivepctlo 0; fivepcthi 1.54439
* = 242 items
  0.0 14736 *************************************************************
  0.5   316 **
  1.0   134 *
  1.5   103 *
  2.0    74 *
  2.5    60 *
  3.0    37 *
  3.5    35 *
  4.0    34 *
-> <stat> Spam scores for all runs: 5800 items; mean 98.71; sdev 5.97
-> <stat> min 0.000221093; median 100; max 100
-> <stat> fivepctlo 92.9253; fivepcthi 100
* = 83 items
95.5   27 *
96.0   21 *
96.5   35 *
97.0   38 *
97.5   40 *
98.0   59 *
98.5   82 *
99.0  122 **
99.5 5005 *************************************************************
-> best cost for all runs: $104.40
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.49 & 0.92
->     fp 3; fn 14; unsure ham 43; unsure spam 259
->     fp rate 0.0187%; fn rate 0.241%; unsure rate 1.39%

=============================================
Dividing the log-products and n by 4:
-> <stat> Ham scores for all runs: 16000 items; mean 1.32; sdev 5.49
-> <stat> min 0; median 0.0140483; max 99.9378
-> <stat> fivepctlo 1.11022e-14; fivepcthi 6.09162
* = 206 items
  0.0 12557 *************************************************************
  0.5   880 *****
  1.0   511 ***
  1.5   298 **
  2.0   223 **
  2.5   176 *
  3.0   135 *
  3.5   113 *
  4.0    91 *
-> <stat> min 0.0626454; median 99.9953; max 100
-> <stat> fivepctlo 87.8576; fivepcthi 100
* = 71 items
95.5   38 *
96.0   54 *
96.5   55 *
97.0   59 *
97.5   70 *
98.0  150 ***
98.5  142 **
99.0  280 ****
99.5 4331 *************************************************************
-> best cost for all runs: $108.20
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 2 cutoff pairs
-> smallest ham & spam cutoffs 0.48 & 0.855
->     fp 4; fn 13; unsure ham 46; unsure spam 230
->     fp rate 0.025%; fn rate 0.224%; unsure rate 1.27%
-> largest ham & spam cutoffs 0.485 & 0.855
->     fp 4; fn 14; unsure ham 42; unsure spam 229
->     fp rate 0.025%; fn rate 0.241%; unsure rate 1.24%


As I expected, this significantly broadens the extremes at only very 
little cost. What this does statistically is downweighting all clues 
thereby taking care of a "standard" correlation between clues. This may 
be functionally equivalent to raising the value of s.

This is the /4 code for reference:

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.38
diff -u -r1.38 classifier.py
--- classifier.py	14 Oct 2002 02:20:35 -0000	1.38
+++ classifier.py	17 Oct 2002 15:24:55 -0000
@@ -516,7 +516,10 @@
          S = ln(S) + Sexp * LN2
          H = ln(H) + Hexp * LN2

-        n = len(clues)
+        S = S/4.0
+        H = H/4.0
+
+        n = len(clues)//4
          if n:
              S = 1.0 - chi2Q(-2.0 * S, 2*n)
              H = 1.0 - chi2Q(-2.0 * H, 2*n)

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob@hooft.net  Thu Oct 17 16:33:36 2002
From: rob@hooft.net (Rob W. W. Hooft)
Date: Thu, 17 Oct 2002 17:33:36 +0200
Subject: [Spambayes] Proposing to remove 4 combining schemes
References: <MJEHLHJKGINLONDMMKNECEAIHBAA.seant@webreply.com>
Message-ID: <3DAED850.6060107@hooft.net>

Sean True wrote:
> I'm not passionate about this in particular, but having a score that looks
> like a gaussian
> when I have a gaussian feel about the scored messages makes sense to me.

Right! But the idea of spambayes is to make a binary classification 
between spam and ham. We have discovered that there is a middle ground 
which can be explored. But why would the "ham" behave in a Gaussian way 
under such a model? Ham is one of the two extremes, and most ham is very 
easy to recognize as such.

Rob


-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From mal@lemburg.com  Thu Oct 17 16:42:24 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 17 Oct 2002 17:42:24 +0200
Subject: [Spambayes] Using mxBeeBase as hammie DB
References: <3DAE9029.2035.FF630F1@localhost>
Message-ID: <3DAEDA60.20801@lemburg.com>

Brad Clements wrote:
> On 17 Oct 2002 at 16:19, M.-A. Lemburg wrote:
> 
> 
>>>What operating system, and how much RAM do you have?
>>
>>SuSE Linux 8 on 1GB RAM. But why would that matter ? The process
>>size is only 4.8MB.
> 
> Two thoughts:
> 
> 1. you ran the test at least once before timing it, so Python and other stuff was probably 
> "still in ram"   Not exactly sure how Linux pages things, but on Windows this statement 
> would most likely be true.

The times come directly from the system's time command and
are user + system times (not wall clock). And yes, things were
most probably still in memory since I always run the tests
a few times and then take the numbers from the last test.

> 2. with less ram, you're more likely to need to throw out something to load Python and 
> stuff (especially on Windows OS).

True.

> I just found the "load time" to be extremely low for a typical office worker box. You don't 
> appear to have a typical box.

Hmm, this is a standard SuSE installation and not even an up-to-date
machine (1.2GHz is only half the speed of today's boxes). I am running
Reiser FS if that makes any difference.

> If your box is typical, is your company hiring?  ;-)

Unfortunately, not. Bad times these days...

> Note I'm not slighting Python, since the load time is a given no matter what. Just 
> wanted to know how you achieved the low load time.

Could be that the file system is using some smart caching
technique which makes the dozens of stat calls at Python
startup time rather fast.

> Regarding the 23 megabytes . well, to run this on an IMAP server supporting 100 
> users. That's a lot of disk space. I realize the context switching from one "user" to the 
> next wouldn't be so bad using a database. If you were using a pickle, argh!

I suppose that you can easily create and use multiple spam
databases, e.g. have a central one for the whole company
which only masks standard spam and then use smaller ones per user
which override the settings in the main one if needed. Sort
of like:

md = open(maindict)
ud = open(userdict)
value = ud.get(key)
if value is None:
     value = md[key]

The database size only increases as more words find their
way into it. I'm not sure, but perhaps it's possible to filter
the entries and remove meaningless ones (those with ~50%
spam level).

No idea. This time I'm a user, not a developer ;-)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/


From tim.one@comcast.net  Thu Oct 17 16:48:02 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 17 Oct 2002 11:48:02 -0400
Subject: [Spambayes] Using mxBeeBase as hammie DB
In-Reply-To: <3DAEAF7C.2060800@lemburg.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHEEEIDMAA.tim.one@comcast.net>

[M.-A. Lemburg]
> I'd need checkin rights for that.

Sorry, Marc-Andre, that excuse just went away <wink>.

mind-your-whitespace-and-we'll-get-along-just-fine-ly y'rs  - tim

From tim.one@comcast.net  Thu Oct 17 16:59:31 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 17 Oct 2002 11:59:31 -0400
Subject: [Spambayes] Using mxBeeBase as hammie DB
In-Reply-To: <3DAE9D30.4050801@lemburg.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHEEEJDMAA.tim.one@comcast.net>

[M.-A. Lemburg, on mxBeeBase]
> Just to put some numbers by the fishes:
>
> Teaching hammie 13000 messages from comp.lang.python
> gives a database size of 23MB (that's data + index).

Note that at least half the words in the database are almost certainly
unique, and so of no actual use.  Pruning the database, and especially over
time, is something that needs work here.

> Checking a single message takes 200ms on my Athlon 1200
> (this includes Python startup time).

For contrast, I run tests using a plain Python dict for "a database", and
reading up msgs stored one per file, but doing many (on the order of 1e5)
scorings per run.  On a slower 866MHz Pentium box with 256MB RAM, this
scores about 80 msgs/second, or about 12.5ms per msg (under 2.3 CVS Python,
which is zippier than 2.2.2).  Firing up the system once per msg is a real
expense; keeping it running in the background all the time is a real expense
of a different kind.


From mal@lemburg.com  Thu Oct 17 17:12:05 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 17 Oct 2002 18:12:05 +0200
Subject: [Spambayes] Using mxBeeBase as hammie DB
References: <BIEJKCLHCIOIHAGOKOLHEEEIDMAA.tim.one@comcast.net>
Message-ID: <3DAEE155.6060202@lemburg.com>

Tim Peters wrote:
> [M.-A. Lemburg]
> 
>>I'd need checkin rights for that.
> 
> 
> Sorry, Marc-Andre, that excuse just went away <wink>.

Oh dear, I knew that would happen ;-) Will I ever be a plain
user ?

> mind-your-whitespace-and-we'll-get-along-just-fine-ly y'rs  - tim

Thanks,
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/


From tim.one@comcast.net  Thu Oct 17 17:12:52 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 17 Oct 2002 12:12:52 -0400
Subject: [Spambayes] 5% points in statistics
In-Reply-To: <3DAEB8A7.6010807@hooft.net>
Message-ID: <BIEJKCLHCIOIHAGOKOLHCEEKDMAA.tim.one@comcast.net>

[Rob W. W. Hooft]
> I added 5% and 95% points to the statistics in Histogram.py. The
> calculation is similar to a "median": a median is the 50% point.

That's a fine idea!  I would like to generalize it, and allow specifying an
arbitrary list of percentile points (e.g., in that sense, you've hard-coded
the list

    5 95

and the code already hard-coded 50).

> This has as effect:
>
> -> <stat> Ham scores for all runs: 16000 items; mean 0.59; sdev 4.96
> -> <stat> min 0; median 1.36141e-11; max 100
> -> <stat> fivepctlo 0; fivepcthi 0.144228
> -> <stat> Spam scores for all runs: 5800 items; mean 99.02; sdev 5.86
> -> <stat> min 6.85475e-09; median 100; max 100
> -> <stat> fivepctlo 96.8278; fivepcthi 100
>
> So indeed this reveals new information about the distributions: where
> "sdev" for ham and spam are very similar, the fivepct{lo,hi} values show
> that the distributions are NOT the same width. 95% of ham is 20 times
> tighter than 95% of spam.

At least on that data <wink>.  The sdev is a lot easier to make sense of
under schemes where score distributions look "kinda normal" (or normalish
Weibull, whatever).  The same is true of the histograms, for that matter --
when the histos approximate two solid bars at 0.1 and 1.0, they're really
not helpful.  Percentile points make reasonable sense for all distributions.


From mal@lemburg.com  Thu Oct 17 17:19:32 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 17 Oct 2002 18:19:32 +0200
Subject: [Spambayes] Using mxBeeBase as hammie DB
References: <BIEJKCLHCIOIHAGOKOLHEEEJDMAA.tim.one@comcast.net>
Message-ID: <3DAEE314.6040903@lemburg.com>

Tim Peters wrote:
> [M.-A. Lemburg, on mxBeeBase]
> 
>>Just to put some numbers by the fishes:
>>
>>Teaching hammie 13000 messages from comp.lang.python
>>gives a database size of 23MB (that's data + index).
> 
> 
> Note that at least half the words in the database are almost certainly
> unique, and so of no actual use.  Pruning the database, and especially over
> time, is something that needs work here.

Is there some way to do this automagically ?

>>Checking a single message takes 200ms on my Athlon 1200
>>(this includes Python startup time).
> 
> For contrast, I run tests using a plain Python dict for "a database", and
> reading up msgs stored one per file, but doing many (on the order of 1e5)
> scorings per run.  On a slower 866MHz Pentium box with 256MB RAM, this
> scores about 80 msgs/second, or about 12.5ms per msg (under 2.3 CVS Python,
> which is zippier than 2.2.2).  Firing up the system once per msg is a real
> expense; keeping it running in the background all the time is a real expense
> of a different kind.

I suppose a PCGI style approach would be best here: you use a small
C program as client (used for filtering by e.g. procmail) which then
talks to a long-running daemon process.

The alternative would be to wrap up the whole spambayes package
into a mxCGIPython kind of frozen application which then uses
an on-disk dictionary.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/


From popiel@wolfskeep.com  Thu Oct 17 17:39:22 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Thu, 17 Oct 2002 09:39:22 -0700
Subject: [Spambayes] Full chi-squared ratio/training analysis
Message-ID: <20021017163922.1685FF4CD@cashew.wolfskeep.com>

Well, I promised a full writeup yesterday, so here it is. ;-)

Quick summary:

The chi-squared method is still sensitive to the differing ham:spam
ratios.  Low ham:spam is still better than high ham:spam, and the
sweet spot seems to have moved down to about 1:2 ham:spam (or maybe
even 1:3... but my data doesn't have the granularity to tell).
People wanting to use this on their real mailfeeds may want to train
with only a subset of their ham mail.

Also, chi-squared is relatively unaffected by _quantity_ of training
data (same as before).  More training data (in the ranges I can provide)
brings at best modest improvement, and that is inconsistent.

Finally, chi-squared seems to do decently with 0.05/0.95 cutoffs,
regardless of ratio.  While not perfectly ideal, the costs are
generally close to ideal (at worst about 1.5 times ideal).


Have some tables:

Chi-squared, 0.05-0.90 cutoffs, 5 sets:
-> <stat> tested 50 hams & 200 spams against 200 hams & 800 spams
[...]
-> <stat> tested 200 hams & 50 spams against 800 hams & 200 spams
ham:spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
fp total:        1       1       2       2       3       3       2
fp %:         0.40    0.27    0.40    0.32    0.40    0.34    0.20
fn total:        2       2       3       2       3       4       6
fn %:         0.20    0.23    0.40    0.32    0.60    1.07    2.40
unsure t:       26      24      25      33      29      26      37
unsure %:     2.08    1.92    2.00    2.64    2.32    2.08    2.96
real cost:  $17.20  $16.80  $28.00  $28.60  $38.80  $39.20  $33.40
best cost:  $15.60  $15.00  $19.80  $19.20  $27.80  $14.80  $14.60
h mean:       2.59    1.18    0.73    0.44    0.51    0.46    0.35
h sdev:      11.57    7.82    7.00    5.68    6.46    6.01    5.02
s mean:      99.31   98.95   98.32   97.41   96.84   96.10   93.12
s sdev:       7.03    8.50   10.20   12.75   14.43   15.70   19.33
mean diff:   96.72   97.77   97.59   96.97   96.33   95.64   92.77
k:            5.20    5.99    5.67    5.26    4.61    4.41    3.81

Chi-squared, 0.05-0.90 cutoffs, 8 sets:
-> <stat> tested 50 hams & 200 spams against 350 hams & 1400 spams
[...]
-> <stat> tested 200 hams & 50 spams against 1400 hams & 350 spams
ham:spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
fp total:        1       2       3       2       3       3       2
fp %:         0.25    0.33    0.38    0.20    0.25    0.21    0.12
fn total:        2       3       2       2       4       6      10
fn %:         0.12    0.21    0.17    0.20    0.50    1.00    2.50
unsure t:       37      33      39      41      46      44      39
unsure %:     1.85    1.65    1.95    2.05    2.30    2.20    1.95
real cost:  $19.40  $29.60  $39.80  $30.20  $43.20  $44.80  $37.80
best cost:  $16.00  $25.80  $29.20  $25.00  $24.00  $19.60  $18.80
h mean:       1.77    0.78    0.66    0.42    0.49    0.48    0.36
h sdev:       9.19    6.90    6.85    5.58    6.16    5.97    4.91
s mean:      99.42   99.03   98.73   97.96   96.88   96.09   93.81
s sdev:       6.02    7.63    8.39   11.27   14.29   15.85   20.02
mean diff:   97.65   98.25   98.07   97.54   96.39   95.61   93.45
k:            6.42    6.76    6.44    5.79    4.71    4.38    3.75

Chi-squared, 0.05-0.90 cutoffs, 10 sets:
-> <stat> tested 50 hams & 200 spams against 450 hams & 1800 spams
[...]
-> <stat> tested 200 hams & 50 spams against 1800 hams & 450 spams
ham:spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
fp total:        2       3       3       4       5       4       2
fp %:         0.40    0.40    0.30    0.32    0.33    0.23    0.10
fn total:        5       6       4       5       6       7       9
fn %:         0.25    0.34    0.27    0.40    0.60    0.93    1.80
unsure t:       41      37      38      39      45      42      49
unsure %:     1.64    1.48    1.52    1.56    1.80    1.68    1.96
real cost:  $33.20  $43.40  $41.60  $52.80  $65.00  $55.40  $38.80
best cost:  $28.60  $28.40  $34.00  $35.60  $34.60  $30.60  $28.60
h mean:       1.31    0.58    0.50    0.46    0.51    0.48    0.36
h sdev:       8.51    6.47    6.46    6.25    6.44    6.12    4.97
s mean:      99.25   98.92   98.60   98.17   97.25   96.73   94.66
s sdev:       6.75    8.05    9.04   10.76   13.47   14.49   18.20
mean diff:   97.94   98.34   98.10   97.71   96.74   96.25   94.30
k:            6.42    6.77    6.33    5.74    4.86    4.67    4.07

Chi-squared, 0.05-0.95 cutoffs, 10 sets:
-> <stat> tested 50 hams & 200 spams against 450 hams & 1800 spams
[...]
-> <stat> tested 200 hams & 50 spams against 1800 hams & 450 spams
ham:spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
fp total:        2       3       3       3       2       2       2
fp %:         0.40    0.40    0.30    0.24    0.13    0.11    0.10
fn total:        5       6       4       5       6       7       9
fn %:         0.25    0.34    0.27    0.40    0.60    0.93    1.80
unsure t:       49      44      49      46      54      58      53
unsure %:     1.96    1.76    1.96    1.84    2.16    2.32    2.12
real cost:  $34.80  $44.80  $43.80  $44.20  $36.80  $38.60  $39.60
best cost:  $28.60  $28.40  $34.00  $35.60  $34.60  $30.60  $28.60
h mean:       1.31    0.58    0.50    0.46    0.51    0.48    0.36
h sdev:       8.51    6.47    6.46    6.25    6.44    6.12    4.97
s mean:      99.25   98.92   98.60   98.17   97.25   96.73   94.66
s sdev:       6.75    8.05    9.04   10.76   13.47   14.49   18.20
mean diff:   97.94   98.34   98.10   97.71   96.74   96.25   94.30
k:            6.42    6.77    6.33    5.74    4.86    4.67    4.07

Chi-squared, 0.02-0.98 cutoffs, 10 sets:
-> <stat> tested 50 hams & 200 spams against 450 hams & 1800 spams
[...]
-> <stat> tested 200 hams & 50 spams against 1800 hams & 450 spams
ham:spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
fp total:        2       3       2       3       2       2       1
fp %:         0.40    0.40    0.20    0.24    0.13    0.11    0.05
fn total:        4       4       3       3       3       5       6
fn %:         0.20    0.23    0.20    0.24    0.30    0.67    1.20
unsure t:       60      63      63      62      67      70      73
unsure %:     2.40    2.52    2.52    2.48    2.68    2.80    2.92
real cost:  $36.00  $46.60  $35.60  $45.40  $36.40  $39.00  $30.60
best cost:  $28.60  $28.40  $34.00  $35.60  $34.60  $30.60  $28.60
h mean:       1.31    0.58    0.50    0.46    0.51    0.48    0.36
h sdev:       8.51    6.47    6.46    6.25    6.44    6.12    4.97
s mean:      99.25   98.92   98.60   98.17   97.25   96.73   94.66
s sdev:       6.75    8.05    9.04   10.76   13.47   14.49   18.20
mean diff:   97.94   98.34   98.10   97.71   96.74   96.25   94.30
k:            6.42    6.77    6.33    5.74    4.86    4.67    4.07

The first three tables show the effects of ratio and training set
size over otherwise consistent parameters.  The last three tables
show the effects of ratio and cutoffs.

These results _ARE_ comparable to my earlier ratio tests; I'm using
the same training data, the same random seed, etc.  I ought to rerun
the old tests to get the best cost info for them, though...

I have this up on my website at:

  http://www.wolfskeep.com/~popiel/spambayes/chi

- Alex

From nas@python.ca  Thu Oct 17 17:48:14 2002
From: nas@python.ca (Neil Schemenauer)
Date: Thu, 17 Oct 2002 09:48:14 -0700
Subject: [Spambayes] Using mxBeeBase as hammie DB
In-Reply-To: <3DAE9D30.4050801@lemburg.com>
References: <3DAE8F49.5080305@lemburg.com> <3DAE9D30.4050801@lemburg.com>
Message-ID: <20021017164814.GA3731@glacier.arctrix.com>

M.-A. Lemburg wrote:
> Just to put some numbers by the fishes:
> 
> Teaching hammie 13000 messages from comp.lang.python
> gives a database size of 23MB (that's data + index).
> 
> Checking a single message takes 200ms on my Athlon 1200
> (this includes Python startup time).

$ time python2.3 neilfilter.py wordprobs.cdb ~/Maildir/ ~/Maildir/ < test.msg
real    0m0.139s
user    0m0.090s
sys     0m0.020s

$ ls -s wordprobs.cdb
4556 wordprobs.cdb

The database was trained on about 3800 messages.  test.msg is about 1 kB
in size.  My machine is a Athon 1700+ with 512 MB of RAM.

  Neil

From popiel@wolfskeep.com  Thu Oct 17 17:54:28 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Thu, 17 Oct 2002 09:54:28 -0700
Subject: [Spambayes] Using mxBeeBase as hammie DB 
In-Reply-To: Message from "M.-A. Lemburg" <mal@lemburg.com> 
   of "Thu, 17 Oct 2002 17:42:24 +0200." <3DAEDA60.20801@lemburg.com> 
References: <3DAE9029.2035.FF630F1@localhost>  <3DAEDA60.20801@lemburg.com> 
Message-ID: <20021017165428.E228CF4CD@cashew.wolfskeep.com>

In message:  <3DAEDA60.20801@lemburg.com>
             "M.-A. Lemburg" <mal@lemburg.com> writes:

>> I just found the "load time" to be extremely low for a typical office
>> worker box. You don't appear to have a typical box.
>
>Hmm, this is a standard SuSE installation and not even an up-to-date
>machine (1.2GHz is only half the speed of today's boxes). I am running
>Reiser FS if that makes any difference.

*snort* At home, I'm running on a 300MHz PII.  At work, I've got
a 350MHz PII and a 450MHz PIII.  Getting that latter box required
effort akin to pulling teeth.  And I'm a coder!

Office workers often get utter crap for hardware.

- Alex

From neale@woozle.org  Thu Oct 17 18:08:51 2002
From: neale@woozle.org (Neale Pickett)
Date: 17 Oct 2002 10:08:51 -0700
Subject: [Spambayes] Using mxBeeBase as hammie DB
In-Reply-To: <3DAEE314.6040903@lemburg.com>
References: <BIEJKCLHCIOIHAGOKOLHEEEJDMAA.tim.one@comcast.net>
	<3DAEE314.6040903@lemburg.com>
Message-ID: <w53smz5gfh8.fsf@woozle.org>

So then, "M.-A. Lemburg" <mal@lemburg.com> is all like:

> I suppose a PCGI style approach would be best here: you use a small
> C program as client (used for filtering by e.g. procmail) which then
> talks to a long-running daemon process.

You might be interested in hammiesrv.py, which is an XML-RPC hammie
server.  I'll check in hammiecli.py, which we've been using at work,
right after I send this message out.

BTW, I have a buttload of email from folks at work now, I'll post some
results of testing it all against my wordlist RSN.

So many cool projects, so little time.

Neale

From tim.one@comcast.net  Thu Oct 17 19:10:14 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 17 Oct 2002 14:10:14 -0400
Subject: [Spambayes] Proposing to remove 4 combining schemes
In-Reply-To: <200210171153.g9HBrcN10612@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <BIEJKCLHCIOIHAGOKOLHAEFADMAA.tim.one@comcast.net>

[Guido]
> ...
> It may be clear by now that I haven't been following recent discussions
> much -- but the "all outcomes are extreme" characteristic was what led
> us to look for an alternative to Graham's scheme, and I've come to
> appreciate having a gray area.

Rob covered this well, but I want to hammer the point home, because I expect
most people here have been overwhelmed by the tech talk:  given enough
training data, Graham's combining scheme was *always* extreme.
chi-combining is extreme "only" about 99% of the time, and the ~1% of the
time it isn't extreme turns out to contain most of its mistakes.  It's the
closest thing to a laser beam we've got, and it's really quite amazing.

Playing with chi2.py as a main program can be very instructive.  If you
fiddle its judge() function to do Graham-combining, the historgrams it
prints shows that G-combining makes an extreme judgement most of the time
even when fed collections of random (uniformly distributed) probabilities:
it infers certainty out of thin air.

But the S and H statistics that go into chi-combining are uniform in the
face of random input:  an S value of 0.001 is as likely as a value of 0.999
is as likely as a value of 0.500 (etc):  given random input, they're
unbiased, favoring no outcome in particular.

These reflect in real life:  chi-combining knows when it's confused, and
Graham-combining doesn't, and those reflect in the real-life score
distributions.  That chi-combining is rarely confused is really a great
strength; that Graham-combining is (almost) never confused is its great
weakness.


From tim.one@comcast.net  Thu Oct 17 19:28:20 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 17 Oct 2002 14:28:20 -0400
Subject: [Spambayes] Proposing to remove 4 combining schemes
In-Reply-To: <MJEHLHJKGINLONDMMKNEAEABHBAA.seant@webreply.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHOEFADMAA.tim.one@comcast.net>

[Sean True]
> I hate to try to speak for Joe User (like speaking for the "common man",
> always a red flag), but I _am_ just a user of these scoring schemes.
> I have several hundred messages (commercial email) tucked away in a
> folder that score in the non-chi scheme in the range .4 to .6. That
> score appears to reflect my own real uncertainty about the value of
> Motley Fool newsletters.   No snickering, please. A system like chi-
> looks like a very good choice for black and white, upstream discards
> offers to increase body part size.

Try rescoring your msgs using chi-combining before simply guessing that it's
inappropriate for you.  You don't have to retrain your database, you just
have to change the spamprob() used for scoring.

In my email so far, chi-combining is correctly and extremely certain that
most of my commerical ham is in fact ham.  The worst it does is on the one
HTML newsletter I have from Strong Investments so far, of which there are no
examples in my training data.  It gets a score of about 0.6, very solidly in
its "middle ground".

A UI that makes most sense under a middle-ground scheme would shuffle "I'm
pretty sure it's spam" into a Spam folder, and the ones it knows it's
confused about into an Unsure folder.  The rest ("I'm pretty sure it's ham")
would be left in your inbox.  We can't really do that with the default
combining scheme because about half your email would end up in the Unsure
folder -- the separation it makes between populations is too fuzzy.

> But I don't want these messages automatically discarded upstream, I want
> them labelled so that I can deal with them more efficiently.

Try chi-combining first.  There's no requirement that extreme msgs get
discarded here, that's simply a choice the emailmeister at python.org is
likely to make.  I've said many times that I personally will never use a
scheme that discards a msg without my review, so short of a major
personality transplant you can be sure I'm not going put anything in this
project that requires such trust.

> When I sort this particular folder by spam score, I get MIT club and
> Infoworld newsletters at the the beginning (the good end), and the
> Motley Fool and Edgar Online at the other end, with a range of spam
> score from .2 to .6 Just right.  If I could color them continuously, it
> would be easy to spot the ones I want to read, now. And over time, as I
> change my definition of spam, their position in the list looks like it
> will vary smoothly -- and appropriately.
>
> This may not fit your original mission statement, but mission statements
> often don't survive contact with the enemy, err, customer.

I was the one who said I wasn't willing to kill off the non-extreme methods
(yet), because there are *many* customers here, and the union of what they
want can't be had with a single scheme.  Bat a fan of a particular scheme
gets a lot more credibility after they've tried the alternatives and given
them thought based on actual experience.


From guido@python.org  Thu Oct 17 19:29:59 2002
From: guido@python.org (Guido van Rossum)
Date: Thu, 17 Oct 2002 14:29:59 -0400
Subject: [Spambayes] Client/server model
In-Reply-To: Your message of "Thu, 17 Oct 2002 11:19:44 PDT."
             <E182FFI-0002nK-00@usw-pr-cvs1.sourceforge.net> 
References: <E182FFI-0002nK-00@usw-pr-cvs1.sourceforge.net> 
Message-ID: <200210171829.g9HITxI21925@odiug.zope.com>

Neale's hammie client and server seem to me to be wasting some
effort.  Currently, what happens, is:

  cli sends the entire message to svr

  svr parses and scores the message
  svr inserts the X-Hammie-Disposition header in the message
  svr sends the message, thus modified, back

  cli prints the returned, modified, message to stdout

What would make more sense from the POV of minimizing traffic and
minimizing work done in the server:

  cli parses the message
  cli sends the list of tokens to svr

  svr scores the list of tokens
  svr returns the text to be inserted in the X-Hammie-Disposition header

  cli inserts the X-Hammie-Disposition in the message
  cli prints the message to stdout

(I like to minimize traffic as well as the work done by the server;
minimizing traffic is always a good idea, while minimizing server work
means less load on a shared server -- if the clients run on separate
machines, the combined CPU power of the clients is much more than that
of the server.)

--Guido van Rossum (home page: http://www.python.org/~guido/)

From skip@pobox.com  Wed Oct 16 22:02:38 2002
From: skip@pobox.com (Skip Montanaro)
Date: Wed, 16 Oct 2002 16:02:38 -0500
Subject: [Spambayes] Slice o' life
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEOHBLAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCAENIBLAB.tim.one@comcast.net>
        <LNBBLJKPBEHFEDALKOLCAEOHBLAB.tim.one@comcast.net>
Message-ID: <15789.54254.340082.342114@montanaro.dyndns.org>


    Tim> It turns out that python.org, Mailman, and SpamAssassin, put
    Tim> sooooooooo many unique "Hey, I had my fingers this!" clues in the
    Tim> headers that virtually any message coming thru python.org has a
    Tim> relatively huge collection of killer-strong ham clues (just listing
    Tim> headers containing such clues):

Why not tweak the code to call the guts of unheader.py?  Something like

    unheader.py -p 'X-Mailman|List-|Errors-to|Sender'

should get rid of most of the header fluff.

Skip

From agmsmith@rogers.com  Thu Oct 17 19:58:32 2002
From: agmsmith@rogers.com (Alexander G. M. Smith)
Date: Thu, 17 Oct 2002 14:58:32 EDT (-0400)
Subject: [Spambayes] Client/server model
In-Reply-To: <200210171829.g9HITxI21925@odiug.zope.com>
Message-ID: <10385763431-BeMail@CR593174-A>

Guido van Rossum wrote:
> What would make more sense from the POV of minimizing traffic and
> minimizing work done in the server:
> 
>   cli parses the message
>   cli sends the list of tokens to svr

I'd want the server to do tokenization for consistency reasons.
Particularly if you are also spam filtering news articles and not
just e-mail messages.  Also, the server can have all that mail
parsing code (discarding attachments, decoding BASE64 etc), making
the client simpler.

>   svr scores the list of tokens
>   svr returns the text to be inserted in the X-Hammie-Disposition header

I'm returning the spam ratio in my server (using BeOS inter-program
communication, though I suppose I could use the package which extends
the BMessage system to the Internet, but the spam database is really
a per-user thing so that isn't useful).  I let the client decide if
it's over their own threshold limit or not (ok, that may be a bad design
choice).  I'm also returning the list of words and their individual
scores, but that's mostly for debugging (and wastes a lot of space -
150 words at a time!).  The client (a plug-in filter for the BeMail
package) also does the sound effects (saying "Spam" or "Genuine" as
each message comes in).

>   cli inserts the X-Hammie-Disposition in the message
>   cli prints the message to stdout
> 
> (I like to minimize traffic as well as the work done by the server;
> minimizing traffic is always a good idea, while minimizing server work
> means less load on a shared server -- if the clients run on separate
> machines, the combined CPU power of the clients is much more than that
> of the server.)

Actually, it turns out that my server approach really isn't needed for
speed reasons.  It just takes a fraction of a second to load and parse the
spam database (a 0.5MB (stripped of unique strings after initial training
on 1500 messages / 21000 words) text file with words and numbers).  But
still it's nice to have it separate from other programs so that it is
more modular.

- Alex


From tim.one@comcast.net  Thu Oct 17 20:05:30 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 17 Oct 2002 15:05:30 -0400
Subject: [Spambayes] Proposing to remove 4 combining schemes
In-Reply-To: <3DAEC536.10803@hooft.net>
Message-ID: <BIEJKCLHCIOIHAGOKOLHGEFCDMAA.tim.one@comcast.net>

[Rob W. W. Hooft]
> ...
> I understood the original idea of Tim as that he wanted to see the
> spamminess of clearcut spam and the hamminess of clearcut ham. I don't
> see the point of that,

Neither do I, but I'm not following your meaning so that shouldn't be too
surprising <wink>.  For my own use with a middle ground, I want near-certain
spam shuffled into a Spam folder, near-certain Ham left in my inbox, and the
middle ground shuffled into an Unsure folder.


From guido@python.org  Thu Oct 17 20:05:25 2002
From: guido@python.org (Guido van Rossum)
Date: Thu, 17 Oct 2002 15:05:25 -0400
Subject: [Spambayes] Client/server model
In-Reply-To: Your message of "Thu, 17 Oct 2002 14:58:32 EDT."
             <10385763431-BeMail@CR593174-A> 
References: <10385763431-BeMail@CR593174-A> 
Message-ID: <200210171905.g9HJ5PD22148@odiug.zope.com>

> > What would make more sense from the POV of minimizing traffic and
> > minimizing work done in the server:
> > 
> >   cli parses the message
> >   cli sends the list of tokens to svr
> 
> I'd want the server to do tokenization for consistency reasons.
> Particularly if you are also spam filtering news articles and not
> just e-mail messages.

I don't understand this.

> Also, the server can have all that mail parsing code (discarding
> attachments, decoding BASE64 etc), making the client simpler.

But discarding attachments in the client would reduce the traffic to
the server tremendously!  Maybe your server has more available CPU
power than your client though?

> >   svr scores the list of tokens
> >   svr returns the text to be inserted in the X-Hammie-Disposition header
> 
> I'm returning the spam ratio in my server (using BeOS inter-program
> communication, though I suppose I could use the package which extends
> the BMessage system to the Internet, but the spam database is really
> a per-user thing so that isn't useful).  I let the client decide if
> it's over their own threshold limit or not (ok, that may be a bad design
> choice).  I'm also returning the list of words and their individual
> scores, but that's mostly for debugging (and wastes a lot of space -
> 150 words at a time!).  The client (a plug-in filter for the BeMail
> package) also does the sound effects (saying "Spam" or "Genuine" as
> each message comes in).

Cool. :-)

> >   cli inserts the X-Hammie-Disposition in the message
> >   cli prints the message to stdout
> > 
> > (I like to minimize traffic as well as the work done by the server;
> > minimizing traffic is always a good idea, while minimizing server work
> > means less load on a shared server -- if the clients run on separate
> > machines, the combined CPU power of the clients is much more than that
> > of the server.)
> 
> Actually, it turns out that my server approach really isn't needed
> for speed reasons.  It just takes a fraction of a second to load and
> parse the spam database (a 0.5MB (stripped of unique strings after
> initial training on 1500 messages / 21000 words) text file with
> words and numbers).  But still it's nice to have it separate from
> other programs so that it is more modular.

Fractions of seconds add up. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)

From agmsmith@rogers.com  Thu Oct 17 20:10:55 2002
From: agmsmith@rogers.com (Alexander G. M. Smith)
Date: Thu, 17 Oct 2002 15:10:55 EDT (-0400)
Subject: [Spambayes] Client/server model
In-Reply-To: <200210171905.g9HJ5PD22148@odiug.zope.com>
Message-ID: <11128972051-BeMail@CR593174-A>

Guido van Rossum <guido@python.org> wrote:
> > I'd want the server to do tokenization for consistency reasons.
> > Particularly if you are also spam filtering news articles and not
> > just e-mail messages.
> 
> I don't understand this.

So that everybody tokenizes the incoming messages in the same way,
particularly the same way as that used earlier during training.

Also, I'd have the server keep track of spam from other sources,
such as UseNet news.  Is there anywere else where spam messages
show up that might need to be included, or is it just mail and
news=3F

- Alex


From guido@python.org  Thu Oct 17 20:19:59 2002
From: guido@python.org (Guido van Rossum)
Date: Thu, 17 Oct 2002 15:19:59 -0400
Subject: [Spambayes] Client/server model
In-Reply-To: Your message of "Thu, 17 Oct 2002 15:10:55 EDT."
             <11128972051-BeMail@CR593174-A> 
References: <11128972051-BeMail@CR593174-A> 
Message-ID: <200210171919.g9HJJxs22230@odiug.zope.com>

> > > I'd want the server to do tokenization for consistency reasons.
> > > Particularly if you are also spam filtering news articles and not
> > > just e-mail messages.
> > 
> > I don't understand this.
> 
> So that everybody tokenizes the incoming messages in the same way,
> particularly the same way as that used earlier during training.

The hammie-client approach has a separate client program that's
invoked each time, and that takes care of the uniform parsing.

> Also, I'd have the server keep track of spam from other sources,
> such as UseNet news.  Is there anywere else where spam messages
> show up that might need to be included, or is it just mail and
> news?

Not that I know of.

--Guido van Rossum (home page: http://www.python.org/~guido/)

From tim.one@comcast.net  Thu Oct 17 20:25:38 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 17 Oct 2002 15:25:38 -0400
Subject: [Spambayes] Proposing to remove 4 combining schemes
In-Reply-To: <3DAED682.3090905@hooft.net>
Message-ID: <BIEJKCLHCIOIHAGOKOLHOEFCDMAA.tim.one@comcast.net>

[Rob W. W. Hooft, divides S and H and n by various things before
 computing chi2Q]

> Normal:
> -> best cost for all runs: $102.60
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at ham & spam cutoffs 0.495 & 0.96
> ->     fp 3; fn 14; unsure ham 40; unsure spam 253
> ->     fp rate 0.0187%; fn rate 0.241%; unsure rate 1.34%


> Dividing the log-products and n by 2:
> -> best cost for all runs: $104.40
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at ham & spam cutoffs 0.49 & 0.92
> ->     fp 3; fn 14; unsure ham 43; unsure spam 259
> ->     fp rate 0.0187%; fn rate 0.241%; unsure rate 1.39%

> Dividing the log-products and n by 4:
> -> best cost for all runs: $108.20
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at 2 cutoff pairs
> -> smallest ham & spam cutoffs 0.48 & 0.855
> ->     fp 4; fn 13; unsure ham 46; unsure spam 230
> ->     fp rate 0.025%; fn rate 0.224%; unsure rate 1.27%
> -> largest ham & spam cutoffs 0.485 & 0.855
> ->     fp 4; fn 14; unsure ham 42; unsure spam 229
> ->     fp rate 0.025%; fn rate 0.241%; unsure rate 1.24%

> As I expected, this significantly broadens the extremes at only very
> little cost.

But what's the point?  By your own cost measure, it didn't do you any good,
and in fact it raised your FP rate by the time you got to 4.


> What this does statistically is downweighting all clues thereby taking
> care of a "standard" correlation between clues.  This may be
> functionally equivalent to raising the value of s.

I doubt the latter, but if it's true I'd much rather get there by raising s,
which is symmetric and comprehensible.  Fudging H, S and n introduces
strange biases, because the info you're feeding into chi2Q no longer follows
a chi-squared distribution after fudging, and chi2Q may as well be some form
of biased random-number generator then.

> This is the /4 code for reference:
>
> Index: classifier.py
> ===================================================================
> RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
> retrieving revision 1.38
> diff -u -r1.38 classifier.py
> --- classifier.py	14 Oct 2002 02:20:35 -0000	1.38
> +++ classifier.py	17 Oct 2002 15:24:55 -0000
> @@ -516,7 +516,10 @@
>           S = ln(S) + Sexp * LN2
>           H = ln(H) + Hexp * LN2
>
> -        n = len(clues)
> +        S = S/4.0
> +        H = H/4.0
> +
> +        n = len(clues)//4
>           if n:
>               S = 1.0 - chi2Q(-2.0 * S, 2*n)
>               H = 1.0 - chi2Q(-2.0 * H, 2*n)

Fiddle chi2.judge() to play with this.  Here's the straight H distribution
(S is similar) on vectors of 52 random probs:

52 random probs
H  10000 items; mean 0.50; sdev 0.29
-> <stat> min 0.000119708; median 0.500356; max 0.999988
* = 9 items
0.00 498 ********************************************************
0.05 494 *******************************************************
0.10 504 ********************************************************
0.15 546 *************************************************************
0.20 484 ******************************************************
0.25 470 *****************************************************
0.30 494 *******************************************************
0.35 491 *******************************************************
0.40 505 *********************************************************
0.45 513 *********************************************************
0.50 504 ********************************************************
0.55 474 *****************************************************
0.60 500 ********************************************************
0.65 502 ********************************************************
0.70 501 ********************************************************
0.75 542 *************************************************************
0.80 517 **********************************************************
0.85 443 **************************************************
0.90 514 **********************************************************
0.95 504 ********************************************************

Do the same but divide everything by 4 first (as you showed), and H is no
longer uniformly distributed:

52 random probs
H/4 & n//4  10000 items; mean 0.52; sdev 0.18
-> <stat> min 0.0144875; median 0.527973; max 0.973816
* = 17 items
0.00    4 *
0.05   47 ***
0.10  116 *******
0.15  238 **************
0.20  303 ******************
0.25  498 ******************************
0.30  631 **************************************
0.35  781 **********************************************
0.40  900 *****************************************************
0.45  933 *******************************************************
0.50  967 *********************************************************
0.55 1017 ************************************************************
0.60  893 *****************************************************
0.65  812 ************************************************
0.70  699 ******************************************
0.75  519 *******************************
0.80  339 ********************
0.85  208 *************
0.90   87 ******
0.95    8 *

The bias also shifts according to the number of extreme words in a msg
modulo 4, getting more lopsided the larger n%4:

53 random probs
H/4 & n//4  10000 items; mean 0.55; sdev 0.18
-> <stat> min 0.030539; median 0.554048; max 0.975847
* = 17 items
0.00    3 *
0.05   24 **
0.10   74 *****
0.15  133 ********
0.20  261 ****************
0.25  420 *************************
0.30  558 *********************************
0.35  706 ******************************************
0.40  822 *************************************************
0.45  936 ********************************************************
0.50  995 ***********************************************************
0.55 1007 ************************************************************
0.60  989 ***********************************************************
0.65  866 ***************************************************
0.70  804 ************************************************
0.75  642 **************************************
0.80  396 ************************
0.85  247 ***************
0.90  106 *******
0.95   11 *


54 random probs
H/4 & n//4  items; mean 0.57; sdev 0.17
-> <stat> min 0.0562266; median 0.579539; max 0.984772
* = 17 items
0.00    0
0.05   14 *
0.10   47 ***
0.15   97 ******
0.20  201 ************
0.25  327 ********************
0.30  478 *****************************
0.35  643 **************************************
0.40  744 ********************************************
0.45  868 ****************************************************
0.50  981 **********************************************************
0.55 1020 ************************************************************
0.60 1004 ************************************************************
0.65  968 *********************************************************
0.70  894 *****************************************************
0.75  750 *********************************************
0.80  532 ********************************
0.85  298 ******************
0.90  112 *******
0.95   22 **


55 random probs
H/4 & n//4  10000 items; mean 0.60; sdev 0.17
-> <stat> min 0.0477139; median 0.61042; max 0.971135
* = 19 items
0.00    1 *
0.05    7 *
0.10   26 **
0.15   84 *****
0.20  153 *********
0.25  270 ***************
0.30  359 *******************
0.35  452 ************************
0.40  659 ***********************************
0.45  819 ********************************************
0.50  919 *************************************************
0.55 1022 ******************************************************
0.60 1108 ***********************************************************
0.65 1088 **********************************************************
0.70  959 ***************************************************
0.75  792 ******************************************
0.80  661 ***********************************
0.85  412 **********************
0.90  186 **********
0.95   23 **

So, sorry, but overall this strikes me as the kind of thing we worked like
hell to get away from in Paul's scheme:  strange and inconsistent biases
that don't actually help, but at least cancel each other out when you get
lucky <wink>.  Extremity merely for the sake of extremity was no virtue, and
neither is its converse.


From python-spambayes@discworld.dyndns.org  Thu Oct 17 20:46:47 2002
From: python-spambayes@discworld.dyndns.org (Charles Cazabon)
Date: Thu, 17 Oct 2002 13:46:47 -0600
Subject: [Spambayes] Client/server model
In-Reply-To: <11128972051-BeMail@CR593174-A>;
	from agmsmith@rogers.com on Thu, Oct 17, 2002 at 03:10:55PM -0400
References: <200210171905.g9HJ5PD22148@odiug.zope.com>
	<11128972051-BeMail@CR593174-A>
Message-ID: <20021017134647.A3293@discworld.dyndns.org>

Alexander G. M. Smith <agmsmith@rogers.com> wrote:
> 
> Also, I'd have the server keep track of spam from other sources, such as
> UseNet news.  Is there anywere else where spam messages show up that might
> need to be included, or is it just mail and news?

Apparently some enterprising spammers are now experimenting with delivering
spam via the netbios messaging protocol (i.e. winpopup) so that the message
comes up on the user's screen in a dialog box.  It doesn't get through
firewalls, but they're more interested in the home user than corporate users
anyway.

Charles
-- 
-----------------------------------------------------------------------
Charles Cazabon                 <python-spambayes@discworld.dyndns.org>
GPL'ed software available at:     http://www.qcc.ca/~charlesc/software/
-----------------------------------------------------------------------

From mal@lemburg.com  Thu Oct 17 21:02:12 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 17 Oct 2002 22:02:12 +0200
Subject: [Spambayes] Client/server model
References: <E182FFI-0002nK-00@usw-pr-cvs1.sourceforge.net>
	<200210171829.g9HITxI21925@odiug.zope.com>
Message-ID: <3DAF1744.8080703@lemburg.com>

Guido van Rossum wrote:
> Neale's hammie client and server seem to me to be wasting some
> effort.  Currently, what happens, is:
> 
>   cli sends the entire message to svr
> 
>   svr parses and scores the message
>   svr inserts the X-Hammie-Disposition header in the message
>   svr sends the message, thus modified, back
> 
>   cli prints the returned, modified, message to stdout
> 
> What would make more sense from the POV of minimizing traffic and
> minimizing work done in the server:
> 
>   cli parses the message
>   cli sends the list of tokens to svr
> 
>   svr scores the list of tokens
>   svr returns the text to be inserted in the X-Hammie-Disposition header
> 
>   cli inserts the X-Hammie-Disposition in the message
>   cli prints the message to stdout
> 
> (I like to minimize traffic as well as the work done by the server;
> minimizing traffic is always a good idea, while minimizing server work
> means less load on a shared server -- if the clients run on separate
> machines, the combined CPU power of the clients is much more than that
> of the server.)

This may be true if you have clients on different CPUs but if
you are on the same machine (client talking to daemon), then
Neale's model is certainly the better one.

In fact, making the client as tiny as possible would save
more CPU time.

I'm thinking of the situation where you have a mail server
which uses procmail to do the filtering for many different
users having their account on that machine. Another scenario
would be to built the C client directly into the MTA being
used for the delivery.

The only situation where the fat client would be better is that
of distributed mail servers, but that seems like a rather uncommon
setup.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/


From seant@iname.com  Thu Oct 17 21:02:57 2002
From: seant@iname.com (Sean True)
Date: Thu, 17 Oct 2002 16:02:57 -0400
Subject: [Spambayes] Proposing to remove 4 combining schemes
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHOEFADMAA.tim.one@comcast.net>
Message-ID: <MJEHLHJKGINLONDMMKNEKEBMHBAA.seant@iname.com>

> [Sean True]
> > I hate to try to speak for Joe User (like speaking for the "common man",
> > always a red flag), but I _am_ just a user of these scoring schemes.
> > I have several hundred messages (commercial email) tucked away in a
> > folder that score in the non-chi scheme in the range .4 to .6. That
> > score appears to reflect my own real uncertainty about the value of
> > Motley Fool newsletters.   No snickering, please. A system like chi-
> > looks like a very good choice for black and white, upstream discards
> > offers to increase body part size.
>
> Try rescoring your msgs using chi-combining before simply
> guessing that it's
> inappropriate for you.  You don't have to retrain your database, you just
> have to change the spamprob() used for scoring.
>
> In my email so far, chi-combining is correctly and extremely certain that
> most of my commerical ham is in fact ham.  The worst it does is on the one
> HTML newsletter I have from Strong Investments so far, of which
> there are no
> examples in my training data.  It gets a score of about 0.6, very
> solidly in
> its "middle ground".

I changed to chi-scoring. That was easy.
I rescored the /News filter, which is not used for any training whatsover.
The spread of scores is now 0 to .99, as advertised. The scores for the
financial newsletters are now much higher, which does not meet my
qualitative assessment. If I'm not sure it's spam, I'd prefer a score
that matched that.

All in all, for exposed scoring, I still prefer the old scores,
but not enough to keep complaining about. And since getting whacked
by Tim for lack of intellectual rigor and laziness is familiar, rousing,
but not all that much fun <grin>, I think I'll go back to doing systems
engineering and just _use_ the results.

>
> A UI that makes most sense under a middle-ground scheme would shuffle "I'm
> pretty sure it's spam" into a Spam folder, and the ones it knows it's
> confused about into an Unsure folder.  The rest ("I'm pretty sure
> it's ham")
> would be left in your inbox.  We can't really do that with the default
> combining scheme because about half your email would end up in the Unsure
> folder -- the separation it makes between populations is too fuzzy.

Depends on whether one wants the machine making the decision, or wants
help making the decision oneself. The ability to label a message, and
then sort using the label is really handy if you spend your time
classifying mail.

>(yet), because there are *many* customers here, and the union of what they
> want can't be had with a single scheme.  Bat a fan of a particular scheme
> gets a lot more credibility after they've tried the alternatives and given
> them thought based on actual experience.


Whack, whack, whack. <vlg>

Always a pleasure, Mr. Peters.

I'm off to rescore all the mail in my Outlook folders, which takes about an
hour on an
Athlon 2000 XP.

-- Sean


From rob@hooft.net  Thu Oct 17 21:06:18 2002
From: rob@hooft.net (Rob Hooft)
Date: Thu, 17 Oct 2002 22:06:18 +0200
Subject: [Spambayes] Proposing to remove 4 combining schemes
References: <BIEJKCLHCIOIHAGOKOLHOEFCDMAA.tim.one@comcast.net>
Message-ID: <3DAF183A.8070600@hooft.net>

Tim Peters wrote:
> 
> But what's the point?  By your own cost measure, it didn't do you any good,
> and in fact it raised your FP rate by the time you got to 4.
> 

There was some discussion about the judgments being too strict. I was 
trying to find a statistically sound way to reduce correlations such 
that results would be less sure.

I explained that here:

>>What this does statistically is downweighting all clues thereby taking
>>care of a "standard" correlation between clues.

To which you said:

> Fudging H, S and n introduces
> strange biases, because the info you're feeding into chi2Q no longer follows
> a chi-squared distribution after fudging, and chi2Q may as well be some form
> of biased random-number generator then.

That is not exactly true. What I am assuming is that if there is one 
clue in a message that says 0.8, there are probably more of those. That 
is the correlation we're discussing. A clue rarely comes alone. Effect 
of that is that my joke messages with "From: xxx@yyy (by way of 
ppp@qqq)" gets a very strong and repeated signal from the From: line, 
and your filtered mailman list is much too sure about hamminess.

This is solved by my hack: it practically divides the number of clues by 
2 or 4.

> Fiddle chi2.judge() to play with this.  Here's the straight H distribution
> (S is similar) on vectors of 52 random probs:

OK:

Index: chi2.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/chi2.py,v
retrieving revision 1.7
diff -u -r1.7 chi2.py
--- chi2.py     16 Oct 2002 21:31:19 -0000      1.7
+++ chi2.py     17 Oct 2002 20:04:17 -0000
@@ -145,7 +145,7 @@

      for i in range(5000):
          ps = [random() for j in range(50)]
-        s1, h1, score1 = judge(ps + [bias] * warp)
+        s1, h1, score1 = judge((ps + [bias] * warp)*4)
          s.add(s1)
          h.add(h1)
          score.add(score1)

(i.e. adding correlated data points) Results in:

Result for random vectors of 50 probs, + 0 forced to 0.99

H 5000 items; mean 0.47; sdev 0.38
-> <stat> min 1.26528e-11; median 0.444004; max 1
-> <stat> fivepctlo 0.000293787; fivepcthi 0.999102
* = 19 items
0.00 1125 ************************************************************
0.05  291 ****************
0.10  230 *************
0.15  182 **********
0.20  157 *********
0.25  146 ********
0.30  119 *******
0.35  135 ********
0.40  129 *******
0.45  121 *******
0.50  120 *******
0.55  131 *******
0.60  128 *******
0.65  152 ********
0.70  128 *******
0.75  167 *********
0.80  172 **********
0.85  208 ***********
0.90  239 *************
0.95  920 *************************************************

S 5000 items; mean 0.50; sdev 0.39
-> <stat> min 2.81657e-11; median 0.497487; max 1
-> <stat> fivepctlo 0.0005459; fivepcthi 0.999608
* = 18 items
0.00 1049 ***********************************************************
0.05  286 ****************
0.10  195 ***********
0.15  166 **********
0.20  138 ********
0.25  163 **********
0.30  129 ********
0.35  128 ********
0.40  123 *******
0.45  128 ********
0.50  123 *******
0.55  129 ********
0.60  114 *******
0.65  133 ********
0.70  149 *********
0.75  142 ********
0.80  201 ************
0.85  183 ***********
0.90  265 ***************
0.95 1056 ***********************************************************

(S-H+1)/2 5000 items; mean 0.51; sdev 0.34
-> <stat> min 3.71508e-09; median 0.515657; max 1
-> <stat> fivepctlo 0.00540499; fivepcthi 0.996936
* = 12 items
0.00 651 *******************************************************
0.05 240 ********************
0.10 214 ******************
0.15 180 ***************
0.20 184 ****************
0.25 173 ***************
0.30 163 **************
0.35 181 ****************
0.40 208 ******************
0.45 243 *********************
0.50 217 *******************
0.55 182 ****************
0.60 216 ******************
0.65 157 **************
0.70 185 ****************
0.75 190 ****************
0.80 191 ****************
0.85 225 *******************
0.90 282 ************************
0.95 718 ************************************************************

So: chi2 will be fairly sure even about random data if it is correlated.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From gward@python.net  Thu Oct 17 21:52:08 2002
From: gward@python.net (Greg Ward)
Date: Thu, 17 Oct 2002 16:52:08 -0400
Subject: [Spambayes] Client/server model
In-Reply-To: <200210171829.g9HITxI21925@odiug.zope.com>
References: <E182FFI-0002nK-00@usw-pr-cvs1.sourceforge.net>
	<200210171829.g9HITxI21925@odiug.zope.com>
Message-ID: <20021017205208.GB14491@cthulhu.gerg.ca>

On 17 October 2002, Guido van Rossum said:
> Neale's hammie client and server seem to me to be wasting some
> effort.  Currently, what happens, is:
> 
>   cli sends the entire message to svr
> 
>   svr parses and scores the message
>   svr inserts the X-Hammie-Disposition header in the message
>   svr sends the message, thus modified, back
> 
>   cli prints the returned, modified, message to stdout

Arrggh.  That's exactly how SpamAssassin's spamc/spamd work, and it's
a pain-in-the-ass for anyone who wants to access spamd in an unusual
way.

> What would make more sense from the POV of minimizing traffic and
> minimizing work done in the server:
> 
>   cli parses the message
>   cli sends the list of tokens to svr
> 
>   svr scores the list of tokens
>   svr returns the text to be inserted in the X-Hammie-Disposition header
> 
>   cli inserts the X-Hammie-Disposition in the message
>   cli prints the message to stdout

If there are multiple client implementations, then spreading work across
clients also means duplicating code.  Yuck.  Based on my experience with
SA, I think I'd prefer a model like this:

  cli sends message headers
  svr parses the headers
  cli sends message body OR individual attachments
      [ie. the protocol needs some state so the client can say,
       "I'm sending you the headers now", or "I'm sending you the
       entire body now", or "I'm sending one attachment now"]
  svr parses the message body/attachments/whatever
  cli tells the server what it wants: eg. "give me the
      X-Hammie-Disposition header", or "give me just the score", or
      "give me the top-N scoring words and their probabilities"
  svr gives the client what it wants

Yes, I know this is more complex.  But it's how I wish SA's spamd
protocol worked!

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
I have the power to HALT PRODUCTION on all TEENAGE SEX COMEDIES!!

From rob@hooft.net  Thu Oct 17 21:49:26 2002
From: rob@hooft.net (Rob Hooft)
Date: Thu, 17 Oct 2002 22:49:26 +0200
Subject: [Spambayes] optimal max_discriminators for chi2
Message-ID: <3DAF2256.30509@hooft.net>

This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
I did a series of runs:
=========================
[Classifier]
use_chi_squared_combining: True
robinson_minimum_prob_strength = 0.0
robinson_probability_s = 0.45
max_discriminators = XXXXXX

[TestDriver]
spam_cutoff: 0.70

nbuckets: 200
best_cutoff_fp_weight: 10

show_false_positives: True
show_false_negatives: True
show_best_discriminators: 50
show_spam_lo = 0.00
show_spam_hi = 0.80
show_ham_lo = 0.40
show_ham_hi = 1.00
show_charlimit: 5000
============

With XXXXXX between 15 and 300. Attached are plots of the 95th 
percentile ham, 5th percentile spam, and of the total cost vertical 
against max_discriminators horizontal. Please note again that my ham is 
much tighter than my spam: vertical scales are from 0 to 0.16 and from 
89 to 100, respectively (Almost a factor of 100!). The cost plot shows 
"no trend at all", but the variation is not large.

I'd almost conclude "anything goes", but based on the spam-5% value
I'd like to stick with values over ~40.

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: ham95.png
Type: image/png
Size: 6748 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021017/107955fc/ham95.png

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: spam5.png
Type: image/png
Size: 6330 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021017/107955fc/spam5.png

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: cost.png
Type: image/png
Size: 8545 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021017/107955fc/cost.png

---------------------- multipart/mixed attachment--


From tim.one@comcast.net  Thu Oct 17 22:03:30 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 17 Oct 2002 17:03:30 -0400
Subject: [Spambayes] Using mxBeeBase as hammie DB
In-Reply-To: <3DAEE314.6040903@lemburg.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHEEFIDMAA.tim.one@comcast.net>

[Tim]
>> Pruning the database, and especially over time, is something that
>> needs work here.

[M.-A. Lemburg]
> Is there some way to do this automagically ?

No; that's part of what "needs work here" means.  In addition, some fields
in the WordInfo records probably aren't needed, or at best are too big (like
saving an 8-byte double for a timestamp).  It's also unknown how pruning
will affect accuracy over time, esp. since training is done on a

    batch of words per msg

basis, but unless the tokenstream for each msg is saved, expiring words from
the database will yield a state that doesn't match any real-life combination
of training msgs.

Feel free to solve all that in your spare time <wink>.


From tim.one@comcast.net  Thu Oct 17 22:32:46 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 17 Oct 2002 17:32:46 -0400
Subject: [Spambayes] Proposing to remove 4 combining schemes
In-Reply-To: <MJEHLHJKGINLONDMMKNEKEBMHBAA.seant@iname.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHAEFJDMAA.tim.one@comcast.net>

[Sean True]
> I changed to chi-scoring. That was easy.
> I rescored the /News filter, which is not used for any training
> whatsover.  The spread of scores is now 0 to .99, as advertised.

That's a bit odd:  in all test reports to date, the median spam score under
chi-combining was 1.00, and that matches what I've seen on my personal email
too (a large majority of spam scores 1.00, to the precision of the Hammie
display).

> The scores for the financial newsletters are now much higher,

Meaning 0.99, or ...?  The rule of thumb under chi-combining so far has been
to say a thing is spam if its score exceeds 0.95, else to call it a "middle
ground" msg (with "it's ham" under 0.05).

> which does not meet my qualitative assessment.

The scores will change over time, of course -- the system learns what it's
taught.

> If I'm not sure it's spam, I'd prefer a score that matched that.

Under chi-combining, a score under .95 (as a rule of thumb so far) does mean
"I'm not sure it's spam".  So quantifying this would be helpful.

> All in all, for exposed scoring, I still prefer the old scores,
> but not enough to keep complaining about.

You're allowed to -- as I said in the msg that started this thread, I too
re-found a fondess for the fuzzy scores when using the same UI you're using.
I'm not sure how much that has to do with the reality of the scoring, and
how much to do with the UI, but we have lots of test results saying that
chi-combining does objectively better *if* you need to pick your cutoffs in
advance.  If you're happy to live with fuzzy and shifting cutoffs (which is
allowed, but I expect will be as much a minority position as my position
that I'd rather have false positives than false negatives), the all-default
scheme may work just as well.

> And since getting whacked by Tim for lack of intellectual rigor and
> laziness is familiar, rousing, but not all that much fun <grin>, I think
> I'll go back to doing systems engineering and just _use_ the results.

I think you're confusing me with the geometric mean of "the MREC group" <0.9
wink>, Sean.  I simply challenged you to use a scheme before strongly
dissing it.

>> A UI that makes most sense under a middle-ground scheme would
>> shuffle "I'm pretty sure it's spam" into a Spam folder, and the
>> ones it knows it's confused about into an Unsure folder.  The
>> rest ("I'm pretty sure it's ham") would be left in your inbox.
>> We can't really do that with the default combining scheme because
>> about half your email would end up in the Unsure folder -- the
>> separation it makes between populations is too fuzzy.

> Depends on whether one wants the machine making the decision, or wants
> help making the decision oneself.

I only want help, but shuffling msgs off to Spam and Unsure folders for
later review is exactly the help I want.  For example, I don't want to be
bothered *at all* with probable spam until I put my brain in "spam mode" and
go on a mass-delete spree dedicated to reviewing probable spam.  Until then,
I don't want it in my inbox at all.

> The ability to label a message, and then sort using the label is really
> handy if you spend your time classifying mail.

With respect to labels specifically measuring spamness, I expect we're
destined never to agree on this.  I found sorting Outlook displays by Hammie
score (yes, I've tried rescoring under all schemes) to be intellectually
interesting, but not the way I'd want to work in real life.  Spam vs
non-spam is a boring decision I want to spend as little time on as possible;
personal email vs work email is an example of an interesting decision.

> ...
> I'm off to rescore all the mail in my Outlook folders, which
> takes about an hour on an Athlon 2000 XP.

This is something else to look into:  scoring thru the Outlook wrappers is
*much* slower than scoring msg-per-plain-text-file (which is what I do
during large tests, which routinely score over 100,000 times).  I score
about 80 msgs per second the latter way.  When scoring about 500 Outlook
Inbox msgs, I take a much-needed bathroom break <wink>.


From nas@python.ca  Thu Oct 17 22:36:09 2002
From: nas@python.ca (Neil Schemenauer)
Date: Thu, 17 Oct 2002 14:36:09 -0700
Subject: [Spambayes] Client/server model
In-Reply-To: <200210171829.g9HITxI21925@odiug.zope.com>
References: <E182FFI-0002nK-00@usw-pr-cvs1.sourceforge.net>
	<200210171829.g9HITxI21925@odiug.zope.com>
Message-ID: <20021017213608.GA4467@glacier.arctrix.com>

Guido van Rossum wrote:
> (I like to minimize traffic as well as the work done by the server;
> minimizing traffic is always a good idea, while minimizing server work
> means less load on a shared server -- if the clients run on separate
> machines, the combined CPU power of the clients is much more than that
> of the server.)

Why do we want to do it on the server at all?  I think having the DB and
classifier on the client would work better.

   Neil

From rob@hooft.net  Thu Oct 17 22:38:41 2002
From: rob@hooft.net (Rob Hooft)
Date: Thu, 17 Oct 2002 23:38:41 +0200
Subject: [Spambayes] Proposing to remove 4 combining schemes
References: <BIEJKCLHCIOIHAGOKOLHAEFJDMAA.tim.one@comcast.net>
Message-ID: <3DAF2DE1.5090404@hooft.net>

Tim Peters wrote:
> [Sean True]

>>If I'm not sure it's spam, I'd prefer a score that matched that.
> 
> 
> Under chi-combining, a score under .95 (as a rule of thumb so far) does mean
> "I'm not sure it's spam".  So quantifying this would be helpful.

My gut feeling says: under ideal combining, a score under .95 means "I'm 
less than 95% sure this is spam".

> I only want help, but shuffling msgs off to Spam and Unsure folders for
> later review is exactly the help I want.  For example, I don't want to be
> bothered *at all* with probable spam until I put my brain in "spam mode" and
> go on a mass-delete spree dedicated to reviewing probable spam.  Until then,
> I don't want it in my inbox at all.

The first time I used SpamAssassin, I used it in label-only mode. That 
gave some relief. After using it for a month, I was confident enough to 
make a procmail rule to move spam into a spam folder without showing it 
to me. I was amazed by the amount of rest that has created. I did not 
realize that the spam was having such a psychological effect on me. This 
is definitely what I'd want from spambayes. I'd only read my "incoming 
ham". Once a week I'd go into unsure mode, and do some selection work. 
Once a month I can probably go into spam-curse mode, and do the mass 
deletion Tim talks about.

But Sean's "sort on score" idea is also very useful. I think it'd speed 
up the manual scanning/deletion process.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From skip@pobox.com  Thu Oct 17 22:43:27 2002
From: skip@pobox.com (Skip Montanaro)
Date: Thu, 17 Oct 2002 16:43:27 -0500
Subject: [Spambayes] Client/server model
In-Reply-To: <200210171829.g9HITxI21925@odiug.zope.com>
References: <E182FFI-0002nK-00@usw-pr-cvs1.sourceforge.net>
        <200210171829.g9HITxI21925@odiug.zope.com>
Message-ID: <15791.12031.16824.172208@montanaro.dyndns.org>


    Guido> Neale's hammie client and server seem to me to be wasting some
    Guido> effort.  Currently, what happens, is:

    [current behavior]

    Guido> What would make more sense from the POV of minimizing traffic and
    Guido> minimizing work done in the server:

    [proposed behavior]    

SpamAssassin has a spamc/spamd pair which works like hammiecli/hammiesvr.
The strongest reason I see to push all the processing into the server
program is that hammiecli can degenerate into a little C program which will
beat the pants off anything which starts up the Python interpreter.  Spamc
has no Perl bits in it (though it does know about the headers spamd adds to
the message.  I rather suspect that by simply changing the port to which
spamc connects and simplifying the code executed after the message is
returned, it could replace hammiecli.

Skip

From rob@hooft.net  Thu Oct 17 22:49:07 2002
From: rob@hooft.net (Rob Hooft)
Date: Thu, 17 Oct 2002 23:49:07 +0200
Subject: [Spambayes] Proposing to remove 4 combining schemes
References: <BIEJKCLHCIOIHAGOKOLHEEEEDMAA.tim.one@comcast.net>
Message-ID: <3DAF3053.6040103@hooft.net>

Tim Peters wrote:
> [Rob]
>>Did you ever try tim combining with (S-H+1)/2?
> 
> 
> No, but it would be an excellent idea to try it with the current default
> combining!  tim-combining is unique in that its S is especially sensitive to
> *low*-spamprob words, and its H to high-spamprob words; when something
> really is spam, tim-combining isn't relying so much on having a high S value
> as on having a low H value, so that the ratio S/(S+H) approaches 1.
> Gary-combining is much more like chi-combining in these respects, and
> chi-combining is where the (S-H+1)/2 reformulation helped.

tim combining:
-> <stat> Ham scores for all runs: 16000 items; mean 13.62; sdev 9.66
-> <stat> min 0.109175; median 12.3561; max 76.0553
-> <stat> fivepctlo 1.35543; fivepcthi 31.4327
-> <stat> Spam scores for all runs: 5800 items; mean 84.42; sdev 11.70
-> <stat> min 21.351; median 85.6889; max 99.8161
-> <stat> fivepctlo 64.4615; fivepcthi 98.8117
-> best cost for all runs: $110.40
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.5 & 0.625
->     fp 5; fn 16; unsure ham 35; unsure spam 187
->     fp rate 0.0312%; fn rate 0.276%; unsure rate 1.02%

default combining:
-> <stat> Ham scores for all runs: 16000 items; mean 26.37; sdev 8.32
-> <stat> min 0.137212; median 27.2524; max 65.3836
-> <stat> fivepctlo 11.7696; fivepcthi 38.3897
-> <stat> Spam scores for all runs: 5800 items; mean 75.96; sdev 10.74
-> <stat> min 33.8547; median 74.3976; max 99.7559
-> <stat> fivepctlo 59.9773; fivepcthi 96.4292
-> best cost for all runs: $106.20
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.5 & 0.585
->     fp 5; fn 16; unsure ham 35; unsure spam 166
->     fp rate 0.0312%; fn rate 0.276%; unsure rate 0.922%

default combining with P-Q instead of (P-Q)/(P+Q):
-> <stat> Ham scores for all runs: 16000 items; mean 21.49; sdev 8.73
-> <stat> min 0.123198; median 21.7049; max 68.8251
-> <stat> fivepctlo 7.34536; fivepcthi 35.6937
-> <stat> Spam scores for all runs: 5800 items; mean 79.44; sdev 11.00
-> <stat> min 29.348; median 79.2283; max 99.786
-> <stat> fivepctlo 61.9311; fivepcthi 97.3078
-> best cost for all runs: $103.40
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.5 & 0.615
->     fp 3; fn 16; unsure ham 37; unsure spam 250
->     fp rate 0.0187%; fn rate 0.276%; unsure rate 1.32%

It is all so close together in the final "cost" result that it is very 
difficult to judge from the statistics.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From neale@woozle.org  Thu Oct 17 23:22:34 2002
From: neale@woozle.org (Neale Pickett)
Date: 17 Oct 2002 15:22:34 -0700
Subject: [Spambayes] Client/server model
In-Reply-To: <20021017205208.GB14491@cthulhu.gerg.ca>
References: <E182FFI-0002nK-00@usw-pr-cvs1.sourceforge.net>
	<200210171829.g9HITxI21925@odiug.zope.com>
	<20021017205208.GB14491@cthulhu.gerg.ca>
Message-ID: <w53zntcg0yd.fsf@woozle.org>

So then, Greg Ward <gward@python.net> is all like:

> If there are multiple client implementations, then spreading work across
> clients also means duplicating code.  Yuck.  Based on my experience with
> SA, I think I'd prefer a model like this:
> 
>   cli sends message headers
>   svr parses the headers
>   cli sends message body OR individual attachments
>       [ie. the protocol needs some state so the client can say,
>        "I'm sending you the headers now", or "I'm sending you the
>        entire body now", or "I'm sending one attachment now"]
>   svr parses the message body/attachments/whatever
>   cli tells the server what it wants: eg. "give me the
>       X-Hammie-Disposition header", or "give me just the score", or
>       "give me the top-N scoring words and their probabilities"
>   svr gives the client what it wants

I'm not sure that the tokenizer would be too amenable to splitting the
header from the body, although if someone can think of a way to do that,
it certainly would rock my world, as it'd make this technique *way* more
accessible to $FIRM's embedded product.

But if you just want the score, you can do that.  Easy squeezy:

    #! /usr/bin/env python

    import xmlrpclib
    import sys

    RPCBASE="http://localhost:65000"

    msg = sys.stdin.read()
    x = xmlrpclib.ServerProxy(RPCBASE)
    m = xmlrpclib.Binary(msg)
    score = x.score(m)
    print "You get", score, "points."

You can even pass a second (true) argument to x.score to get back a list
of the contributing words.

I wrote hammiecli to show how easy it is to use hammiesrv.  You don't
have to do it my way though--feel free to write your own 6 lines of code
:)

Neale

From neale@woozle.org  Thu Oct 17 23:31:56 2002
From: neale@woozle.org (Neale Pickett)
Date: 17 Oct 2002 15:31:56 -0700
Subject: [Spambayes] Re: Client/server model
In-Reply-To: <200210171829.g9HITxI21925@odiug.zope.com>
References: <E182FFI-0002nK-00@usw-pr-cvs1.sourceforge.net>
	<200210171829.g9HITxI21925@odiug.zope.com>
Message-ID: <w53vg40g0ir.fsf@woozle.org>

So then, Guido van Rossum <guido@python.org> is all like:

> Neale's hammie client and server seem to me to be wasting some
> effort.  Currently, what happens, is:
> 
>   [ server tokenizes ]

I did it this way so you could write your own hammiecli in <your
favorite language> as nothing more than an XML-RPC call.  So like, <your
favorite MTA> could easily integrate hammie checking, without having to
know how to tokenize.  And as I pointed out in another message, you can
call the .score() method if you don't want the whole message back.  I
wrote hammiecli to run from my .procmailrc.

> What would make more sense from the POV of minimizing traffic and
> minimizing work done in the server:
>
>   [ client tokenizes ] 

That makes sense too.  It depends on how you're going to use the thing,
I guess.  So I'll make .score() and .filter() accept a tokenized list as
well as a string.  Then you can call me the Burger King, 'cause you can
Have It Your Way.  :^)

Neale

From popiel@wolfskeep.com  Thu Oct 17 23:38:23 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Thu, 17 Oct 2002 15:38:23 -0700
Subject: [Spambayes] Proposing to remove 4 combining schemes 
In-Reply-To: Message from Rob Hooft <rob@hooft.net> 
   of "Thu, 17 Oct 2002 23:38:41 +0200." <3DAF2DE1.5090404@hooft.net> 
References: <BIEJKCLHCIOIHAGOKOLHAEFJDMAA.tim.one@comcast.net>
	<3DAF2DE1.5090404@hooft.net> 
Message-ID: <20021017223823.785E5F4CD@cashew.wolfskeep.com>

In message:  <3DAF2DE1.5090404@hooft.net>
             Rob Hooft <rob@hooft.net> writes:
>Tim Peters wrote:
>> [Sean True]
>
>>>If I'm not sure it's spam, I'd prefer a score that matched that.
>> 
>> 
>> Under chi-combining, a score under .95 (as a rule of thumb so far) does mean
>> "I'm not sure it's spam".  So quantifying this would be helpful.
>
>My gut feeling says: under ideal combining, a score under .95 means "I'm 
>less than 95% sure this is spam".

Ah, here's the basic problem... the final score we're generating has
very little to do with a percentage, or any human concept of assurance.
Heck, the final number isn't even a percentage of how much the message
looks like ham or spam, since we're combining _those_ two numbers in
very non-percentage-like ways.

On the other hand, end users are quite likely to inappropriate
interpretations like this on the numbers, if they see them... so in
any final presentation of this system, I'd _STRONGLY_ discourage
showing the numbers.  Just the three categories 'spam', 'ham', and
'unknown' should be sufficient.

<rant>
People who are not statisticians tend to make a lot of silly
interpretations of numbers, particularly when those numbers are
percentages (or look like percentages).  If I tell people "I'm 75%
sure these dice are loaded", the vast majority of them will expect
that they will roll particular values 75% of the time.  (Translation
to spambayes: for every message in some set of messages, a classifier
says it's 75% sure that the message is spam... and people think that
about 3/4 of those messages will be spam.  As a simple disproof,
consider if all the messages are identical.)  People just don't grok
that surety has very little to do with distribution of results.
They also tend to go for all sorts of logical fallacies like a
statement implying its converse, excluded middles, etc.

The score we've got is just a number in the range 0 to 1 which has
interesting discriminatory properties.  It's not linear with any
concept of surety, and it's not linear with similarity to spam or
ham, either.  People not immersed in how it's generated and/or
buried in test results over decent sized corpora are sure (there's
that troubling word again) to misinterpret it.
</rant>

>But Sean's "sort on score" idea is also very useful. I think it'd speed 
>up the manual scanning/deletion process.

Having looked at the results from the show_unsure config option,
I tend to disagree... position in the list doesn't seem to have
any correlation with spam vs. ham.

- Alex

From nas@python.ca  Thu Oct 17 23:54:49 2002
From: nas@python.ca (Neil Schemenauer)
Date: Thu, 17 Oct 2002 15:54:49 -0700
Subject: [Spambayes] Proposing to remove 4 combining schemes
In-Reply-To: <3DAF2DE1.5090404@hooft.net>
References: <BIEJKCLHCIOIHAGOKOLHAEFJDMAA.tim.one@comcast.net>
	<3DAF2DE1.5090404@hooft.net>
Message-ID: <20021017225449.GA4778@glacier.arctrix.com>

Rob Hooft wrote:
> The first time I used SpamAssassin, I used it in label-only mode. That 
> gave some relief. After using it for a month, I was confident enough to 
> make a procmail rule to move spam into a spam folder without showing it 
> to me. I was amazed by the amount of rest that has created. I did not 
> realize that the spam was having such a psychological effect on me.

That matches my experience with setting up a spam filter.  After
installation, I found it much easier to deal with messages in my inbox
(mail not from lists).  The psychological effect was larger than I had
expected as well.  I look at the spam mailbox last and only if I have
the time and energy.

  Neil

From agmsmith@rogers.com  Fri Oct 18 00:04:05 2002
From: agmsmith@rogers.com (Alexander G. M. Smith)
Date: Thu, 17 Oct 2002 19:04:05 EDT (-0400)
Subject: [Spambayes] Sort on Score Usefulness for Manual Updates
In-Reply-To: <3DAF2DE1.5090404@hooft.net>
Message-ID: <25119245050-BeMail@CR593174-A>

Rob Hooft wrote:
> But Sean's "sort on score" idea is also very useful. I think it'd speed 
> up the manual scanning/deletion process.

It does.  The BeOS version adds an attribute with the spam score to
each e-mail (each is a separate file in BeOS).  It's then easy to
show the attribute as an extra column in a normal directory window
and then click on the heading to sort it.  Then I can quickly junk
the spam and also quickly spot the marginal ones (ham or spam that's
close to the threshold).  I then manually add those ones as examples
to the database (right click on them, open-with the spam classifier
program, it asks if they are spam or genuine, and that's it).  Then
I go back to sort by thread and read the mail.  Relatively quick,
and effective at keeping the database up to date.

- Alex


From guido@python.org  Fri Oct 18 00:06:34 2002
From: guido@python.org (Guido van Rossum)
Date: Thu, 17 Oct 2002 19:06:34 -0400
Subject: [Spambayes] Client/server model
In-Reply-To: Your message of "Thu, 17 Oct 2002 15:22:34 PDT."
             <w53zntcg0yd.fsf@woozle.org> 
References: <E182FFI-0002nK-00@usw-pr-cvs1.sourceforge.net>
	<200210171829.g9HITxI21925@odiug.zope.com>
	<20021017205208.GB14491@cthulhu.gerg.ca>  
	<w53zntcg0yd.fsf@woozle.org> 
Message-ID: <200210172306.g9HN6Z312594@pcp02138704pcs.reston01.va.comcast.net>

> I'm not sure that the tokenizer would be too amenable to splitting the
> header from the body, although if someone can think of a way to do that,
> it certainly would rock my world, as it'd make this technique *way* more
> accessible to $FIRM's embedded product.

The email package makes this a breeze AFAIK.

--Guido van Rossum (home page: http://www.python.org/~guido/)

From agmsmith@rogers.com  Fri Oct 18 00:09:02 2002
From: agmsmith@rogers.com (Alexander G. M. Smith)
Date: Thu, 17 Oct 2002 19:09:02 EDT (-0400)
Subject: [Spambayes] Client/server model
In-Reply-To: <w53zntcg0yd.fsf@woozle.org>
Message-ID: <25415735222-BeMail@CR593174-A>

Neale Pickett wrote:
> I'm not sure that the tokenizer would be too amenable to splitting the
> header from the body, although if someone can think of a way to do that,
> it certainly would rock my world, as it'd make this technique *way* more
> accessible to $FIRM's embedded product.

That's a feature I've been asked for.  Just classify by the header
alone.  The idea being that it would only download the header from
the mail server, and immediately delete the message on the server
if it looked like spam.  I'm a bit nervous about implementing it
in case it is a false positive and thus irretrievably deletes the
message.

- Alex


From neale@woozle.org  Fri Oct 18 00:52:55 2002
From: neale@woozle.org (Neale Pickett)
Date: 17 Oct 2002 16:52:55 -0700
Subject: [Spambayes] Client/server model
In-Reply-To: <200210172306.g9HN6Z312594@pcp02138704pcs.reston01.va.comcast.net>
References: <E182FFI-0002nK-00@usw-pr-cvs1.sourceforge.net>
	<200210171829.g9HITxI21925@odiug.zope.com>
	<20021017205208.GB14491@cthulhu.gerg.ca> <w53zntcg0yd.fsf@woozle.org>
	<200210172306.g9HN6Z312594@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <w53it00fwrs.fsf@woozle.org>

So then, Guido van Rossum <guido@python.org> is all like:

> > I'm not sure that the tokenizer would be too amenable to splitting
> > the header from the body, although if someone can think of a way to
> > do that, it certainly would rock my world, as it'd make this
> > technique *way* more accessible to $FIRM's embedded product.
> 
> The email package makes this a breeze AFAIK.

Yeah but I don't think anybody's done any tests to see if classifying on
headers alone still gets good results.

At least, I assume that's the reasoning behind the "send me the headers,
and I'll tell you if I need the body" approach...

From tim.one@comcast.net  Fri Oct 18 01:21:06 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 17 Oct 2002 20:21:06 -0400
Subject: [Spambayes] Ham or spam?
Message-ID: <LNBBLJKPBEHFEDALKOLCEEFDBMAB.tim.one@comcast.net>

Running the classifier over the day's email load so far turned up one=
 with a
low (chi-combining) score of 0.01 (I'm skipping most of the header li=
nes, so
it's not the python.org clues driving it so low):

"""
Return-path: <python-list-admin@python.org>
Path:

news.baymountain.net!uunet!ash.uu.net!easynews.net!newsfeed3.easynews=
.net!ma
ngo.news.easynet.net!easynet.net!proxad.net!feeder2-1.proxad.net!news=
2-1.fre
e.fr!not-for-mail
Received: from bright08.icomcast.net (bright08-qfe0.icomcast.net
[172.20.4.65])
 by msgstore01.icomcast.net
 (iPlanet Messaging Server 5.1 HotFix 0.8 (built May 13 2002))
 with ESMTP id <0H45002HAFBN1S@msgstore01.icomcast.net> for
 tim.one@ims-ms-daemon (ORCPT tim.one@comcast.net); Thu,
 17 Oct 2002 19:16:35 -0400 (EDT)
Received: from mtain02 (bright-LB.icomcast.net [172.20.3.155])
=09by bright08.icomcast.net (8.11.6/8.11.6) with ESMTP id g9HNGlG2232=
4=09for
 <@msgstore01.icomcast.net:tim.one@comcast.net>; Thu,
 17 Oct 2002 19:16:47 -0400 (EDT)
Received: from mail.python.org (mail.python.org [12.155.117.29])
 by mtain02.icomcast.net
 (iPlanet Messaging Server 5.1 HotFix 1.4 (built Aug  5 2002))
 with ESMTP id <0H45004FMFBH05@mtain02.icomcast.net> for tim.one@comc=
ast.net
 (ORCPT tim.one@comcast.net); Thu, 17 Oct 2002 19:16:29 -0400 (EDT)
Received: from localhost.localdomain ([127.0.0.1] helo=3Dmail.python.=
org)
=09by mail.python.org with esmtp (Exim 4.05)=09id 182JsN-00070C-00; T=
hu,
 17 Oct 2002 19:16:23 -0400
X-Trace: 1034896234 news2-1.free.fr 1396 62.212.104.101
Date: Fri, 18 Oct 2002 01:10:19 +0200
=46rom: Meles MELES <meles@free.fr>
Subject: Barre de progression
Sender: python-list-admin@python.org
To: python-list@python.org
Errors-to: python-list-admin@python.org
Message-id: <5jfnoa.94c.ln@farfadet.home.org>
Organization: Guest of ProXad - France
X-Complaints-to: abuse@proxad.net
MIME-version: 1.0
Content-type: text/plain; charset=3Diso-8859-15
Content-transfer-encoding: 8BIT
NNTP-posting-date: 18 Oct 2002 01:10:34 MEST
Precedence: bulk
X-BeenThere: python-list@python.org
User-Agent: KNode/0.7.1
Newsgroups: comp.lang.python
Lines: 18
NNTP-posting-host: 62.212.104.101
X-Mailman-Version: 2.0.13 (101270)
List-Post: <mailto:python-list@python.org>
List-Subscribe: <http://mail.python.org/mailman/listinfo/python-list>=
,
=09<mailto:python-list-request@python.org?subject=3Dsubscribe>
List-Unsubscribe: <http://mail.python.org/mailman/listinfo/python-lis=
t>,
=09<mailto:python-list-request@python.org?subject=3Dunsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-list/>
List-Help: <mailto:python-list-request@python.org?subject=3Dhelp>
List-Id: General discussion list for the Python programming language
 <python-list.python.org>
Xref: news.baymountain.net comp.lang.python:186832


Bonsoir =E0 tous,
        je suis =E0 la recherche d'exemple d'impl=E9mentation d'une b=
a
rre de progression en mode console (ou, =E0 d=E9faut en mode graphiqu=
e) un
peu du style de celle de urpmi lors de l'installation d'un paquet a
vec la mandrake. Si en plus, =E0 la fin de celle ci, le pourcentage d=
e t
ravail effectu=E9 pouvait s'afficher, ce serai le bonheur.

L'id=E9al pour moi serait de voir un code tout fait pour m'en inspire=
r, au
pire un peu de docs derait l'affaire.

Cordialement
"""

I read French well enough to know she's asking for directions to the =
nearest
alligator breeding museum, but what I don't know is whether it's ham =
or
spam.  Whaddyathink?  It's the only judgment made today that isn't ob=
viously
correct (nor incorrect, for that matter) to my English eyes.


From guido@python.org  Fri Oct 18 01:43:33 2002
From: guido@python.org (Guido van Rossum)
Date: Thu, 17 Oct 2002 20:43:33 -0400
Subject: [Spambayes] Ham or spam?
In-Reply-To: Your message of "Thu, 17 Oct 2002 20:21:06 EDT."
             <LNBBLJKPBEHFEDALKOLCEEFDBMAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCEEFDBMAB.tim.one@comcast.net> 
Message-ID: <200210180043.g9I0hYS12884@pcp02138704pcs.reston01.va.comcast.net>

> Bonsoir � tous,
>         je suis � la recherche d'exemple d'impl�mentation d'une ba
> rre de progression en mode console (ou, � d�faut en mode graphique) un
> peu du style de celle de urpmi lors de l'installation d'un paquet a
> vec la mandrake. Si en plus, � la fin de celle ci, le pourcentage de t
> ravail effectu� pouvait s'afficher, ce serai le bonheur.
> 
> L'id�al pour moi serait de voir un code tout fait pour m'en inspirer, au
> pire un peu de docs derait l'affaire.
> 
> Cordialement
> """
> 
> I read French well enough to know she's asking for directions to the
> nearest alligator breeding museum, but what I don't know is whether
> it's ham or spam.  Whaddyathink?  It's the only judgment made today
> that isn't obviously correct (nor incorrect, for that matter) to my
> English eyes.

Definitely ham; she's asking on how to implement a progress bar in the
style of Mandrake's urpmi tool.  Bonus points for showing the
percentage of work done.

Now, is a score of .01 typically ham or spam?

--Guido van Rossum (home page: http://www.python.org/~guido/)

From tim.one@comcast.net  Fri Oct 18 01:43:52 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 17 Oct 2002 20:43:52 -0400
Subject: [Spambayes] Client/server model
In-Reply-To: <25415735222-BeMail@CR593174-A>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEFEBMAB.tim.one@comcast.net>

[Alexander G. M. Smith]
> That's a feature I've been asked for.  Just classify by the header
> alone.  The idea being that it would only download the header from
> the mail server, and immediately delete the message on the server
> if it looked like spam.  I'm a bit nervous about implementing it
> in case it is a false positive and thus irretrievably deletes the
> message.

I'd be very nervous about that.  You may want to ask Eric Raymond if he got
anywhere with this -- at one time he intended to set up a "header score
server" in connection with, or as an offshoot of, his bogofilter project.


From tim.one@comcast.net  Fri Oct 18 01:59:32 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 17 Oct 2002 20:59:32 -0400
Subject: [Spambayes] Client/server model              akjha
In-Reply-To: <w53it00fwrs.fsf@woozle.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEFFBMAB.tim.one@comcast.net>

[Neale Pickett]
> Yeah but I don't think anybody's done any tests to see if classifying on
> headers alone still gets good results.

A while back I reported on an experiment that looked only at Subject lines:
no other headers, and nothing in the body.  It did very heavy tokenization
of subject lines (word unigrams, and word bigrams, and folding case, and
preserving case, and splitting on whitespace, and sucking out alphanumeric
runs, and tokenizing runs of pure punctuation).  Using the default
combining, the bottom line was

-> best cutoff for all runs: 0.575
->     with weighted total 10*65 fp + 486 fn = 1136
->     fp rate 0.325%  fn rate 3.47%

That's much worse than we do by taking the body into account too, but in
absolute terms it's not too shabby!

Staring at the results caused me to add the least likely part of that
gimmick to our regular tokenizer:  generating tokens for runs of pure
punctuation in Subject lines.  It's obvious in retrospect:  spam often has
over-the-top PUNCTUATION!!! $$$$$$, and the one that delighted me the most
was long runs of blanks.  Those come from Subject lines that stuff a short
random string at the end of the line to fool dumb filters, separated from
the ***SCREAMING PART*** by a long run of blanks.  I added one to the
Subject line here for illustration <wink>.


From tim.one@comcast.net  Fri Oct 18 02:15:58 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 17 Oct 2002 21:15:58 -0400
Subject: [Spambayes] Ham or spam?
In-Reply-To: <200210180043.g9I0hYS12884@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEFHBMAB.tim.one@comcast.net>

[Tim, displays his profound knowledge of French]

[Guido]
> Definitely ham; she's asking on how to implement a progress bar in the
> style of Mandrake's urpmi tool.  Bonus points for showing the
> percentage of work done.

Thanks!

> Now, is a score of .01 typically ham or spam?

Under chi-combining (which was used here), it's in the "I'm sure it's ham"
range.  The median score for ham under chi-combining is too small to express
in 2 digits, though, so 0.01 is a fairly large ham score, indicating slight
uncertainty.

Since starting this on my own email, there is (exactly) one spam (of 843 so
far) that's scored under 0.05, which is the high end of chi's "I'm sure it's
ham" range:

"""
python,

A friend of yours, Michael (michael_suswanto@yahoo.com)
thought you might like to check out this web page.

http://www.newmarketingsite.com/2848/

--
The coolest site in town
"""

SpamAssassin didn't stop this either, but did find strong spam clues in the
headers that we're ignorant about:

X-Spam-Status: No, hits=4.2 required=5.0
 tests=FROM_NAME_NO_SPACES,FROM_BIGISP,NO_REAL_NAME,FORGED_YAHOO_RCVD
X-Spam-Level: ****

On the other side, no ham so far has scored in the "I'm sure it's spam
range".  My highest-scoring ham is at 0.76 (in chi's "middle ground" range),
and is a short "happy new year" msg left over from January. from a friend I
exchange email with once each 3 years <wink/sigh>.


From tim.one@comcast.net  Fri Oct 18 06:54:20 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 18 Oct 2002 01:54:20 -0400
Subject: [Spambayes] Proposing to remove 4 combining schemes
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHAEFJDMAA.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEFPBMAB.tim.one@comcast.net>

I removed the 4 schemes in question.  The log msg is attached, as this
affected lots of code (mostly in an "it's gone" sense).  If anyone has a
real use for use_tim_combining, speak up, else I expect to drop that too (it
was really another attempt to get a better middle ground, but chi-combining
beats it for that).

Modified Files:
	Options.py README.txt TestDriver.py classifier.py
Removed Files:
	clgen.py clpik.py rmspik.py
Log Message:
Removed 4 combining schemes:

    use_central_limit
    use_central_limit2
    use_central_limit3
    use_z_combining

The central limit schemes aimed at getting a useful middle ground, but
chi-combining has proved to work better for that.  The chi scheme doesn't
require the troublesome "third training pass" either.  z-combining was
more like chi-combining, and worked well, but not as well as chi-
combining; z-combining proved vulnerable to "cancellation disease", to
which chi-combining seems all but immune.

Removed supporting option zscore_ratio_cutoff.

Removed various data attributes of class Bayes, unique to the central
limit schemes.  __getstate__ and __setstate__ had never been
updated to save or restore them, so old pickles will still work fine.

Removed method Bayes.compute_population_stats(), which constituted
"the third training pass" unique to the central limit schemes.  There's
scant chance this will ever be needed again, since it was never clear
how to make the 3-pass schemes practical over time.

Gave the still-default combining scheme's method the name gary_spamprob,
and made spamprob an alias for that by default.  This allows to name
each combining scheme explicitly in case you want to test using more
than one (the others are named tim_spamprob and chi2_spamprob).

In gary_spamprob, simplified the scaling of (P-Q)/(P+Q) into 0 .. 1,
replacing the whole shebang with P/(P+Q).  Same result, but a little
faster.

Removed files clgen.py, clpik.py, and rmspik.py.  These were data
generation and analysis tools unique to the central limit schemes.


From tim.one@comcast.net  Fri Oct 18 07:59:41 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 18 Oct 2002 02:59:41 -0400
Subject: [Spambayes] 5% points in statistics
In-Reply-To: <3DAEB8A7.6010807@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEGBBMAB.tim.one@comcast.net>

Inspired by Rob's patch, there's a new option:

"""
[TestDriver]
# Histogram analysis also displays percentiles.  For each percentile p
# in the list, the score S such that p% of all scores are <= S is given.
# Note that percentile 50 is the median, and is displayed (along with the
# min score and max score) independent of this option.
percentiles: 5 25 75 95
"""

Example output from the starts of histogram displays:

-> <stat> Ham scores for all runs: 100 items; mean 6.23; sdev 16.47
-> <stat> min 2.51688e-008; median 0.19102; max 85.9665
-> <stat> percentiles: 5% 0.000538997; 25% 0.0281789; 75% 2.81561; 95%
45.2147

-> <stat> Spam scores for all runs: 100 items; mean 99.97; sdev 0.26
-> <stat> min 97.3715; median 100; max 100
-> <stat> percentiles: 5% 99.9512; 25% 100; 75% 100; 95% 100

>From that alone you can deduce that this tiny 10-fold cv run using
chi-combining nailed all the spam (min spam score was over 95), nailed at
least 75% of the ham (75% of all ham scores were under 2.82 < 5), and that
no ham scored in the spam zone (max ham score was < 86).

BTW, it's a curious thing that *all* schemes have been better at nailing
spam than ham with very little training data, going all the way down to
training on just one of each.  I still don't know where the cutoff point is
in my data (i.e., by the time I run my fat test, the roles are reversed:
it's better at nailing ham than spam).


From msergeant@startechgroup.co.uk  Fri Oct 18 10:45:19 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: 18 Oct 2002 10:45:19 +0100
Subject: [Spambayes] Client/server model
In-Reply-To: <11128972051-BeMail@CR593174-A>
References: <11128972051-BeMail@CR593174-A>
Message-ID: <1034934319.23385.26.camel@felony.int.star.co.uk>


---------------------- multipart/signed attachment
On Thu, 2002-10-17 at 20:10, Alexander G. M. Smith wrote:
> Guido van Rossum <guido@python.org> wrote:
> > > I'd want the server to do tokenization for consistency reasons.
> > > Particularly if you are also spam filtering news articles and not
> > > just e-mail messages.
> >=20
> > I don't understand this.
>=20
> So that everybody tokenizes the incoming messages in the same way,
> particularly the same way as that used earlier during training.

What does it matter? The worst thing that happens is that the client
gets the wrong answer back, in which case it's a good excuse to get the
client upgraded ;-)

> Also, I'd have the server keep track of spam from other sources,
> such as UseNet news.  Is there anywere else where spam messages
> show up that might need to be included, or is it just mail and
> news?

I'm waiting for spammers to start spamming web based forums. It's
probably harder than usenet since most have local moderation systems in
place, but I suspect it's only a matter of time.

Matt.

---------------------- multipart/signed attachment
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: This is a digitally signed message part
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021018/c81ca11c/attachment.bin

---------------------- multipart/signed attachment--


From msergeant@startechgroup.co.uk  Fri Oct 18 10:52:21 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: 18 Oct 2002 10:52:21 +0100
Subject: [Spambayes] Client/server model
In-Reply-To: <20021017205208.GB14491@cthulhu.gerg.ca>
References: <E182FFI-0002nK-00@usw-pr-cvs1.sourceforge.net>
	<200210171829.g9HITxI21925@odiug.zope.com> 
	<20021017205208.GB14491@cthulhu.gerg.ca>
Message-ID: <1034934741.23383.34.camel@felony.int.star.co.uk>


---------------------- multipart/signed attachment
On Thu, 2002-10-17 at 21:52, Greg Ward wrote:
> On 17 October 2002, Guido van Rossum said:
> > Neale's hammie client and server seem to me to be wasting some
> > effort.  Currently, what happens, is:
> >=20
> >   cli sends the entire message to svr
> >=20
> >   svr parses and scores the message
> >   svr inserts the X-Hammie-Disposition header in the message
> >   svr sends the message, thus modified, back
> >=20
> >   cli prints the returned, modified, message to stdout
>=20
> Arrggh.  That's exactly how SpamAssassin's spamc/spamd work, and it's
> a pain-in-the-ass for anyone who wants to access spamd in an unusual
> way.

FWIW this was fixed for Ask Bjorn Hansen, who wanted to be able to use
spamd "in an unusual way". Here's how his spamassassin plugin does it
for qpsmtpd:

print "REPORT_IFSPAM SPAMC/1.0" to spamd's socket.
print the message to spamd's socket.
shutdown the sending end of the socket.
get *just* the spam headers back from the socket (that's all that is
sent).

All done.

See
http://cvs.perl.org/viewcvs/qpsmtpd/plugins/spamassassin?rev=3D1.2&content-=
type=3Dtext/vnd.viewcvs-markup

(though I'm not even sure I like that mechanism, but I digress).


---------------------- multipart/signed attachment
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: This is a digitally signed message part
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021018/a586400f/attachment.bin

---------------------- multipart/signed attachment--


From skip@pobox.com  Fri Oct 18 14:16:59 2002
From: skip@pobox.com (Skip Montanaro)
Date: Fri, 18 Oct 2002 08:16:59 -0500
Subject: [Spambayes] timcv.py?
Message-ID: <15792.2507.448518.596654@montanaro.dyndns.org>


I've been busy with other stuff for a couple weeks and have only vaguely
noticed all the changes happening.  I've been using a somewhat simplified
version of Neale's runtest.sh script.  Is timcv.py still the core program
used for testing?

Skip


From tim.one@comcast.net  Fri Oct 18 17:07:26 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 18 Oct 2002 12:07:26 -0400
Subject: [Spambayes] timcv.py?
In-Reply-To: <15792.2507.448518.596654@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEICBMAB.tim.one@comcast.net>

[Skip Montanaro]
> I've been busy with other stuff for a couple weeks and have only
> vaguely noticed all the changes happening.

Them mostly you missed a whole bunch of experimental code in classifier.py
get tested and thrown away again.  The code is more like it was now than it
was when you tuned out <wink>.

> I've been using a somewhat simplified version of Neale's runtest.sh
> script.  Is timcv.py still the core program used for testing?

That or mboxtest.py (depending on your data setup) remain our two
cross-validation drivers, which I recommend.  timtest.py remains a grid
driver, and there's no reason to use it unless you want to set up brutal
tests (train on a little data and predict against a lot, N**2-N times).


From tim.one@comcast.net  Fri Oct 18 19:43:33 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 18 Oct 2002 14:43:33 -0400
Subject: [Spambayes] Combining combining schemes
Message-ID: <LNBBLJKPBEHFEDALKOLCIEIPBMAB.tim.one@comcast.net>

I mentioned earlier that chi-combining and gary-combining have quite
different ideas about "how certain" they are on my extreme FP and FN.  So I
checked in some new options to allow us to play with that:

"""
[Classifier]
# Use a weighted average of chi-combining and gary-combining.
use_mixed_combining: False
mixed_combining_chi_weight: 0.9
"""

I ran my fat test just once (10-fold CV with 20,000 ham and 14,000 spam),
making parameters up off the top of my head:

"""
[Classifier]
use_mixed_combining: True
mixed_combining_chi_weight: 0.9

[TestDriver]
ham_cutoff:  0.10
spam_cutoff: 0.90
nbuckets: 200
"""

The bottom line is that this particular combination of settings removed
all(!) false negatives, left me with my 2 very hard FP, moved all other hard
ham very solidly into the middle ground, and had an unsure rate under 1%:

-> <stat> all runs false positives: 2
-> <stat> all runs false negatives: 0
-> <stat> all runs unsure: 226
-> <stat> all runs false positive %: 0.01
-> <stat> all runs false negative %: 0.0
-> <stat> all runs unsure %: 0.664705882353
-> <stat> all runs cost: $65.20

The histogram analysis found that it was possible to reduce the total middle
ground to 20 (out of 34,000!) messages at the cost of biting 3 FN:

-> best cost for all runs: $27.00
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 3 cutoff pairs
-> smallest ham & spam cutoffs 0.5 & 0.75
->     fp 2; fn 3; unsure ham 12; unsure spam 8
->     fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0588%
-> largest ham & spam cutoffs 0.5 & 0.76
->     fp 2; fn 3; unsure ham 12; unsure spam 8
->     fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0588%

I can't make more time for this right now, but I think there's clearly
potential worth pursuing.

-> <stat> Ham scores for all runs: 20000 items; mean 2.81; sdev 2.92
-> <stat> min 0.121417; median 2.54101; max 96.5433
-> <stat> percentiles: 5% 1.68334; 25% 2.20207; 75% 2.89507; 95% 3.54761
* = 111 items
 0.0    6 *
 0.5   41 *
 1.0  420 ****
 1.5 2355 **********************
 2.0 6526 ***********************************************************
 2.5 6743 *************************************************************
 3.0 2789 **************************
 3.5  568 ******
 4.0  120 **
 4.5   71 *
 5.0   44 *
 5.5   21 *
 6.0   23 *
 6.5   14 *
 7.0   17 *
 7.5    8 *
 8.0    9 *
 8.5   12 *
 9.0    6 *
 9.5    4 *
10.0    7 *
10.5   10 *
11.0    7 *
11.5   10 *
12.0    7 *
12.5    9 *
13.0    5 *
13.5    3 *
14.0    3 *
14.5    4 *
15.0    3 *
15.5    7 *
16.0    3 *
16.5    7 *
17.0    1 *
17.5    5 *
18.0    4 *
18.5    0
19.0    0
19.5    4 *
20.0    3 *
20.5    3 *
21.0    2 *
21.5    3 *
22.0    1 *
22.5    2 *
23.0    1 *
23.5    3 *
24.0    2 *
24.5    3 *
25.0    0
25.5    1 *
26.0    5 *
26.5    0
27.0    2 *
27.5    3 *
28.0    3 *
28.5    1 *
29.0    3 *
29.5    1 *
30.0    1 *
30.5    0
31.0    1 *
31.5    2 *
32.0    0
32.5    2 *
33.0    3 *
33.5    1 *
34.0    2 *
34.5    0
35.0    0
35.5    1 *
36.0    2 *
36.5    2 *
37.0    1 *
37.5    2 *
38.0    0
38.5    2 *
39.0    1 *
39.5    1 *
40.0    2 *
40.5    1 *
41.0    1 *
41.5    0
42.0    2 *
42.5    2 *
43.0    1 *
43.5    0
44.0    2 *
44.5    0
45.0    0
45.5    2 *
46.0    0
46.5    1 *
47.0    2 *
47.5    0
48.0    0
48.5    2 *
49.0    3 *
49.5    3 *
50.0    1 *  A resume from a "an experienced
             engineer/mathematician/modeler who has built models and done
             computational mathematics in Python".
50.5    0
51.0    3 *  TOOLS Europe '99 conference announcement
             A word-free post kidy listing 3 URLs; we've argued before
                 about whether it's ham or spam; I think it's ham
             Someone posting a reply they got from MSN Hotmail Customer
                 support in response to a complaint about fetish porn
                 spam on c.l.py
51.5    0
52.0    0
52.5    0
53.0    0
53.5    0
54.0    1 *  "If you are interested in saving money ..."
54.5    0
55.0    0
55.5    0
56.0    0
56.5    0
57.0    0
57.5    0
58.0    0
58.5    0
59.0    0
59.5    1 *  questions about the job and real estate markets in France
60.0    1 *  HTML "Please unsubscribe me"
60.5    0
61.0    0
61.5    0
62.0    1 *  asking for advice on how to break into others' computers
62.5    0
63.0    0
63.5    0
64.0    0
64.5    0
65.0    0
65.5    0
66.0    0
66.5    0
67.0    0
67.5    0
68.0    0
68.5    0
69.0    1 *  long emotional msg the day after the 911 terrorist attack
69.5    0
70.0    0
70.5    0
71.0    0
71.5    1 *  Job announcement from Industrial Light & Magic.  Hurt
             in part because split-on-whitespace left "Python-savvy"
             as one word.
72.0    0
72.5    0
73.0    1 *  asking for help with a webmaster-ish program; it's in the
                 middle ground of both schemes:
             prob('*gary_score*') = 0.532758
             prob('*chi_score*') = 0.751966
73.5    0
74.0    0
74.5    1 *  inappropriate two-word "confirm 438765" followed by
                 "Get Your Private, Free E-mail from ..."
75.0    0
75.5    0
76.0    0
76.5    0
77.0    0
77.5    0
78.0    0
78.5    0
79.0    0
79.5    0
80.0    0
80.5    0
81.0    0
81.5    0
82.0    0
82.5    0
83.0    0
83.5    0
84.0    0
84.5    0
85.0    0
85.5    0
86.0    0
86.5    0
87.0    0
87.5    0
88.0    0
88.5    0
89.0    0
89.5    0
90.0    0
90.5    0
91.0    0
91.5    0
92.0    0
92.5    0
93.0    0
93.5    0
94.0    0
94.5    1 *  lady with the long, obnoxious employer-generated sig;
             gary-combining looks on this one much more kindly (but
             still outside a reasonable middle groud for it); chi is
             only slightly unsure
             prob('*gary_score*') = 0.597568
             prob('*chi_score*') = 0.986116
             prob('*H*') = 0.0277634
             prob('*S*') = 0.999996
             prob('*Q*') = 0.542133
             prob('*P*') = 0.805009
95.0    0
95.5    0
96.0    0
96.5    1 *  Nigerian scam quote
             gary-combining again has a much milder judgment, but
             chi is off the charts
             prob = 0.965433332477
             prob('*gary_score*') = 0.654334
             prob('*chi_score*') = 1
             prob('*H*') = 7.07788e-008
             prob('*S*') = 1
             prob('*Q*') = 0.466239
             prob('*P*') = 0.882573
97.0    0
97.5    0
98.0    0
98.5    0
99.0    0
99.5    0

-> <stat> Spam scores for all runs: 14000 items; mean 98.32; sdev 1.55
-> <stat> min 31.4614; median 98.3667; max 99.9601
-> <stat> percentiles: 5% 97.1931; 25% 97.9872; 75% 98.7541; 95% 99.657

Note that > 95% of spam scored higher than the Nigerian "ham"! (its score is
lower than spam's 5-percentile score)

* = 76 items
... [all 0] ...
30.5    0
31.0    1 *  "Hello, my Name is BlackIntrepid"
             prob = 0.314614377139
             prob('*gary_score*') = 0.480559
             prob('*chi_score*') = 0.296176
             prob('*H*') = 0.930885
             prob('*S*') = 0.523237
             prob('*Q*') = 0.684254
             prob('*P*') = 0.633036
31.5    0
32.0    0
32.5    0
33.0    0
33.5    1 *  uuencoded text body we throw away unlooked at
34.0    0
34.5    0
35.0    0
35.5    0
36.0    0
36.5    0
37.0    0
37.5    0
38.0    0
38.5    0
39.0    0
39.5    0
40.0    0
40.5    0
41.0    0
41.5    0
42.0    0
42.5    0
43.0    0
43.5    0
44.0    0
44.5    0
45.0    0
45.5    0
46.0    0
46.5    0
47.0    0
47.5    0
48.0    0
48.5    0
49.0    1 *  giant base64-encoded text file; gary- and chi- both score it
             near 0.50
49.5    0
50.0    1 *  Website Programmers Available Now!; full of tech talk
50.5    2 *  webmaster link directory
             the spam with dozens of killer spam clues hiding in
             meta tags we don't look at
51.0    0
51.5    0
52.0    0
52.5    0
53.0    0
53.5    0
54.0    0
54.5    0
55.0    0
55.5    0
56.0    0
56.5    0
57.0    0
57.5    0
58.0    1 *
58.5    0
59.0    0
59.5    0
60.0    0
60.5    0
61.0    0
61.5    0
62.0    0
62.5    0
63.0    1 *
63.5    0
64.0    0
64.5    0
65.0    0
65.5    0
66.0    0
66.5    1 *
67.0    0
67.5    0
68.0    0
68.5    1 *
69.0    0
69.5    0
70.0    0
70.5    0
71.0    0
71.5    0
72.0    0
72.5    0
73.0    0
73.5    1 *
74.0    0
74.5    0
75.0    0
75.5    0
76.0    1 *
76.5    0
77.0    0
77.5    0
78.0    1 *
78.5    0
79.0    0
79.5    0
80.0    0
80.5    0
81.0    1 *
81.5    0
82.0    1 *
82.5    1 *
83.0    1 *
83.5    0
84.0    0
84.5    1 *
85.0    2 *
85.5    0
86.0    0
86.5    0
87.0    0
87.5    0
88.0    0
88.5    1 *
89.0    3 *
89.5    1 *
90.0    2 *
90.5    1 *
91.0    1 *
91.5    0
92.0   16 *
92.5    3 *
93.0    3 *
93.5    2 *
94.0    2 *
94.5    6 *
95.0    6 *
95.5   20 *
96.0   76 *
96.5  269 ****
97.0  838 ************
97.5 2329 *******************************
98.0 4600 *************************************************************
98.5 3792 **************************************************
99.0 1045 **************
99.5  964 *************


From popiel@wolfskeep.com  Sat Oct 19 05:44:50 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Fri, 18 Oct 2002 21:44:50 -0700
Subject: [Spambayes] Mixed combining
Message-ID: <20021019044450.73D33F5B4@cashew.wolfskeep.com>

I did two runs of the mixed combining.  Data is not yet indexed
on my website; perhaps tomorrow.

By my results, mixed spamprob is effectively neutral compared to
straight chi-squared.  The best cost is better, but how to achieve
those costs is no clearer than before.  The fp & fn counts are
lower, but at a cost of about half again more unsures.  I guess
it all depends on how you assign your costs.

Anyway, here's the tables:

Mixed, .9 chi-squared, 0.10-0.90 unsure:
-> <stat> tested 50 hams & 200 spams against 450 hams & 1800 spams
[...]
-> <stat> tested 200 hams & 50 spams against 1800 hams & 450 spams
ham:spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
fp total:        2       3       3       3       3       2       2
fp %:         0.40    0.40    0.30    0.24    0.20    0.11    0.10
fn total:        5       6       4       5       6       7       9
fn %:         0.25    0.34    0.27    0.40    0.60    0.93    1.80
unsure t:       46      44      45      42      52      51      52
unsure %:     1.84    1.76    1.80    1.68    2.08    2.04    2.08
real cost:  $34.20  $44.80  $43.00  $43.40  $46.40  $37.20  $39.40
best cost:  $28.60  $28.20  $34.00  $33.20  $34.20  $30.40  $23.80
h mean:       3.61    2.70    2.47    2.30    2.29    2.21    1.99
h sdev:       8.09    6.15    6.13    5.93    6.13    5.84    4.79
s mean:      97.08   96.69   96.33   95.84   94.94   94.34   92.25
s sdev:       6.48    7.71    8.63   10.21   12.73   13.67   17.09
mean diff:   93.47   93.99   93.86   93.54   92.65   92.13   90.26
k:            6.42    6.78    6.36    5.80    4.91    4.72    4.13

Mixed, .9 chi-squared, 0.05-0.95 unsure:
-> <stat> tested 50 hams & 200 spams against 450 hams & 1800 spams
[...]
-> <stat> tested 200 hams & 50 spams against 1800 hams & 450 spams
ham:spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
fp total:        2       2       2       2       2       2       1
fp %:         0.40    0.27    0.20    0.16    0.13    0.11    0.05
fn total:        4       4       3       3       3       3       4
fn %:         0.20    0.23    0.20    0.24    0.30    0.40    0.80
unsure t:       69      71      70      73      83      82      89
unsure %:     2.76    2.84    2.80    2.92    3.32    3.28    3.56
real cost:  $37.80  $38.20  $37.00  $37.60  $39.60  $39.40  $31.80
best cost:  $28.60  $28.20  $34.00  $33.20  $34.20  $30.40  $23.80
h mean:       3.61    2.70    2.47    2.30    2.29    2.21    1.99
h sdev:       8.09    6.15    6.13    5.93    6.13    5.84    4.79
s mean:      97.08   96.69   96.33   95.84   94.94   94.34   92.25
s sdev:       6.48    7.71    8.63   10.21   12.73   13.67   17.09
mean diff:   93.47   93.99   93.86   93.54   92.65   92.13   90.26
k:            6.42    6.78    6.36    5.80    4.91    4.72    4.13

And, for reference, pure chi-squared, 0.05-0.95 unsure:
-> <stat> tested 50 hams & 200 spams against 450 hams & 1800 spams
[...]
-> <stat> tested 200 hams & 50 spams against 1800 hams & 450 spams
ham:spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
fp total:        2       3       3       3       2       2       2
fp %:         0.40    0.40    0.30    0.24    0.13    0.11    0.10
fn total:        5       6       4       5       6       7       9
fn %:         0.25    0.34    0.27    0.40    0.60    0.93    1.80
unsure t:       49      44      49      46      54      58      53
unsure %:     1.96    1.76    1.96    1.84    2.16    2.32    2.12
real cost:  $34.80  $44.80  $43.80  $44.20  $36.80  $38.60  $39.60
best cost:  $28.60  $28.40  $34.00  $35.60  $34.60  $30.60  $28.60
h mean:       1.31    0.58    0.50    0.46    0.51    0.48    0.36
h sdev:       8.51    6.47    6.46    6.25    6.44    6.12    4.97
s mean:      99.25   98.92   98.60   98.17   97.25   96.73   94.66
s sdev:       6.75    8.05    9.04   10.76   13.47   14.49   18.20
mean diff:   97.94   98.34   98.10   97.71   96.74   96.25   94.30
k:            6.42    6.77    6.33    5.74    4.86    4.67    4.07

Enjoy.

- Alex

From tim.one@comcast.net  Sat Oct 19 06:03:42 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 19 Oct 2002 01:03:42 -0400
Subject: [Spambayes] Proposing to remove 4 combining schemes
In-Reply-To: <3DAF183A.8070600@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEKKBMAB.tim.one@comcast.net>

[Tim]
>> But what's the point?  By your own cost measure, it didn't do
>> you any good, and in fact it raised your FP rate by the time you
>> got to 4.

[Rob Hooft]
> There was some discussion about the judgments being too strict. I was
> trying to find a statistically sound way to reduce correlations such
> that results would be less sure.
>
> I explained that here:
>
> >>What this does statistically is downweighting all clues thereby taking
> >>care of a "standard" correlation between clues.

I understood that was your intent.  That's why I asked what the point in the
*end* was.  By your own cost measure, it didn't help; etc -- at best it was
neutral, and taking the results at face value it hurt a little.

>> Fudging H, S and n introduces strange biases, because the info
>> you're feeding into chi2Q no longer follows a chi-squared
>> distribution after fudging, and chi2Q may as well be some form
>> of biased random-number generator then.

> That is not exactly true. What I am assuming is that if there is one
> clue in a message that says 0.8, there are probably more of those. That
> is the correlation we're discussing. A clue rarely comes alone. Effect
> of that is that my joke messages with "From: xxx@yyy (by way of
> ppp@qqq)" gets a very strong and repeated signal from the From: line,

Except that those correlations act in your favor, correct?  You called the
bad jokes ham, and their very high spam scores relative to the rest of your
ham suggested that these correlated "From" clues were the only things saving
them from being false positives.

> and your filtered mailman list is much too sure about hamminess.

Yes, and if we don't strip HTML tags, every scheme has been much too sure
about spaminess.  But these last two are the only two cases I've seen where
correlations were harmful:  all other cases I've looked at are akin to your
"bad joke" example, where correlations were helpful in making the correct
decision.

> This is solved by my hack: it practically divides the number of clues by
> 2 or 4.

I'm not sure what it does, but until it demonstrably helps something I'm not
keen to pursue it.  Note that the new use_mixed_combining gimmick can be
used to get any scoring behavior between default-combining and
chi-combining, using a simple weighted average of two schemes that have been
widely tested with good results.

> ...
> --- chi2.py     16 Oct 2002 21:31:19 -0000      1.7
> +++ chi2.py     17 Oct 2002 20:04:17 -0000
> @@ -145,7 +145,7 @@
>
>       for i in range(5000):
>           ps = [random() for j in range(50)]
> -        s1, h1, score1 = judge(ps + [bias] * warp)
> +        s1, h1, score1 = judge((ps + [bias] * warp)*4)
>           s.add(s1)
>           h.add(h1)
>           score.add(score1)
>
> (i.e. adding correlated data points) Results in:
>
> Result for random vectors of 50 probs, + 0 forced to 0.99
>
> H 5000 items; mean 0.47; sdev 0.38
> -> <stat> min 1.26528e-11; median 0.444004; max 1
> -> <stat> fivepctlo 0.000293787; fivepcthi 0.999102
> * = 19 items
> 0.00 1125 ************************************************************
> 0.05  291 ****************
> 0.10  230 *************
> 0.15  182 **********
> 0.20  157 *********
> 0.25  146 ********
> 0.30  119 *******
> 0.35  135 ********
> 0.40  129 *******
> 0.45  121 *******
> 0.50  120 *******
> 0.55  131 *******
> 0.60  128 *******
> 0.65  152 ********
> 0.70  128 *******
> 0.75  167 *********
> 0.80  172 **********
> 0.85  208 ***********
> 0.90  239 *************
> 0.95  920 *************************************************
> ..

> So: chi2 will be fairly sure even about random data if it is correlated.

Well, random data isn't correlated <wink>, but I see what you mean and it is
an interesting point.  Whether it's of general importance (as opposed to in
the few special cases we've identified by staring at the tiny minority of
mistakes) to the ham-vs-spam problem I don't know.

I *expect* a related question is why split-on-whitespace works better than
searching for alphanumeric runs.  When staring at one of his own false
positives, Guido complained that, e.g., "hotels" and "hotels," were counted
as two distinct clues.  And s-o-w routinely creates lots of highly
correlated word combinations.  But staring only at mistakes gives no insight
into what *works*, and I suspect that, e.g., counting "Python" and "Python?"
and "Python." and "Python," (etc) as distinct clues actively helps my c.l.py
ham.  Ditto counting "erection" and "Viagra" as distinct, and "Nigeria" and
"Nigerian", etc.

Recall that we also had Matt Sergeant's testimony that lemmatization harmed
performance in his Bayesian classifier, and one clear effect of
lemmatization is to reduce the number of highly correlated features.  So I'm
not willing to believe that reducing correlation is a sensible goal in this
task without strong experimental evidence to back it up; so far, all we have
is indirect evidence about that, but to the extent that applies, it's not
supporting the thesis that correlation is generally a bad thing here.


From tim.one@comcast.net  Sat Oct 19 06:10:17 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 19 Oct 2002 01:10:17 -0400
Subject: [Spambayes] Mixed combining
In-Reply-To: <20021019044450.73D33F5B4@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEKLBMAB.tim.one@comcast.net>

[T. Alexander Popiel]
> I did two runs of the mixed combining.  Data is not yet indexed
> on my website; perhaps tomorrow.
>
> By my results, mixed spamprob is effectively neutral compared to
> straight chi-squared.  The best cost is better, but how to achieve
> those costs is no clearer than before.  The fp & fn counts are
> lower, but at a cost of about half again more unsures.  I guess
> it all depends on how you assign your costs.

I've run some more experiments of my own, and I'm embarrassed <wink> to
agree that indeed straight chi-squared did just as well, and that cutoffs
got fuzzier under mixed combining, and that Yet Another Parameter to fiddle
(the chi weight) was more Yet Another PITA (Parameter In The Ass) than
anything else.  Chalk it up to youthful enthusiasm -- I should follow my own
advice and just give up on my two miserable FP.

> Anyway, here's the tables:
>
> Mixed, .9 chi-squared, 0.10-0.90 unsure:
> -> <stat> tested 50 hams & 200 spams against 450 hams & 1800 spams
> [...]
> -> <stat> tested 200 hams & 50 spams against 1800 hams & 450 spams
> ham:spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
> fp total:        2       3       3       3       3       2       2
> fp %:         0.40    0.40    0.30    0.24    0.20    0.11    0.10
> fn total:        5       6       4       5       6       7       9
> fn %:         0.25    0.34    0.27    0.40    0.60    0.93    1.80
> unsure t:       46      44      45      42      52      51      52
> unsure %:     1.84    1.76    1.80    1.68    2.08    2.04    2.08
> real cost:  $34.20  $44.80  $43.00  $43.40  $46.40  $37.20  $39.40
> best cost:  $28.60  $28.20  $34.00  $33.20  $34.20  $30.40  $23.80
> h mean:       3.61    2.70    2.47    2.30    2.29    2.21    1.99
> h sdev:       8.09    6.15    6.13    5.93    6.13    5.84    4.79
> s mean:      97.08   96.69   96.33   95.84   94.94   94.34   92.25
> s sdev:       6.48    7.71    8.63   10.21   12.73   13.67   17.09
> mean diff:   93.47   93.99   93.86   93.54   92.65   92.13   90.26
> k:            6.42    6.78    6.36    5.80    4.91    4.72    4.13

This is a nice way to present summary info.  Are these produced by your
table2.py?  If so, I know where to find that -- would you consider
contributing it to the project?

> ...


From tim.one@comcast.net  Sat Oct 19 06:20:24 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 19 Oct 2002 01:20:24 -0400
Subject: [Spambayes] optimal max_discriminators for chi2
In-Reply-To: <3DAF2256.30509@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEKMBMAB.tim.one@comcast.net>

[Rob Hooft]
> I did a series of runs:
> =========================
> [Classifier]
> use_chi_squared_combining: True
> robinson_minimum_prob_strength = 0.0
> robinson_probability_s = 0.45
> max_discriminators = XXXXXX
> ...

> With XXXXXX between 15 and 300. Attached are plots of the 95th
> percentile ham, 5th percentile spam, and of the total cost vertical
> against max_discriminators horizontal. Please note again that my ham is
> much tighter than my spam: vertical scales are from 0 to 0.16 and from
> 89 to 100, respectively (Almost a factor of 100!). The cost plot shows
> "no trend at all", but the variation is not large.

Thanks, Rob!  Have you ever plotted the density of the number of "words" in
your msgs?  I did at one time but have forgotten the result; IIRC, a
surprisingly large percentage didn't *have* 150 distinct words (but then I'm
also using the default robinson_minimum_prob_strength, which renders a whole
bunch of bland words invisible).

The cost plot is disturbing, suggesting we're looking at random effects more
than trends.  Perhaps "best cost" is just too fickle a measure here, and it
would be better to develop a measure of "average cost" across all cutoff
pairs within the specified base (ham_cutoff, spam_cutoff) pair.

> I'd almost conclude "anything goes", but based on the spam-5% value
> I'd like to stick with values over ~40.

This sounds sensible to me too, and my own data doesn't contradict it
<wink>.  I'll leave the default at 150 until there's a clear reason to
change it.


From tim.one@comcast.net  Sat Oct 19 06:35:52 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 19 Oct 2002 01:35:52 -0400
Subject: [Spambayes] Proposing to remove 4 combining schemes
In-Reply-To: <3DAF2DE1.5090404@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEKNBMAB.tim.one@comcast.net>

[Rob Hooft]
> My gut feeling says: under ideal combining, a score under .95 means "I'm
> less than 95% sure this is spam".

That may be close to what my gut feeling was at the start of this:  that a
spam score of p "should mean" that if I see N messages with that p value,
N*p of those messages "should" be spam, and N*(1-p) of them "should be" ham.
But over time I've come to believe that (an actual probability) would be a
pretty useless measure in real life -- I want absolute certainty now <0.9
wink>.

> ...
> The first time I used SpamAssassin, I used it in label-only mode. That
> gave some relief. After using it for a month, I was confident enough to
> make a procmail rule to move spam into a spam folder without showing it
> to me. I was amazed by the amount of rest that has created. I did not
> realize that the spam was having such a psychological effect on me. This
> is definitely what I'd want from spambayes. I'd only read my "incoming
> ham". Once a week I'd go into unsure mode, and do some selection work.
> Once a month I can probably go into spam-curse mode, and do the mass
> deletion Tim talks about.

I flipped-flopped on this until I actually rescored my mail using the
mixed-combining gimmick.  It was just plain annoying then to see 20 spam
scoring 1.00 and 20 more scoring .99 and 20 more scoring .98, etc.  Chi- and
default- combining are both satisfying in their own way if you have to see
the scores, the former because it does an excellent job of reflecting my own
"certainty at a glance" in a vast majority of cases; I'm still not sure
what's satisfying about the latter, but it's *something*.

> But Sean's "sort on score" idea is also very useful. I think it'd speed
> up the manual scanning/deletion process.

Simulating, by hand, my ideal of "spam" and "unsure" folders, sorting on
score is very helpful, but it's *also* very helpful then to see the actual
scores.  Especially in my Unsure folder, what's typical is that about 10% of
the msgs are very close to the low end of the unsure range, and another 10%
very close to the high end of the unsure range, and they're usually what you
expect them to be (i.e., ham and spam, respectively).  That polishes off a
fifth of them very quickly.  The rest are more puzzling (they're more
solidly in the range where the scheme is very sure it's unsure <0.5 wink>),
and I'm not sure the scores help at all then.  The helpful part of that
could be gotten via color-coding too, where "the helpful part" means
segregating the "I'm unsure, but I think I have a good guess" parts at the
ends, from the "I'm certain I'm lost" part in the middle.


From tim.one@comcast.net  Sat Oct 19 06:55:16 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 19 Oct 2002 01:55:16 -0400
Subject: [Spambayes] Proposing to remove 4 combining schemes
In-Reply-To: <3DAF3053.6040103@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEKOBMAB.tim.one@comcast.net>

[Tim, suggests that (S-H+1)/2 would be good to try with gary-combining]

[Rob]
> tim combining:
> -> <stat> Ham scores for all runs: 16000 items; mean 13.62; sdev 9.66
> -> <stat> min 0.109175; median 12.3561; max 76.0553
> -> <stat> fivepctlo 1.35543; fivepcthi 31.4327
> -> <stat> Spam scores for all runs: 5800 items; mean 84.42; sdev 11.70
> -> <stat> min 21.351; median 85.6889; max 99.8161
> -> <stat> fivepctlo 64.4615; fivepcthi 98.8117
> -> best cost for all runs: $110.40
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at ham & spam cutoffs 0.5 & 0.625
> ->     fp 5; fn 16; unsure ham 35; unsure spam 187
> ->     fp rate 0.0312%; fn rate 0.276%; unsure rate 1.02%

BTW, note that I killed this scheme off -- it was, at the time, trying to
get a better middle ground, but chi-combining works better for that.

> default combining:
> -> <stat> Ham scores for all runs: 16000 items; mean 26.37; sdev 8.32
> -> <stat> min 0.137212; median 27.2524; max 65.3836
> -> <stat> fivepctlo 11.7696; fivepcthi 38.3897
> -> <stat> Spam scores for all runs: 5800 items; mean 75.96; sdev 10.74
> -> <stat> min 33.8547; median 74.3976; max 99.7559
> -> <stat> fivepctlo 59.9773; fivepcthi 96.4292
> -> best cost for all runs: $106.20
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at ham & spam cutoffs 0.5 & 0.585
> ->     fp 5; fn 16; unsure ham 35; unsure spam 166
> ->     fp rate 0.0312%; fn rate 0.276%; unsure rate 0.922
>
> default combining with P-Q instead of (P-Q)/(P+Q):
> -> <stat> Ham scores for all runs: 16000 items; mean 21.49; sdev 8.73
> -> <stat> min 0.123198; median 21.7049; max 68.8251
> -> <stat> fivepctlo 7.34536; fivepcthi 35.6937
> -> <stat> Spam scores for all runs: 5800 items; mean 79.44; sdev 11.00
> -> <stat> min 29.348; median 79.2283; max 99.786
> -> <stat> fivepctlo 61.9311; fivepcthi 97.3078
> -> best cost for all runs: $103.40
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at ham & spam cutoffs 0.5 & 0.615
> ->     fp 3; fn 16; unsure ham 37; unsure spam 250
> ->     fp rate 0.0187%; fn rate 0.276%; unsure rate 1.32%
>
> It is all so close together in the final "cost" result that it is very
> difficult to judge from the statistics.

Then let's take the stats at face value:  these are large runs, so if it
doesn't make a clear difference here, it's unlikely to make a clear
difference anywhere.  IIRC, you were inspired to try S-H under chi-combining
by staring at mistakes where a modest S value was paired with a very low H
value, leading to S/(S+H) approaching 1 despite that S was far from certain
on its own.  But gary-combining is much less extreme in both its S and H
measures, so it's less of a *potential* problem there.  It *may* account for
the two FP that got redeemed in your last run, though -- knowing their
internal S and H values would help (oops -- they're called P and Q inside
the default scheme, but same thing).


From tim.one@comcast.net  Sat Oct 19 07:11:49 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 19 Oct 2002 02:11:49 -0400
Subject: [Spambayes] Proposing to remove 4 combining schemes
In-Reply-To: <20021017223823.785E5F4CD@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEKPBMAB.tim.one@comcast.net>

[T. Alexander Popiel]
> ...
> The score we've got is just a number in the range 0 to 1 which has
> interesting discriminatory properties.  It's not linear with any
> concept of surety, and it's not linear with similarity to spam or
> ham, either.

Ah, but it is linear with 1 minus the probability that -2 times the natural
log of the geometric mean of 1-p_i for a vector of random probabilities p
would exceed 1 minus -2 times the natural log of the geometric mean of 1-p_i
for the estimated spamprobs in the message, minus 1 minus the probability
that -2 times the natural log of the geometric mean of p_i for a vector or
random probabilities p would exceed 1 minus -2 times the natural log of the
geometric mean of p_i for the estimated spamprobs in the message.

> People not immersed in how it's generated and/or buried in test results
> over decent sized corpora are sure (there's that troubling word again)
> to misinterpret it.
> </rant>

Even given a clear explanation like the above?  I vote we put that in the
user docs, and strongly imply that anyone to whom that isn't obvious from
mere inspection is an idiot who deserves all the spam they get <wink>.

[Rob]
>> But Sean's "sort on score" idea is also very useful. I think it'd speed
>> up the manual scanning/deletion process.

[Alex]
> Having looked at the results from the show_unsure config option,
> I tend to disagree... position in the list doesn't seem to have
> any correlation with spam vs. ham.

Are you sure?  I've got a GUI that sorts email by "Hammie score" now, and
there's a clear correlation by eyeball *adjacent to the endpoints* of the
unsure range.  The middle of the unsure range is a jumble, though, and
predictably so since long messages suffering cancellation disease in
particular predictably score near very close to 0.5 under chi-combining.
Where Graham-combining would score them at 0.0 or 1.0 depending on which
flavor of clue just happened to appear more often, chi scores them more like
0.49999 or 0.50001.  It's still a coin toss, but of an exceedingly tiny coin
<wink>.


From tim.one@comcast.net  Sat Oct 19 07:17:01 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 19 Oct 2002 02:17:01 -0400
Subject: [Spambayes] Proposing to remove 4 combining schemes
In-Reply-To: <20021017225449.GA4778@glacier.arctrix.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCELABMAB.tim.one@comcast.net>

[Neil Schemenauer]
> That matches my experience with setting up a spam filter.  After
> installation, I found it much easier to deal with messages in my inbox
> (mail not from lists).  The psychological effect was larger than I had
> expected as well.  I look at the spam mailbox last and only if I have
> the time and energy.

Heh.  I find myself eagerly awaiting my next batch of spam now, just to see
how it scores.  When an hour goes by without new spam, I nervously check all
the cables and starting pinging my POP3 servers.  You obviously need an
attitude adjustment <wink>.


From tim.one@comcast.net  Sat Oct 19 08:56:56 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 19 Oct 2002 03:56:56 -0400
Subject: [Spambayes] Scoring spam discussions
Message-ID: <LNBBLJKPBEHFEDALKOLCKELCBMAB.tim.one@comcast.net>

Here's an interesting result:  I've got a separate folder for my email
discussing this project.  It currently contains 1,316 msgs, and *none* of
them have been trained on.

Just for fun, I scored them with my growing-but-still-small "Tim's email"
classifier.  Result:  almost all scored 0.00 under chi-combining, including
the ones you expected would score as spam <wink>.

Non-zero scores:

0.01   7
0.02   6
0.03   1
0.04   5

------------- 0.05 is a paranoid "I'm sure it's ham" chi cutoff

0.06   3
0.08   1
0.09   1

------------- 0.10 is a conservative chi ham_cutoff

0.11   2
0.12   1
0.13   1  pvt 30KB email from someone including a full listing of all
          their FP on a run

------------- 0.30 is a fine chi ham_cutoff on my c.l.py data

0.40   1  strange brief note from an acm.org spam-filter developer

0.66   1  A msg from me, from PythonLabs email discussions that
          took place before any code was written.

That last was a forwarded Asian spam, with a bunch of my comments, and the
Subject line is:

Subject: [PythonLabs]
    =?ks_c_5601-1987?B?Rlc6ICixpLDtKbDmwO+75yC48LTPxc24tSC8rbrxvbo=?=

It turns out that the MIME structure in this msg is damaged, and the email
package gave up after parsing the headers.  The high spam score (which is
nevertheless solidly in chi's middle ground, thanks to finding clues that I
sent this msg), was mostly due to all the gibberish in the Subject line
(Rob, avert your eyes <wink>):

'subject:[' 0.206009
'subject:PythonLabs' 0.228589
'subject:-' 0.356645
'subject:?' 0.681345
'subject:1987' 0.844828
'subject:] =?' 0.844828
'subject:ks_c_5601' 0.844828
'subject:skip:7 20' 0.844828
'subject:skip:R 10' 0.844828
'subject:=?=' 0.978469
'subject:+' 0.980349

If I repair the MIME by hand, so that it sees my comments (as well as the
forwarded spam), the chi score falls to 0.03.  The forwarded spam in
isolation scores 1.00.  My comments in isolation broke the Pentium's
underflow trap <wink>.

damn-this-stuff-works-good-ly y'rs  - tim


From agmsmith@rogers.com  Sat Oct 19 15:17:54 2002
From: agmsmith@rogers.com (Alexander G. M. Smith)
Date: Sat, 19 Oct 2002 10:17:54 EDT (-0400)
Subject: [Spambayes] Client/server model and Headers Only
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEFEBMAB.tim.one@comcast.net>
Message-ID: <5417381978-BeMail@CR593174-A>

Tim Peters wrote:
> [Alexander G. M. Smith]
> > That's a feature I've been asked for.  Just classify by the header
> > alone.  The idea being that it would only download the header from
> > the mail server, and immediately delete the message on the server
> > if it looked like spam.  I'm a bit nervous about implementing it
> > in case it is a false positive and thus irretrievably deletes the
> > message.
> 
> I'd be very nervous about that.  You may want to ask Eric Raymond if he got
> anywhere with this -- at one time he intended to set up a "header score
> server" in connection with, or as an offshoot of, his bogofilter project.

No answer yet, but I did add some mail parsing code, to my program,
that only trained and used headers.  It seems to work surprisingly
well even if it is just using headers.  My database file is now
full of e-mail mailbox names, from those spammers that use CC:,
and lots of message IDs, and IP addresses.  About a third of the
"words" start with a number.  The example data is just slapped
together from some recent messages I had (probably should try to
get a longer history for the ham).

Genuine (ham): 374 examples used in training.
Spam: 401 examples.
Genuine: 100 test messages which came in the last week and a half.
Spam: 45 tests.

Gary-combining method, simplistic word tokenizing.
Genuines: .0862202 to .51863, all under the 0.56 threshold, zero false positives.
Spams: .471454 to .726808, giving 23 false negatives under the 0.56 threshold,
or 11 under 0.52.

Summing up, it can get rid of half the spam by just looking at the
headers.

Applying the same messages to the full examination (whole e-mail
text just examined for words, not parsed into parts, resulting
database is 3X larger) I get:
Genuine: .147655 to .830371, 6 false positives over the 0.56 threshold.
Spam: .600419 to .993935, 0 false negatives.

Hmmmm.  I'll have to be more careful in selecting my example
messages, I usually get better performance with my actual
working database.  Better tests needed, more later...

- Alex


From popiel@wolfskeep.com  Sat Oct 19 18:32:46 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Sat, 19 Oct 2002 10:32:46 -0700
Subject: [Spambayes] Mixed combining 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	of "Sat, 19 Oct 2002 01:10:17 EDT."
	<LNBBLJKPBEHFEDALKOLCKEKLBMAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCKEKLBMAB.tim.one@comcast.net> 
Message-ID: <20021019173246.592DCF4D6@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCKEKLBMAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>[T. Alexander Popiel]
>> Anyway, here's the tables:
>>
>> Mixed, .9 chi-squared, 0.10-0.90 unsure:
>> -> <stat> tested 50 hams & 200 spams against 450 hams & 1800 spams
>> [...]
>> -> <stat> tested 200 hams & 50 spams against 1800 hams & 450 spams
>> ham:spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
>> fp total:        2       3       3       3       3       2       2
>> fp %:         0.40    0.40    0.30    0.24    0.20    0.11    0.10
>> fn total:        5       6       4       5       6       7       9
>> fn %:         0.25    0.34    0.27    0.40    0.60    0.93    1.80
>> unsure t:       46      44      45      42      52      51      52
>> unsure %:     1.84    1.76    1.80    1.68    2.08    2.04    2.08
>> real cost:  $34.20  $44.80  $43.00  $43.40  $46.40  $37.20  $39.40
>> best cost:  $28.60  $28.20  $34.00  $33.20  $34.20  $30.40  $23.80
>> h mean:       3.61    2.70    2.47    2.30    2.29    2.21    1.99
>> h sdev:       8.09    6.15    6.13    5.93    6.13    5.84    4.79
>> s mean:      97.08   96.69   96.33   95.84   94.94   94.34   92.25
>> s sdev:       6.48    7.71    8.63   10.21   12.73   13.67   17.09
>> mean diff:   93.47   93.99   93.86   93.54   92.65   92.13   90.26
>> k:            6.42    6.78    6.36    5.80    4.91    4.72    4.13
>
>This is a nice way to present summary info.  Are these produced by your
>table2.py?  If so, I know where to find that -- would you consider
>contributing it to the project?

Yes, it's produced by table2.py.  Feel free to use it.  I just created
a sourceforge account for myself (username popiel), just in case you
feel the urge to add me as a developer.

- Alex

From popiel@wolfskeep.com  Sat Oct 19 18:37:45 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Sat, 19 Oct 2002 10:37:45 -0700
Subject: [Spambayes] Proposing to remove 4 combining schemes 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	of "Sat, 19 Oct 2002 02:11:49 EDT."
	<LNBBLJKPBEHFEDALKOLCEEKPBMAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCEEKPBMAB.tim.one@comcast.net> 
Message-ID: <20021019173745.E948EF4D6@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCEEKPBMAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>[Rob]
>>> But Sean's "sort on score" idea is also very useful. I think it'd speed
>>> up the manual scanning/deletion process.
>
>[Alex]
>> Having looked at the results from the show_unsure config option,
>> I tend to disagree... position in the list doesn't seem to have
>> any correlation with spam vs. ham.
>
>Are you sure?  I've got a GUI that sorts email by "Hammie score" now, and
>there's a clear correlation by eyeball *adjacent to the endpoints* of the
>unsure range.

Well, no, I'm not sure.  There was a bit more ham towards the bottom
of the range, and a bit more spam towards the top... but I had high
scoring ham and low scoring spam right near the endpoints, too.  And
to make it all worse, each section of the show_unsure output would
only have 4-5 messages, so seeing the trends is hard. ;-)

- Alex

From popiel@wolfskeep.com  Sun Oct 20 00:25:37 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Sat, 19 Oct 2002 16:25:37 -0700
Subject: [Spambayes] Mixed combining 
In-Reply-To: Message from "T. Alexander Popiel" <popiel@wolfskeep.com> 
	of "Sat, 19 Oct 2002 10:32:46 PDT."
	<20021019173246.592DCF4D6@cashew.wolfskeep.com> 
References: <LNBBLJKPBEHFEDALKOLCKEKLBMAB.tim.one@comcast.net>
	<20021019173246.592DCF4D6@cashew.wolfskeep.com> 
Message-ID: <20021019232537.DDFEAF4D6@cashew.wolfskeep.com>

In message:  <20021019173246.592DCF4D6@cashew.wolfskeep.com>
             "T. Alexander Popiel" <popiel@wolfskeep.com> writes:
>
>Yes, it's produced by table2.py.  Feel free to use it.

Of course, if you do include it in the project, I'll feel
obliged to submit patches to it to correct the horribly
out of date (and incorrect) header comments.  And clean it
up in general.  Bah.  If people are going to actually
_use_ it, I'll have to make it presentable, at least...

- Alex

From tim.one@comcast.net  Sun Oct 20 07:03:27 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 20 Oct 2002 02:03:27 -0400
Subject: [Spambayes] Mixed combining
In-Reply-To: <20021019173246.592DCF4D6@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIENIBMAB.tim.one@comcast.net>

[T. Alexander Popiel, on his nice table2.py]
> Yes, it's produced by table2.py.  Feel free to use it.  I just created
> a sourceforge account for myself (username popiel), just in case you
> feel the urge to add me as a developer.

I thought I would be delighted to -- and it turned out I was.  Yet another
correct prediction <wink>.  You can cut your SF CVS teeth by adding
table2.py to the project, if you like.  If there are any problems dealing
with the mechanics of remote CVS, feel free to ask (either here, or directly
to me if you can afford to wait).  Welcome again!


From anthony@interlink.com.au  Sun Oct 20 13:50:05 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Sun, 20 Oct 2002 22:50:05 +1000
Subject: [Spambayes] expiration ideas.
Message-ID: <200210201250.g9KCo5C03237@localhost.localdomain>

Just thinking again about expiration, and wondering if the following 
would work:

  When training new data (say a new week's worth), train it with a 
  new classifier ("interim"). Once it's trained, merge the interim 
  classifier's wordinfo into your master classifier wordinfo by adding 
  the new spamcounts and hamcounts to the master wordinfo blob, then 
  recalc probabilities. 

  Keep the "interim" wordinfo around (gzipped, datestamped) until your
  expiration time is up - then undo the earlier merge, subtracting
  the spamcount/hamcounts. 

Thoughts? Unless there's a screamingly obvious "don't be stupid" I'll
play with this tomorrow (ah, leave....)

Anthony

From agmsmith@rogers.com  Sun Oct 20 17:04:14 2002
From: agmsmith@rogers.com (Alexander G. M. Smith)
Date: Sun, 20 Oct 2002 12:04:14 EDT (-0400)
Subject: [Spambayes] expiration ideas.
In-Reply-To: <200210201250.g9KCo5C03237@localhost.localdomain>
Message-ID: <2124500893-BeMail@CR593174-A>

Anthony Baxter wrote:
>   Keep the "interim" wordinfo around (gzipped, datestamped) until your
>   expiration time is up - then undo the earlier merge, subtracting
>   the spamcount/hamcounts. 
> 
> Thoughts=3F Unless there's a screamingly obvious "don't be stupid" I'll
> play with this tomorrow (ah, leave....)

Sounds reasonable.  But I'd rather keep around the whole messages so
that I can change tokenizing schemes.  Or perhaps use one of those
future inter-word relation schemes.

The total space is several times (ten times) more than a word list
(5.9MB raw, 2.4MB zipped archive, 1.5MB gzip tar file, 1.2MB
bzip2ed tar file vs 660KB raw, 270KB zipped word list), but it is
still almost trivial on today's computers and huge disk drives to
store the complete messages.  So, you have to ask yourself if a
10X space (and tokenizing time) savings is worth it.

- Alex


From popiel@wolfskeep.com  Sun Oct 20 17:52:28 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Sun, 20 Oct 2002 09:52:28 -0700
Subject: [Spambayes] expiration ideas. 
In-Reply-To: Message from "Alexander G. M. Smith" <agmsmith@rogers.com> 
   of "Sun, 20 Oct 2002 12:04:14 EDT." <2124500893-BeMail@CR593174-A> 
References: <2124500893-BeMail@CR593174-A> 
Message-ID: <20021020165228.51EF0F519@cashew.wolfskeep.com>

In message:  <2124500893-BeMail@CR593174-A>
             "Alexander G. M. Smith" <agmsmith@rogers.com> writes:
>Anthony Baxter wrote:
>>   Keep the "interim" wordinfo around (gzipped, datestamped) until your
>>   expiration time is up - then undo the earlier merge, subtracting
>>   the spamcount/hamcounts. 
>> 
>> Thoughts=3F Unless there's a screamingly obvious "don't be stupid" I'll
>> play with this tomorrow (ah, leave....)
>
>Sounds reasonable.  But I'd rather keep around the whole messages so
>that I can change tokenizing schemes.  Or perhaps use one of those
>future inter-word relation schemes.

Whether you want to keep whole messages or just the wordlists
depends entirely on whether you want to fully retrain when you
switch tokenization schemes vs. keeping the old database and
just adding new stuff with the new tokenization.

If you keep the database through tokenizations, then you want a
record of what actually got added during a prior training,
instead of when would have been added if the current tokenization
was used.  Thus, the word lists are better for database integrity.

Of course, if you fully retrain every time you switch tokenizers,
then keeping the entire messages is the only way to support
arbitrary changes in the tokenizer.

It's a question of approach...

Personally, I'm keeping all messages for all time, so it doesn't
matter much one way or another.

- Alex

PS. We can really confuse folks if Alex and Alex start holding
    regular debates on the list...

From tim@fourstonesforum.com  Sun Oct 20 15:15:43 2002
From: tim@fourstonesforum.com (Four Stones Forum)
Date: Sun, 20 Oct 2002 09:15:43 -0500
Subject: [Spambayes] from a new member
Message-ID: <E0CA2Y5Z738UPOM71KEEBZYGBHD5ZWQ.3db2ba8f@riven>

I've recently become aware of the Spambayes project, and I'm quite interested, so I subscribed to the mailing, and I've been reading for a 
while, trying to get my head around the solution you're working on.  I think I (kinda) have the idea now, and I figured I'd post to introduce 
myself and to ask a few questions.

I'm Tim Stone, I've never worked on an open source project before, though I've used lots of open source stuff, I've been in the IT industry 
since 1975 (which makes me a geezer), I know 30-odd languages (Python isn't one of them), I've worked on all kinds of stuff under lots of 
different architectures...  so enough about me.

First of all: I HATE SPAM.  It is an insidious evil, and I'm glad to see some truely progressive thinking about how to deal with it, not only 
deal with the mail, but deal with the PROBLEM.

Second of all: I run a website (www.fourstonesExpressions.com) that has a mailing list (I say these words at risk of having this mail 
rejected by simplistic filters) that I feel to be completely legit.  It's a completely voluntary opt-in, there are no checkboxes with 'Yes' 
defaults, etc. etc.  I don't sell the list or give it away.  I only send mailings occasionally, perhaps 3 or 4 times a year.  I've only ever had one 
opt-out.  I think that speaks well of how the list is run.  I say all that to say this: I take great pains in my mailings to ensure that things like 
spam-assassin don't label my mailings.  Spam-assassin is very popular, and it does some great things.  It also documents its reasonings in 
incoming mail's headers, so you can see how it arrived at its conclusion about your mail.  This allows me to optimize my mailings by simply 
sending one to myself and seeing how SA rates it, and then fixing the problems.  It shocks me that all spammers don't do this, but I'm 
certainly glad that they don't, because that allows SA to work for me.  However, as our ability to block spam becomes better and better, I 
think they'll be forced to use this stragegy more and more.  As someone who sends mailings that *could* be thought of as spam, these are 
the things that I'm sure spammers will think about.  How do you defeat Spambayes?  Well, if I'm a spammer, I get me a copy and train it on 
a vast number of spams that are like mine, then I start tweaking....

As such, I think that Spambayes will work BEST in conjunction with other technologies.  One of the best ideas I've in the discussions thus 
far is to keep a PUBLIC list of urls that spammers actively promote.  This should probably be done at the domain level.  The keeper of this 
list could very well use a crawler and a Bayesian approach to rating the website itself, which is a double safety net.  Otherwise a spammer 
could include urls that are not related to the spam, and do (at least public relations) damage to other sites.  Using this in conjunction with 
Spambayes actually defeats several other simple (temporary) workarounds that spammers could employ, s u c h  a s  i n c l u d i n g  s p a c 
e s  b e t w e e n  l e t t e r s, which is quite human readable, but breaks the document down into a large number of single character words, 
or sending spam as a single jpg.

Well, that's enough for now.  Is anybody working on the Bayesian crawler idea?

- Tim


From agmsmith@rogers.com  Sun Oct 20 22:12:07 2002
From: agmsmith@rogers.com (Alexander G. M. Smith)
Date: Sun, 20 Oct 2002 17:12:07 EDT (-0400)
Subject: [Spambayes] Headers and Other Significant Message Parts
Message-ID: <13547908442-BeMail@CR593174-A>

This is a multipart message in MIME format.

---------------------- multipart/mixed attachment
Database:
341 training genuine (ham) messages,  406 training spam messages
(or 398 spam when parsing due to a bug with messages that don't
have body text).

40 test genuine messages, 40 test spam messages, all more recent
than the training ones.

Spam threshold is 0.56, Gary-combining method, simplistic
word tokenization.

Just headers:
Genuine .181352 to .557881, one false positive (a mailbox full announcement).  2.5% wrong.
Spam .450602 to .750511, 21 false negatives.  52.5% wrong.

Whole raw message text:
Genuine .163027 to .627022, 3 false positives.  7.5% wrong.
Spam .509355 to .993985, 1 false negative.  2.5% wrong.

Any text/* parts and header:
Genuine .162697 to .614136, 4 false positives, 10% wrong.
Spam .614973 to .994362, 0 false negatives, 0% wrong.

Any text parts, no headers:
Genuine .221923 to .635487, 6 false positives, 15% wrong.
Spam .594271 to .994441, 0 false negatives, 0% wrong.

Just text/plain parts (including body text) and headers:
Genuine .137869 to .583192, 3 false positives, 7.5% wrong.
Spam .448059 to .994119, 17 false negatives, 42.5% wrong.

Just text/plain parts, no headers.  150 spam and 1 genuine training message had no words:
Genuine .219169 to .696899, 9 false positives, 22.5% wrong.
Spam .660755 to .994116, 0 false positives, 27 had no words.

So, the headers are quite useful for identifying Spam.

The winners are chewing up the whole message, or using all text
text parts (throwing away binary attachments) and including the
headers too.  The advantage with the parts method is that the
database doesn't fill up with junk words from binary attachments.

- Alex


---------------------- multipart/mixed attachment
I did some more tests using AGMSBayesianSpam v1.58 for BeOS
(http://www.bebits.com/app/3055) to tokenize different parts
of mail messages, to see if headers were useful or if some
parts could be discarded.

Database:
341 training genuine (ham) messages,  406 training spam messages
(or 398 spam when parsing due to a bug with messages that don't
have body text, shouldn't influence it too much).

40 test genuine messages, 40 test spam messages, all more recent
than the training ones.

Spam threshold is 0.56, Gary-combining method, simplistic
word tokenization.

Just headers:
Genuine .181352 to .557881, one false positive (a mailbox full announcement).  2.5% wrong.
Spam .450602 to .750511, 21 false negatives.  52.5% wrong.

Whole raw message text (only quoted-printable decoding):
Genuine .163027 to .627022, 3 false positives.  7.5% wrong.
Spam .509355 to .993985, 1 false negative.  2.5% wrong.

Message parsed into parts (parsing decodes base64 and
quoted-printable, and for text converts the character
set to UTF-8), plus headers (includes MIME subheaders too):
Genuine .168857 to .609005, 4 false positives, 10% wrong.
Spam .614564 to .994364, 0 false negatives, 0% wrong.

Message parsed into parts of all kinds, no header data:
Genuine .220161 to .631161, 5 false positives, 12.5% wrong.
Spam .592501 to .994444, 0 false negatives, 0% wrong.

Only text/* parts and headers:
Genuine .162697 to .614136, 4 false positives, 10% wrong.
Spam .614973 to .994362, 0 false negatives, 0% wrong.

Just text/* parts, no headers:
Genuine .221923 to .635487, 6 false positives, 15% wrong.
Spam .594271 to .994441, 0 false negatives, 0% wrong.

Just text/plain parts (including body text) and headers:
Genuine .137869 to .583192, 3 false positives, 7.5% wrong.
Spam .448059 to .994119, 17 false negatives, 42.5% wrong.

Just text/plain parts, no headers.
150 spam and 1 genuine training message had no words.
Genuine .219169 to .696899, 9 false positives, 22.5% wrong.
Spam .660755 to .994116, 0 false positives,
27 spam had no words (a good sign of spam).

So, the headers are quite useful for identifying Spam in general.
If using just headers, there are few false positives, making them
suitable for deleting spam on the server (only downloading the
header).  But they have many false negatives, so it isn't that
useful.  Harmless and half useless :-).

The winners are the whole message as raw text method, or using
all text parts (throwing away binary attachments) and including the
headers too.  The advantage with the parts method is that the
database doesn't fill up with junk words from binary attachments.

- Alex


---------------------- multipart/mixed attachment--


From agmsmith@rogers.com  Sun Oct 20 22:17:17 2002
From: agmsmith@rogers.com (Alexander G. M. Smith)
Date: Sun, 20 Oct 2002 17:17:17 EDT (-0400)
Subject: [Spambayes] Headers and Other Significant Message Parts
In-Reply-To: <13547908442-BeMail@CR593174-A>
Message-ID: <13857947920-BeMail@CR593174-A>

Sorry about the mangled message (another bug found!), here it is again:

I did some more tests using AGMSBayesianSpam v1.58 for BeOS
(http://www.bebits.com/app/3055) to tokenize different parts
of mail messages, to see if headers were useful or if some
parts could be discarded.

Database:
341 training genuine (ham) messages,  406 training spam messages
(or 398 spam when parsing due to a bug with messages that don't
have body text, shouldn't influence it too much).

40 test genuine messages, 40 test spam messages, all more recent
than the training ones.

Spam threshold is 0.56, Gary-combining method, simplistic
word tokenization.

Just headers:
Genuine .181352 to .557881, one false positive (a mailbox full announcement).  2.5% wrong.
Spam .450602 to .750511, 21 false negatives.  52.5% wrong.

Whole raw message text (only quoted-printable decoding):
Genuine .163027 to .627022, 3 false positives.  7.5% wrong.
Spam .509355 to .993985, 1 false negative.  2.5% wrong.

Message parsed into parts (parsing decodes base64 and
quoted-printable, and for text converts the character
set to UTF-8), plus headers (includes MIME subheaders too):
Genuine .168857 to .609005, 4 false positives, 10% wrong.
Spam .614564 to .994364, 0 false negatives, 0% wrong.

Message parsed into parts of all kinds, no header data:
Genuine .220161 to .631161, 5 false positives, 12.5% wrong.
Spam .592501 to .994444, 0 false negatives, 0% wrong.

Only text/* parts and headers:
Genuine .162697 to .614136, 4 false positives, 10% wrong.
Spam .614973 to .994362, 0 false negatives, 0% wrong.

Just text/* parts, no headers:
Genuine .221923 to .635487, 6 false positives, 15% wrong.
Spam .594271 to .994441, 0 false negatives, 0% wrong.

Just text/plain parts (including body text) and headers:
Genuine .137869 to .583192, 3 false positives, 7.5% wrong.
Spam .448059 to .994119, 17 false negatives, 42.5% wrong.

Just text/plain parts, no headers.
150 spam and 1 genuine training message had no words.
Genuine .219169 to .696899, 9 false positives, 22.5% wrong.
Spam .660755 to .994116, 0 false positives,
27 spam had no words (a good sign of spam).

So, the headers are quite useful for identifying Spam in general.
If using just headers, there are few false positives, making them
suitable for deleting spam on the server (only downloading the
header).  But they have many false negatives, so it isn't that
useful.  Harmless and half useless :-).

The winners are the whole message as raw text method, or using
all text parts (throwing away binary attachments) and including the
headers too.  The advantage with the parts method is that the
database doesn't fill up with junk words from binary attachments.

- Alex


From dereks@itsite.com  Sun Oct 20 23:43:21 2002
From: dereks@itsite.com (Derek Simkowiak)
Date: Sun, 20 Oct 2002 15:43:21 -0700 (PDT)
Subject: [Spambayes] Deployment time
In-Reply-To: <mailman.14.1035152236.732.spambayes@python.org>
Message-ID: <Pine.LNX.4.33L2.0210201519350.3222-100000@dev.itsite.com>


	After reading The Article and briefly reviewing the code, I've
decided to go out on a limb and deploy hammie.py as the server-wide spam
filter for a production email server.  This is with Postfix.

	I'll be reporting my experiences here to the list, so you
algorithm geniouses get a reminder that dummies like me should be able to
install and use it.  If things work out well, I'll write a "success story"
that can be posted on the website.  I may also be able to help contribute
enduser documentation.

	So my immediate first question is: Is there a pre-exististing DBM
store that I can install on my server?  I'm looking for something that
will catch most (>90%) of current spams, and has the kind of ham traffic
you'd see at a University or Corporate America institution (i.e., words
like ConfigParser and -OO shouldn't be required to classify it as ham).

	If not, could one of you nice guys make one available for
download?  (Unfortunately, other than my developer-type personal emails, I
have no existing ham store of my own to work from.)

	Any help in this regard is greatly appreciated.  And although I'm
not a spam expert, I know enough Python to get by, so hopefully I'll make
a helpful test candidate.  (My Python experience includes a custom Apache
authentication module, a load-balanced cluster management system, and a 3D
OpenGL/SDL game engine -- yes, Python is fast enough for a 3D game engine,
as long as the rendering is done in C extension modules :)


Thank You,
Derek Simkowiak


From dereks@itsite.com  Mon Oct 21 04:18:52 2002
From: dereks@itsite.com (Derek Simkowiak)
Date: Sun, 20 Oct 2002 20:18:52 -0700 (PDT)
Subject: [Spambayes] Deployment time
In-Reply-To: <Pine.LNX.4.33L2.0210201519350.3222-100000@dev.itsite.com>
Message-ID: <Pine.LNX.4.33L2.0210201938150.4401-100000@dev.itsite.com>

> 	I'll be reporting my experiences here to the list [...]

	First note: The filter-mode header of "X-Hammie-Disposition" seems
inappropriate, and let me explain why.

	I, the sysadmin, know what hammie.py is.  I know that I installed
it, and I know that it filters for spam.  I know that it is part of the
SpamBayes project, and that the header is inserted into spam-like
messages.

	However, someone else looking at the "X-Hammie-Disposition" header
out of context would not know at all what that header means, what to do
with it, or that they can filter on it for classifying spam.  A Google
search for "Hammie" does not give any results relating to the SpamBayes
project, and even worse, a search for "X-Hammie-Disposition"  gives no
results at all.

	It would be much more useful to use a header that can be
recognized for what it is, without having to be one of the rare
individuals who knows what "hammie.py" is.

	I suggest something like

X-SpamBayes-Disposition  [or]
X-Spam-Disposition  [or]
X-Spamfilter-Disposition

	...or, better yet, to stick with the conventions that SpamAssassin
has used.  This would be easiest on endusers and helpdesks, since setting
up filters for a SpamBayes installation would be the same as doing it for
a SpamAssassin installation.  Mobile users with email accounts in both
kinds of domain would only need one set of filter rules.  That, and the
SpamAssassin headers are pretty intuitive.

	SpamAssassin has slightly different semantics for its headers, but
it will be trivial to implement them in hammie.py.  If the maintainer(s)
are in favor of this approach, I can submit a patch in a couple of weeks.

	For reference, you can see how SpamAssassin tags spams at the
following URL:

http://spamassassin.taint.org/doc/spamassassin.html#tagging:


Thanks,
Derek Simkowiak


From anthony@interlink.com.au  Mon Oct 21 05:22:47 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Mon, 21 Oct 2002 14:22:47 +1000
Subject: [Spambayes] expiration ideas. 
In-Reply-To: <2124500893-BeMail@CR593174-A> 
Message-ID: <200210210422.g9L4MlK08246@localhost.localdomain>


>>> "Alexander G. M. Smith" wrote
> Anthony Baxter wrote:
> >   Keep the "interim" wordinfo around (gzipped, datestamped) until your
> >   expiration time is up - then undo the earlier merge, subtracting
> >   the spamcount/hamcounts. 
> Sounds reasonable.  But I'd rather keep around the whole messages so
> that I can change tokenizing schemes.  Or perhaps use one of those
> future inter-word relation schemes.

That's fine, but once this stuff is deployed, how many end-users are
going to want to tweak their tokeniser? I'd suggest approximately
three eighth's of one fifth of bugger-all :)

> The total space is several times (ten times) more than a word list
> (5.9MB raw, 2.4MB zipped archive, 1.5MB gzip tar file, 1.2MB
> bzip2ed tar file vs 660KB raw, 270KB zipped word list), but it is
> still almost trivial on today's computers and huge disk drives to
> store the complete messages.  So, you have to ask yourself if a
> 10X space (and tokenizing time) savings is worth it.

For one user, fine - but in a setting where you've got multiple
users, say, using an IMAP server? You'd want the stuff to happen
on the server, before the end users have to run a program to
download the mail, check it, and send commands to the IMAP server
to move the spam out of the way...

I also get enough email that I really don't want to be lugging 
around all of my old email for a couple of months...

Anthony

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From dereks@itsite.com  Mon Oct 21 05:38:58 2002
From: dereks@itsite.com (Derek Simkowiak)
Date: Sun, 20 Oct 2002 21:38:58 -0700 (PDT)
Subject: [Spambayes] expiration ideas. 
In-Reply-To: <200210210422.g9L4MlK08246@localhost.localdomain>
Message-ID: <Pine.LNX.4.33L2.0210202135280.4931-100000@dev.itsite.com>

> > The total space is several times (ten times) more than a word list
> > (5.9MB raw, 2.4MB zipped archive, 1.5MB gzip tar file, 1.2MB
> > bzip2ed tar file vs 660KB raw, 270KB zipped word list), but it is
> > still almost trivial on today's computers and huge disk drives to
> > store the complete messages.

> For one user, fine - but in a setting where you've got multiple
> users, say, using an IMAP server?

	Many hosting companies only offer 5 or 10 megs of email space with
their "basic" accounts.


From anthony@interlink.com.au  Mon Oct 21 05:38:14 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Mon, 21 Oct 2002 14:38:14 +1000
Subject: [Spambayes] expiration ideas. 
In-Reply-To: <Pine.LNX.4.33L2.0210202135280.4931-100000@dev.itsite.com> 
Message-ID: <200210210438.g9L4cE608354@localhost.localdomain>


> 	Many hosting companies only offer 5 or 10 megs of email space with
> their "basic" accounts.

*nod* I think our webmail is < 50M or so....


From tim.one@comcast.net  Mon Oct 21 06:57:09 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 21 Oct 2002 01:57:09 -0400
Subject: [Spambayes] expiration ideas.
In-Reply-To: <200210201250.g9KCo5C03237@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEDHBNAB.tim.one@comcast.net>

[Anthony Baxter]
> Just thinking again about expiration, and wondering if the following
> would work:
>
>   When training new data (say a new week's worth), train it with a
>   new classifier ("interim"). Once it's trained, merge the interim
>   classifier's wordinfo into your master classifier wordinfo by adding
>   the new spamcounts and hamcounts to the master wordinfo blob, then
>   recalc probabilities.
>
>   Keep the "interim" wordinfo around (gzipped, datestamped) until your
>   expiration time is up - then undo the earlier merge, subtracting
>   the spamcount/hamcounts.
>
> Thoughts? Unless there's a screamingly obvious "don't be stupid" I'll
> play with this tomorrow (ah, leave....)

It's sure the most principled idea I've heard, in that it would always leave
the database corresponding exactly with *some* real-world collection of
msgs.

OTOH, what's the purpose of expiration?  I can think of two:

1. To reduce database size.

2. To accelerate adaptation to changes in ham and/or spam.

I don't know that #2 is a real problem, and some reason to doubt it.  Over
the weekend, I tried my c.l.py ham + bruceg spam classifer on newer data
Greg Ward harvested from all non-personal python.org traffic (which turns
out to be partly untrue:  python.org also hosts a few small & unadvertised
"hobby lists" I didn't know about, and they count as "personal email" to
me).

Anyway, the c.l.py classifier had a very high FP rate, and especially on the
"hobby list" traffic.  But its FN rate was identical to that of a classifier
trained from scratch on the new data:  1 FN, under chi's rules for FN.

This suggests that everyone is right in believing that spam is much the
same.  So far as changes in ham go, it suggests that a significantly new
source of ham needs to be trained on ASAP, lest it be viewed as spam.

About #1, there are lots of things that haven't been tested properly, the
most obvious being to purge unique words from the database immediately after
training.  That should cut the database size in half with one quick and easy
stroke.  Whether it hurts performance is unknown.

At the start, my favorite gimmick was embodied in the atime attr of WordInfo
records:  remember the most recent time a word was used in scoring, and get
rid of words that haven't been used "recently".  If they're not being used,
then getting rid of them can't affect accuracy.  It addresses both #1 and
#2, but #1 on a revolving-door basis, and #2 in only a very weak sense.


From anthony@interlink.com.au  Mon Oct 21 07:06:53 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Mon, 21 Oct 2002 16:06:53 +1000
Subject: [Spambayes] expiration ideas. 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEDHBNAB.tim.one@comcast.net> 
Message-ID: <200210210606.g9L66sO08825@localhost.localdomain>


>>> Tim Peters wrote
> OTOH, what's the purpose of expiration?  I can think of two:
> 
> 1. To reduce database size.
> 
> 2. To accelerate adaptation to changes in ham and/or spam.

The former. I'm trying to think about how this could be deployed "in 
the real world". 

Note also that I'm not so much worried about adapting to spam as 
adapting to changing ham patterns. I know that my own email changes 
over time (for instance, until this project started, I doubt the word 
"Nigerian" would have been considered a strong ham indicator for me :)

(somewhat off-topic, but related: I also suspect that if the spambayes 
code is vulnerable to being deliberately sabotaged, it'll be the 
tokeniser that's the weak point, not the classifier. For instance, 
I already have a couple of persistent FNs with message bodies entirely 
encoded in javascript. I don't want to think about having to decode
javascript or run it to check if something's spam.)

I'm somewhat nervous of the "purge all unique words" approach - one
obvious failing is that it means if you _are_ doing ongoing training,
you'd want to batch up a bunch of messages. I'm also not sure that 
deliberately perverting the real world in that way isn't going against
the "stupid beats smart" meta-rule that's served us so far... 

but-then-maybe-stupider-beats-stupid, too.
Anthony.


From anthony@interlink.com.au  Mon Oct 21 07:30:02 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Mon, 21 Oct 2002 16:30:02 +1000
Subject: [Spambayes] expiration ideas. 
In-Reply-To: <200210210606.g9L66sO08825@localhost.localdomain> 
Message-ID: <200210210630.g9L6U3809108@localhost.localdomain>


>>> Anthony Baxter wrote
> Note also that I'm not so much worried about adapting to spam as 
> adapting to changing ham patterns. I know that my own email changes 
> over time (for instance, until this project started, I doubt the word 
> "Nigerian" would have been considered a strong ham indicator for me :)

Another thought - if we were to ship a package with a small "starter"
wordinfo dict, it would be very good if this was gradually expired 
out. Two reasons I can think of: the gradually adapting wordinfo will
end up better representing the user's real usage, plus it means anyone
out there starting with a standard wordinfo won't be vulnerable to 
spammers picking up words with high hamprob and deliberately inserting 
them into their spam. I imagine it's highly possible we'll start seeing 
things like 'wrote:' appearing, I'm already seeing spam with 'Re: ' in 
the subject (but as yet, no 'In-reply-to' headers...)

Anthony

From anthony@interlink.com.au  Mon Oct 21 09:44:37 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Mon, 21 Oct 2002 18:44:37 +1000
Subject: [Spambayes] Slice o' life 
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHOECIDMAA.tim.one@comcast.net> 
Message-ID: <200210210844.g9L8icP10231@localhost.localdomain>


>>> Tim Peters wrote
> [Rob W.W. Hooft]
> > Correlations, correlations, correlations. It all boils down to
> > correlations. Not the fact that there are correlations, but that they
> > are very, very different from one clue to the next. All these mailman
> > clues are correlated. And by not downweighting them, we're blinding the
> > procedure to the other clues that do not come by the dozens...
> 
> It's not even that they're Mailman clues, though, it's more that python.org
> specifically already has strong anti-spam and anti-virus measures in place.
> That's how these "Mailman clues" earned their very low spamprobs to begin
> with -- it's not that Mailman is stopping spam, it's that virtually all the
> Mailman lists I'm on go through python.org. 

For an additional data point - if I turn on mine_received_headers, one
of the clues that shows up in a lot of very very low-prob fn's is 
received lines with mail.python.org. 

So stripping out just the mailman headers won't help. 

This also shows up with the footers of the messages that do make it
past Greg to python-list. The .sig at the end shows up as strong
ham clues.

Anthony
-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From anthony@interlink.com.au  Mon Oct 21 10:34:55 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Mon, 21 Oct 2002 19:34:55 +1000
Subject: [Spambayes] cancellation disease again?
Message-ID: <200210210934.g9L9Ytk10743@localhost.localdomain>


I think I'm seeing what's been referred to as cancellation disease again,
using chi combining. I'm getting very very long spams (like those 
interminable MLMs with the "5 reports" that are getting both *H* and
*S* scores at or near 1, and a final score of 0.5. 

E.g. the perfectly standard "send money for 5 reports" spam gets:

prob = 0.500000000004
prob('*H*') = 1
prob('*S*') = 1
prob('sent:') = 0.000670741
prob('indeed') = 0.00248756
prob('place.') = 0.0025729
prob('obviously') = 0.00272893
prob('missed') = 0.0033358
prob('persistent') = 0.00378469
prob('replaced') = 0.00455005
prob('something.') = 0.00542823
prob('george') = 0.00617284
prob('think.') = 0.00617284
prob('"no') = 0.0065312
prob('happen.') = 0.00672646
prob('him.') = 0.00672646
prob('basically,') = 0.00819672
prob('key.') = 0.00850662
prob('"why') = 0.00884086
prob('correctly,') = 0.00920245
prob('sorry.') = 0.00920245
prob('"just') = 0.00959488
prob('hopes') = 0.00959488
prob('initially') = 0.00959488
prob('(so') = 0.0104895
prob('it;') = 0.0104895
prob('assumes') = 0.0110024
prob('at.') = 0.0110024
prob('everyone,') = 0.0110024
prob('myself.') = 0.0110024
prob('determining') = 0.0115681
prob('problem).') = 0.0115681
prob('falling') = 0.0121951
prob('received,') = 0.0121951
prob("don't,") = 0.012894
prob('stanford') = 0.012894
prob('struggling') = 0.012894
prob('directions') = 0.0136778
prob('jonathan') = 0.0145631
prob('portland,') = 0.0145631
prob('privately') = 0.0145631
prob('sometime') = 0.0145631
prob('saying,') = 0.0155709
prob('gained') = 0.0180723
prob('sized') = 0.0180723
prob('belief') = 0.0196507
prob('fortunately,') = 0.0196507
prob('goodness') = 0.0196507
prob('encounters') = 0.0215311
prob('scratch.') = 0.0215311
prob('trash') = 0.0215311
prob('build.') = 0.0238095
prob('exactly,') = 0.0238095
prob('invested') = 0.0238095
prob('pressed') = 0.0238095
prob('me;') = 0.0266272
prob('work...') = 0.0266272
prob('financially') = 0.973743
prob('responses.') = 0.974053
prob('money.') = 0.974232
prob('ordering') = 0.974234
prob('wolf') = 0.974514
prob('remember,') = 0.974677
prob('residual') = 0.975263
prob('guidelines') = 0.975736
prob('downline') = 0.97619
prob('investing') = 0.976423
prob('response,') = 0.976574
prob('investment') = 0.976809
prob('goes,') = 0.976946
prob('pencil') = 0.9772
prob('me!') = 0.977468
prob('envelope') = 0.977667
prob('involved.') = 0.97782
prob('recession') = 0.978047
prob('following.') = 0.978188
prob('lately.') = 0.978188
prob('legal.') = 0.978188
prob('receive,') = 0.97834
prob('in!') = 0.978469
prob('devoted') = 0.978815
prob('orders') = 0.979431
prob('wife,') = 0.979852
prob('purchase') = 0.979994
prob('subject:YOUR') = 0.980474
prob('tested,') = 0.980495
prob('plan.') = 0.980747
prob('materials') = 0.981284
prob('friend,') = 0.981371
prob('opportunity.') = 0.981474
prob('$5,000') = 0.981496
prob('income,') = 0.981928
prob('$50,000') = 0.981962
prob('gambling') = 0.982318
prob('$25') = 0.982672
prob('chicago,') = 0.982771
prob('secrets') = 0.982771
prob('resell') = 0.982897
prob('letter,') = 0.983163
prob('#4.') = 0.983271
prob('e-mails') = 0.983483
prob('currency)') = 0.983805
prob('instructed') = 0.984241
prob('live.') = 0.984241
prob('success:') = 0.985
prob('exceedingly') = 0.985702
prob('her,') = 0.985702
prob('reach.') = 0.98603
prob('earn') = 0.986397
prob('e-mailed') = 0.986405
prob('profits.') = 0.986641
prob('e-mail,') = 0.987065
prob('profit!') = 0.987106
prob('subject:Money') = 0.987106
prob('500,000') = 0.987464
prob('invaluable') = 0.987784
prob('independent.') = 0.988086
prob('marketing,') = 0.988432
prob('crammed') = 0.988647
prob('mitchell.') = 0.988647
prob('p.o.') = 0.988647
prob('prohibiting') = 0.988989
prob("'knew'") = 0.988998
prob("so'") = 0.988998
prob('orders,') = 0.989157
prob('profitable') = 0.989427
prob('reports!') = 0.98951
prob('ordered.') = 0.990472
prob('advertise.') = 0.990959
prob('imagined.') = 0.991185
prob('originator') = 0.991185
prob('$500,000') = 0.991603
prob("1,000's") = 0.991603
prob('feet.') = 0.991603
prob('grumbled') = 0.991603
prob('50,000') = 0.991984
prob('concealed') = 0.991984
prob('year!!!') = 0.992846
prob('refinance') = 0.993653
prob('accurately!') = 0.994148
prob('cash,') = 0.994148
prob('relax,') = 0.994297
prob('spouting') = 0.994438
prob('instructed.') = 0.994572
prob('jody') = 0.994572
prob('merciless') = 0.994572
prob('(u.s.') = 0.994822
prob('income') = 0.994933
prob('multilevel') = 0.994938
prob('ordering,') = 0.995156
prob('e-mails.') = 0.995258
prob('money!') = 0.99579
prob('message-id:@yarrina.connect.com.au') = 0.998453


I'm not sure what the best way to approach this is....

Anthony

From tim.one@comcast.net  Mon Oct 21 16:37:57 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 21 Oct 2002 11:37:57 -0400
Subject: [Spambayes] cancellation disease again?
In-Reply-To: <200210210934.g9L9Ytk10743@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEGLBNAB.tim.one@comcast.net>

[Anthony Baxter]
> I think I'm seeing what's been referred to as cancellation disease again,
> using chi combining. I'm getting very very long spams (like those
> interminable MLMs with the "5 reports" that are getting both *H* and
> *S* scores at or near 1, and a final score of 0.5.
>
> E.g. the perfectly standard "send money for 5 reports" spam gets:
>
> prob = 0.500000000004

In "cancellation disease", "cancellation" does indeed refer to msgs with
huge numbers of both low-spamprob and high-spamprob words, and that's a
property of the msg in conjunction with the state of your training data --
cancellation can't be stopped.  "Diseased" refers to a scheme that infers
certainty when given such a msg.  For example, Graham-combining is diseased
in this way -- it would have scored this msg 0.0 or 1.0, and it's hard to
predict which.  chi-combining reliably scores such msgs smack in its middle
ground, which is the best that can be done -- chi is confused, and it knows
it's confused, and it tells you it's confused.

> ...
> I'm not sure what the best way to approach this is....

Middle-ground schemes *have* a middle ground -- that's their point <wink>.
You have to be aware of their middle ground.  I set up Sean/Mark's Outlook
GUI to move chi middle-ground msgs into an Unsure folder.  For python.org
use, chi middle-ground msgs will be kicked out for human review.  If you
lack a mechanism like that, I suppose the best you can do is pass them on
(if you hate FP more than FN), or call them spam (if you hate FN more than
FP), or decide that 0.000000000004 over 0.5 means spam is the best guess (if
you're determined to wish away reality <wink>).

In any case, after a correct classification is known, you should add it to
your training data.  Over time, the word spamprobs will change accordingly.
The "5 reports" spams I have in my personal-email classifier score with an
internal H of 0 and an internal S of 1, for a final score of 1.


From neale@woozle.org  Mon Oct 21 20:21:20 2002
From: neale@woozle.org (Neale Pickett)
Date: 21 Oct 2002 12:21:20 -0700
Subject: [Spambayes] 
 Testing against someone else's corpora (Was: There Can Be Only One)
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEDOBHAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCIEDOBHAB.tim.one@comcast.net>
Message-ID: <w53vg3vbntb.fsf@woozle.org>

I bet you thought I'd forgotten about this :)

So then, Tim Peters <tim.one@comcast.net> is all like:

> [TIm]
> >> 3. Is it possible to "seed" a database with somebody else's data and
> >>    get decent results out of the box?
> 
> [Neale Pickett]
> > $FIRM has a tangible interest in the answer to this question.
[snip]
> So I'd build a custom test driver on top of TestDriver, like so:
> 
> d = TestDriver.Driver()
> d.train(ham, spam)  # create the seed database
> for user in users:
>     d.test(user.ham, user.spam)
>     d.finishtest()
> d.alldone()
[snip]
> The output will display results for each user individually, and an aggregate
> across all users.  Then you'll want to stare at the output to see how well
> it does.  Come back when you get that far <wink>.

Okay.  Here's my test setup.

I have been collecting all the spam sent to $FIRM for the past week and
a half.  I'm sad to report that "all the spam" means "all incoming mail
that spamassassin scored over 10".  For the ten days I collected it, I
got 14997 spam!  If this is typical, I understand better why spam
filtering is such a big deal.

The ham came from a guy who's been working here since 1998.  It's every
message he's sent or recieved since then.  He claims he hand-filtered
spam out of it, but I know it's not that clean from timcv runs.  I'm
working on hand-cleaning this and the spam corpus, but it's going to
take some time.

To test things, I hand-cleaned two mailboxes of co-workers, W and B.
Then I ran this code:

    import TestDriver
    from Options import options
    import msgs

    users = ("B", "W")
    hamdir_template = "Data/Users/%s/Ham"
    spamdir_template = "Data/Users/%s/Spam"

    def drive(nsets):
        print options.display()

        spamdirs = [options.spam_directories % i for i in range(1, nsets+1)]
        hamdirs  = [options.ham_directories % i for i in range(1, nsets+1)]

        d = TestDriver.Driver()
        d.train(msgs.HamStream("%s-%d" % (hamdirs[0], nsets), hamdirs),
                msgs.SpamStream("%s-%d" % (spamdirs[0], nsets), spamdirs))

        for user in users:
            hamdir = hamdir_template % user
            spamdir = spamdir_template % user
            d.test(msgs.HamStream(hamdir, [hamdir]),
                   msgs.SpamStream(spamdir, [spamdir]))
            d.finishtest()
        d.alldone()

    drive(2)


So, here's the output:

    [TestDriver]
    show_histograms = True
    show_best_discriminators = 30
    nbuckets = 200
    spam_cutoff = 0.560
    pickle_basename = class
    show_ham_lo = 1.0
    show_false_negatives = True
    best_cutoff_fn_weight = 1.00
    ham_cutoff = 0.560
    show_spam_hi = 0.0
    show_unsure = False
    show_spam_lo = 1.0
    save_trained_pickles = False
    show_ham_hi = 0.0
    show_false_positives = True
    spam_directories = Data/Spam/Set%d
    percentiles = 5 25 75 95
    compute_best_cutoffs_from_histograms = True
    best_cutoff_fp_weight = 10.00
    show_charlimit = 3000
    best_cutoff_unsure_weight = 0.20
    ham_directories = Data/Ham/Set%d
    save_histogram_pickles = False

    [CV Driver]
    build_each_classifier_from_scratch = False

    [Tokenizer]
    mine_received_headers = False
    octet_prefix_size = 5
    generate_long_skips = True
    count_all_header_lines = False
    check_octets = False
    ignore_redundant_html = False
    basic_header_tokenize = True
    safe_headers = abuse-reports-to
            date
            errors-to
            from
            importance
            in-reply-to
            message-id
            mime-version
            organization
            received
            reply-to
            return-path
            subject
            to
            user-agent
            x-abuse-info
            x-complaints-to
            x-face
    basic_header_skip = received
            x-.*
            delivered-to
            date
    basic_header_tokenize_only = False
    retain_pure_html_tags = False

    [Classifier]
    use_mixed_combining = False
    robinson_probability_x = 0.5
    robinson_minimum_prob_strength = 0.1
    robinson_probability_s = 0.45
    use_chi_squared_combining = False
    max_discriminators = 150
    mixed_combining_chi_weight = 0.9


    -> Training on Data/Ham/Set1-2 & Data/Spam/Set1-2 ... 400 hams & 400 spams
    -> Predicting Data/Users/B/Ham & Data/Users/B/Spam ...
    -> <stat> tested 121 hams & 23 spams against 400 hams & 400 spams
    -> <stat> false positive %: 7.43801652893
    -> <stat> false negative %: 0.0
    -> <stat> unsure %: 0.0
    -> <stat> cost: $90.00
    -> <stat> 9 new false positives

[snip]

    -> <stat> 0 new false negatives
    -> <stat> 0 new unsure

        best discriminators:
            'edit' 42 0.0564005
            'to:skip:w 10' 43 0.370886
            'header:Received:4' 44 0.00169875
            'subject:PERFORCE' 44 0.00585176
            'subject:change' 44 0.00585176
            'subject:review' 44 0.00570342
            'to:skip:B 10' 44 0.0412844
            '...' 46 0.181134
            'message-id:@horus.inside.$FIRM' 46 0.00556242
            'your' 46 0.758353
            'affected' 47 0.00570342
            'message-id:skip:h 20' 48 0.0416277
            'precedence:bulk' 48 0.0429152
            'header:MIME-Version:1' 49 0.346045
            'url:com' 49 0.761515
            'you' 49 0.650341
            'content-type:plain' 50 0.177419
            'from' 50 0.691328
            'change' 52 0.267149
            'this' 53 0.655698
            'proto:http' 54 0.738164
            'files' 57 0.125668
            'header:Message-Id:1' 60 0.72845
            'message-id:skip:2 20' 60 0.724415
            'header:Message-ID:1' 64 0.298295
            'from:email addr:$FIRM>' 72 0.00825756
            'from:skip:w 10' 76 0.0214323
            'return-path:skip:w 10' 98 0.038085
            'header:Return-Path:1' 121 0.685963
            'content-type:text/plain' 124 0.272913

    -> <stat> Ham scores for this pair: 121 items; mean 35.83; sdev 13.31
    -> <stat> min 17.1664; median 32.0866; max 71.2362
    -> <stat> percentiles: 5% 20.1379; 25% 24.186; 75% 45.0782; 95% 60.3254
    * = 1 items
     0.0 0 
     0.5 0 
     1.0 0 
     1.5 0 
     2.0 0 
     2.5 0 
     3.0 0 
     3.5 0 
     4.0 0 
     4.5 0 
     5.0 0 
     5.5 0 
     6.0 0 
     6.5 0 
     7.0 0 
     7.5 0 
     8.0 0 
     8.5 0 
     9.0 0 
     9.5 0 
    10.0 0 
    10.5 0 
    11.0 0 
    11.5 0 
    12.0 0 
    12.5 0 
    13.0 0 
    13.5 0 
    14.0 0 
    14.5 0 
    15.0 0 
    15.5 0 
    16.0 0 
    16.5 0 
    17.0 1 *
    17.5 0 
    18.0 0 
    18.5 2 **
    19.0 0 
    19.5 2 **
    20.0 4 ****
    20.5 2 **
    21.0 1 *
    21.5 5 *****
    22.0 5 *****
    22.5 3 ***
    23.0 2 **
    23.5 2 **
    24.0 3 ***
    24.5 1 *
    25.0 1 *
    25.5 2 **
    26.0 6 ******
    26.5 4 ****
    27.0 0 
    27.5 3 ***
    28.0 2 **
    28.5 1 *
    29.0 1 *
    29.5 3 ***
    30.0 0 
    30.5 1 *
    31.0 2 **
    31.5 1 *
    32.0 2 **
    32.5 0 
    33.0 0 
    33.5 1 *
    34.0 0 
    34.5 0 
    35.0 2 **
    35.5 1 *
    36.0 1 *
    36.5 0 
    37.0 2 **
    37.5 0 
    38.0 3 ***
    38.5 0 
    39.0 0 
    39.5 0 
    40.0 2 **
    40.5 1 *
    41.0 1 *
    41.5 0 
    42.0 4 ****
    42.5 1 *
    43.0 3 ***
    43.5 3 ***
    44.0 2 **
    44.5 1 *
    45.0 2 **
    45.5 1 *
    46.0 3 ***
    46.5 0 
    47.0 0 
    47.5 2 **
    48.0 2 **
    48.5 1 *
    49.0 2 **
    49.5 0 
    50.0 2 **
    50.5 3 ***
    51.0 0 
    51.5 0 
    52.0 1 *
    52.5 0 
    53.0 0 
    53.5 0 
    54.0 1 *
    54.5 0 
    55.0 2 **
    55.5 0 
    56.0 0 
    56.5 0 
    57.0 0 
    57.5 1 *
    58.0 0 
    58.5 1 *
    59.0 0 
    59.5 0 
    60.0 1 *
    60.5 1 *
    61.0 0 
    61.5 0 
    62.0 0 
    62.5 0 
    63.0 0 
    63.5 0 
    64.0 0 
    64.5 0 
    65.0 0 
    65.5 0 
    66.0 0 
    66.5 0 
    67.0 0 
    67.5 0 
    68.0 2 **
    68.5 1 *
    69.0 0 
    69.5 0 
    70.0 1 *
    70.5 0 
    71.0 1 *
    71.5 0 
    72.0 0 
    72.5 0 
    73.0 0 
    73.5 0 
    74.0 0 
    74.5 0 
    75.0 0 
    75.5 0 
    76.0 0 
    76.5 0 
    77.0 0 
    77.5 0 
    78.0 0 
    78.5 0 
    79.0 0 
    79.5 0 
    80.0 0 
    80.5 0 
    81.0 0 
    81.5 0 
    82.0 0 
    82.5 0 
    83.0 0 
    83.5 0 
    84.0 0 
    84.5 0 
    85.0 0 
    85.5 0 
    86.0 0 
    86.5 0 
    87.0 0 
    87.5 0 
    88.0 0 
    88.5 0 
    89.0 0 
    89.5 0 
    90.0 0 
    90.5 0 
    91.0 0 
    91.5 0 
    92.0 0 
    92.5 0 
    93.0 0 
    93.5 0 
    94.0 0 
    94.5 0 
    95.0 0 
    95.5 0 
    96.0 0 
    96.5 0 
    97.0 0 
    97.5 0 
    98.0 0 
    98.5 0 
    99.0 0 
    99.5 0 

    -> <stat> Spam scores for this pair: 23 items; mean 73.88; sdev 5.94
    -> <stat> min 62.9927; median 74.0114; max 82.6517
    -> <stat> percentiles: 5% 64.6143; 25% 70.2017; 75% 78.8789; 95% 82.2079
    * = 1 items
     0.0 0 
     0.5 0 
     1.0 0 
     1.5 0 
     2.0 0 
     2.5 0 
     3.0 0 
     3.5 0 
     4.0 0 
     4.5 0 
     5.0 0 
     5.5 0 
     6.0 0 
     6.5 0 
     7.0 0 
     7.5 0 
     8.0 0 
     8.5 0 
     9.0 0 
     9.5 0 
    10.0 0 
    10.5 0 
    11.0 0 
    11.5 0 
    12.0 0 
    12.5 0 
    13.0 0 
    13.5 0 
    14.0 0 
    14.5 0 
    15.0 0 
    15.5 0 
    16.0 0 
    16.5 0 
    17.0 0 
    17.5 0 
    18.0 0 
    18.5 0 
    19.0 0 
    19.5 0 
    20.0 0 
    20.5 0 
    21.0 0 
    21.5 0 
    22.0 0 
    22.5 0 
    23.0 0 
    23.5 0 
    24.0 0 
    24.5 0 
    25.0 0 
    25.5 0 
    26.0 0 
    26.5 0 
    27.0 0 
    27.5 0 
    28.0 0 
    28.5 0 
    29.0 0 
    29.5 0 
    30.0 0 
    30.5 0 
    31.0 0 
    31.5 0 
    32.0 0 
    32.5 0 
    33.0 0 
    33.5 0 
    34.0 0 
    34.5 0 
    35.0 0 
    35.5 0 
    36.0 0 
    36.5 0 
    37.0 0 
    37.5 0 
    38.0 0 
    38.5 0 
    39.0 0 
    39.5 0 
    40.0 0 
    40.5 0 
    41.0 0 
    41.5 0 
    42.0 0 
    42.5 0 
    43.0 0 
    43.5 0 
    44.0 0 
    44.5 0 
    45.0 0 
    45.5 0 
    46.0 0 
    46.5 0 
    47.0 0 
    47.5 0 
    48.0 0 
    48.5 0 
    49.0 0 
    49.5 0 
    50.0 0 
    50.5 0 
    51.0 0 
    51.5 0 
    52.0 0 
    52.5 0 
    53.0 0 
    53.5 0 
    54.0 0 
    54.5 0 
    55.0 0 
    55.5 0 
    56.0 0 
    56.5 0 
    57.0 0 
    57.5 0 
    58.0 0 
    58.5 0 
    59.0 0 
    59.5 0 
    60.0 0 
    60.5 0 
    61.0 0 
    61.5 0 
    62.0 0 
    62.5 1 *
    63.0 0 
    63.5 0 
    64.0 0 
    64.5 2 **
    65.0 1 *
    65.5 0 
    66.0 0 
    66.5 1 *
    67.0 0 
    67.5 0 
    68.0 0 
    68.5 0 
    69.0 0 
    69.5 0 
    70.0 2 **
    70.5 0 
    71.0 0 
    71.5 1 *
    72.0 0 
    72.5 0 
    73.0 2 **
    73.5 1 *
    74.0 2 **
    74.5 0 
    75.0 1 *
    75.5 1 *
    76.0 0 
    76.5 0 
    77.0 0 
    77.5 0 
    78.0 1 *
    78.5 2 **
    79.0 0 
    79.5 0 
    80.0 1 *
    80.5 1 *
    81.0 1 *
    81.5 0 
    82.0 1 *
    82.5 1 *
    83.0 0 
    83.5 0 
    84.0 0 
    84.5 0 
    85.0 0 
    85.5 0 
    86.0 0 
    86.5 0 
    87.0 0 
    87.5 0 
    88.0 0 
    88.5 0 
    89.0 0 
    89.5 0 
    90.0 0 
    90.5 0 
    91.0 0 
    91.5 0 
    92.0 0 
    92.5 0 
    93.0 0 
    93.5 0 
    94.0 0 
    94.5 0 
    95.0 0 
    95.5 0 
    96.0 0 
    96.5 0 
    97.0 0 
    97.5 0 
    98.0 0 
    98.5 0 
    99.0 0 
    99.5 0 
    -> best cost for this pair: $2.40
    -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
    -> achieved at 4 cutoff pairs
    -> smallest ham & spam cutoffs 0.61 & 0.715
    ->     fp 0; fn 0; unsure ham 5; unsure spam 7
    ->     fp rate 0%; fn rate 0%; unsure rate 8.33%
    -> largest ham & spam cutoffs 0.625 & 0.715
    ->     fp 0; fn 0; unsure ham 5; unsure spam 7
    ->     fp rate 0%; fn rate 0%; unsure rate 8.33%

    -> <stat> Ham scores for all in this training set: 121 items; mean 35.83; sdev 13.31
    -> <stat> min 17.1664; median 32.0866; max 71.2362
    -> <stat> percentiles: 5% 20.1379; 25% 24.186; 75% 45.0782; 95% 60.3254
    * = 1 items
     0.0 0 
     0.5 0 
     1.0 0 
     1.5 0 
     2.0 0 
     2.5 0 
     3.0 0 
     3.5 0 
     4.0 0 
     4.5 0 
     5.0 0 
     5.5 0 
     6.0 0 
     6.5 0 
     7.0 0 
     7.5 0 
     8.0 0 
     8.5 0 
     9.0 0 
     9.5 0 
    10.0 0 
    10.5 0 
    11.0 0 
    11.5 0 
    12.0 0 
    12.5 0 
    13.0 0 
    13.5 0 
    14.0 0 
    14.5 0 
    15.0 0 
    15.5 0 
    16.0 0 
    16.5 0 
    17.0 1 *
    17.5 0 
    18.0 0 
    18.5 2 **
    19.0 0 
    19.5 2 **
    20.0 4 ****
    20.5 2 **
    21.0 1 *
    21.5 5 *****
    22.0 5 *****
    22.5 3 ***
    23.0 2 **
    23.5 2 **
    24.0 3 ***
    24.5 1 *
    25.0 1 *
    25.5 2 **
    26.0 6 ******
    26.5 4 ****
    27.0 0 
    27.5 3 ***
    28.0 2 **
    28.5 1 *
    29.0 1 *
    29.5 3 ***
    30.0 0 
    30.5 1 *
    31.0 2 **
    31.5 1 *
    32.0 2 **
    32.5 0 
    33.0 0 
    33.5 1 *
    34.0 0 
    34.5 0 
    35.0 2 **
    35.5 1 *
    36.0 1 *
    36.5 0 
    37.0 2 **
    37.5 0 
    38.0 3 ***
    38.5 0 
    39.0 0 
    39.5 0 
    40.0 2 **
    40.5 1 *
    41.0 1 *
    41.5 0 
    42.0 4 ****
    42.5 1 *
    43.0 3 ***
    43.5 3 ***
    44.0 2 **
    44.5 1 *
    45.0 2 **
    45.5 1 *
    46.0 3 ***
    46.5 0 
    47.0 0 
    47.5 2 **
    48.0 2 **
    48.5 1 *
    49.0 2 **
    49.5 0 
    50.0 2 **
    50.5 3 ***
    51.0 0 
    51.5 0 
    52.0 1 *
    52.5 0 
    53.0 0 
    53.5 0 
    54.0 1 *
    54.5 0 
    55.0 2 **
    55.5 0 
    56.0 0 
    56.5 0 
    57.0 0 
    57.5 1 *
    58.0 0 
    58.5 1 *
    59.0 0 
    59.5 0 
    60.0 1 *
    60.5 1 *
    61.0 0 
    61.5 0 
    62.0 0 
    62.5 0 
    63.0 0 
    63.5 0 
    64.0 0 
    64.5 0 
    65.0 0 
    65.5 0 
    66.0 0 
    66.5 0 
    67.0 0 
    67.5 0 
    68.0 2 **
    68.5 1 *
    69.0 0 
    69.5 0 
    70.0 1 *
    70.5 0 
    71.0 1 *
    71.5 0 
    72.0 0 
    72.5 0 
    73.0 0 
    73.5 0 
    74.0 0 
    74.5 0 
    75.0 0 
    75.5 0 
    76.0 0 
    76.5 0 
    77.0 0 
    77.5 0 
    78.0 0 
    78.5 0 
    79.0 0 
    79.5 0 
    80.0 0 
    80.5 0 
    81.0 0 
    81.5 0 
    82.0 0 
    82.5 0 
    83.0 0 
    83.5 0 
    84.0 0 
    84.5 0 
    85.0 0 
    85.5 0 
    86.0 0 
    86.5 0 
    87.0 0 
    87.5 0 
    88.0 0 
    88.5 0 
    89.0 0 
    89.5 0 
    90.0 0 
    90.5 0 
    91.0 0 
    91.5 0 
    92.0 0 
    92.5 0 
    93.0 0 
    93.5 0 
    94.0 0 
    94.5 0 
    95.0 0 
    95.5 0 
    96.0 0 
    96.5 0 
    97.0 0 
    97.5 0 
    98.0 0 
    98.5 0 
    99.0 0 
    99.5 0 

    -> <stat> Spam scores for all in this training set: 23 items; mean 73.88; sdev 5.94
    -> <stat> min 62.9927; median 74.0114; max 82.6517
    -> <stat> percentiles: 5% 64.6143; 25% 70.2017; 75% 78.8789; 95% 82.2079
    * = 1 items
     0.0 0 
     0.5 0 
     1.0 0 
     1.5 0 
     2.0 0 
     2.5 0 
     3.0 0 
     3.5 0 
     4.0 0 
     4.5 0 
     5.0 0 
     5.5 0 
     6.0 0 
     6.5 0 
     7.0 0 
     7.5 0 
     8.0 0 
     8.5 0 
     9.0 0 
     9.5 0 
    10.0 0 
    10.5 0 
    11.0 0 
    11.5 0 
    12.0 0 
    12.5 0 
    13.0 0 
    13.5 0 
    14.0 0 
    14.5 0 
    15.0 0 
    15.5 0 
    16.0 0 
    16.5 0 
    17.0 0 
    17.5 0 
    18.0 0 
    18.5 0 
    19.0 0 
    19.5 0 
    20.0 0 
    20.5 0 
    21.0 0 
    21.5 0 
    22.0 0 
    22.5 0 
    23.0 0 
    23.5 0 
    24.0 0 
    24.5 0 
    25.0 0 
    25.5 0 
    26.0 0 
    26.5 0 
    27.0 0 
    27.5 0 
    28.0 0 
    28.5 0 
    29.0 0 
    29.5 0 
    30.0 0 
    30.5 0 
    31.0 0 
    31.5 0 
    32.0 0 
    32.5 0 
    33.0 0 
    33.5 0 
    34.0 0 
    34.5 0 
    35.0 0 
    35.5 0 
    36.0 0 
    36.5 0 
    37.0 0 
    37.5 0 
    38.0 0 
    38.5 0 
    39.0 0 
    39.5 0 
    40.0 0 
    40.5 0 
    41.0 0 
    41.5 0 
    42.0 0 
    42.5 0 
    43.0 0 
    43.5 0 
    44.0 0 
    44.5 0 
    45.0 0 
    45.5 0 
    46.0 0 
    46.5 0 
    47.0 0 
    47.5 0 
    48.0 0 
    48.5 0 
    49.0 0 
    49.5 0 
    50.0 0 
    50.5 0 
    51.0 0 
    51.5 0 
    52.0 0 
    52.5 0 
    53.0 0 
    53.5 0 
    54.0 0 
    54.5 0 
    55.0 0 
    55.5 0 
    56.0 0 
    56.5 0 
    57.0 0 
    57.5 0 
    58.0 0 
    58.5 0 
    59.0 0 
    59.5 0 
    60.0 0 
    60.5 0 
    61.0 0 
    61.5 0 
    62.0 0 
    62.5 1 *
    63.0 0 
    63.5 0 
    64.0 0 
    64.5 2 **
    65.0 1 *
    65.5 0 
    66.0 0 
    66.5 1 *
    67.0 0 
    67.5 0 
    68.0 0 
    68.5 0 
    69.0 0 
    69.5 0 
    70.0 2 **
    70.5 0 
    71.0 0 
    71.5 1 *
    72.0 0 
    72.5 0 
    73.0 2 **
    73.5 1 *
    74.0 2 **
    74.5 0 
    75.0 1 *
    75.5 1 *
    76.0 0 
    76.5 0 
    77.0 0 
    77.5 0 
    78.0 1 *
    78.5 2 **
    79.0 0 
    79.5 0 
    80.0 1 *
    80.5 1 *
    81.0 1 *
    81.5 0 
    82.0 1 *
    82.5 1 *
    83.0 0 
    83.5 0 
    84.0 0 
    84.5 0 
    85.0 0 
    85.5 0 
    86.0 0 
    86.5 0 
    87.0 0 
    87.5 0 
    88.0 0 
    88.5 0 
    89.0 0 
    89.5 0 
    90.0 0 
    90.5 0 
    91.0 0 
    91.5 0 
    92.0 0 
    92.5 0 
    93.0 0 
    93.5 0 
    94.0 0 
    94.5 0 
    95.0 0 
    95.5 0 
    96.0 0 
    96.5 0 
    97.0 0 
    97.5 0 
    98.0 0 
    98.5 0 
    99.0 0 
    99.5 0 
    -> best cost for all in this training set: $2.40
    -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
    -> achieved at 4 cutoff pairs
    -> smallest ham & spam cutoffs 0.61 & 0.715
    ->     fp 0; fn 0; unsure ham 5; unsure spam 7
    ->     fp rate 0%; fn rate 0%; unsure rate 8.33%
    -> largest ham & spam cutoffs 0.625 & 0.715
    ->     fp 0; fn 0; unsure ham 5; unsure spam 7
    ->     fp rate 0%; fn rate 0%; unsure rate 8.33%


This doesn't look--it's all over the map.  However, IANAS, nor am I a
Tim, so I'll leave judgement up to you fine folks.

Here's W's mail:

    -> Predicting Data/Users/W/Ham & Data/Users/W/Spam ...
    -> <stat> tested 361 hams & 0 spams against 400 hams & 400 spams
    -> <stat> false positive %: 1.38504155125
    -> <stat> false negative %: 0.0
    -> <stat> unsure %: 0.0
    -> <stat> cost: $50.00
    -> <stat> 5 new false positives

[snip]

    -> <stat> 0 new false negatives
    -> <stat> 0 new unsure

        best discriminators:
            '2002' 129 0.805927
            'our' 129 0.799565
            'message-----' 134 0.0121951
            'subject:' 140 0.0297372
            'from:' 141 0.0563603
            'watchguard' 145 0.0213021
            'subject:] ' 163 0.0148575
            'are' 166 0.629383
            'to:' 167 0.214615
            'please' 180 0.839247
            'from' 184 0.691328
            'subject:-' 187 0.218038
            'precedence:bulk' 189 0.0429152
            'url:com' 196 0.761515
            'your' 206 0.758353
            'x-mailer:internet mail service (5.5.2653.19)' 210 0.00556242
            'proto:http' 239 0.738164
            'you' 239 0.650341
            'to:skip:w 10' 240 0.370886
            'content-type:text' 246 0.610023
            'this' 281 0.655698
            'content-type:charset' 287 0.33352
            'content-type:plain' 306 0.177419
            'return-path:skip:w 10' 312 0.038085
            'from:email addr:$FIRM>' 318 0.00825756
            'from:skip:w 10' 322 0.0214323
            'header:MIME-Version:1' 322 0.346045
            'header:Return-Path:1' 331 0.685963
            'header:Message-ID:1' 358 0.298295
            'content-type:text/plain' 456 0.272913

    -> <stat> Ham scores for this pair: 361 items; mean 38.74; sdev 7.88
    -> <stat> min 17.2567; median 39.624; max 63.0457
    -> <stat> percentiles: 5% 24.5112; 25% 33.4889; 75% 44.0288; 95% 49.719
    * = 1 items
     0.0  0 
     0.5  0 
     1.0  0 
     1.5  0 
     2.0  0 
     2.5  0 
     3.0  0 
     3.5  0 
     4.0  0 
     4.5  0 
     5.0  0 
     5.5  0 
     6.0  0 
     6.5  0 
     7.0  0 
     7.5  0 
     8.0  0 
     8.5  0 
     9.0  0 
     9.5  0 
    10.0  0 
    10.5  0 
    11.0  0 
    11.5  0 
    12.0  0 
    12.5  0 
    13.0  0 
    13.5  0 
    14.0  0 
    14.5  0 
    15.0  0 
    15.5  0 
    16.0  0 
    16.5  0 
    17.0  5 *****
    17.5  0 
    18.0  1 *
    18.5  3 ***
    19.0  0 
    19.5  2 **
    20.0  0 
    20.5  0 
    21.0  0 
    21.5  1 *
    22.0  0 
    22.5  0 
    23.0  3 ***
    23.5  2 **
    24.0  1 *
    24.5  2 **
    25.0  2 **
    25.5  2 **
    26.0  3 ***
    26.5  2 **
    27.0  4 ****
    27.5  3 ***
    28.0  0 
    28.5  1 *
    29.0  5 *****
    29.5  6 ******
    30.0  4 ****
    30.5  4 ****
    31.0  2 **
    31.5  5 *****
    32.0  8 ********
    32.5  9 *********
    33.0 11 ***********
    33.5  6 ******
    34.0  6 ******
    34.5 11 ***********
    35.0  4 ****
    35.5  3 ***
    36.0  6 ******
    36.5  7 *******
    37.0  5 *****
    37.5  7 *******
    38.0  6 ******
    38.5 11 ***********
    39.0 14 **************
    39.5 13 *************
    40.0  7 *******
    40.5  8 ********
    41.0  7 *******
    41.5 12 ************
    42.0 15 ***************
    42.5  8 ********
    43.0 14 **************
    43.5  6 ******
    44.0 14 **************
    44.5  9 *********
    45.0 15 ***************
    45.5 10 **********
    46.0  2 **
    46.5  4 ****
    47.0  6 ******
    47.5  7 *******
    48.0  2 **
    48.5  3 ***
    49.0  3 ***
    49.5  1 *
    50.0  2 **
    50.5  1 *
    51.0  2 **
    51.5  2 **
    52.0  0 
    52.5  1 *
    53.0  0 
    53.5  1 *
    54.0  1 *
    54.5  1 *
    55.0  2 **
    55.5  0 
    56.0  0 
    56.5  1 *
    57.0  0 
    57.5  1 *
    58.0  0 
    58.5  1 *
    59.0  0 
    59.5  0 
    60.0  1 *
    60.5  0 
    61.0  0 
    61.5  0 
    62.0  0 
    62.5  0 
    63.0  1 *
    63.5  0 
    64.0  0 
    64.5  0 
    65.0  0 
    65.5  0 
    66.0  0 
    66.5  0 
    67.0  0 
    67.5  0 
    68.0  0 
    68.5  0 
    69.0  0 
    69.5  0 
    70.0  0 
    70.5  0 
    71.0  0 
    71.5  0 
    72.0  0 
    72.5  0 
    73.0  0 
    73.5  0 
    74.0  0 
    74.5  0 
    75.0  0 
    75.5  0 
    76.0  0 
    76.5  0 
    77.0  0 
    77.5  0 
    78.0  0 
    78.5  0 
    79.0  0 
    79.5  0 
    80.0  0 
    80.5  0 
    81.0  0 
    81.5  0 
    82.0  0 
    82.5  0 
    83.0  0 
    83.5  0 
    84.0  0 
    84.5  0 
    85.0  0 
    85.5  0 
    86.0  0 
    86.5  0 
    87.0  0 
    87.5  0 
    88.0  0 
    88.5  0 
    89.0  0 
    89.5  0 
    90.0  0 
    90.5  0 
    91.0  0 
    91.5  0 
    92.0  0 
    92.5  0 
    93.0  0 
    93.5  0 
    94.0  0 
    94.5  0 
    95.0  0 
    95.5  0 
    96.0  0 
    96.5  0 
    97.0  0 
    97.5  0 
    98.0  0 
    98.5  0 
    99.0  0 
    99.5  0 

    -> <stat> Spam scores for this pair:
    -> <stat> Ham scores for all in this training set: 361 items; mean 38.74; sdev 7.88
    -> <stat> min 17.2567; median 39.624; max 63.0457
    -> <stat> percentiles: 5% 24.5112; 25% 33.4889; 75% 44.0288; 95% 49.719
    * = 1 items
     0.0  0 
     0.5  0 
     1.0  0 
     1.5  0 
     2.0  0 
     2.5  0 
     3.0  0 
     3.5  0 
     4.0  0 
     4.5  0 
     5.0  0 
     5.5  0 
     6.0  0 
     6.5  0 
     7.0  0 
     7.5  0 
     8.0  0 
     8.5  0 
     9.0  0 
     9.5  0 
    10.0  0 
    10.5  0 
    11.0  0 
    11.5  0 
    12.0  0 
    12.5  0 
    13.0  0 
    13.5  0 
    14.0  0 
    14.5  0 
    15.0  0 
    15.5  0 
    16.0  0 
    16.5  0 
    17.0  5 *****
    17.5  0 
    18.0  1 *
    18.5  3 ***
    19.0  0 
    19.5  2 **
    20.0  0 
    20.5  0 
    21.0  0 
    21.5  1 *
    22.0  0 
    22.5  0 
    23.0  3 ***
    23.5  2 **
    24.0  1 *
    24.5  2 **
    25.0  2 **
    25.5  2 **
    26.0  3 ***
    26.5  2 **
    27.0  4 ****
    27.5  3 ***
    28.0  0 
    28.5  1 *
    29.0  5 *****
    29.5  6 ******
    30.0  4 ****
    30.5  4 ****
    31.0  2 **
    31.5  5 *****
    32.0  8 ********
    32.5  9 *********
    33.0 11 ***********
    33.5  6 ******
    34.0  6 ******
    34.5 11 ***********
    35.0  4 ****
    35.5  3 ***
    36.0  6 ******
    36.5  7 *******
    37.0  5 *****
    37.5  7 *******
    38.0  6 ******
    38.5 11 ***********
    39.0 14 **************
    39.5 13 *************
    40.0  7 *******
    40.5  8 ********
    41.0  7 *******
    41.5 12 ************
    42.0 15 ***************
    42.5  8 ********
    43.0 14 **************
    43.5  6 ******
    44.0 14 **************
    44.5  9 *********
    45.0 15 ***************
    45.5 10 **********
    46.0  2 **
    46.5  4 ****
    47.0  6 ******
    47.5  7 *******
    48.0  2 **
    48.5  3 ***
    49.0  3 ***
    49.5  1 *
    50.0  2 **
    50.5  1 *
    51.0  2 **
    51.5  2 **
    52.0  0 
    52.5  1 *
    53.0  0 
    53.5  1 *
    54.0  1 *
    54.5  1 *
    55.0  2 **
    55.5  0 
    56.0  0 
    56.5  1 *
    57.0  0 
    57.5  1 *
    58.0  0 
    58.5  1 *
    59.0  0 
    59.5  0 
    60.0  1 *
    60.5  0 
    61.0  0 
    61.5  0 
    62.0  0 
    62.5  0 
    63.0  1 *
    63.5  0 
    64.0  0 
    64.5  0 
    65.0  0 
    65.5  0 
    66.0  0 
    66.5  0 
    67.0  0 
    67.5  0 
    68.0  0 
    68.5  0 
    69.0  0 
    69.5  0 
    70.0  0 
    70.5  0 
    71.0  0 
    71.5  0 
    72.0  0 
    72.5  0 
    73.0  0 
    73.5  0 
    74.0  0 
    74.5  0 
    75.0  0 
    75.5  0 
    76.0  0 
    76.5  0 
    77.0  0 
    77.5  0 
    78.0  0 
    78.5  0 
    79.0  0 
    79.5  0 
    80.0  0 
    80.5  0 
    81.0  0 
    81.5  0 
    82.0  0 
    82.5  0 
    83.0  0 
    83.5  0 
    84.0  0 
    84.5  0 
    85.0  0 
    85.5  0 
    86.0  0 
    86.5  0 
    87.0  0 
    87.5  0 
    88.0  0 
    88.5  0 
    89.0  0 
    89.5  0 
    90.0  0 
    90.5  0 
    91.0  0 
    91.5  0 
    92.0  0 
    92.5  0 
    93.0  0 
    93.5  0 
    94.0  0 
    94.5  0 
    95.0  0 
    95.5  0 
    96.0  0 
    96.5  0 
    97.0  0 
    97.5  0 
    98.0  0 
    98.5  0 
    99.0  0 
    99.5  0 

    -> <stat> Spam scores for all in this training set:
    -> <stat> Ham scores for all runs: 482 items; mean 38.01; sdev 9.62
    -> <stat> min 17.1664; median 39.2174; max 71.2362
    -> <stat> percentiles: 5% 21.5969; 25% 31.7042; 75% 44.2002; 95% 51.8263
    * = 1 items
     0.0  0 
     0.5  0 
     1.0  0 
     1.5  0 
     2.0  0 
     2.5  0 
     3.0  0 
     3.5  0 
     4.0  0 
     4.5  0 
     5.0  0 
     5.5  0 
     6.0  0 
     6.5  0 
     7.0  0 
     7.5  0 
     8.0  0 
     8.5  0 
     9.0  0 
     9.5  0 
    10.0  0 
    10.5  0 
    11.0  0 
    11.5  0 
    12.0  0 
    12.5  0 
    13.0  0 
    13.5  0 
    14.0  0 
    14.5  0 
    15.0  0 
    15.5  0 
    16.0  0 
    16.5  0 
    17.0  6 ******
    17.5  0 
    18.0  1 *
    18.5  5 *****
    19.0  0 
    19.5  4 ****
    20.0  4 ****
    20.5  2 **
    21.0  1 *
    21.5  6 ******
    22.0  5 *****
    22.5  3 ***
    23.0  5 *****
    23.5  4 ****
    24.0  4 ****
    24.5  3 ***
    25.0  3 ***
    25.5  4 ****
    26.0  9 *********
    26.5  6 ******
    27.0  4 ****
    27.5  6 ******
    28.0  2 **
    28.5  2 **
    29.0  6 ******
    29.5  9 *********
    30.0  4 ****
    30.5  5 *****
    31.0  4 ****
    31.5  6 ******
    32.0 10 **********
    32.5  9 *********
    33.0 11 ***********
    33.5  7 *******
    34.0  6 ******
    34.5 11 ***********
    35.0  6 ******
    35.5  4 ****
    36.0  7 *******
    36.5  7 *******
    37.0  7 *******
    37.5  7 *******
    38.0  9 *********
    38.5 11 ***********
    39.0 14 **************
    39.5 13 *************
    40.0  9 *********
    40.5  9 *********
    41.0  8 ********
    41.5 12 ************
    42.0 19 *******************
    42.5  9 *********
    43.0 17 *****************
    43.5  9 *********
    44.0 16 ****************
    44.5 10 **********
    45.0 17 *****************
    45.5 11 ***********
    46.0  5 *****
    46.5  4 ****
    47.0  6 ******
    47.5  9 *********
    48.0  4 ****
    48.5  4 ****
    49.0  5 *****
    49.5  1 *
    50.0  4 ****
    50.5  4 ****
    51.0  2 **
    51.5  2 **
    52.0  1 *
    52.5  1 *
    53.0  0 
    53.5  1 *
    54.0  2 **
    54.5  1 *
    55.0  4 ****
    55.5  0 
    56.0  0 
    56.5  1 *
    57.0  0 
    57.5  2 **
    58.0  0 
    58.5  2 **
    59.0  0 
    59.5  0 
    60.0  2 **
    60.5  1 *
    61.0  0 
    61.5  0 
    62.0  0 
    62.5  0 
    63.0  1 *
    63.5  0 
    64.0  0 
    64.5  0 
    65.0  0 
    65.5  0 
    66.0  0 
    66.5  0 
    67.0  0 
    67.5  0 
    68.0  2 **
    68.5  1 *
    69.0  0 
    69.5  0 
    70.0  1 *
    70.5  0 
    71.0  1 *
    71.5  0 
    72.0  0 
    72.5  0 
    73.0  0 
    73.5  0 
    74.0  0 
    74.5  0 
    75.0  0 
    75.5  0 
    76.0  0 
    76.5  0 
    77.0  0 
    77.5  0 
    78.0  0 
    78.5  0 
    79.0  0 
    79.5  0 
    80.0  0 
    80.5  0 
    81.0  0 
    81.5  0 
    82.0  0 
    82.5  0 
    83.0  0 
    83.5  0 
    84.0  0 
    84.5  0 
    85.0  0 
    85.5  0 
    86.0  0 
    86.5  0 
    87.0  0 
    87.5  0 
    88.0  0 
    88.5  0 
    89.0  0 
    89.5  0 
    90.0  0 
    90.5  0 
    91.0  0 
    91.5  0 
    92.0  0 
    92.5  0 
    93.0  0 
    93.5  0 
    94.0  0 
    94.5  0 
    95.0  0 
    95.5  0 
    96.0  0 
    96.5  0 
    97.0  0 
    97.5  0 
    98.0  0 
    98.5  0 
    99.0  0 
    99.5  0 


    -> <stat> Spam scores for all runs: 23 items; mean 73.88; sdev 5.94
    -> <stat> min 62.9927; median 74.0114; max 82.6517
    -> <stat> percentiles: 5% 64.6143; 25% 70.2017; 75% 78.8789; 95% 82.2079
    * = 1 items
     0.0 0 
     0.5 0 
     1.0 0 
     1.5 0 
     2.0 0 
     2.5 0 
     3.0 0 
     3.5 0 
     4.0 0 
     4.5 0 
     5.0 0 
     5.5 0 
     6.0 0 
     6.5 0 
     7.0 0 
     7.5 0 
     8.0 0 
     8.5 0 
     9.0 0 
     9.5 0 
    10.0 0 
    10.5 0 
    11.0 0 
    11.5 0 
    12.0 0 
    12.5 0 
    13.0 0 
    13.5 0 
    14.0 0 
    14.5 0 
    15.0 0 
    15.5 0 
    16.0 0 
    16.5 0 
    17.0 0 
    17.5 0 
    18.0 0 
    18.5 0 
    19.0 0 
    19.5 0 
    20.0 0 
    20.5 0 
    21.0 0 
    21.5 0 
    22.0 0 
    22.5 0 
    23.0 0 
    23.5 0 
    24.0 0 
    24.5 0 
    25.0 0 
    25.5 0 
    26.0 0 
    26.5 0 
    27.0 0 
    27.5 0 
    28.0 0 
    28.5 0 
    29.0 0 
    29.5 0 
    30.0 0 
    30.5 0 
    31.0 0 
    31.5 0 
    32.0 0 
    32.5 0 
    33.0 0 
    33.5 0 
    34.0 0 
    34.5 0 
    35.0 0 
    35.5 0 
    36.0 0 
    36.5 0 
    37.0 0 
    37.5 0 
    38.0 0 
    38.5 0 
    39.0 0 
    39.5 0 
    40.0 0 
    40.5 0 
    41.0 0 
    41.5 0 
    42.0 0 
    42.5 0 
    43.0 0 
    43.5 0 
    44.0 0 
    44.5 0 
    45.0 0 
    45.5 0 
    46.0 0 
    46.5 0 
    47.0 0 
    47.5 0 
    48.0 0 
    48.5 0 
    49.0 0 
    49.5 0 
    50.0 0 
    50.5 0 
    51.0 0 
    51.5 0 
    52.0 0 
    52.5 0 
    53.0 0 
    53.5 0 
    54.0 0 
    54.5 0 
    55.0 0 
    55.5 0 
    56.0 0 
    56.5 0 
    57.0 0 
    57.5 0 
    58.0 0 
    58.5 0 
    59.0 0 
    59.5 0 
    60.0 0 
    60.5 0 
    61.0 0 
    61.5 0 
    62.0 0 
    62.5 1 *
    63.0 0 
    63.5 0 
    64.0 0 
    64.5 2 **
    65.0 1 *
    65.5 0 
    66.0 0 
    66.5 1 *
    67.0 0 
    67.5 0 
    68.0 0 
    68.5 0 
    69.0 0 
    69.5 0 
    70.0 2 **
    70.5 0 
    71.0 0 
    71.5 1 *
    72.0 0 
    72.5 0 
    73.0 2 **
    73.5 1 *
    74.0 2 **
    74.5 0 
    75.0 1 *
    75.5 1 *
    76.0 0 
    76.5 0 
    77.0 0 
    77.5 0 
    78.0 1 *
    78.5 2 **
    79.0 0 
    79.5 0 
    80.0 1 *
    80.5 1 *
    81.0 1 *
    81.5 0 
    82.0 1 *
    82.5 1 *
    83.0 0 
    83.5 0 
    84.0 0 
    84.5 0 
    85.0 0 
    85.5 0 
    86.0 0 
    86.5 0 
    87.0 0 
    87.5 0 
    88.0 0 
    88.5 0 
    89.0 0 
    89.5 0 
    90.0 0 
    90.5 0 
    91.0 0 
    91.5 0 
    92.0 0 
    92.5 0 
    93.0 0 
    93.5 0 
    94.0 0 
    94.5 0 
    95.0 0 
    95.5 0 
    96.0 0 
    96.5 0 
    97.0 0 
    97.5 0 
    98.0 0 
    98.5 0 
    99.0 0 
    99.5 0 
    -> best cost for all runs: $2.60
    -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
    -> achieved at 4 cutoff pairs
    -> smallest ham & spam cutoffs 0.61 & 0.715
    ->     fp 0; fn 0; unsure ham 6; unsure spam 7
    ->     fp rate 0%; fn rate 0%; unsure rate 2.57%
    -> largest ham & spam cutoffs 0.625 & 0.715
    ->     fp 0; fn 0; unsure ham 6; unsure spam 7
    ->     fp rate 0%; fn rate 0%; unsure rate 2.57%
    -> <stat> all runs false positives: 14
    -> <stat> all runs false negatives: 0
    -> <stat> all runs unsure: 0
    -> <stat> all runs false positive %: 2.90456431535
    -> <stat> all runs false negative %: 0.0
    -> <stat> all runs unsure %: 0.0
    -> <stat> all runs cost: $140.00

The f-ps are conference announcements, solicited commercial email, or
listserv responses.

Should I set the cutoff to 0.63?  Do I owe Tim and Gary $140?  Sorry I
can't answer these questions myself, but I've been lucky to skim subject
headers on this list lately so I don't know what all this new-fangled
stuff is.  I realize the data is less than ideal, but it's all I can get
at the moment.

Aside from cleaning the training data, what should I do next?

Neale

From agmsmith@rogers.com  Mon Oct 21 23:48:25 2002
From: agmsmith@rogers.com (Alexander G. M. Smith)
Date: Mon, 21 Oct 2002 18:48:25 EDT (-0400)
Subject: [Spambayes] expiration ideas. 
In-Reply-To: <200210210630.g9L6U3809108@localhost.localdomain>
Message-ID: <2987213371-BeMail@CR593174-A>

Anthony Baxter wrote:
> Another thought - if we were to ship a package with a small "starter"
> wordinfo dict, it would be very good if this was gradually expired 
> out. Two reasons I can think of: the gradually adapting wordinfo will
> end up better representing the user's real usage, plus it means anyone
> out there starting with a standard wordinfo won't be vulnerable to 
> spammers picking up words with high hamprob and deliberately inserting 
> them into their spam. I imagine it's highly possible we'll start seeing 
> things like 'wrote:' appearing, I'm already seeing spam with 'Re: ' in 
> the subject (but as yet, no 'In-reply-to' headers...)

That's what I do, with expiry based on the age of training messages
added, not the number of times used (so it's not quite as efficient,
but it doesn't need to update the database every time it checks
for new mail).

Plus, every time I release a new version, I include a new sample
database, with fresh spam (single words removed, so it's only
185KB).  That works well enough to keep the users happy.  I've
now added an illustrated guide on how to train the system; some
people didn't realise they could do that - still need to add a
big red flashing button to the mail client :-).

- Alex


From tim.one@comcast.net  Thu Oct 24 03:06:28 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 23 Oct 2002 22:06:28 -0400
Subject: [Spambayes] Foreign language spam: bug or feature?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEKNBMAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEGEBPAB.tim.one@comcast.net>

There's an interesting bug in the Outlook 2000 client that's absolutely
nailing all the Asian spam I get, along with several other non-Asian
languages.  "The bug" is this, in Outlook2000/manager.py's
GetBayesStreamForMessage():

        body += message.Text.encode("ascii", "replace")

Outlook uses Unicode internally.  message.Text grabs the message body from
Outlook as a Unicode string.  .encode(...) is then plain Python, telling it
to encode the Unicode string as a regular string, using the ascii encoding,
and replacing Unicode characters that can't be represented faithfully in
ascii by "a suitable replacement character".  For the ascii encoding, that
almost always turns out to be a question mark character, because there's
almost always nothing in ascii that's truly suitable.

While this may suck from a purity view, it leads to spam-clue listings like
this (from a typical Asian spam):

Spam Score: 1

'*H*'                          0
'*S*'                          1
'header:Return-Path:1'         0.611133
'header:Message-ID:1'          0.813889
'15????'                       0.844828
'24????'                       0.844828
'7??????'                      0.844828
'&amp;'                        0.863317
'header:Mime-Version:1'        0.89556
'header:Reply-To:1'            0.90756
'10????'                       0.934783
'??????!!!'                    0.934783
'header:Received:2'            0.957828
'??????????)'                  0.958716
'??????...'                    0.965116
'????????...'                  0.965116
'message-id:@cpimssmtpa05.msn.com' 0.969799
'from:email addr:korea.com>'   0.980349
'(????'                        0.981928
'??.'                          0.985437
'e-mail??????'                 0.986322
'????,'                        0.99505
'????????,'                    0.995258
'??????,'                      0.99545
'????????.'                    0.997691
'??????????.'                  0.99776
'skip:? 20'                    0.998034
'????????????'                 0.998192
'??????????'                   0.998474
'??????'                       0.998562
'????'                         0.998598
'????????'                     0.998672
'skip:? 10'                    0.998894

That is, languages having scant intersection with ASCII end up getting
tokenized as collections of mostly question marks, and each instance of
"?"*n ends up earning a high spamprob.  The database burden is trivial,
since there just aren't many *possible* strings consisting of nearly pure
question marks, and the "skip" gimmick kicks in when a contiguous string of
question marks gets long.

Of course lots of '?'*n thingies in a msg are highly correlated, which in
*my* personal email is helpful:  spam or not, anything sent to me in a
language having small intersection with ASCII may as well be spam -- there's
no chance *I* can read it regardless.

If somebody would like to formalize this bug as a tokenizer option, so that
non-Outlook American-English users can enjoy its benefits too, I won't
object.  For International Sensitivity reasons, we may have to put it in a
[Dont Ask Dont Tell] .ini section <wink>.


From barry@python.org  Thu Oct 24 21:51:21 2002
From: barry@python.org (Barry A. Warsaw)
Date: Thu, 24 Oct 2002 16:51:21 -0400
Subject: [Spambayes] Get rid of the email directory?
Message-ID: <15800.23881.157471.402911@gargle.gargle.HOWL>


We checked the email package into spambayes because we wanted to use
the new api and avoid the bugs which were present in earlier versions
of the libary.  Python 2.2.2 and Python 2.3a0 have the same, latest
version of the email package now, so I think this directory isn't
necessary, and may be harmful.  I'd like to remove it, but it means if
you're running Python 2.2.1, you'll need to upgrade.

Any objections?
-Barry

From popiel@wolfskeep.com  Thu Oct 24 22:33:17 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Thu, 24 Oct 2002 14:33:17 -0700
Subject: [Spambayes] Get rid of the email directory? 
In-Reply-To: Message from barry@python.org (Barry A. Warsaw) 
	of "Thu, 24 Oct 2002 16:51:21 EDT."
	<15800.23881.157471.402911@gargle.gargle.HOWL> 
References: <15800.23881.157471.402911@gargle.gargle.HOWL> 
Message-ID: <20021024213317.7D3B5F599@cashew.wolfskeep.com>

In message:  <15800.23881.157471.402911@gargle.gargle.HOWL>
             barry@python.org (Barry A. Warsaw) writes:
>
>We checked the email package into spambayes because we wanted to use
>the new api and avoid the bugs which were present in earlier versions
>of the libary.  Python 2.2.2 and Python 2.3a0 have the same, latest
>version of the email package now, so I think this directory isn't
>necessary, and may be harmful.  I'd like to remove it, but it means if
>you're running Python 2.2.1, you'll need to upgrade.

I'd rather you didn't remove the directory, since python 2.2.2 is not
easily available for debian woody.  (It appears that 2.2.2 has only
been packaged in the unstable branch... which many of us assiduously
avoid.)

- Alex

From dereks@itsite.com  Thu Oct 24 23:35:14 2002
From: dereks@itsite.com (Derek Simkowiak)
Date: Thu, 24 Oct 2002 15:35:14 -0700 (PDT)
Subject: [Spambayes] Get rid of the email directory?
In-Reply-To: <15800.23881.157471.402911@gargle.gargle.HOWL>
Message-ID: <Pine.LNX.4.33L2.0210241532060.31223-100000@dev.itsite.com>

> Any objections?

	I object because

> [...] if you're running Python 2.2.1, you'll need to upgrade.

	and I intend on using hammie.py in a production environment where
upgrading Python would be a big deal.

	Please let one or two more minor-version releases come out before
removing the directory.

	Just my $0.02.


--Derek


From tim.one@comcast.net  Fri Oct 25 15:56:44 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 25 Oct 2002 10:56:44 -0400
Subject: [Spambayes] Foreign language spam: bug or feature?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEGEBPAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEOLBPAB.tim.one@comcast.net>

[Tim, remarks about an Outlook client "bug" that caused Asian spam to
 get nailed via replacing most high-bit chars with question marks, leading
 to clue lists like this one:
]
> Spam Score: 1
>
> '*H*'                          0
> '*S*'                          1
> 'header:Return-Path:1'         0.611133
> 'header:Message-ID:1'          0.813889
> '15????'                       0.844828
> '24????'                       0.844828
> '7??????'                      0.844828
> '&amp;'                        0.863317
> 'header:Mime-Version:1'        0.89556
> 'header:Reply-To:1'            0.90756
> '10????'                       0.934783
> '??????!!!'                    0.934783
> 'header:Received:2'            0.957828
> '??????????)'                  0.958716
> '??????...'                    0.965116
> '????????...'                  0.965116
> 'message-id:@cpimssmtpa05.msn.com' 0.969799
> 'from:email addr:korea.com>'   0.980349
> '(????'                        0.981928
> '??.'                          0.985437
> 'e-mail??????'                 0.986322
> '????,'                        0.99505
> '????????,'                    0.995258
> '??????,'                      0.99545
> '????????.'                    0.997691
> '??????????.'                  0.99776
> 'skip:? 20'                    0.998034
> '????????????'                 0.998192
> '??????????'                   0.998474
> '??????'                       0.998562
> '????'                         0.998598
> '????????'                     0.998672
> 'skip:? 10'                    0.998894

MarkH subsequently fixed that bug by accident <wink>, while greatly speeding
the Outlook operations and making the Outlook client more robust.  My Asian
spam is *still* nailed, but via clue lists like this now:

'skip:\x92 40'                 0.958716
'skip:\x95 40'                 0.958716
'skip:\x96 30'                 0.958716
'skip:\x93 30'                 0.965116
'skip:\x93 50'                 0.965116
'8bit%:58'                     0.969799
'skip:\x82 10'                 0.969799
'skip:\x83 30'                 0.969799
'skip:\x8d 30'                 0.969799
'skip:\x93 20'                 0.969799
'subject:==?='                 0.969799
'skip:\x81 60'                 0.973373
'skip:\x93 10'                 0.973373
'url:jp'                       0.973373
'skip:\x81 10'                 0.97619
'skip:\x81 40'                 0.97619
'skip:\x82 30'                 0.97619
'subject:GyRCTCQ'              0.97619
'subject:iso'                  0.978469
'8bit%:69'                     0.980349
'skip:\x81 30'                 0.980349
'skip:\x81 20'                 0.981928
'8bit%:97'                     0.983271
'8bit%:72'                     0.988432
'8bit%:83'                     0.990405
'8bit%:87'                     0.990798
'8bit%:91'                     0.991159
'8bit%:81'                     0.99236
'8bit%:56'                     0.993274
'8bit%:88'                     0.994148
'8bit%:68'                     0.9947
'8bit%:85'                     0.9947
'8bit%:94'                     0.994822
'8bit%:50'                     0.994938
'8bit%:80'                     0.995258
'8bit%:75'                     0.99545
'subject:=?'                   0.996151
'8bit%:86'                     0.996562
'8bit%:93'                     0.99776
'8bit%:100'                    0.998375

The downside for me is that the database size took a significant hit, just
because there are a lot more potential "skip" tokens than strings of
question marks.  WRT correlation effects, a msg that has an 8bit% metatoken
under this scheme is likely to have lots of them, but is also likely to have
lots of distinct '?'*n tokens under the other scheme; in both cases,
counting them all as distinct clues actually helps nail this stuff as spam.

Unless someone has a strong objection, I expect to introduce a new option:

"""
[Tokenizer]
# If true, replace high-bit characters (ord(c) >= 128) and
# control characters with question marks.  This allows
# non-ASCII character strings to be identified with little
# training and small database burden.  It's appropriate only
# if your ham is plain 7-bit ASCII, or nearly so, so that
# the mere presence of non-ASCII character strings is known
# in advance to be a strong spam indicator.
replace_nonascii_chars: False
"""


From jeremy@zope.com  Fri Oct 25 17:02:33 2002
From: jeremy@zope.com (Jeremy Hylton)
Date: Fri, 25 Oct 2002 12:02:33 -0400
Subject: [Spambayes] pop3proxy bug? (resend)
Message-ID: <web-472639@digicool.com>

I'm resending this message because python.org rejected it
the first time around.

To: richiehindle@users.sourceforge.net
Cc: spambayes@python.org
Subject: pop3proxy bug?
Reply-to: jeremy@alum.mit.edu

I tried to use pop3proxy.py but it failed every time it
tried to send data to the real pop server.

The traceback printed by asyncore is this:

error: uncaptured python exception, closing channel
<BayesProxy ('63.100.190.18', 110)>
(exceptions.IOError:[Errno 9] Bad file descriptor
[/usr/local/lib/python2.2/asyncore.py|poll|99]
[/usr/local/lib/python2.2/asyncore.py|handle_read_event|396]
[/usr/local/lib/python2.2/asynchat.py|handle_read|130]
[/home/jeremy/src/spambayes/pop3proxy.py|found_terminator|187])

When I look at the code, I don't see how it could ever
work :-(. found_terminator() is calling
self.serverFile.write(), but
self.serverFile was produced by calling makefile() on a
socket. With makefile() you get either a readable file or a
writeable file. pop3proxy.py is using makefile() with no
arguments so it gets a readable file. There's no way to
write to this file.

I changed the code to use the raw server socket and
sendall() instead of self.serverFile.write() and it worked.
But I'm uneasy. Did you ever test this code?

Jeremy

From tim.one@comcast.net  Fri Oct 25 17:36:08 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 25 Oct 2002 12:36:08 -0400
Subject: [Spambayes] Foreign language spam: bug or feature?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEOLBPAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEPJBPAB.tim.one@comcast.net>

[Tim]
> ...
> Unless someone has a strong objection, I expect to introduce a new option:
>
> """
> [Tokenizer]
> # If true, replace high-bit characters (ord(c) >= 128) and
> # control characters with question marks.  This allows
> # non-ASCII character strings to be identified with little
> # training and small database burden.  It's appropriate only
> # if your ham is plain 7-bit ASCII, or nearly so, so that
> # the mere presence of non-ASCII character strings is known
> # in advance to be a strong spam indicator.
> replace_nonascii_chars: False
> """

This has been added, and is False by default.  However, it's True by default
for users of the Outlook 2000 client, since I can't remember the last time
Mark or Sean asked me a question in Korean <wink>.


From jeremy@alum.mit.edu  Fri Oct 25 22:19:25 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Fri, 25 Oct 2002 17:19:25 -0400
Subject: [Spambayes] progress on POP+VM+ZODB deployment
Message-ID: <15801.46429.507385.482352@slothrop.zope.com>

I don't know if anyone else on Earth wants to manage their mail the
same way I do.  I've made some progress on hooking my mail up to
spambayes, however, and wanted to report on the deploment issues.

I read my mail with VM, an emacs mail reader.  My mail collects on a
couple of POP servers, and I fetch the mail directly from the POP
servers using VM.

I addressed the following issues:

- Incremental training from VM folders
- Scoring via a POP proxy
- Management of training data using ZODB

(I don't know if the last part was necessary or not, but I wanted to
use ZODB.  I think it's simplified some things.)

The runtime environment is fairly complicated.  It's got more moving
parts than I would like, but I don't know how to eliminate any of
them.  It's also slower than I would like, but I haven't done enough
profiling to really understand why.

There are a few open issues:

- It was hard to use the classifier module with ZODB because of the
  __slots__.  I ended up using the WordInfo objects unchanged, and
  __slots__ there helped minimize storage.  But I wanted to make the
  Bayes class persistent and I couldn't do that because of the slots.
  Since there's only a single Bayes instance, I can't see why it needs
  to use __slots__.

- It thought it would be nice if spambayes was a package, so I could
  separate it from my code.  It can't work as a package, though,
  because it contains a copy of the email package.  When I turned
  spambayes into a package, it ended up treating email as a
  subpackage.  My apps ended up getting two copies of the email
  package loaded -- one from the std library and one as a subpackage
  of spambayes.  The duplication broke a bunch of isinstance() tests.

- Configuration.  It would be nice to use the existing options
  framework and extend it with application-specific options (like the
  POP ports, the ZEO server location, etc.).  It isn't clear what the
  best way to extend Options is.

The different components involved in the setup are:

- A ZEO server managing a ZODB database.

  I have a long-running ZEO server process.  By using ZEO, multiple
  clients can access the database at the same time.  Clients connect
  to the server using a Unix domain socket.

- A persistent mail profile based on VM folders.

  The profile is stored in the database.  A VM folder is just a Unix
  mailbox.  A config file contains a list of folders that contain ham
  and a list of folders that contain spam.  The profile manages these
  folders and a spambayes classifier.

- A training program, update.py.

  The training program scans the folders listed in the profile.  When
  it finds new messages, it learns from them.  When it finds that a
  message was deleted, it unlearns it.  This process is incremental,
  but it depends on the mailbox module to parse the folders.  The
  parsing is definitely slow -- especially for large folders.

- A POP3 proxy

  I wrote my own proxy based on SocketServer.ThreadingTCPServer.  I
  don't like the asynchat style of programming, and I was having
  trouble integrating pop3proxy with ZEO.  They both use ZEO, but the
  way they use them seemed to be causing deadlocks :-(.

  The proxy uses the strategy as pop3proxy, intercepting messages and
  adding a spam score header.  I add a header like this:

     From: Martijn Pieters <mj@zope.com>
     To: <geeks@zope.com> (Zope.Com Geeks)
     Cc: sa@zope.com
     Subject: [Zope.Com Geeks] Zope.org storage server was down..
     Date: Fri, 25 Oct 2002 17:10:42 -0400
     X-Spambayes: 0.001

  The proxy doesn't do anything other than add the header.

- A set of VM filters and tools for handling spam and training.

  I wrote some little elisp functions.  One saves a message to the
  spam training folder and deletes it.  Another saves a message to
  the ham training folder, but does not delete it.  A third pipes it
  to a small Python script that prints out the evidence for a message.

  The next step is to add autofoldering rules that file spam above a
  certain threshold to the spam folder and messages in the middle to
  an unsure folder.  That's a standard VM thing, but I haven't done it
  yet.

The total code base is about 2000 lines of code, half of it in the POP
proxy.  I'd be happy to check it in to the spambayes project if anyone
else wants to try to use parts of it.

Jeremy


From dereks@itsite.com  Fri Oct 25 22:57:08 2002
From: dereks@itsite.com (Derek Simkowiak)
Date: Fri, 25 Oct 2002 14:57:08 -0700 (PDT)
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <15801.46429.507385.482352@slothrop.zope.com>
Message-ID: <Pine.LNX.4.33L2.0210251448570.5611-100000@dev.itsite.com>


	Where did you get your initial training corpses... carpals...
um, collections of email?  Just personal stuff lying around?

	I am still after a nice "real world" hammie.db.  (I'll buy a pizza
for the first person to send me a good .db file, just include your
address, topping list, and the phone number of your favorite local pizza
joint in a private email to me.)

	Not having a nice .db to start out with seems like a pretty heavy
barrier for [potential] new users.  We need to go searching through
undocumented code just to figure out how to play with it.


Thanks,
Derek


From popiel@wolfskeep.com  Fri Oct 25 22:58:35 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Fri, 25 Oct 2002 14:58:35 -0700
Subject: [Spambayes] Where are we heading?
Message-ID: <20021025215835.57F16F5A4@cashew.wolfskeep.com>

It seems like all the work for the last week or so has been
on integration of the classifier with end-user deployments
(clients, mailing list filters, whathaveyou).  Have we reached
the point where we're no longer interested in this as a research
project, but instead as a useful tool?

If so, I suggest that we may want to rewrite the whole thing
from scratch, after actually deciding on a usage model or two.
Choosing the algorithms to use (gary-combining or chi-square?)
would be good, too.  What we've got now is a decent prototype,
but it lacks quite a bit as a finished tool... there are a lot
of issues with database storage (what should be in it, how it
should be stored, etc.) and options management, just to name
two of the hotspots.

Personally, I'm still interested in the research aspects;
once I get another two free hours to rub together, I'm going
to see if I can deal with some of the mail decoding issues in
the tokenizer (the unencoded mailing-list footer appended to
a base64 body, to be specific).  There's also a few experiments
I'd like to see revisited: the time of delivery stuff might be
interesting to test on multiple corpora, as an example (since
my spam does not seem to be evenly spread throughout the day,
unlike the original experimenter's spam).

- Alex

From popiel@wolfskeep.com  Fri Oct 25 23:21:38 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Fri, 25 Oct 2002 15:21:38 -0700
Subject: [Spambayes] progress on POP+VM+ZODB deployment 
In-Reply-To: Message from Derek Simkowiak <dereks@itsite.com> 
	of "Fri, 25 Oct 2002 14:57:08 PDT."
	<Pine.LNX.4.33L2.0210251448570.5611-100000@dev.itsite.com> 
References: <Pine.LNX.4.33L2.0210251448570.5611-100000@dev.itsite.com> 
Message-ID: <20021025222138.EC47BF5A4@cashew.wolfskeep.com>

In message:  <Pine.LNX.4.33L2.0210251448570.5611-100000@dev.itsite.com>
             Derek Simkowiak <dereks@itsite.com> writes:
>
>	Where did you get your initial training corpses... carpals...
>um, collections of email?  Just personal stuff lying around?

I personally get my corpora by adding a procmail entry to save
all my incoming email to a folder that I never touch, before
doing any other filing on it.  Then, as I process my mail, I
move any spam I get into a spam folder.  The spam folder acts
as my spam corpus, and the everything - spam stuff acts as
my ham corpus.  Do this for about a month, and you should
have some decent size corpora.  (It took me about a month and
a half to get above the 2000 ham and 2000 spam limit that
Tim set for doing algorithm shootouts. :-) )

>	I am still after a nice "real world" hammie.db.  (I'll buy a pizza
>for the first person to send me a good .db file, just include your
>address, topping list, and the phone number of your favorite local pizza
>joint in a private email to me.)

I think sharing dbs is actually a very _BAD_ idea.  Sure, it
saves some initial effort, but it encourages a tendency to just
take the stock db and never retrain.  One of the things I like
most about this system is how easily and automatically it
customizes itself to your personal mail patterns... which means
that spammers will have a harder time defeating it (since there's
no single widespread db to defeat).

>	Not having a nice .db to start out with seems like a pretty heavy
>barrier for [potential] new users.  We need to go searching through
>undocumented code just to figure out how to play with it.

I agree that the documentation needs to be improved, if this is
to be used by anyone other than researchers.  I don't think that
providing a starter db is the right way to make up for the lack
of documentation. :-)

- Alex

From jeremy@alum.mit.edu  Fri Oct 25 23:27:42 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Fri, 25 Oct 2002 18:27:42 -0400
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <Pine.LNX.4.33L2.0210251448570.5611-100000@dev.itsite.com>
References: <15801.46429.507385.482352@slothrop.zope.com>
	<Pine.LNX.4.33L2.0210251448570.5611-100000@dev.itsite.com>
Message-ID: <15801.50526.459467.387029@slothrop.zope.com>

>>>>> "DS" == Derek Simkowiak <dereks@itsite.com> writes:

  DS> 	Where did you get your initial training corpses... carpals...
  DS> um, collections of email?  Just personal stuff lying around?

I started with a few messages from my existing VM folders.  I've also
got two training folders that I just created.  I'm adding any messages
that wasn't classified correctly to the training folder.  For example,
if a ham comes in and its score isn't < 0.10, I'm training on it.
Same for spam, but the min score is 0.95.  I've got some new key
bindings that automatically save messages in the appropriate folder.

  DS> 	I am still after a nice "real world" hammie.db.  (I'll buy a
  DS> 	pizza
  DS> for the first person to send me a good .db file, just include
  DS> your address, topping list, and the phone number of your
  DS> favorite local pizza joint in a private email to me.)

I don't think you want someone else's database.  Their ham might be
your spam, or vice versa.  Tim has mentioned a couple of times the
example of Guido's email about hotels.  Guido gets a non-trivial
amount of email about hotels for conferences.  He would have to train
his classifier to recognize messages about hotels as ham, but that
probably makes it more likely he'll get spams advertising discount
hotels.  The details of what exactly your ham looks like is pretty
personal.  The spam is easy to collect, unless you don't get much
spam.  And if you don't get much spam, it's hardly a problem.

  DS> Not having a nice .db to start out with seems like a pretty
  DS> heavy barrier for [potential] new users.  We need to go
  DS> searching through undocumented code just to figure out how to
  DS> play with it.

I agree that there are a lot of problems to be solved before potential
new users can try things out.  I think an initial training database is
a pretty minor problem.  I just spent an entire day getting the POP
proxies hooked up to a training database, and I still have a
bubble-gum-and-bailing-wire solution.

Jeremy


From jeremy@alum.mit.edu  Fri Oct 25 23:30:46 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Fri, 25 Oct 2002 18:30:46 -0400
Subject: [Spambayes] Where are we heading?
In-Reply-To: <20021025215835.57F16F5A4@cashew.wolfskeep.com>
References: <20021025215835.57F16F5A4@cashew.wolfskeep.com>
Message-ID: <15801.50710.553887.279223@slothrop.zope.com>

>>>>> "TAP" == T Alexander Popiel <popiel@wolfskeep.com> writes:

  TAP> It seems like all the work for the last week or so has been on
  TAP> integration of the classifier with end-user deployments
  TAP> (clients, mailing list filters, whathaveyou).  Have we reached
  TAP> the point where we're no longer interested in this as a
  TAP> research project, but instead as a useful tool?

I think we've reached the pointer where the classifier is good enough
to be useful for my email.  This issues is pretty much independent of
the need for further research on the algorithms.

I hope there is continued progress on the classifier.  But there's
also big gulf between a good enough classifier and a usable spam
filtering system.  I'm hoping to contribute to the latter.

Jeremy


From dereks@itsite.com  Sat Oct 26 00:06:26 2002
From: dereks@itsite.com (Derek Simkowiak)
Date: Fri, 25 Oct 2002 16:06:26 -0700 (PDT)
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <15801.50526.459467.387029@slothrop.zope.com>
Message-ID: <Pine.LNX.4.33L2.0210251543170.5882-100000@dev.itsite.com>

> I don't think you want someone else's database.  Their ham might be
> your spam, or vice versa.

	A couple of people have mentioned this, and while I see the point,
I disagree.  Let me explain why.

	The differences between one person's ham and another individual's
spam (such as the hotel conference-info example) is far less significant
than the difference between one person's ham and everyone's spam.  That
is, the strongest indicators like "color=#FF0000" and porn-type swearwords
are not likely to appear in anyone's ham.  At least, not nearly as
frequently as it will be found in most of the spams that are out there.

	I take it for granted than a general starter.db file will not be
very accurate for my particular needs.  But I should be able to set a
fairly high cutoff value and get 80% to 90% of real-world spams correctly
flagged right out of the gate -- that's heads and tails above having
nothing at all, when trying to learn how this stuff works.

	But most importantly, training a starter.db for my specialized
needs is far easier as "step two" than creating a .db from scratch is as
"step one".  And that is why I'm asking for a .db file.


> I just spent an entire day getting the POP proxies hooked up to a
> training database, and I still have a bubble-gum-and-bailing-wire
> solution.

	I just used the Postfix-with-SpamAssassin instructions and
replaced SpamAssassin with hammie.py in filter mode.  For my needs,
finding a nice "real world" starter corpus is what's holding me back.
I'm not looking for a "documentation substitute".  I'm just looking for
something that will (a) tell me if I've installed the software correctly,
and (b) correctly identify more than 80% of the spams that I feed it.

	So again, with full recognition that whatever somebody else has
won't be tailored to my email lifestyle, I ask for the .db -- just to save
me a few hours of ramp-up time.  Once I've had a chance to dink around,
and try out the software, I will know if I want to take the time necessary
to collect, organize, and manually filter a highly-customized training
corpus for my personalized needs.

	The pizza offer still stands :)


Thanks,
Derek Simkowiak


From tim.one@comcast.net  Sat Oct 26 00:06:28 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 25 Oct 2002 19:06:28 -0400
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <15801.46429.507385.482352@slothrop.zope.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEBNCAAB.tim.one@comcast.net>

[Jeremy Hylton]
> I don't know if anyone else on Earth wants to manage their mail the
> same way I do.  I've made some progress on hooking my mail up to
> spambayes, however, and wanted to report on the deploment issues.

Thanks for the report!

> I read my mail with VM, an emacs mail reader.  My mail collects on a
> couple of POP servers, and I fetch the mail directly from the POP
> servers using VM.
>
> I addressed the following issues:
>
> - Incremental training from VM folders
> - Scoring via a POP proxy
> - Management of training data using ZODB
>
> (I don't know if the last part was necessary or not, but I wanted to
> use ZODB.  I think it's simplified some things.)
>
> The runtime environment is fairly complicated.  It's got more moving
> parts than I would like, but I don't know how to eliminate any of
> them.

Check out the Outlook2000 directory -- there's already more code there than
in the tokenizer and classifier combined.  It's a remarkable and very
capable GUI, but still, email clients seem universally poorly designed for
programmability.

> It's also slower than I would like, but I haven't done enough
> profiling to really understand why.

MarkH made great progress in speeding the Outlook client via finding a way
to tell Outlook to deliver "batches" of msgs.  It's still at best twice as
slow (when bulk training or bulk classifying) as when running in "one msg
per plain text file" tests, but it's at least 30 msgs/second, and I don't
notice the speed drag at all when it's doing auto-filtering of incoming
email.  I do notice the increase in Outlook startup time, as it drags in
several pickles and lots of Python code (including mounds of the Python
win32 extensions).

Just for fun, I'd suggest training in a different way:  start with an empty
database and forget batch training!  Feed it examples from your live email.
The system does better than chance after training on one ham and one spam,
and it's fun & gratifying to see it get better in response to your training
efforts.  I've done that a few times now, and one day's worth of ham and
spam (of which I admittedly get a lot in a day -- about 100 spam) has always
been enough that it did better on its own then than my previous collection
of by-hand Outlook rules (which I eventually reduced to one, because they
made so many mistakes -- I don't use any now, except for spambayes-based
"spam" and "unsure" rules).

> There are a few open issues:
>
> - It was hard to use the classifier module with ZODB because of the
>   __slots__.

My understanding here is that this is a problem with inheritance from ZODB's
Persistent class.

>   I ended up using the WordInfo objects unchanged, and __slots__ there
>   helped minimize storage.  But I wanted to make the Bayes class
>   persistent and I couldn't do that because of the slots.   Since
>   there's only a single Bayes instance, I can't see why it needs to
>   use __slots__.

There may be more than one Bayes instance (for example, I believe Sean True
routinely uses several, faking N-way classification via chaining differently
trained binary classifiers), but the real reason I used slots here was for
their better error-detecting capabilities.  This *was* very rapidly changing
research code, and __slots__ caught early.  Fine by me if we nuke the Bayes
__slots__ now.

> - It thought it would be nice if spambayes was a package, so I could
>   separate it from my code.  It can't work as a package, though,
>   because it contains a copy of the email package.  When I turned
>   spambayes into a package, it ended up treating email as a
>   subpackage.  My apps ended up getting two copies of the email
>   package loaded -- one from the std library and one as a subpackage
>   of spambayes.  The duplication broke a bunch of isinstance() tests.

As Barry pointed out yesterday, Python 2.2.2 users don't need the duplicated
email pkg at all.  Neither people using CVS Python.  We should nuke it.
People who want to run under 2.2.1 should then work out what they need to do
to fiddle their PYTHONPATH to get a backported copy loaded.

> - Configuration.  It would be nice to use the existing options
>   framework and extend it with application-specific options (like the
>   POP ports, the ZEO server location, etc.).  It isn't clear what the
>   best way to extend Options is.

Name one way, and it will automatically become "the best" <wink>.  Something
that's been a minor problem in the Outlook client:  as soon as you load any
module in the spambayes core, it imports Options.py, and somtimes makes
module compile-time decisions based on the option values then in effect.
Setting the BAYESCUSTOMIZE envar after that point has no effect, since
Options has already been loaded.

> The different components involved in the setup are:
>
> - A ZEO server managing a ZODB database.

And you marvel at how many moving parts you've got <wink>?

>   I have a long-running ZEO server process.  By using ZEO, multiple
>   clients can access the database at the same time.  Clients connect
>   to the server using a Unix domain socket.

YAGNI for *you*, right?

> - A persistent mail profile based on VM folders.
>
>   The profile is stored in the database.  A VM folder is just a Unix
>   mailbox.  A config file contains a list of folders that contain ham
>   and a list of folders that contain spam.  The profile manages these
>   folders and a spambayes classifier.

Mark added another database to the Outlook client:  a mapping from (Outlook)
message id to whether it's been trained on as ham or spam (and a message id
is absent if neither).  So far this has at least two good effects:  (1) if a
mistake is moved from one flavor of training folder to another, the system
automatically knows to untrain it from the wrong flavor; (2) folder-based
training is much faster now, as it doesn't even bother to fetch msgs it
already trained on.

> - A training program, update.py.
>
>   The training program scans the folders listed in the profile.  When
>   it finds new messages, it learns from them.  When it finds that a
>   message was deleted, it unlearns it.

I don't think you'll want that over time:  If a msg has been deleted, fine,
it's gone but still trained.  Right now I'm carrying around many megabytes
of useless spam in my Outlook store, and that has lots of bad effects:
longer backup times, much longer scanpst times (the Outlook "inbox repair
tool"), and very much longer times to transfer my msg store between laptop
and desktop.  Only a researcher wants to carry dead spam around forever.

>   This process is incremental,  but it depends on the mailbox module
>   to parse the folders.  The parsing is definitely slow -- especially
>   for large folders.

Perhaps your moral equivalent to the Outlook client's msgid -> training
status map would be a jeremy_msg_id -> seek offset map, along with a
highwater mark offset to distinguish old from new msgs.

> - A POP3 proxy
>
>   I wrote my own proxy based on SocketServer.ThreadingTCPServer.  I
>   don't like the asynchat style of programming, and I was having
>   trouble integrating pop3proxy with ZEO.  They both use ZEO, but the
>   way they use them seemed to be causing deadlocks :-(.

That's unheard of in ZEO <wink>.

>   The proxy uses the strategy as pop3proxy, intercepting messages and
>   adding a spam score header.  I add a header like this:
>
>      From: Martijn Pieters <mj@zope.com>
>      To: <geeks@zope.com> (Zope.Com Geeks)
>      Cc: sa@zope.com
>      Subject: [Zope.Com Geeks] Zope.org storage server was down..
>      Date: Fri, 25 Oct 2002 17:10:42 -0400
>      X-Spambayes: 0.001
>
>   The proxy doesn't do anything other than add the header.

Is your ultimate email reader programmable enough to "do something" with
this?  One prediction I made for myself turned out to be just right:  moving
things automagically into Probable-Spam and Unsure folders is exactly what I
wanted and turns out to be exactly what I still want.  Works great.

> - A set of VM filters and tools for handling spam and training.
>
>   I wrote some little elisp functions.  One saves a message to the
>   spam training folder and deletes it.  Another saves a message to
>   the ham training folder, but does not delete it.  A third pipes it
>   to a small Python script that prints out the evidence for a message.
>
>   The next step is to add autofoldering rules that file spam above a
>   certain threshold to the spam folder and messages in the middle to
>   an unsure folder.  That's a standard VM thing, but I haven't done it
>   yet.

Thank you for answering my questions so quickly.

> The total code base is about 2000 lines of code, half of it in the POP
> proxy.  I'd be happy to check it in to the spambayes project if anyone
> else wants to try to use parts of it.

I'm bothered that you had no luck with the POP3 proxy already checked in.
Who's using that, and why didn't it work for Jeremy?


From tim.one@comcast.net  Sat Oct 26 00:09:17 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 25 Oct 2002 19:09:17 -0400
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <Pine.LNX.4.33L2.0210251448570.5611-100000@dev.itsite.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEBNCAAB.tim.one@comcast.net>

[Derek Simkowiak]
> ...
> 	I am still after a nice "real world" hammie.db.  (I'll buy a pizza
> for the first person to send me a good .db file, just include your
> address, topping list, and the phone number of your favorite local pizza
> joint in a private email to me.)

You don't need one:  just start.  Train it on examples from your live email.
It learns quickly.

> 	Not having a nice .db to start out with seems like a pretty heavy
> barrier for [potential] new users.  We need to go searching through
> undocumented code just to figure out how to play with it.

Follow my suggestion, and you'll discover that you still don't know what to
do <wink> -- it's not the lack of a prepackaged database that's stopping
you.


From dereks@itsite.com  Sat Oct 26 00:18:55 2002
From: dereks@itsite.com (Derek Simkowiak)
Date: Fri, 25 Oct 2002 16:18:55 -0700 (PDT)
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <Pine.LNX.4.33L2.0210251543170.5882-100000@dev.itsite.com>
Message-ID: <Pine.LNX.4.33L2.0210251614170.6059-100000@dev.itsite.com>

> > I don't think you want someone else's database.  Their ham might be
> > your spam, or vice versa.

	I just thought of another argument for a stock "starter.db".

	How can we test out new algorithms if the project doesn't have a
control group?  We have no way of knowing if someone's successful (or
poor) results are an attribute of the new algorithm, or if it's an
attribute of their particular sample data.

	Having a starter.db would both (a) make life easier for getting
started, and (b) give us a well-established baseline to test against.


--Derek


From tim.one@comcast.net  Sat Oct 26 00:34:33 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 25 Oct 2002 19:34:33 -0400
Subject: [Spambayes] Where are we heading?
In-Reply-To: <20021025215835.57F16F5A4@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEBPCAAB.tim.one@comcast.net>

[T. Alexander Popiel]
> It seems like all the work for the last week or so has been
> on integration of the classifier with end-user deployments
> (clients, mailing list filters, whathaveyou).

Pretty much, yes.

> Have we reached the point where we're no longer interested in
> this as a research project, but instead as a useful tool?

I'm no longer *primarily* interested in this as a research project -- the
results on the corpora I'm targeting have been so good for more than a month
that I couldn't measure an improvement if one were to be made.  Time to move
it along.

> If so, I suggest that we may want to rewrite the whole thing
> from scratch, after actually deciding on a usage model or two.

There's no point to that I can see, so far as the "heavy lifting" code
goes -- the classifier has had the same interface since the day it was first
written, and the tokenizer has changed interface in only minor ways.  IOW,
there's nothing in need of refactoring there, else it would have been
refactored already.  Reworking the WordInfo structure is overdue, but it's
hard to know what to do with that before we know more about how decisions
affect accuracy over time (I'm thinking of database cleaning here).

> Choosing the algorithms to use (gary-combining or chi-square?)
> would be good, too.

I don't mind supporting both (the differences are trivial at the code
level).  I do intend to get rid of mixed-combining, and want to make
chi-combining the default.

> What we've got now is a decent prototype, but it lacks quite a bit
> as a finished tool...

Like, approximately, everything <0.5 wink>.  It's a 10,000 horsepower engine
without seats, tires, or a steering wheel now.

> there are a lot of issues with database storage (what should be in it,
> how it should be stored, etc.)

People won't agree on that -- let 1000 databases bloom.  A clean API for
database interface would be nice, *provided that* database heads would
actually use it.  I expect they're more likely to break into the internals
"for speed".

> and options management, just to name two of the hotspots.

One thing I want to do too is purge useless options that are gumming up the
works now.

> Personally, I'm still interested in the research aspects;
> once I get another two free hours to rub together, I'm going
> to see if I can deal with some of the mail decoding issues in
> the tokenizer (the unencoded mailing-list footer appended to
> a base64 body, to be specific).

Barry promised to do that "tomorrow", which was admittedly a week ago, but
Barry works on his own calendar <wink>.

> There's also a few experiments I'd like to see revisited: the time of
> delivery stuff might be interesting to test on multiple corpora, as an
> example (since my spam does not seem to be evenly spread throughout
> the day, unlike the original experimenter's spam).

Yup, that's a good one.  I'd also like to dig deeper into header-line
tokenization:  to date, I've gotten worse results on both my own email, and
on a new "pure" collection of python.org traffic, when enabling *any* of
these:

    mine_received_headers
    count_all_header_lines
    basic_header_tokenize

I suspect but don't know this is mostly due to the dark side of the
correlation effects Rob likes to worry about.  For example,
mine_received_headers presents IP and machine name info in several distinct
and partially redundant ways, so that, e.g., email going thru python.org
ends up with 6 good ham clues for that alone.  I saw one spam get thru under
basic_header_tokenize just because 6 different header lines happened to have
the string "GMT" in them.  Etc.  I'm sure there's a world of info in the
header lines that default tokenization is missing, but it remains unclear
(to me) how to exploit it in a way that does more good than harm.


From popiel@wolfskeep.com  Sat Oct 26 00:37:39 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Fri, 25 Oct 2002 16:37:39 -0700
Subject: [Spambayes] progress on POP+VM+ZODB deployment 
In-Reply-To: Message from Derek Simkowiak <dereks@itsite.com> 
	<Pine.LNX.4.33L2.0210251614170.6059-100000@dev.itsite.com> 
References: <Pine.LNX.4.33L2.0210251614170.6059-100000@dev.itsite.com> 
Message-ID: <20021025233739.9C94DF5A4@cashew.wolfskeep.com>

In message:  <Pine.LNX.4.33L2.0210251614170.6059-100000@dev.itsite.com>
             Derek Simkowiak <dereks@itsite.com> writes:
>
>	I just thought of another argument for a stock "starter.db".
>
>	How can we test out new algorithms if the project doesn't have a
>control group?  We have no way of knowing if someone's successful (or
>poor) results are an attribute of the new algorithm, or if it's an
>attribute of their particular sample data.

That's why we have multiple people test anything that looks promising,
and compare the variations across all the different runs.  Since the
classifications are reproducable over given corpora, we don't need
control groups in the same way that biological experiments do.

>	Having a starter.db would both (a) make life easier for getting
>started, and (b) give us a well-established baseline to test against.

I disagree with (b), because changes in the tokenizer (where I suspect
some of the advances will come from) will invalidate the database.

- Alex

From jeremy@alum.mit.edu  Sat Oct 26 00:36:08 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Fri, 25 Oct 2002 19:36:08 -0400
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEBNCAAB.tim.one@comcast.net>
References: <15801.46429.507385.482352@slothrop.zope.com>
	<LNBBLJKPBEHFEDALKOLCAEBNCAAB.tim.one@comcast.net>
Message-ID: <15801.54632.545081.293386@slothrop.zope.com>

>>>>> "TP" == Tim Peters <tim.one@comcast.net> writes:

  TP> MarkH made great progress in speeding the Outlook client via
  TP> finding a way to tell Outlook to deliver "batches" of msgs.
  TP> It's still at best twice as slow (when bulk training or bulk
  TP> classifying) as when running in "one msg per plain text file"
  TP> tests, but it's at least 30 msgs/second, and I don't notice the
  TP> speed drag at all when it's doing auto-filtering of incoming
  TP> email.  I do notice the increase in Outlook startup time, as it
  TP> drags in several pickles and lots of Python code (including
  TP> mounds of the Python win32 extensions).

The POP proxy I'm using is a long-running process with a ZEO client
connection.  It just calls spamprob() for each email as it passes
through the proxy.  The ZEO/ZODB cache should do a good job of keeping
recently used words in memory.  (I'm storing the WordInfo objects in
an OOBTree instead of a dict.)

  TP> Just for fun, I'd suggest training in a different way: start
  TP> with an empty database and forget batch training!  Feed it
  TP> examples from your live email.

I just started it off with a few things I was sure I didn't want to
miss that don't show up in my email every day.  Examples: email from
my brother and sister, order receipts from things I've bought online,
etc.  I also started with the 4 spams that were sitting in my INBOX.

  >> There are a few open issues:
  >>
  >> - It was hard to use the classifier module with ZODB because of
  >>   the
  >> __slots__.

  TP> My understanding here is that this is a problem with inheritance
  TP> from ZODB's Persistent class.

That's right.  I'm going to fix that problem for ZODB4, but I wanted
to use ZODB3 for this project.

  >> I ended up using the WordInfo objects unchanged, and __slots__
  >> there helped minimize storage.  But I wanted to make the Bayes
  >> class persistent and I couldn't do that because of the slots.
  >> Since there's only a single Bayes instance, I can't see why it
  >> needs to use __slots__.

  TP> There may be more than one Bayes instance (for example, I
  TP> believe Sean True routinely uses several, faking N-way
  TP> classification via chaining differently trained binary
  TP> classifiers), but the real reason I used slots here was for
  TP> their better error-detecting capabilities.  This *was* very
  TP> rapidly changing research code, and __slots__ caught early.
  TP> Fine by me if we nuke the Bayes __slots__ now.

Cool.

  TP> As Barry pointed out yesterday, Python 2.2.2 users don't need
  TP> the duplicated email pkg at all.  Neither people using CVS
  TP> Python.  We should nuke it.  People who want to run under 2.2.1
  TP> should then work out what they need to do to fiddle their
  TP> PYTHONPATH to get a backported copy loaded.

I agree that we should nuke it.

  >> - Configuration.  It would be nice to use the existing options
  >> framework and extend it with application-specific options (like
  >> the POP ports, the ZEO server location, etc.).  It isn't clear
  >> what the best way to extend Options is.

  TP> Name one way, and it will automatically become "the best"
  TP> <wink>.

I haven't come up with one yet <0.5 wink>.  I import Options from
spambayes, then I add stuff to its all_options dict and call mergfiles
a second time.  Yuck.

  TP> Something that's been a minor problem in the Outlook client: as
  TP> soon as you load any module in the spambayes core, it imports
  TP> Options.py, and somtimes makes module compile-time decisions
  TP> based on the option values then in effect.  Setting the
  TP> BAYESCUSTOMIZE envar after that point has no effect, since
  TP> Options has already been loaded.

This is one of the reasons I'd like something else :-).

  >> The different components involved in the setup are:
  >>
  >> - A ZEO server managing a ZODB database.

  TP> And you marvel at how many moving parts you've got <wink>?

This is one of the moving parts I'm not entirely happy with.  But I
hope to get to a point where I start the database and POP proxies when
I boot my machine and leave them running all the time.

  >> I have a long-running ZEO server process.  By using ZEO, multiple
  >> clients can access the database at the same time.  Clients
  >> connect to the server using a Unix domain socket.

  TP> YAGNI for *you*, right?

No.  I want to be able to train the database while I'm fetching mail
or scoring a particular message.  Even though I'm a single user, I
find it essential to have multiple processes reading and writing the
classifier database concurrently.

  TP> Mark added another database to the Outlook client: a mapping
  TP> from (Outlook) message id to whether it's been trained on as ham
  TP> or spam (and a message id is absent if neither).  So far this
  TP> has at least two good effects: (1) if a mistake is moved from
  TP> one flavor of training folder to another, the system
  TP> automatically knows to untrain it from the wrong flavor; (2)
  TP> folder-based training is much faster now, as it doesn't even
  TP> bother to fetch msgs it already trained on.

That's an interesting point.  I was thinking about "deletion from
folder" as the mechanism to correct training mistakes.  I think "shows
up in the other folder" sounds like a good alternative.

  >> - A training program, update.py.
  >>
  >> The training program scans the folders listed in the profile.
  >> When it finds new messages, it learns from them.  When it finds
  >> that a message was deleted, it unlearns it.

  TP> I don't think you'll want that over time: If a msg has been
  TP> deleted, fine, it's gone but still trained.  Right now I'm
  TP> carrying around many megabytes of useless spam in my Outlook
  TP> store, and that has lots of bad effects: longer backup times,
  TP> much longer scanpst times (the Outlook "inbox repair tool"), and
  TP> very much longer times to transfer my msg store between laptop
  TP> and desktop.  Only a researcher wants to carry dead spam around
  TP> forever.

I only train on spam when the existing classifier doesn't mark it as
spam.  I expect that the amount of spam I keep around won't be that
big compared to all the other email that I keep :-).

  >> This process is incremental, but it depends on the mailbox module
  >> to parse the folders.  The parsing is definitely slow --
  >> especially for large folders.

  TP> Perhaps your moral equivalent to the Outlook client's msgid ->
  TP> training status map would be a jeremy_msg_id -> seek offset map,
  TP> along with a highwater mark offset to distinguish old from new
  TP> msgs.

I was thinking that something like that would work.  The mailbox
module is passing the start and stop point of each message to
_Subfile() before it calls the message factory.  So if I hook that, I
can store the location of the message in the database.  Then I only
need to check that the locations are still valid, which is true as
long as messages aren't deleted.

  >> The next step is to add autofoldering rules that file spam above
  >> a certain threshold to the spam folder and messages in the middle
  >> to an unsure folder.  That's a standard VM thing, but I haven't
  >> done it yet.

  TP> Thank you for answering my questions so quickly.

As I learn more about auto foldering, I discover that my mail client
doesn't do quite what I want.  Once the labelled spam shows up in my
INBOX, I can cause it to be deleted by a single key.  Unfortunately,
that interacts badly with the use of the feature to automatically
guess what folder to save a message in when you want to keep it.

I'll have to wait for Barry to come up with some serious elisp to move
the spam to a secure location.

  >> The total code base is about 2000 lines of code, half of it in
  >> the POP proxy.  I'd be happy to check it in to the spambayes
  >> project if anyone else wants to try to use parts of it.

  TP> I'm bothered that you had no luck with the POP3 proxy already
  TP> checked in.  Who's using that, and why didn't it work for
  TP> Jeremy?

I sent an email earlier about the problem where the proxy attempts to
write to a read-only file-wrapping-a-socket.  I just don't see how the
code can work as written.

It's even worse, though, that it uses asycnore.  I found asyncore
added a lot of complexity to ZEO and would rather we hadn't used it.
Then add in a second asyncore app (the proxy) and you've got real
trouble.  The complexity seems to be multiplicative rather than
additive.

Jeremy


From jeremy@alum.mit.edu  Sat Oct 26 03:15:09 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Fri, 25 Oct 2002 22:15:09 -0400
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEBNCAAB.tim.one@comcast.net>
References: <15801.46429.507385.482352@slothrop.zope.com>
	<LNBBLJKPBEHFEDALKOLCAEBNCAAB.tim.one@comcast.net>
Message-ID: <15801.64173.255155.281547@slothrop.zope.com>

Earlier I reported that the pop proxy was slow.  It's now a lot
faster, and I didn't change a stitch of code.  I guess the network
between me and the real POP servers was very slow this afternoon.

Jeremy


From tim.one@comcast.net  Sat Oct 26 03:20:39 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 25 Oct 2002 22:20:39 -0400
Subject: [Spambayes] Proposing to drop use_mixed_combining
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEBNCAAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAECHCAAB.tim.one@comcast.net>

Proposing to drop the options:

    use_mixed_combining
    mixed_combining_chi_weight

They haven't worked better than chi_combining for anyone yet.

From tim.one@comcast.net  Sat Oct 26 03:29:34 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 25 Oct 2002 22:29:34 -0400
Subject: [Spambayes] Proposing to drop ignore_redundant_html
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEBNCAAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAECICAAB.tim.one@comcast.net>

Proposing to drop the option

    ignore_redundant_html

This has been False by default for a long time, and there are no known
clients.  I used it early in the project, before we stripped HTML tags, else
(at the time) there was no way to get any multipart/alternative msg with a
text/html part to score as ham in the c.l.py tests.

Since then,

A. We strip HTML tags by default (and &nbsp; character entities --
   that's a change I made recently I probably didn't announce here,
   although I mentioned it often enough <wink>).

B. We know that sometimes multipart/alternative msgs have different
   content in the text/plain and text/html parts, and in particular
   that some spam can be identified only by staring at the HTML part.

C. We no longer count multiple instances of a word in a msg multiple
   times during training.  So if text/html and text/plain parts are
   in fact redundant, training isn't affected by seeing the content
   twice.  It used to be.

IOW, ignore_redundant_html has nothing going for it anymore.


From tim.one@comcast.net  Sat Oct 26 04:05:12 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 25 Oct 2002 23:05:12 -0400
Subject: [Spambayes] Proposing to make chi-combining the default
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEBNCAAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCOECJCAAB.tim.one@comcast.net>

The current default combining scheme is anonymous, so this proposal amounts
to two things:

1. Introduce option use_gary_combining, defaulting to False, meaning
   the combining scheme that's currently the default.

2. Change the default for use_chi_combining to True.

2'. Change the default ham_cutoff to 0.20 and the default spam_cutoff
    to 0.90.

I'll introduce a named option for #1 in any case, since an anonymous
behavior is a Bad Idea regardless.

Both combining schemes are 100% compatible at the database level -- they
don't affect training at all (you can use either scheme to *score* msgs,
using the same database).  In all my tests, use_chi_combining works better,
because it has a small (in # of msgs) middle ground spanning a large range
of scores where most mistakes live, and the boundaries of the middle ground
aren't touchy.  In contrast, there still seems no way to predict good cutoff
values for gary_combining; they're corpus- and training-data dependent.

People still seem to have some fear of chi-combining because it makes
extreme judgments (median score for spam is near 0.0, and median score for
ham is near 1.0, in test after test), and this reminds them of the bad
behavior of Graham-combining.  The difference is that Graham-combining had
no middle ground as training data increased, but chi-combining does.
Indeed, the more training data there is, the more certain chi-combining
seems to get about just *how* confused it is <wink>.

FYI, in my personal email I use chi-combining all the time now.  About 1% of
incoming msgs (I get about 600 per day, w/ about 100 spam) end up in my
Unsure folder, using ham_cutoff 0.30 and spam_cutoff 0.80.  They're about
evenly mixed between ham and spam, they "make sense" to me as Unsure msgs,
and training on them correctly ASAP is very effective in preventing
followups (for ham) or near-duplicates (for spam) from ending up in the
Unsure bucket too.  Hapaxes ("words" appearing uniquely in the msg) appear
to play a large role in that last happy result.

One spam has been left in my Inbox, which SpamAssassin let thru on the
mailing-list verion of comp.lang.python, so that the Mailman-inserted URL at
the bottom gave it some strong ham clues:

http://mail.python.org/mailman/listinfo/python-list

'url:python-list'              0.00712585
'url:mailman'                  0.0120696
'url:listinfo'                 0.0121098
'url:python'                   0.0170755

The text of the spam was

"""
python,

A friend of yours, Michael (michael_suswanto@yahoo.com)
thought you might like to check out this web page.

http://www.newmarketingsite.com/2848/

--
The coolest site in town
"""

and so it got strong help from mentioning "python," too.  As spam goes, it
wasn't particularly disgusting <wink>.

There have been no false positives in this time.

This reminds me that Jim Bublitz reported that using a system "for real" day
to day gave even better results than contrived tests, and while I'm not
running controlled experiments on my own email, that's my (subjective)
impression too.  For my personal use in real life, it's been pure delight.


From tim.one@comcast.net  Sat Oct 26 04:06:06 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 25 Oct 2002 23:06:06 -0400
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <15801.64173.255155.281547@slothrop.zope.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCECKCAAB.tim.one@comcast.net>

[Jeremy Hylton]
> Earlier I reported that the pop proxy was slow.  It's now a lot
> faster, and I didn't change a stitch of code.  I guess the network
> between me and the real POP servers was very slow this afternoon.

Well, I checked in a change to the Outlook client.  I figured that would
cure your speed problems <wink>.


From popiel@wolfskeep.com  Sat Oct 26 05:23:54 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Fri, 25 Oct 2002 21:23:54 -0700
Subject: [Spambayes] Proposing to drop use_mixed_combining 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	<LNBBLJKPBEHFEDALKOLCAECHCAAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCAECHCAAB.tim.one@comcast.net> 
Message-ID: <20021026042354.2A0E7F5A4@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCAECHCAAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>Proposing to drop the options:
>
>    use_mixed_combining
>    mixed_combining_chi_weight

Hear!  Hear!

- Alex

From popiel@wolfskeep.com  Sat Oct 26 05:24:14 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Fri, 25 Oct 2002 21:24:14 -0700
Subject: [Spambayes] Proposing to drop ignore_redundant_html 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	<LNBBLJKPBEHFEDALKOLCAECICAAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCAECICAAB.tim.one@comcast.net> 
Message-ID: <20021026042414.145B8F5A4@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCAECICAAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>Proposing to drop the option
>
>    ignore_redundant_html

Sounds good.

- Alex

From popiel@wolfskeep.com  Sat Oct 26 05:29:07 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Fri, 25 Oct 2002 21:29:07 -0700
Subject: [Spambayes] Proposing to make chi-combining the default 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	<LNBBLJKPBEHFEDALKOLCOECJCAAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCOECJCAAB.tim.one@comcast.net> 
Message-ID: <20021026042907.14850F5A4@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCOECJCAAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>The current default combining scheme is anonymous, so this proposal amounts
>to two things:
>
>1. Introduce option use_gary_combining, defaulting to False, meaning
>   the combining scheme that's currently the default.
>
>2. Change the default for use_chi_combining to True.
>
>2'. Change the default ham_cutoff to 0.20 and the default spam_cutoff
>    to 0.90.

I'm slightly surprised at the looseness of 2', but as you say,
the boundaries aren't all that touchy.

I'm all for the above.

- Alex

From jbublitz@nwinternet.com  Sat Oct 26 16:25:09 2002
From: jbublitz@nwinternet.com (Jim Bublitz)
Date: Sat, 26 Oct 2002 08:25:09 -0700 (PDT)
Subject: [Spambayes] Proposing to make chi-combining the default
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOECJCAAB.tim.one@comcast.net>
Message-ID: <XFMail.021026082509.jbublitz@nwinternet.com>

On 26-Oct-02 Tim Peters wrote:
> This reminds me that Jim Bublitz reported that using a system
> "for real" day to day gave even better results than contrived
> tests, and while I'm not running controlled experiments on my
> own email, that's my (subjective) impression too.  For my
> personal use in real life, it's been pure delight.

Just like Beetlejuice, if you say my name, I appear :)

Previously I just did testing in chronological order and got much
better results than random testing. As of Sunday I turned on my new
mail system which includes the spam filter, but also replaces
fetchmail, procmail, cron (for mail anyway), and some of qmail
(qmail still acts as my local smtp/pop3 server). I also have a
whitelist in front of the spam filter (my fps can be expensive).
It's completely in Python, of course.

Over 6 full days of use, 1 or 2 spams per day get through and I've
had a couple of fps total. The fns/fps are less than 1%.  I haven't
written anything to parse the logs yet so I don't have actual
stats, and for the first few days I had to restart the mail system
a number of times (bugs), so there hasn't been any way to accumulate
actual results except the logs.

As several other people have mentioned, I'm also juggling msgs
between ham and spam folders based on the results of "manual"
review. The only thing I think deserves mention is the review
process. Every 20 spams received, the mail system puts together a
msg with a list of Subj and From lines from 20 spams in score order.
Next to each msg is a checkbox []. This email msg gets sent to a
user (all 2 of us) on an alternating basis. The user replies to the
msg to confirm the scoring was correct (leaves the checkboxes
empty), or if a score looks wrong, puts an [x] in the box. When the
mail system receives the reply, it forwards any checked msgs back
to the user. If the checked msg was really spam, the user places it
in a local spam folder, along with any fns; if it *was* ham they do
nothing (the mail system has already moved the msg to the ham folder
temporarily). At the end of the day the mail system empties all of
the local spam folders and shifts msgs around again if req'd, and
then retrains on the new mail.

Reviewing 20 msgs at a time takes less than a minute. Doing it via
email makes it more likely the msgs will actually get reviewed. The
review email is much easier than having to scan a folder and delete
each unwanted msg, but it still gives users a sense of control over
the process, demonstrates how much spam is actually being blocked,
and offers a sense of "victory over spammers". My wife likes it
anyway, and she usually hates my UIs.

We still end up looking at spam subject lines because we can't
afford any fps, but real mail gets through more quickly and sorting
is much more accurate and done more quickly with about the absolute
minimum of user activity. In a few months I might have enough
confidence to trust the .99 (spam) scores without review, or else
have constructed a sizable blacklist that doesn't require scoring
or review. The couple of fps so far have scored around .502 (0.5
cutoff) - one was a legitimate mail from a guy who works for a
company that's well represented in my spam corpus.

Jim


From sholden@holdenweb.com  Sat Oct 26 18:48:20 2002
From: sholden@holdenweb.com (Steve Holden)
Date: Sat, 26 Oct 2002 13:48:20 -0400
Subject: [Spambayes] Some minor nits ...
Message-ID: <001a01c27d17$dfec99a0$6300000a@holdenweb.com>

I've just been testing the pop3proxy with Outlook Express on Win2K.

If I run it under Windows 2.2.1 (the ActiveState distro, if it makes a
difference), everything seems to work except that the X-Hammie-Disposition
header is treated as  apart of the message body, presumably due to the <CR>
that precedes the <LF>. Could we make the line ending controllable by some
sort of option, or are there specific reasons for sticking to RFC standards
here :-)?

Under cygwin (python 2.2.1) I see the folowing asyncore error:

error: uncaptured python exception, closing channel
<__main__.BayesProxy connected 127.0.0.1:1111 at 0x102026c0>
(exceptions.IOError:(0, 'Error')
[/tmp/python.576/usr/lib/python2.2/asyncore.py|poll|95]
[/tmp/python.576/usr/lib/python2.2/asyncore.py|handle_read_event|392]
[/tmp/python.576/usr/lib/python2.2/asynchat.py|handle_read|130]
[pop3proxy.py|found_terminator|181])

Not sure quite what that's about, I'll take a look if I get a chance.

regards
-----------------------------------------------------------------------
Steve Holden                                  http://www.holdenweb.com/
Python Web Programming                 http://pydish.holdenweb.com/pwp/
Previous .sig file retired to                    www.homeforoldsigs.com
-----------------------------------------------------------------------


From jeremy@alum.mit.edu  Sat Oct 26 19:04:21 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Sat, 26 Oct 2002 14:04:21 -0400
Subject: [Spambayes] Some minor nits ...
In-Reply-To: <001a01c27d17$dfec99a0$6300000a@holdenweb.com>
References: <001a01c27d17$dfec99a0$6300000a@holdenweb.com>
Message-ID: <15802.55589.455847.462428@slothrop.zope.com>

>>>>> "SH" == Steve Holden <sholden@holdenweb.com> writes:

  SH> Under cygwin (python 2.2.1) I see the folowing asyncore error:

  SH> error: uncaptured python exception, closing channel
  SH> <__main__.BayesProxy connected 127.0.0.1:1111 at 0x102026c0>
  SH> (exceptions.IOError:(0, 'Error')
  SH> [/tmp/python.576/usr/lib/python2.2/asyncore.py|poll|95]
  SH> [/tmp/python.576/usr/lib/python2.2/asyncore.py|handle_read_event|392]
  SH> [/tmp/python.576/usr/lib/python2.2/asynchat.py|handle_read|130]
  SH> [pop3proxy.py|found_terminator|181])

  SH> Not sure quite what that's about, I'll take a look if I get a
  SH> chance.

This is the error I saw on Linux.  I assume that means that makefile()
on Windows doesn't care whether it's opened with a read or write
mode.  That's hardly surprising, although it might be a Python bug.

Jeremy


From popiel@wolfskeep.com  Sat Oct 26 21:34:09 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Sat, 26 Oct 2002 13:34:09 -0700
Subject: [Spambayes] Mining the headers
Message-ID: <20021026203410.166C1F54A@cashew.wolfskeep.com>

Tim mentioned three tokenizer options (mine_received_headers,
count_all_header_lines, basic_header_tokenize).  I hadn't
played with these yet, so I ran the 8 combinations of these.

Summary: both mine_received_headers and basic_header_tokenize
seem good for me, but count_all_header_lines is a minor lose.

 r == mine_received_headers: False
 R == mine_received_headers: True
 c == count_all_header_lines: False
 C == count_all_header_lines: True
 b == basic_header_tokenize: False
 B == basic_header_tokenize: True

Other options are:
 [Classifier]
 use_chi_squared_combining: True

 [TestDriver]
 show_false_negatives: False
 show_false_positives: False
 show_unsure: False
 ham_cutoff: 0.20
 spam_cutoff: 0.90

-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
[...]
filename:      rcb     rcB     rCb     rCB     Rcb     RcB     RCb     RCB
ham:spam:  2000:2000       2000:2000       2000:2000       2000:2000      
                   2000:2000       2000:2000       2000:2000       2000:2000
fp total:        3       3       3       3       3       3       3       3
fp %:         0.15    0.15    0.15    0.15    0.15    0.15    0.15    0.15
fn total:       12      14      16      14      12      12      12      12
fn %:         0.60    0.70    0.80    0.70    0.60    0.60    0.60    0.60
unsure t:       53      37      50      39      40      31      37      32
unsure %:     1.32    0.93    1.25    0.97    1.00    0.78    0.93    0.80
real cost:  $52.60  $51.40  $56.00  $51.80  $50.00  $48.20  $49.40  $48.40
best cost:  $48.20  $45.20  $49.20  $45.60  $37.20  $38.80  $40.60  $38.60
h mean:       0.40    0.32    0.35    0.32    0.31    0.30    0.29    0.29
h sdev:       5.39    4.71    5.12    4.68    4.55    4.47    4.47    4.43
s mean:      98.45   98.68   98.35   98.68   98.75   98.85   98.72   98.85
s sdev:       9.76    9.57   10.46    9.58    9.08    9.06    9.37    9.11
mean diff:   98.05   98.36   98.00   98.36   98.44   98.55   98.43   98.56
k:            6.47    6.89    6.29    6.90    7.22    7.28    7.11    7.28

Yes, it looks like there's good info in the headers.  Counting
the header lines doesn't appear to be a helpful way to get at
that information, but mining the received headers and just doing
basic tokenization over all the headers both seem to work, and
work even better together.

This is on my website at:
  http://www.wolfskeep.com/~popiel/spambayes/headers

- Alex

From gward@python.net  Sat Oct 26 22:11:18 2002
From: gward@python.net (Greg Ward)
Date: Sat, 26 Oct 2002 17:11:18 -0400
Subject: [Spambayes] python.org corpus updated
Message-ID: <20021026211118.GA29889@cthulhu.gerg.ca>

Hi all -- I've just updated the python.org email corpus to include mail
harvested last week, ie. from 2002-10-19 to 2002-10-24.  (The harvest
was supposed to run for a full week, ie. until this morning, but for
some reason it stopped Thursday evening.  Oh well.)

As before, I'm not going to share this with just anyone.  Let me know if
you're interested, and I'll let you know the URL and password.  If I've
never met you personally, I'll probably ask for approval from
Guido/Barry/Tim first.

Oh: there are undoubtedly spams in the ham folder and vice-versa; I've
done a manual pass over all of the folders, but running them through a
different spam filter always finds some errors.  If you download the
corpus and find mis-filed messages, please let me know and I'll update
the canonical corpus accordingly.

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
Never try to outstubborn a cat.

From gward@python.net  Sat Oct 26 22:13:36 2002
From: gward@python.net (Greg Ward)
Date: Sat, 26 Oct 2002 17:13:36 -0400
Subject: [Spambayes] Re: python.org corpus updated
In-Reply-To: <20021026211118.GA29889@cthulhu.gerg.ca>
References: <20021026211118.GA29889@cthulhu.gerg.ca>
Message-ID: <20021026211336.GA29902@cthulhu.gerg.ca>

Oh, some depressing statistics.  From the Sept harvest:

  dsn                   1662 messages      8662 kB
  ham                   3819 messages     11249 kB
  spam                  1896 messages     16692 kB
  virus                  991 messages    120758 kB

(dsn = delivery status notification = bounces, delay notifications,
vacation mail, etc.  I only kept 10% of the actual DSNs received.)

And from October:

  dsn                   1006 messages      5347 kB
  ham                   2851 messages      7803 kB
  spam                  2841 messages     22206 kB
  virus                  634 messages     73754 kB

Note that the spam:ham ratio is now 1:1.  *sigh*

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
Eschew obfuscation!

From rob@hooft.net  Sat Oct 26 22:25:08 2002
From: rob@hooft.net (Rob Hooft)
Date: Sat, 26 Oct 2002 23:25:08 +0200
Subject: [Spambayes] hammie deployment without Outlook
Message-ID: <3DBB0834.7050504@hooft.net>

This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Since I'm Linux-only, I've been trying to put hammie.py to work for me.
I made the attached change to hammie.py. I Trained it on a few hundred 
recent representative ham and spam messages. Then I created an 
executable file "~/bin/hammie":

----
#!/bin/sh
cd $HOME/p/spambayes
/usr/local/bin/python hammie.py -d -f
----

In .forward I put (on one line):
----
"|exec /mnt/disk2/People/Development/hooft/bin/hammie |/usr/bin/procmail"
----

and I created a file "~/.procmailrc" with contents:

----
LOGFILE=procmail.log
:0:
* ^X-Hammie-Disposition: Yes
imap/spam

:0 c
* ^X-Hammie-Disposition: Unsure
imap/unsure

:0 c
* ^X-Hammie-Disposition: No
imap/ham
----

Any other hints from people with more experience? I'll be waiting for a 
mozilla plugin to make interactive use of spambayes....

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.29
diff -u -r1.29 hammie.py
--- hammie.py	6 Oct 2002 23:07:23 -0000	1.29
+++ hammie.py	26 Oct 2002 21:23:40 -0000
@@ -59,6 +59,7 @@
 
 # Probability at which a message is considered spam
 SPAM_THRESHOLD = options.spam_cutoff
+HAM_THRESHOLD = options.ham_cutoff
 
 # Tim's tokenizer kicks far more booty than anything I would have
 # written.  Score one for analysis ;)
@@ -227,7 +228,8 @@
             import traceback
             traceback.print_exc()
 
-    def filter(self, msg, header=DISPHEADER, cutoff=SPAM_THRESHOLD):
+    def filter(self, msg, header=DISPHEADER, spam_cutoff=SPAM_THRESHOLD,
+               ham_cutoff=HAM_THRESHOLD):
         """Score (judge) a message and add a disposition header.
 
         msg can be a string, a file object, or a Message object.
@@ -245,10 +247,12 @@
         elif not hasattr(msg, "add_header"):
             msg = email.message_from_string(msg)
         prob, clues = self._scoremsg(msg, True)
-        if prob < cutoff:
+        if prob < ham_cutoff:
             disp = "No"
-        else:
+        elif prob > spam_cutoff:
             disp = "Yes"
+        else:
+            disp = "Unsure"
         disp += "; %.2f" % prob
         disp += "; " + self.formatclues(clues)
         msg.add_header(header, disp)

---------------------- multipart/mixed attachment--


From skip@pobox.com  Sun Oct 27 00:18:25 2002
From: skip@pobox.com (Skip Montanaro)
Date: Sat, 26 Oct 2002 18:18:25 -0500
Subject: [Spambayes] Mining the headers
In-Reply-To: <20021026203410.166C1F54A@cashew.wolfskeep.com>
References: <20021026203410.166C1F54A@cashew.wolfskeep.com>
Message-ID: <15803.8897.442152.315985@montanaro.dyndns.org>


    Alex> Tim mentioned three tokenizer options (mine_received_headers,
    Alex> count_all_header_lines, basic_header_tokenize).  I hadn't played
    Alex> with these yet, so I ran the 8 combinations of these.

I've had three other options knocking around locally which haven't seemed to
help or hurt when applied to my collections: mine_date_headers,
generate_time_buckets, and extract_dow. The first controls overall attention
to the Date: header.  The second generates tokens like time:12:3 (the third
six-minute bucket of the twelfth hour).  The third generates tokens like
dow:0 (Monday).  Should I check them in to see if they are useful for other
people?  (I seem to have a bit different fp & fn results than others.)

Skip

From skip@pobox.com  Sun Oct 27 00:26:39 2002
From: skip@pobox.com (Skip Montanaro)
Date: Sat, 26 Oct 2002 18:26:39 -0500
Subject: [Spambayes] Re: python.org corpus updated
In-Reply-To: <20021026211336.GA29902@cthulhu.gerg.ca>
References: <20021026211118.GA29889@cthulhu.gerg.ca>
        <20021026211336.GA29902@cthulhu.gerg.ca>
Message-ID: <15803.9391.949518.510594@montanaro.dyndns.org>

>>>>> "Greg" == Greg Ward <gward@python.net> writes:

    Greg> And from October:

    Greg>   dsn                   1006 messages      5347 kB
    Greg>   ham                   2851 messages      7803 kB
    Greg>   spam                  2841 messages     22206 kB
    Greg>   virus                  634 messages     73754 kB

    Greg> Note that the spam:ham ratio is now 1:1.  *sigh*

Greg,

Do you get email from python-list?  Its source is primarily the c.l.py
newsfeed, which has been down for about five days now.  According to the
last message I saw from Barry, it seems there's a problem between Bay
Mountain and UUNet.  That would seriously impact your spam:ham ratio.  Take
a look at the statistics at the bottom of
<http://groups.yahoo.com/group/python-list/>.  It looks to me like October's
going to be a light month as a result.  If the feed isn't reestablished
soon, newsgroup messages are likely to begin expiring before they get to the
list archive.

Skip


From tim.one@comcast.net  Sun Oct 27 01:31:59 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 26 Oct 2002 21:31:59 -0400
Subject: [Spambayes] python.org corpus updated
In-Reply-To: <20021026211118.GA29889@cthulhu.gerg.ca>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEHLCAAB.tim.one@comcast.net>

This is a multi-part message in MIME format.

---------------------- multipart/mixed attachment
[Greg Ward]
> ...
> Oh: there are undoubtedly spams in the ham folder and vice-versa; I've
> done a manual pass over all of the folders, but running them through a
> different spam filter always finds some errors.  If you download the
> corpus and find mis-filed messages, please let me know and I'll update
> the canonical corpus accordingly.

Thanks, Greg!  I did the usual quick cheap-ass thing, taking the new data
and just splitting it in half, training on the first half and predicting
against the second.  This seemed discouraging at first:

-> <stat> 4 new false positives
    new fp: ['pyham/02155.txt', 'pyham/01816.txt', 'pyham/02322.txt',
             'pyham/02406.txt']

but I believe they're all spam.  I'll attach them for your review.  They
correspond, respectively, to your

ham/184Q4b-0000MJ-00  2155
ham/184Dc6-0003e0-00  1816
ham/184UDM-0002Cn-00  2322
ham/184Xs0-0007q6-00  2406

A difference from last time:  since python.org already takes a Draconian
view of Asian-language traffic, I enabled the new (late this week) tokenizer
option replace_nonascii_chars.  This allows detection of "hated languages"
more reliably with a smaller database and less training data.

In the other direction (training on the 2nd half of the new data and
predicting on the 1st half), the FP rate zoomed:

-> <stat> 9 new false positives
    new fp: ['pyham/00277.txt', 'pyham/00278.txt', 'pyham/00275.txt',
             'pyham/00267.txt', 'pyham/01346.txt', 'pyham/00261.txt',
             'pyham/00276.txt', 'pyham/01284.txt', 'pyham/00645.txt']

Again I believe these are all spam, and some are so outrageously spam it's
hard to believe SpamAssassin let them pass!  Then again, most are in a hated
language <wink>.

ham/183BtE-00072Z-00   261
ham/183DZB-0007dJ-00   267
ham/183Epz-0001IH-00   275
ham/183Epz-0001II-00   276
ham/183Epz-0001IJ-00   277
ham/183Epz-0001IK-00   278
ham/183aCi-00024k-00   645
ham/183ueG-0006vd-00  1284
ham/183xNY-0008Gi-00  1346

Take those away and there were no false positives in either direction.

There's also ham in the spam, but there are a lot more of those to dig thru,
they seem to be grosser errors than last time around, and I'm tired of this
now.  One example:

spam/183UWS-00060A-00  633

seems a perfectly ordinary piece of mailman-users traffic.  chi-combining is
quite certain it's ham:

prob = 3.37424532759e-012
prob('*H*') = 1
prob('*S*') = 6.63913e-012

OTOH, SpamAssassin seems certain it's spam:

"""
Return-Path: <whitestar@sourcevisions.net>
Envelope-To: mailman-users@python.org
Received: from northgate.starhub.net.sg ([203.117.1.53])
        by mail.python.org with esmtp (Exim 4.05)
        id 183UWS-00060A-00
        for mailman-users@python.org; Mon, 21 Oct 2002 00:50:37 -0400
Received: from sourcevisions.net (root@cm29.omega93.scvmaxonline.com.sg
[218.186.93.29])
        by northgate.starhub.net.sg (8.12.5/8.12.5) with ESMTP id
g9L4oXr2016531
        for <mailman-users@python.org>; Mon, 21 Oct 2002 12:50:34 +0800
(SST)
Received: from localhost (whitestar@localhost)
        by sourcevisions.net (8.11.6/8.11.6) with ESMTP id g9L4qx330186
        for <mailman-users@python.org>; Mon, 21 Oct 2002 12:52:59 +0800
Date: Mon, 21 Oct 2002 12:52:59 +0800 (SGT)
From: Terence <whitestar@sourcevisions.net>
To: mailman-users@python.org
Subject: Need help on email headers
Message-ID: <Pine.LNX.4.44.0210211246460.30125-100000@sourcevisions.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-warning: 203.117.1.53 in blacklist at unconfirmed.dsbl.org
 (http://dsbl.org/listing.php?203.117.1.53)
X-Spam-Status: Yes, hits=5.4 required=5.0
     tests=MAILTO_WITH_SUBJ,RCVD_IN_MULTIHOP_DSBL,RCVD_IN_RFCI,
           RCVD_IN_UNCONFIRMED_DSBL,SIGNATURE_SHORT_DENSE,
           SPAM_PHRASE_02_03,USER_AGENT_PINE,WEIRD_PORT
X-Spam-Flag: YES
X-Spam-Level: *****

Hello,

I see some redundant links on the email headers, listed below:

List-Help: <mailto:be-request@sourcevisions.net?subject=help>
List-Post: <mailto:be@sourcevisions.net>
List-Subscribe: <http://sourcevisions.net:81/mailman/listinfo/be>,
        <mailto:be-request@sourcevisions.net?subject=subscribe>
List-Id: Basic Essence Mailing List <be.sourcevisions.net>
List-Unsubscribe: <http://sourcevisions.net:81/mailman/listinfo/be>,
        <mailto:be-request@sourcevisions.net?subject=unsubscribe>
List-Archive: <http://sourcevisions.net:81/pipermail/be/>

Is there any way to remove those above while only maintaining the
List-Id?. Those lines are extras since we can follow the link below to
configure our options

I'll appreciate if anyone can help me out

Thank you very much :)

--
Best Regards
Terence Tham
E-Alias: Odin
whitestar@sourcevisions.net
"""

There also appear to be an awful lot of "false negatives" of the form:

"""
    This is a message from the IFL E-Mail Virus Protection Service
    --------------------------------------------------------------

The original e-mail attachment

    "Card.DOC.pif"

appears to be infected by a virus and has been replaced by this=20
warning message.
"""

That may be virus fallout, but I don't believe it belongs in the spam
corpus, right?

---------------------- multipart/mixed attachment
Return-Path: <MO2501_20021022_3638@link2buy.com>=0A=
Envelope-To: do-sig-request@python.org=0A=
Received: from [64.14.139.219] (helo=3Doutboundc.link2buy.com)=0A=
	by mail.python.org with esmtp (Exim 4.05)=0A=
	id 184Dc6-0003e0-00=0A=
	for do-sig-request@python.org; Wed, 23 Oct 2002 00:59:26 -0400=0A=
Received: from [10.3.220.208]=0A=
	by outboundc.link2buy.com (10.3.220.242) with QMQP; 22 Oct 2002 =
21:56:21 +0000=0A=
Message-ID: <704023967.1035349071889.mu@link2buy.com>=0A=
Date: Tue Oct 22 21:46:35 PDT 2002=0A=
From: ConsumerDirect <MO2501_20021022_3638@link2buy.com>=0A=
To: do-sig-request@python.org=0A=
Subject: Is there a Miracle Vitamin?=0A=
Mime-Version: 1.0=0A=
Content-Type: text/html=0A=
Content-Transfer-Encoding: 7bit=0A=
X-warning: 64.14.139.219 in blacklist at dnsbl.njabl.org=0A=
 (spam source -- 1031416958)=0A=
X-Spam-Status: No, hits=3D-4.6 required=3D5.0 =
tests=3DCTYPE_JUST_HTML,FROM_ENDS_IN_NUMS,HTTP_WITH_EMAIL_IN_URL,INVALID_=
DATE,MAILTO_LINK,SPAM_PHRASE_03_05,USER_IN_WHITELIST_TO,WEB_BUGS=0A=
X-Spam-Level:=0A=
=0A=
=0A=
<html><body leftmargin=3D"0" marginwidth=3D"0" topmargin=3D"0" =
marginheight=3D"0" bgcolor=3D"#ffffff"><img =
src=3D"http://link2buy.com/c/i.jsp?MO2501_20021022_3638;259578032=3Dh" =
alt=3D"link2buy.com"><br>=0A=
=0A=
<table width=3D"600" border=3D"0" cellspacing=3D"0" =
cellpadding=3D"0"><tr><td bgcolor=3D"#FFFFFF" align=3D"center"><a =
href=3D"http://link2buy.com/c/cs.jsp?MO2501_20021022_3638;259578032=3Dhtt=
p://www.buyseasilvernow.com/health/special8/"><IMG =
SRC=3D"http://link2buy.com/img/2501_20021022_3638/1.jpg" WIDTH=3D"600" =
HEIGHT=3D"111" border=3D"0"></a><br><a =
href=3D"http://link2buy.com/c/cs.jsp?MO2501_20021022_3638;259578032=3Dhtt=
p://www.buyseasilvernow.com/health/special8/"><IMG =
SRC=3D"http://link2buy.com/img/2501_20021022_3638/2.jpg" WIDTH=3D"169" =
HEIGHT=3D"278" border=3D"0"></a><img =
src=3D"http://link2buy.com/img/2501_20021022_3638/3.gif" width=3D"431" =
height=3D"278" border=3D"0" usemap=3D"#1"><br><IMG =
SRC=3D"http://link2buy.com/img/2501_20021022_3638/4.jpg" WIDTH=3D"600" =
HEIGHT=3D"252" border=3D"0" usemap=3D"#2"><br><IMG =
SRC=3D"http://link2buy.com/img/2501_20021022_3638/5.gif" WIDTH=3D"600" =
HEIGHT=3D"277" border=3D"0" usemap=3D"#3"><br><IMG =
SRC=3D"http://link2buy.com/img/2501_20021022_3638/6.jpg" WIDTH=3D"600" =
HEIGHT=3D"143" border=3D"0" usemap=3D"#4"></td></tr></table><map =
name=3D"1"><area shape=3D"rect" coords=3D"96,152,252,186" =
href=3D"http://link2buy.com/c/cs.jsp?MO2501_20021022_3638;259578032=3Dhtt=
p://www.buyseasilvernow.com/health/special8/"></map><map =
name=3D"2"><area shape=3D"rect" coords=3D"51,164,253,203" =
href=3D"http://link2buy.com/c/cs.jsp?MO2501_20021022_3638;259578032=3Dhtt=
p://www.buyseasilvernow.com/health/special8/index_reg.htm"><area =
shape=3D"rect" coords=3D"54,6,221,26" =
href=3D"http://link2buy.com/c/cs.jsp?MO2501_20021022_3638;259578032=3Dhtt=
p://www.buyseasilvernow.com/health/special8/"></map><map =
name=3D"3"><area shape=3D"rect" coords=3D"15,41,151,231" =
href=3D"http://link2buy.com/c/cs.jsp?MO2501_20021022_3638;259578032=3Dhtt=
p://www.buyseasilvernow.com/health/special8/"><area shape=3D"rect" =
coords=3D"167,175,228,190" =
href=3D"http://link2buy.com/c/cs.jsp?MO2501_20021022_3638;259578032=3Dhtt=
p://www.buyseasilvernow.com/health/special8/"><area shape=3D"rect" =
coords=3D"252,91,307,107" =
href=3D"http://link2buy.com/c/cs.jsp?MO2501_20021022_3638;259578032=3Dhtt=
p://www.buyseasilvernow.com/health/special8/"></map><map =
name=3D"4"><area shape=3D"rect" coords=3D"245,49,426,65" =
href=3D"mailto:info@buyseasilvernow.com"><area shape=3D"rect" =
coords=3D"209,10,386,32" =
href=3D"http://link2buy.com/c/cs.jsp?MO2501_20021022_3638;259578032=3Dhtt=
p://www.buyseasilvernow.com/health/special8/"></map>=0A=
=0A=
<table width=3D"600" cellspacing=3D"0" cellpadding=3D"5" border=3D"1" =
bordercolor=3D"#000000" style=3D"border-collapse:collapse"><tr><td =
bgcolor=3D"#ffffff"><img =
src=3D"http://link2buy.com/c/i.jsp?MO2501_20021022_3638=3Df" alt=3D"Why =
are you receiving this email?"><br><span style=3D"font-family:arial; =
color:#666666; =
font-size:10px;">http://link2buy.com/c/u.jsp?E=3Ddo-sig-request@python.or=
g&P=3DMO2501_20021022_3638&U=3D259578032<br>Your email address on record =
is do-sig-request@python.org.<br><br><a =
href=3D"http://link2buy.com/c/u.jsp?E=3Ddo-sig-request@python.org&P=3DMO2=
501_20021022_3638&U=3D259578032">Unsubscribe =
me.</a></span></td></tr></table></body></html>=0A=
=0A=

---------------------- multipart/mixed attachment
Return-Path: <MO4509_20021021_3632@link2buy.com>
Envelope-To: do-sig-request@python.org
Received: from [64.14.139.218] (helo=outbounda.link2buy.com)
	by mail.python.org with esmtp (Exim 4.05)
	id 184UDM-0002Cn-00
	for do-sig-request@python.org; Wed, 23 Oct 2002 18:43:00 -0400
Received: from [10.3.220.205]
	by outbounda.link2buy.com (10.3.20.240) with QMQP; 23 Oct 2002 15:42:29 +0000
Message-ID: <704023967.1035412948706.mu@link2buy.com>
Date: Wed Oct 23 15:31:03 PDT 2002
From: ConsumerDirect <MO4509_20021021_3632@link2buy.com>
To: do-sig-request@python.org
Subject: Back Child Support? We Can Help!
Mime-Version: 1.0
Content-Type: text/html
Content-Transfer-Encoding: 7bit
X-warning: 64.14.139.218 in blacklist at bl.spamcop.net
 (Blocked - see http://spamcop.net/bl.shtml?64.14.139.218)
X-Spam-Status: No, hits=-4.7 required=5.0 tests=CTYPE_JUST_HTML,FROM_ENDS_IN_NUMS,HTTP_WITH_EMAIL_IN_URL,INVALID_DATE,PLING_QUERY,SPAM_PHRASE_03_05,USER_IN_WHITELIST_TO,WEB_BUGS
X-Spam-Level:


<html>
<body leftmargin="0" marginwidth="0" topmargin="0" marginheight="0" bgcolor="#ffffff">
<img src="http://link2buy.com/c/i.jsp?MO4509_20021021_3632;259578032=h" alt="link2buy.com"><br>

<a href="http://link2buy.com/c/cs.jsp?MO4509_20021021_3632;259578032=http://www.nationalchildsupport.com/onlmkt/netcom/92702/1.html"><img src="http://link2buy.com/img/4509_20021021_3632/1.gif" width="600" height="288" border="0"></a> 

<table width="600" cellspacing="0" cellpadding="5" border="1" bordercolor="#000000" style="border-collapse:collapse"><tr><td bgcolor="#ffffff"><img src="http://link2buy.com/c/i.jsp?MO4509_20021021_3632=f" alt="Why are you receiving this email?"><br><span style="font-family:arial; color:#666666; font-size:10px;">http://link2buy.com/c/u.jsp?E=do-sig-request@python.org&P=MO4509_20021021_3632&U=259578032<br>Your email address on record is do-sig-request@python.org.<br><br><a href="http://link2buy.com/c/u.jsp?E=do-sig-request@python.org&P=MO4509_20021021_3632&U=259578032">Unsubscribe me.</a></span></td></tr></table></body></html>


---------------------- multipart/mixed attachment
Return-Path: <MO3922_20021023_3653@link2buy.com>
Envelope-To: do-sig-request@python.org
Received: from [64.14.139.214] (helo=link2buy.com)
	by mail.python.org with smtp (Exim 4.05)
	id 184Xs0-0007q6-00
	for do-sig-request@python.org; Wed, 23 Oct 2002 22:37:12 -0400
Received: (qmail 3223 invoked from network); 24 Oct 2002 02:20:51 -0000
Received: from unknown (HELO app02) (64.14.139.216)
  by 10.3.220.32 with SMTP; 24 Oct 2002 02:20:51 -0000
Message-ID: <704023967.1035426051418.mu@link2buy.com>
Date: Wed, 23 Oct 2002 19:14:17 -0700 (PDT)
From: ConsumerDirect <MO3922_20021023_3653@link2buy.com>
To: do-sig-request@python.org
Subject: Get a Car Loan Now - Any Credit-No Commitment or Fees!
Mime-Version: 1.0
Content-Type: text/html
Content-Transfer-Encoding: 7bit
X-warning: 64.14.139.214 in blacklist at dnsbl.njabl.org
 (spam source -- 1029255112)
X-Spam-Status: No, hits=-0.4 required=5.0 tests=CTYPE_JUST_HTML,FROM_ENDS_IN_NUMS,HTTP_WITH_EMAIL_IN_URL,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_OSIRUSOFT_COM,RCVD_IN_SBL,SPAM_PHRASE_03_05,USER_IN_WHITELIST_TO,WEB_BUGS,X_OSIRU_SPAMWARE_SITE
X-Spam-Level:

<html>
<body leftmargin="0" marginwidth="0" topmargin="0" marginheight="0" bgcolor="#ffffff">
<img src="http://link2buy.com/c/i.jsp?MO3922_20021023_3653;259578032=h" alt="link2buy.com"><br>

<table width="600" border="0" cellspacing="0" cellpadding="10"><tr><td bgcolor="#FFFFFF" align="center">
<br>
<a href="http://link2buy.com/c/cs.jsp?MO3922_20021023_3653;259578032=https://www.ecarcredit.com/navigator.jsp?dlrcode=ecc&cpnid=115&proid=200"><img src="http://link2buy.com/img/3922_20021023_3653/1.gif" width="504" height="554" border="0"></a>
<br>
</td></tr></table>

<table width="600" cellspacing="0" cellpadding="5" border="1" bordercolor="#000000" style="border-collapse:collapse"><tr><td bgcolor="#ffffff"><img src="http://link2buy.com/c/i.jsp?MO3922_20021023_3653=f" alt="Why are you receiving this email?"><br><span style="font-family:arial; color:#666666; font-size:10px;">http://link2buy.com/c/u.jsp?E=do-sig-request@python.org&P=MO3922_20021023_3653&U=259578032<br>Your email address on record is do-sig-request@python.org.<br><br><a href="http://link2buy.com/c/u.jsp?E=do-sig-request@python.org&P=MO3922_20021023_3653&U=259578032">Unsubscribe me.</a></span></td></tr></table></body></html>


---------------------- multipart/mixed attachment
Return-Path: <MO2735_20021017_3587@link2buy.com>
Envelope-To: do-sig-request@python.org
Received: from [64.14.139.219] (helo=outboundc.link2buy.com)
	by mail.python.org with esmtp (Exim 4.05)
	id 184Q4b-0000MJ-00
	for do-sig-request@python.org; Wed, 23 Oct 2002 14:17:41 -0400
Received: from [10.3.220.208]
	by outboundc.link2buy.com (10.3.220.242) with QMQP; 23 Oct 2002 11:14:17 +0000
Message-ID: <704023967.1035396949410.mu@link2buy.com>
Date: Wed Oct 23 11:08:22 PDT 2002
From: ConsumerDirect <MO2735_20021017_3587@link2buy.com>
To: do-sig-request@python.org
Subject: Enjoy the Web Again
Mime-Version: 1.0
Content-Type: text/html
Content-Transfer-Encoding: 7bit
X-warning: 64.14.139.219 in blacklist at dnsbl.njabl.org
 (spam source -- 1031416958)
X-Spam-Status: No, hits=-4.5 required=5.0 tests=CTYPE_JUST_HTML,FROM_ENDS_IN_NUMS,HTML_50_70,HTTP_WITH_EMAIL_IN_URL,INVALID_DATE,SPAM_PHRASE_03_05,USER_IN_WHITELIST_TO,WEB_BUGS
X-Spam-Level:


<html>
<body leftmargin="0" marginwidth="0" topmargin="0" marginheight="0" bgcolor="#ffffff">
<img src="http://link2buy.com/c/i.jsp?MO2735_20021017_3587;259578032=h" alt="link2buy.com"><br>

<table width="600" border="0" cellspacing="0" cellpadding="10"><tr><td bgcolor="#FFFFFF" align="center"><a href="http://link2buy.com/c/cs.jsp?MO2735_20021017_3587;259578032=http://link2buy.com/img/2735_20021017_3587/r_MO.html"><img src="http://link2buy.com/img/2735_20021017_3587/1.gif" width="454" height="454" border="0"></a></td></tr></table>

<table width="600" cellspacing="0" cellpadding="5" border="1" bordercolor="#000000" style="border-collapse:collapse"><tr><td bgcolor="#ffffff"><img src="http://link2buy.com/c/i.jsp?MO2735_20021017_3587=f" alt="Why are you receiving this email?"><br><span style="font-family:arial; color:#666666; font-size:10px;">http://link2buy.com/c/u.jsp?E=do-sig-request@python.org&P=MO2735_20021017_3587&U=259578032<br>Your email address on record is do-sig-request@python.org.<br><br><a href="http://link2buy.com/c/u.jsp?E=do-sig-request@python.org&P=MO2735_20021017_3587&U=259578032">Unsubscribe me.</a></span></td></tr></table></body></html>


---------------------- multipart/mixed attachment
Return-Path: <info@goldrush.ntf.ne.jp>
Envelope-To: jobs@python.org
Received: from 211008071155.cidr.odn.ne.jp ([211.8.71.155] helo=vivis.ntf.ne.jp)
	by mail.python.org with smtp (Exim 4.05)
	id 183BtE-00072Z-00
	for jobs@python.org; Sun, 20 Oct 2002 04:56:52 -0400
Received: (qmail 542 invoked from network); 20 Oct 2002 07:26:47 -0000
Received: from p1096-ipbf04fukuokachu.fukuoka.ocn.ne.jp (HELO ?192.172.0.105?) (220.96.125.96)
  by 211008071155.cidr.odn.ne.jp with SMTP; 20 Oct 2002 07:26:47 -0000
Date: Sun, 20 Oct 2002 16:27:31 +0900
From: INFO <info@goldrush.ntf.ne.jp>
To: a@a.a
Subject: =?ISO-2022-JP?B?GyRCTCQ+NUJ6OS05cCIoIVYlKyE8JUkkThsoQg==?=
 =?ISO-2022-JP?B?GyRCR2MkJEoqT0gkRyUtJWMlQyU3JWUlMhsoQg==?=
 =?ISO-2022-JP?B?GyRCJUMlSCFXGyhC?=
Message-Id: <20021020155600.237F.INFO@goldrush.ntf.ne.jp>
MIME-Version: 1.0
Content-Type: text/plain; charset="ISO-2022-JP"
Content-Transfer-Encoding: 7bit
X-Mailer: Becky! ver. 2.05.06
X-Spam-Status: No, hits=0.8 required=5.0 tests=SPAM_PHRASE_00_01
X-Spam-Level: 


$B!c#F#R#O#M!d(B 	info@goldrush.ntf.ne.jp
$B!cAw?.<T!d(B	$B;aL>Kt$OL>>N(B:$B%(%`%1%$!&%$%s%?!<%J%7%g%J%k!!9-9pIt(B
		$B=;=j!'J!2,8)KL6e=#;T>.ARKL6hEDD.(B10-38
		$BEEOCHV9f!'(B093-592-4496
$B!cG[?.Dd;_J}K!!d>e5-%"%I%l%9$K$=$N$^$^JV?.%a!<%k$r$*Aw$j$/$@$5$$!#(BDB$B$h$j:o=|$5$;$F$$$?$@$-$^$9!#(B

$B!2!2!2!2!2!2!2!2!2!2K\!!!!J8!2!2!2!2!2!2!2!2!2!2(B

http://goldrush.ntf.ne.jp/

$B%+!<%I$N%7%g%C%T%s%0OH$G%-%c%C%7%e%2%C%H>pJs(B

$B$$$^$^$GEl5~$dBg:e$N0lIt$N?M$7$+MxMQ$G$-$J$+$C$?!J$=$l$b9b$$<j?tNAJ'$C$F!K$3$N%-%c%C%7%s%0J}<0$,A49q$I$3$G$b%M%C%H$GMxMQ$G$-$k$h$&$K$J$j$^$7$?!#$b$A$m$sMhE9ITMW$G$9!#K\?M3NG'$N$?$a?HJ,>ZL@>Z$N#F#A#X$G#O#K!#(B
$B5$$K$J$k496bN($G$9$,!"9qFb:G9b?e=`$N496bN(#8#7!s$G496b$G$-$^$9!#(B
$BJLESHqMQEy$OAwNA$N(B300$B1_$H?6$j9~$_NA!J6d9T5,Dj<BHq!K0J300l@Z$+$+$j$^$;$s!#(B
$B$^$?!"$3$N%7%9%F%`$N6H<T$O4X7841D#FO=P:Q$_$NM%NI6H<T$G$9!#B><R$N$h$&$K!"0cK!$K496b$7$?$j!"496bN($,6KC<$KDc$+$C$?$j!"JLES<j?tNA$r@A5a$9$k$3$H$O$"$j$^$;$s!#%+!<%I7h:Q$N%;%-%e%j%F%#$b!V%Y%j%5%$%s!W;HMQ$N$?$a0BA4$G$9!#$b$A$m$sHkL)87<i$G$9!#(B

  $B$4MxMQ$O(BVISAorMASTER$B$,$4MxMQ2DG=$G$9!#(B<br>
  $B8BEY3[$O(B1$B2s$"$?$j(B10$BK|1_$^$G$G$*0l?MMM(B1$B%v7n$"$?$j(B20$BK|1_$^$G$G$9!#(B<br>
  <br>
  $B!|:#7n$N;YJ'$$$,B-$j$J$$(B<br>
  $B!|<g?M!J1|$5$s!K$KFb=o$G>.8/$$$,$$$k(B<br>
  $B!|2q<R$N%+!<%I$r;H$C$F8D?M$N8}:B$K?69~$7$F$[$7$$(B<br>
  $B!|3X@8%+!<%I$G%-%c%C%7%s%0$,?F$K$P$l$k$H$^$:$$(B<br>
  <br>
  $B$=$s$J$"$J$?$K$T$C$?$j$G$9!#(B</p>
$B$=$l$>$l$N2hLL$O2<$N%j%s%/$r%/%j%C%/$9$l$P=g<!=P$F$-$^$9!#(B

http://goldrush.ntf.ne.jp/


---------------------- multipart/mixed attachment
Return-Path: <info@goldrush.ntf.ne.jp>
Envelope-To: tutor@python.org
Received: from 211008071155.cidr.odn.ne.jp ([211.8.71.155] helo=vivis.ntf.ne.jp)
	by mail.python.org with smtp (Exim 4.05)
	id 183DZB-0007dJ-00
	for tutor@python.org; Sun, 20 Oct 2002 06:44:17 -0400
Received: (qmail 542 invoked from network); 20 Oct 2002 07:26:47 -0000
Received: from p1096-ipbf04fukuokachu.fukuoka.ocn.ne.jp (HELO ?192.172.0.105?) (220.96.125.96)
  by 211008071155.cidr.odn.ne.jp with SMTP; 20 Oct 2002 07:26:47 -0000
Date: Sun, 20 Oct 2002 16:27:31 +0900
From: INFO <info@goldrush.ntf.ne.jp>
To: a@a.a
Subject: =?ISO-2022-JP?B?GyRCTCQ+NUJ6OS05cCIoIVYlKyE8JUkkThsoQg==?=
 =?ISO-2022-JP?B?GyRCR2MkJEoqT0gkRyUtJWMlQyU3JWUlMhsoQg==?=
 =?ISO-2022-JP?B?GyRCJUMlSCFXGyhC?=
Message-Id: <20021020155600.237F.INFO@goldrush.ntf.ne.jp>
MIME-Version: 1.0
Content-Type: text/plain; charset="ISO-2022-JP"
Content-Transfer-Encoding: 7bit
X-Mailer: Becky! ver. 2.05.06
X-Spam-Status: No, hits=0.8 required=5.0 tests=SPAM_PHRASE_00_01
X-Spam-Level: 


$B!c#F#R#O#M!d(B 	info@goldrush.ntf.ne.jp
$B!cAw?.<T!d(B	$B;aL>Kt$OL>>N(B:$B%(%`%1%$!&%$%s%?!<%J%7%g%J%k!!9-9pIt(B
		$B=;=j!'J!2,8)KL6e=#;T>.ARKL6hEDD.(B10-38
		$BEEOCHV9f!'(B093-592-4496
$B!cG[?.Dd;_J}K!!d>e5-%"%I%l%9$K$=$N$^$^JV?.%a!<%k$r$*Aw$j$/$@$5$$!#(BDB$B$h$j:o=|$5$;$F$$$?$@$-$^$9!#(B

$B!2!2!2!2!2!2!2!2!2!2K\!!!!J8!2!2!2!2!2!2!2!2!2!2(B

http://goldrush.ntf.ne.jp/

$B%+!<%I$N%7%g%C%T%s%0OH$G%-%c%C%7%e%2%C%H>pJs(B

$B$$$^$^$GEl5~$dBg:e$N0lIt$N?M$7$+MxMQ$G$-$J$+$C$?!J$=$l$b9b$$<j?tNAJ'$C$F!K$3$N%-%c%C%7%s%0J}<0$,A49q$I$3$G$b%M%C%H$GMxMQ$G$-$k$h$&$K$J$j$^$7$?!#$b$A$m$sMhE9ITMW$G$9!#K\?M3NG'$N$?$a?HJ,>ZL@>Z$N#F#A#X$G#O#K!#(B
$B5$$K$J$k496bN($G$9$,!"9qFb:G9b?e=`$N496bN(#8#7!s$G496b$G$-$^$9!#(B
$BJLESHqMQEy$OAwNA$N(B300$B1_$H?6$j9~$_NA!J6d9T5,Dj<BHq!K0J300l@Z$+$+$j$^$;$s!#(B
$B$^$?!"$3$N%7%9%F%`$N6H<T$O4X7841D#FO=P:Q$_$NM%NI6H<T$G$9!#B><R$N$h$&$K!"0cK!$K496b$7$?$j!"496bN($,6KC<$KDc$+$C$?$j!"JLES<j?tNA$r@A5a$9$k$3$H$O$"$j$^$;$s!#%+!<%I7h:Q$N%;%-%e%j%F%#$b!V%Y%j%5%$%s!W;HMQ$N$?$a0BA4$G$9!#$b$A$m$sHkL)87<i$G$9!#(B

  $B$4MxMQ$O(BVISAorMASTER$B$,$4MxMQ2DG=$G$9!#(B<br>
  $B8BEY3[$O(B1$B2s$"$?$j(B10$BK|1_$^$G$G$*0l?MMM(B1$B%v7n$"$?$j(B20$BK|1_$^$G$G$9!#(B<br>
  <br>
  $B!|:#7n$N;YJ'$$$,B-$j$J$$(B<br>
  $B!|<g?M!J1|$5$s!K$KFb=o$G>.8/$$$,$$$k(B<br>
  $B!|2q<R$N%+!<%I$r;H$C$F8D?M$N8}:B$K?69~$7$F$[$7$$(B<br>
  $B!|3X@8%+!<%I$G%-%c%C%7%s%0$,?F$K$P$l$k$H$^$:$$(B<br>
  <br>
  $B$=$s$J$"$J$?$K$T$C$?$j$G$9!#(B</p>
$B$=$l$>$l$N2hLL$O2<$N%j%s%/$r%/%j%C%/$9$l$P=g<!=P$F$-$^$9!#(B

http://goldrush.ntf.ne.jp/


---------------------- multipart/mixed attachment
Return-Path: <info@goldrush.ntf.ne.jp>
Envelope-To: python-docs@python.org
Received: from 211008071155.cidr.odn.ne.jp ([211.8.71.155] helo=vivis.ntf.ne.jp)
	by mail.python.org with smtp (Exim 4.05)
	id 183Epz-0001IH-00
	for python-docs@python.org; Sun, 20 Oct 2002 08:05:43 -0400
Received: (qmail 24767 invoked from network); 20 Oct 2002 08:23:57 -0000
Received: from p1096-ipbf04fukuokachu.fukuoka.ocn.ne.jp (HELO ?192.172.0.105?) (220.96.125.96)
  by 211008071155.cidr.odn.ne.jp with SMTP; 20 Oct 2002 08:23:57 -0000
Date: Sun, 20 Oct 2002 17:24:41 +0900
From: INFO <info@goldrush.ntf.ne.jp>
To: a@a.a
Subject: =?ISO-2022-JP?B?GyRCTCQ+NUJ6OS05cCIoIVYlKyE8JUkkThsoQg==?=
 =?ISO-2022-JP?B?GyRCR2MkJEoqT0gkRyUtJWMlQyU3JWUlMhsoQg==?=
 =?ISO-2022-JP?B?GyRCJUMlSCFXGyhC?=
Message-Id: <20021020164146.2388.INFO@goldrush.ntf.ne.jp>
MIME-Version: 1.0
Content-Type: text/plain; charset="ISO-2022-JP"
Content-Transfer-Encoding: 7bit
X-Mailer: Becky! ver. 2.05.06
X-Spam-Status: No, hits=0.8 required=5.0 tests=SPAM_PHRASE_00_01
X-Spam-Level: 


$B!c#F#R#O#M!d(B 	info@goldrush.ntf.ne.jp
$B!cAw?.<T!d(B	$B;aL>Kt$OL>>N(B:$B%(%`%1%$!&%$%s%?!<%J%7%g%J%k!!9-9pIt(B
		$B=;=j!'J!2,8)KL6e=#;T>.ARKL6hEDD.(B10-38
		$BEEOCHV9f!'(B093-592-4496
$B!cG[?.Dd;_J}K!!d>e5-%"%I%l%9$K$=$N$^$^JV?.%a!<%k$r$*Aw$j$/$@$5$$!#(BDB$B$h$j:o=|$5$;$F$$$?$@$-$^$9!#(B

$B!2!2!2!2!2!2!2!2!2!2K\!!!!J8!2!2!2!2!2!2!2!2!2!2(B

http://goldrush.ntf.ne.jp/

$B%+!<%I$N%7%g%C%T%s%0OH$G%-%c%C%7%e%2%C%H>pJs(B

$B$$$^$^$GEl5~$dBg:e$N0lIt$N?M$7$+MxMQ$G$-$J$+$C$?!J$=$l$b9b$$<j?tNAJ'$C$F!K$3$N%-%c%C%7%s%0J}<0$,A49q$I$3$G$b%M%C%H$GMxMQ$G$-$k$h$&$K$J$j$^$7$?!#$b$A$m$sMhE9ITMW$G$9!#K\?M3NG'$N$?$a?HJ,>ZL@>Z$N#F#A#X$G#O#K!#(B
$B5$$K$J$k496bN($G$9$,!"9qFb:G9b?e=`$N496bN(#8#7!s$G496b$G$-$^$9!#(B
$BJLESHqMQEy$OAwNA$N(B300$B1_$H?6$j9~$_NA!J6d9T5,Dj<BHq!K0J300l@Z$+$+$j$^$;$s!#(B
$B$^$?!"$3$N%7%9%F%`$N6H<T$O4X7841D#FO=P:Q$_$NM%NI6H<T$G$9!#B><R$N$h$&$K!"0cK!$K496b$7$?$j!"496bN($,6KC<$KDc$+$C$?$j!"JLES<j?tNA$r@A5a$9$k$3$H$O$"$j$^$;$s!#%+!<%I7h:Q$N%;%-%e%j%F%#$b!V%Y%j%5%$%s!W;HMQ$N$?$a0BA4$G$9!#$b$A$m$sHkL)87<i$G$9!#(B

  $B$4MxMQ$O(BVISAorMASTER$B$,$4MxMQ2DG=$G$9!#(B<br>
  $B8BEY3[$O(B1$B2s$"$?$j(B10$BK|1_$^$G$G$*0l?MMM(B1$B%v7n$"$?$j(B20$BK|1_$^$G$G$9!#(B<br>
  <br>
  $B!|:#7n$N;YJ'$$$,B-$j$J$$(B<br>
  $B!|<g?M!J1|$5$s!K$KFb=o$G>.8/$$$,$$$k(B<br>
  $B!|2q<R$N%+!<%I$r;H$C$F8D?M$N8}:B$K?69~$7$F$[$7$$(B<br>
  $B!|3X@8%+!<%I$G%-%c%C%7%s%0$,?F$K$P$l$k$H$^$:$$(B<br>
  <br>
  $B$=$s$J$"$J$?$K$T$C$?$j$G$9!#(B</p>
$B$=$l$>$l$N2hLL$O2<$N%j%s%/$r%/%j%C%/$9$l$P=g<!=P$F$-$^$9!#(B

http://goldrush.ntf.ne.jp/


---------------------- multipart/mixed attachment
Return-Path: <info@goldrush.ntf.ne.jp>
Envelope-To: python-list@python.org
Received: from 211008071155.cidr.odn.ne.jp ([211.8.71.155] helo=vivis.ntf.ne.jp)
	by mail.python.org with smtp (Exim 4.05)
	id 183Epz-0001II-00
	for python-list@python.org; Sun, 20 Oct 2002 08:05:43 -0400
Received: (qmail 24767 invoked from network); 20 Oct 2002 08:23:57 -0000
Received: from p1096-ipbf04fukuokachu.fukuoka.ocn.ne.jp (HELO ?192.172.0.105?) (220.96.125.96)
  by 211008071155.cidr.odn.ne.jp with SMTP; 20 Oct 2002 08:23:57 -0000
Date: Sun, 20 Oct 2002 17:24:41 +0900
From: INFO <info@goldrush.ntf.ne.jp>
To: a@a.a
Subject: =?ISO-2022-JP?B?GyRCTCQ+NUJ6OS05cCIoIVYlKyE8JUkkThsoQg==?=
 =?ISO-2022-JP?B?GyRCR2MkJEoqT0gkRyUtJWMlQyU3JWUlMhsoQg==?=
 =?ISO-2022-JP?B?GyRCJUMlSCFXGyhC?=
Message-Id: <20021020164146.2388.INFO@goldrush.ntf.ne.jp>
MIME-Version: 1.0
Content-Type: text/plain; charset="ISO-2022-JP"
Content-Transfer-Encoding: 7bit
X-Mailer: Becky! ver. 2.05.06
X-Spam-Status: No, hits=0.8 required=5.0 tests=SPAM_PHRASE_00_01
X-Spam-Level: 


$B!c#F#R#O#M!d(B 	info@goldrush.ntf.ne.jp
$B!cAw?.<T!d(B	$B;aL>Kt$OL>>N(B:$B%(%`%1%$!&%$%s%?!<%J%7%g%J%k!!9-9pIt(B
		$B=;=j!'J!2,8)KL6e=#;T>.ARKL6hEDD.(B10-38
		$BEEOCHV9f!'(B093-592-4496
$B!cG[?.Dd;_J}K!!d>e5-%"%I%l%9$K$=$N$^$^JV?.%a!<%k$r$*Aw$j$/$@$5$$!#(BDB$B$h$j:o=|$5$;$F$$$?$@$-$^$9!#(B

$B!2!2!2!2!2!2!2!2!2!2K\!!!!J8!2!2!2!2!2!2!2!2!2!2(B

http://goldrush.ntf.ne.jp/

$B%+!<%I$N%7%g%C%T%s%0OH$G%-%c%C%7%e%2%C%H>pJs(B

$B$$$^$^$GEl5~$dBg:e$N0lIt$N?M$7$+MxMQ$G$-$J$+$C$?!J$=$l$b9b$$<j?tNAJ'$C$F!K$3$N%-%c%C%7%s%0J}<0$,A49q$I$3$G$b%M%C%H$GMxMQ$G$-$k$h$&$K$J$j$^$7$?!#$b$A$m$sMhE9ITMW$G$9!#K\?M3NG'$N$?$a?HJ,>ZL@>Z$N#F#A#X$G#O#K!#(B
$B5$$K$J$k496bN($G$9$,!"9qFb:G9b?e=`$N496bN(#8#7!s$G496b$G$-$^$9!#(B
$BJLESHqMQEy$OAwNA$N(B300$B1_$H?6$j9~$_NA!J6d9T5,Dj<BHq!K0J300l@Z$+$+$j$^$;$s!#(B
$B$^$?!"$3$N%7%9%F%`$N6H<T$O4X7841D#FO=P:Q$_$NM%NI6H<T$G$9!#B><R$N$h$&$K!"0cK!$K496b$7$?$j!"496bN($,6KC<$KDc$+$C$?$j!"JLES<j?tNA$r@A5a$9$k$3$H$O$"$j$^$;$s!#%+!<%I7h:Q$N%;%-%e%j%F%#$b!V%Y%j%5%$%s!W;HMQ$N$?$a0BA4$G$9!#$b$A$m$sHkL)87<i$G$9!#(B

  $B$4MxMQ$O(BVISAorMASTER$B$,$4MxMQ2DG=$G$9!#(B<br>
  $B8BEY3[$O(B1$B2s$"$?$j(B10$BK|1_$^$G$G$*0l?MMM(B1$B%v7n$"$?$j(B20$BK|1_$^$G$G$9!#(B<br>
  <br>
  $B!|:#7n$N;YJ'$$$,B-$j$J$$(B<br>
  $B!|<g?M!J1|$5$s!K$KFb=o$G>.8/$$$,$$$k(B<br>
  $B!|2q<R$N%+!<%I$r;H$C$F8D?M$N8}:B$K?69~$7$F$[$7$$(B<br>
  $B!|3X@8%+!<%I$G%-%c%C%7%s%0$,?F$K$P$l$k$H$^$:$$(B<br>
  <br>
  $B$=$s$J$"$J$?$K$T$C$?$j$G$9!#(B</p>
$B$=$l$>$l$N2hLL$O2<$N%j%s%/$r%/%j%C%/$9$l$P=g<!=P$F$-$^$9!#(B

http://goldrush.ntf.ne.jp/


---------------------- multipart/mixed attachment
Return-Path: <info@goldrush.ntf.ne.jp>
Envelope-To: python-help@python.org
Received: from 211008071155.cidr.odn.ne.jp ([211.8.71.155] helo=vivis.ntf.ne.jp)
	by mail.python.org with smtp (Exim 4.05)
	id 183Epz-0001IJ-00
	for python-help@python.org; Sun, 20 Oct 2002 08:05:43 -0400
Received: (qmail 24767 invoked from network); 20 Oct 2002 08:23:57 -0000
Received: from p1096-ipbf04fukuokachu.fukuoka.ocn.ne.jp (HELO ?192.172.0.105?) (220.96.125.96)
  by 211008071155.cidr.odn.ne.jp with SMTP; 20 Oct 2002 08:23:57 -0000
Date: Sun, 20 Oct 2002 17:24:41 +0900
From: INFO <info@goldrush.ntf.ne.jp>
To: a@a.a
Subject: =?ISO-2022-JP?B?GyRCTCQ+NUJ6OS05cCIoIVYlKyE8JUkkThsoQg==?=
 =?ISO-2022-JP?B?GyRCR2MkJEoqT0gkRyUtJWMlQyU3JWUlMhsoQg==?=
 =?ISO-2022-JP?B?GyRCJUMlSCFXGyhC?=
Message-Id: <20021020164146.2388.INFO@goldrush.ntf.ne.jp>
MIME-Version: 1.0
Content-Type: text/plain; charset="ISO-2022-JP"
Content-Transfer-Encoding: 7bit
X-Mailer: Becky! ver. 2.05.06
X-Spam-Status: No, hits=0.8 required=5.0 tests=SPAM_PHRASE_00_01
X-Spam-Level: 


$B!c#F#R#O#M!d(B 	info@goldrush.ntf.ne.jp
$B!cAw?.<T!d(B	$B;aL>Kt$OL>>N(B:$B%(%`%1%$!&%$%s%?!<%J%7%g%J%k!!9-9pIt(B
		$B=;=j!'J!2,8)KL6e=#;T>.ARKL6hEDD.(B10-38
		$BEEOCHV9f!'(B093-592-4496
$B!cG[?.Dd;_J}K!!d>e5-%"%I%l%9$K$=$N$^$^JV?.%a!<%k$r$*Aw$j$/$@$5$$!#(BDB$B$h$j:o=|$5$;$F$$$?$@$-$^$9!#(B

$B!2!2!2!2!2!2!2!2!2!2K\!!!!J8!2!2!2!2!2!2!2!2!2!2(B

http://goldrush.ntf.ne.jp/

$B%+!<%I$N%7%g%C%T%s%0OH$G%-%c%C%7%e%2%C%H>pJs(B

$B$$$^$^$GEl5~$dBg:e$N0lIt$N?M$7$+MxMQ$G$-$J$+$C$?!J$=$l$b9b$$<j?tNAJ'$C$F!K$3$N%-%c%C%7%s%0J}<0$,A49q$I$3$G$b%M%C%H$GMxMQ$G$-$k$h$&$K$J$j$^$7$?!#$b$A$m$sMhE9ITMW$G$9!#K\?M3NG'$N$?$a?HJ,>ZL@>Z$N#F#A#X$G#O#K!#(B
$B5$$K$J$k496bN($G$9$,!"9qFb:G9b?e=`$N496bN(#8#7!s$G496b$G$-$^$9!#(B
$BJLESHqMQEy$OAwNA$N(B300$B1_$H?6$j9~$_NA!J6d9T5,Dj<BHq!K0J300l@Z$+$+$j$^$;$s!#(B
$B$^$?!"$3$N%7%9%F%`$N6H<T$O4X7841D#FO=P:Q$_$NM%NI6H<T$G$9!#B><R$N$h$&$K!"0cK!$K496b$7$?$j!"496bN($,6KC<$KDc$+$C$?$j!"JLES<j?tNA$r@A5a$9$k$3$H$O$"$j$^$;$s!#%+!<%I7h:Q$N%;%-%e%j%F%#$b!V%Y%j%5%$%s!W;HMQ$N$?$a0BA4$G$9!#$b$A$m$sHkL)87<i$G$9!#(B

  $B$4MxMQ$O(BVISAorMASTER$B$,$4MxMQ2DG=$G$9!#(B<br>
  $B8BEY3[$O(B1$B2s$"$?$j(B10$BK|1_$^$G$G$*0l?MMM(B1$B%v7n$"$?$j(B20$BK|1_$^$G$G$9!#(B<br>
  <br>
  $B!|:#7n$N;YJ'$$$,B-$j$J$$(B<br>
  $B!|<g?M!J1|$5$s!K$KFb=o$G>.8/$$$,$$$k(B<br>
  $B!|2q<R$N%+!<%I$r;H$C$F8D?M$N8}:B$K?69~$7$F$[$7$$(B<br>
  $B!|3X@8%+!<%I$G%-%c%C%7%s%0$,?F$K$P$l$k$H$^$:$$(B<br>
  <br>
  $B$=$s$J$"$J$?$K$T$C$?$j$G$9!#(B</p>
$B$=$l$>$l$N2hLL$O2<$N%j%s%/$r%/%j%C%/$9$l$P=g<!=P$F$-$^$9!#(B

http://goldrush.ntf.ne.jp/


---------------------- multipart/mixed attachment
Return-Path: <info@goldrush.ntf.ne.jp>
Envelope-To: python-announce@python.org
Received: from 211008071155.cidr.odn.ne.jp ([211.8.71.155] helo=vivis.ntf.ne.jp)
	by mail.python.org with smtp (Exim 4.05)
	id 183Epz-0001IK-00
	for python-announce@python.org; Sun, 20 Oct 2002 08:05:43 -0400
Received: (qmail 24767 invoked from network); 20 Oct 2002 08:23:57 -0000
Received: from p1096-ipbf04fukuokachu.fukuoka.ocn.ne.jp (HELO ?192.172.0.105?) (220.96.125.96)
  by 211008071155.cidr.odn.ne.jp with SMTP; 20 Oct 2002 08:23:57 -0000
Date: Sun, 20 Oct 2002 17:24:41 +0900
From: INFO <info@goldrush.ntf.ne.jp>
To: a@a.a
Subject: =?ISO-2022-JP?B?GyRCTCQ+NUJ6OS05cCIoIVYlKyE8JUkkThsoQg==?=
 =?ISO-2022-JP?B?GyRCR2MkJEoqT0gkRyUtJWMlQyU3JWUlMhsoQg==?=
 =?ISO-2022-JP?B?GyRCJUMlSCFXGyhC?=
Message-Id: <20021020164146.2388.INFO@goldrush.ntf.ne.jp>
MIME-Version: 1.0
Content-Type: text/plain; charset="ISO-2022-JP"
Content-Transfer-Encoding: 7bit
X-Mailer: Becky! ver. 2.05.06
X-Spam-Status: No, hits=0.8 required=5.0 tests=SPAM_PHRASE_00_01
X-Spam-Level: 


$B!c#F#R#O#M!d(B 	info@goldrush.ntf.ne.jp
$B!cAw?.<T!d(B	$B;aL>Kt$OL>>N(B:$B%(%`%1%$!&%$%s%?!<%J%7%g%J%k!!9-9pIt(B
		$B=;=j!'J!2,8)KL6e=#;T>.ARKL6hEDD.(B10-38
		$BEEOCHV9f!'(B093-592-4496
$B!cG[?.Dd;_J}K!!d>e5-%"%I%l%9$K$=$N$^$^JV?.%a!<%k$r$*Aw$j$/$@$5$$!#(BDB$B$h$j:o=|$5$;$F$$$?$@$-$^$9!#(B

$B!2!2!2!2!2!2!2!2!2!2K\!!!!J8!2!2!2!2!2!2!2!2!2!2(B

http://goldrush.ntf.ne.jp/

$B%+!<%I$N%7%g%C%T%s%0OH$G%-%c%C%7%e%2%C%H>pJs(B

$B$$$^$^$GEl5~$dBg:e$N0lIt$N?M$7$+MxMQ$G$-$J$+$C$?!J$=$l$b9b$$<j?tNAJ'$C$F!K$3$N%-%c%C%7%s%0J}<0$,A49q$I$3$G$b%M%C%H$GMxMQ$G$-$k$h$&$K$J$j$^$7$?!#$b$A$m$sMhE9ITMW$G$9!#K\?M3NG'$N$?$a?HJ,>ZL@>Z$N#F#A#X$G#O#K!#(B
$B5$$K$J$k496bN($G$9$,!"9qFb:G9b?e=`$N496bN(#8#7!s$G496b$G$-$^$9!#(B
$BJLESHqMQEy$OAwNA$N(B300$B1_$H?6$j9~$_NA!J6d9T5,Dj<BHq!K0J300l@Z$+$+$j$^$;$s!#(B
$B$^$?!"$3$N%7%9%F%`$N6H<T$O4X7841D#FO=P:Q$_$NM%NI6H<T$G$9!#B><R$N$h$&$K!"0cK!$K496b$7$?$j!"496bN($,6KC<$KDc$+$C$?$j!"JLES<j?tNA$r@A5a$9$k$3$H$O$"$j$^$;$s!#%+!<%I7h:Q$N%;%-%e%j%F%#$b!V%Y%j%5%$%s!W;HMQ$N$?$a0BA4$G$9!#$b$A$m$sHkL)87<i$G$9!#(B

  $B$4MxMQ$O(BVISAorMASTER$B$,$4MxMQ2DG=$G$9!#(B<br>
  $B8BEY3[$O(B1$B2s$"$?$j(B10$BK|1_$^$G$G$*0l?MMM(B1$B%v7n$"$?$j(B20$BK|1_$^$G$G$9!#(B<br>
  <br>
  $B!|:#7n$N;YJ'$$$,B-$j$J$$(B<br>
  $B!|<g?M!J1|$5$s!K$KFb=o$G>.8/$$$,$$$k(B<br>
  $B!|2q<R$N%+!<%I$r;H$C$F8D?M$N8}:B$K?69~$7$F$[$7$$(B<br>
  $B!|3X@8%+!<%I$G%-%c%C%7%s%0$,?F$K$P$l$k$H$^$:$$(B<br>
  <br>
  $B$=$s$J$"$J$?$K$T$C$?$j$G$9!#(B</p>
$B$=$l$>$l$N2hLL$O2<$N%j%s%/$r%/%j%C%/$9$l$P=g<!=P$F$-$^$9!#(B

http://goldrush.ntf.ne.jp/


---------------------- multipart/mixed attachment
Return-Path: <nvwar_wawa@msn.com>
Envelope-To: python-list-request@python.org
Received: from dav11.pav3.hotmail.com ([64.4.38.115] helo=hotmail.com)
	by mail.python.org with esmtp (Exim 4.05)
	id 183aCi-00024k-00
	for python-list-request@python.org; Mon, 21 Oct 2002 06:54:36 -0400
Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC;
	 Mon, 21 Oct 2002 03:54:05 -0700
X-Originating-IP: [202.105.138.19]
From: "amou *^_^*" <nvwar_wawa@msn.com>
To: <python-list-request@python.org>
Subject: re
Date: Mon, 21 Oct 2002 18:57:43 +0800
MIME-Version: 1.0
X-Mailer: MSN Explorer 7.00.0021.1700
Content-Type: multipart/alternative; boundary="----=_NextPart_001_0000_01C27933.BC9FFDD0"
Message-ID: <DAV11HrdusluCBcKfEx0000bab0@hotmail.com>
X-OriginalArrivalTime: 21 Oct 2002 10:54:05.0647 (UTC) FILETIME=[2CA7BDF0:01C278F0]
X-Spam-Status: No, hits=-7.3 required=5.0 tests=BASE64_ENC_TEXT,MIME_ALTERNATIVE,SPAM_PHRASE_00_01,USER_IN_WHITELIST_TO
X-Spam-Level: 


------=_NextPart_001_0000_01C27933.BC9FFDD0
Content-Type: text/plain; charset="gb2312"
Content-Transfer-Encoding: base64

tNPN+NW+tcO1vbj8tuDQxc+ioaNNU04gRXhwbG9yZXIgw+K30c/C1Ng6aHR0cDovL2V4cGxvcmVy
Lm1zbi5jb20vbGNjbg==

------=_NextPart_001_0000_01C27933.BC9FFDD0
Content-Type: text/html; charset="gb2312"
Content-Transfer-Encoding: quoted-printable

<HTML><BODY STYLE=3D"font:10pt verdana; border:none;"><DIV><BR><BR></DIV>=
</BODY></HTML><br clear=3Dall><hr>=B4=D3=CD=F8=D5=BE=B5=C3=B5=BD=B8=FC=B6=
=E0=D0=C5=CF=A2=A1=A3MSN Explorer =C3=E2=B7=D1=CF=C2=D4=D8=A3=BA<a href=3D=
'http://explorer.msn.com/lccn'>http://explorer.msn.com/lccn</a><br></p>

------=_NextPart_001_0000_01C27933.BC9FFDD0--

---------------------- multipart/mixed attachment
Return-Path: <istanbul@sushi.co.jp>=0A=
Envelope-To: mailman-users-request@python.org=0A=
Received: from abn14-168.ist-avrupa-ports.kablonet.net.tr =
([195.174.14.168] helo=3Ddell)=0A=
	by mail.python.org with smtp (Exim 4.05)=0A=
	id 183ueG-0006vd-00=0A=
	for mailman-users-request@python.org; Tue, 22 Oct 2002 04:44:27 -0400=0A=
x-esmtp: 0 0 1=0A=
Message-ID: <633420021022264341949@dell>=0A=
X-EM-Version: 5, 0, 0, 19=0A=
X-EM-Registration: #01B0530810E603002D00=0A=
X-Priority: 1=0A=
Reply-To: istanbul@sushi.co.jp=0A=
X-MSMail-Priority: High=0A=
From: "www.indirimim.com" <istanbul@sushi.co.jp>=0A=
To:<mailman-users-request@python.org>=0A=
Subject: =
=3D?windows-1254?Q?ekim_ay=3DFD_=3DE7ekili=3DFElerine_kat=3DFDl=3DFDn?=3D=0A=
Date: Tue, 22 Oct 2002 09:43:41 +0300=0A=
MIME-Version: 1.0=0A=
Content-Type: multipart/alternative; =0A=
	boundary=3D"----=3D_NextPart_84815C5ABAF209EF376268C8"=0A=
X-SMTPExp-Version: 1, 0, 2, 10=0A=
X-SMTPExp-Registration: =FF=FF=FF=FF=0A=
X-Spam-Status: No, hits=3D-0.1 required=3D5.0 =
tests=3DHEADER_8BITS,MIME_ALTERNATIVE,MISSING_MIMEOLE,PRIORITY_NO_NAME,RC=
VD_IN_RFCI,SPAM_PHRASE_02_03,USER_IN_WHITELIST_TO,X_ESMTP,X_MSMAIL_PRIORI=
TY_HIGH,X_PRIORITY_HIGH,X_SMTPEXP_REGISTRATION,X_SMTPEXP_VERSION=0A=
X-Spam-Level:=0A=
=0A=
This message was sent using a demo of EasyMail SMTP Express.  For more =
information about EasyMail SMTP Express visit =
http://www.quiksoft.com/easymail.=0A=
=0A=
------=3D_NextPart_84815C5ABAF209EF376268C8=0A=
Content-type: text/plain; charset=3D"windows-1254"=0A=
=0A=
 =DDnk=FDlap Kitapevin'den=0A=
 =FCcretsiz kitaplar Jimmi's 'de=0A=
 Misafirimiz olun Tatilya'da =FCcretsiz=0A=
 - S=FDn=FDrs=FDz E=F0lence - =DCcretsiz Fresh Look=0A=
 Renkli Lensler UltraForm G=FCzellik&amp;Zay=FDflama=0A=
 Merkezi Ekim 2002 hediyelerini kazanmak i=E7in formu doldurun ve =
=E7ekili=FEimize kat=FDl=FDn...... Aram=FDza Kat=FDlan Yeni Firmalar =
Lens Market=0A=
 =DEa=FE=FDrt=FDc=FD De=F0i=FEiminizi www.indirimim.com avantaj=FDyla =
ya=FEay=FDn. 64K/Kaspersky Antivirus (AVP) =0A=
 Bilgisayar=FDn=FDz=FD www.indirimim.com fiyat fark=FDyla koruyun=0A=
 Metro Turizm=0A=
 =DDndirimimli seyahatler sizi bekliyor. =DDnk=FDlap Kitapevi=0A=
 www.indirimim.com ile kitap almak daha avantajli Solid Sa=F0l=FDk =
=DCr=FCnleri =0A=
 www.indirimim.com m=FC=FEterilerine =F6zel indirimler =0A=
 =0A=
=0A=
------=3D_NextPart_84815C5ABAF209EF376268C8=0A=
Content-Type: text/html; charset=3D"windows-1254"=0A=
Content-Transfer-Encoding: quoted-printable=0A=
=0A=
<html>=0A=
<head>=0A=
<title>Untitled Document</title>=0A=
<meta http-equiv=3D3D"Content-Type" content=3D3D"text/html; =
charset=3D3D">=0A=
</head>=0A=
=0A=
<body bgcolor=3D3D"#FFFFFF">=0A=
<table width=3D3D"525" border=3D3D"0" height=3D3D"259">=0A=
  <tr>=3D20=0A=
    <td height=3D3D"29" width=3D3D"163" bgcolor=3D3D"#FFFFFF">=3D20=0A=
      <div align=3D3D"center"><img =
src=3D3D"http://indirimim=3D2Ecom/logos/inkil=3D=0A=
apkitabevi=3D2Egif" width=3D3D"110" height=3D3D"30"></div>=0A=
    </td>=0A=
    <td height=3D3D"29" width=3D3D"186" bgcolor=3D3D"#FFFFFF">=3D20=0A=
      <div align=3D3D"center"><img =
src=3D3D"http://indirimim=3D2Ecom/logos/jimmi=3D=0A=
s=3D2Egif" width=3D3D"99" height=3D3D"34"></div>=0A=
    </td>=0A=
    <td height=3D3D"29" width=3D3D"162" bgcolor=3D3D"#FFFFFF">=3D20=0A=
      <div align=3D3D"center"><img =
src=3D3D"http://indirimim=3D2Ecom/logos/tatil=3D=0A=
ya=3D2Egif" width=3D3D"66" height=3D3D"46"></div>=0A=
    </td>=0A=
  </tr>=0A=
  <tr>=3D20=0A=
    <td height=3D3D"9" width=3D3D"163" bgcolor=3D3D"#66CCFF">=3D20=0A=
      <div align=3D3D"center">=3DDDnk=3DFDlap Kitapevin'den<br>=0A=
        =3DFCcretsiz kitaplar </div>=0A=
    </td>=0A=
    <td height=3D3D"9" width=3D3D"186" bgcolor=3D3D"#FF6666">=3D20=0A=
      <div align=3D3D"center">Jimmi's 'de<br>=0A=
        Misafirimiz olun</div>=0A=
    </td>=0A=
    <td height=3D3D"9" width=3D3D"162" bgcolor=3D3D"#99FF33">=3D20=0A=
      <div align=3D3D"center">Tatilya'da =3DFCcretsiz<br>=0A=
        - S=3DFDn=3DFDrs=3DFDz E=3DF0lence -</div>=0A=
    </td>=0A=
  </tr>=0A=
  <tr>=3D20=0A=
    <td height=3D3D"2" width=3D3D"163" bgcolor=3D3D"#66CCFF">=3D20=0A=
      <div align=3D3D"center"><a =
href=3D3D"http://www=3D2Eindirimim=3D2Ecom/cekili=3D=0A=
s_inkilap=3D2Easp"><img =
src=3D3D"http://indirimim=3D2Ecom/banner/inkilap_ucr=3D2Eg=3D=0A=
if" width=3D3D"100" height=3D3D"20" border=3D3D"0"></a></div>=0A=
    </td>=0A=
    <td width=3D3D"186" height=3D3D"2" bgcolor=3D3D"#FF6666">=3D20=0A=
      <div align=3D3D"center"><a =
href=3D3D"http://www=3D2Eindirimim=3D2Ecom/cekili=3D=0A=
s_jimmis=3D2Easp"><img =
src=3D3D"http://indirimim=3D2Ecom/banner/jimmis_ucr=3D2Egif=3D=0A=
" width=3D3D"100" height=3D3D"20" border=3D3D"0"></a></div>=0A=
    </td>=0A=
    <td width=3D3D"162" height=3D3D"2" bgcolor=3D3D"#99FF33">=3D20=0A=
      <div align=3D3D"center"><a =
href=3D3D"http://www=3D2Eindirimim=3D2Ecom/cekili=3D=0A=
s_tatilya=3D2Easp"><img =
src=3D3D"http://indirimim=3D2Ecom/banner/tatilya_ucr=3D2Eg=3D=0A=
if" width=3D3D"100" height=3D3D"20" border=3D3D"0"></a></div>=0A=
    </td>=0A=
  </tr>=0A=
  <tr>=3D20=0A=
    <td height=3D3D"62" width=3D3D"163" bgcolor=3D3D"#FFFFFF">=3D20=0A=
      <div align=3D3D"center"><img =
src=3D3D"http://indirimim=3D2Ecom/logos/lensm=3D=0A=
=3D2Egif" width=3D3D"98" height=3D3D"45"></div>=0A=
    </td>=0A=
    <td height=3D3D"40" width=3D3D"186" bgcolor=3D3D"#000000" =
rowspan=3D3D"3">=3D20=0A=
      <div align=3D3D"center"></div>=0A=
      <div align=3D3D"center"></div>=0A=
      <div align=3D3D"center"><a =
href=3D3D"http://www=3D2Eindirimim=3D2Ecom" targe=3D=0A=
t=3D3D"_blank"><img =
src=3D3D"http://indirimim=3D2Ecom/banner/indirimim_anim=3D2Egi=3D=0A=
f" width=3D3D"120" height=3D3D"40" border=3D3D"0"></a></div>=0A=
    </td>=0A=
    <td height=3D3D"62" width=3D3D"162" bgcolor=3D3D"#FFFFFF">=3D20=0A=
      <div align=3D3D"center"><img =
src=3D3D"http://indirimim=3D2Ecom/logos/ultra=3D=0A=
formc=3D2Egif" width=3D3D"133" height=3D3D"30"></div>=0A=
    </td>=0A=
  </tr>=0A=
  <tr>=3D20=0A=
    <td height=3D3D"45" width=3D3D"163" bgcolor=3D3D"#9999FF">=3D20=0A=
      <div align=3D3D"center">=3DDCcretsiz Fresh Look<br>=0A=
        Renkli Lensler</div>=0A=
    </td>=0A=
    <td height=3D3D"45" width=3D3D"162" bgcolor=3D3D"#FFCCFF">=3D20=0A=
      <div align=3D3D"center">UltraForm =
G=3DFCzellik&amp;Zay=3DFDflama<br>=0A=
        Merkezi</div>=0A=
    </td>=0A=
  </tr>=0A=
  <tr>=3D20=0A=
    <td height=3D3D"2" width=3D3D"163" bgcolor=3D3D"#9999FF">=3D20=0A=
      <div align=3D3D"center"><a =
href=3D3D"http://www=3D2Eindirimim=3D2Ecom/cekili=3D=0A=
s_lensmarket=3D2Easp"><img =
src=3D3D"http://indirimim=3D2Ecom/banner/lensmarket_u=3D=0A=
cr=3D2Egif" width=3D3D"100" height=3D3D"20" border=3D3D"0"></a></div>=0A=
    </td>=0A=
    <td width=3D3D"162" height=3D3D"2" bgcolor=3D3D"#00FFFF">=3D20=0A=
      <div align=3D3D"center"><a =
href=3D3D"http://www=3D2Eindirimim=3D2Ecom/cekili=3D=0A=
s_ultra=3D2Easp"><img =
src=3D3D"http://indirimim=3D2Ecom/banner/ultraform_ucr=3D2Eg=3D=0A=
if" width=3D3D"100" height=3D3D"20" border=3D3D"0"></a></div>=0A=
    </td>=0A=
  </tr>=0A=
  <tr>=3D20=0A=
    <td colspan=3D3D"3" height=3D3D"23">=3D20=0A=
      <div align=3D3D"center">Ekim 2002 hediyelerini kazanmak i=3DE7in =
formu d=3D=0A=
oldurun=3D20=0A=
        ve =3DE7ekili=3DFEimize =
kat=3DFDl=3DFDn=3D2E=3D2E=3D2E=3D2E=3D2E=3D2E</div>=0A=
    </td>=0A=
  </tr>=0A=
</table>=0A=
<table width=3D3D"524" border=3D3D"0" height=3D3D"145">=0A=
  <tr bgcolor=3D3D"#FFCCCC">=3D20=0A=
    <td colspan=3D3D"2">=3D20=0A=
      <div align=3D3D"left"><b>Aram=3DFDza Kat=3DFDlan Yeni =
Firmalar</b></div>=0A=
    </td>=0A=
  </tr>=0A=
  <tr bgcolor=3D3D"#FFCCCC">=3D20=0A=
    <td height=3D3D"55" width=3D3D"51"><a =
href=3D3D"http://www=3D2Eindirimim=3D2Ecom=3D=0A=
"><img src=3D3D"http://indirimim=3D2Ecom/banner/lens=3D2Egif" =
width=3D3D"50" heigh=3D=0A=
t=3D3D"50" border=3D3D"0"></a></td>=0A=
    <td height=3D3D"55" width=3D3D"463">Lens Market<br>=0A=
      =3DDEa=3DFE=3DFDrt=3DFDc=3DFD De=3DF0i=3DFEiminizi =
www=3D2Eindirimim=3D2Ecom avantaj=3DFD=3D=0A=
yla ya=3DFEay=3DFDn=3D2E</td>=0A=
  </tr>=0A=
  <tr bgcolor=3D3D"#FFCCCC">=3D20=0A=
    <td height=3D3D"53" width=3D3D"51"><a =
href=3D3D"http://www=3D2Eindirimim=3D2Ecom=3D=0A=
"><img src=3D3D"http://indirimim=3D2Ecom/banner/64k=3D2Egif" =
width=3D3D"50" height=3D=0A=
=3D3D"50" border=3D3D"0"></a></td>=0A=
    <td height=3D3D"53" width=3D3D"463">64K/Kaspersky Antivirus (AVP) =
<br>=0A=
      Bilgisayar=3DFDn=3DFDz=3DFD www=3D2Eindirimim=3D2Ecom fiyat =
fark=3DFDyla koruyun=3D=0A=
<br>=0A=
    </td>=0A=
  </tr>=0A=
  <tr bgcolor=3D3D"#FFCCCC">=3D20=0A=
    <td height=3D3D"53" width=3D3D"51"><a =
href=3D3D"http://www=3D2Eindirimim=3D2Ecom=3D=0A=
"><img src=3D3D"http://indirimim=3D2Ecom/banner/m80=3D2Egif" =
width=3D3D"50" height=3D=0A=
=3D3D"50" border=3D3D"0"></a></td>=0A=
    <td height=3D3D"53" width=3D3D"463">Metro Turizm<br>=0A=
      =3DDDndirimimli seyahatler sizi bekliyor=3D2E</td>=0A=
  </tr>=0A=
</table>=0A=
<table width=3D3D"524" border=3D3D"0" height=3D3D"145">=0A=
  <tr bgcolor=3D3D"#FFCCCC">=3D20=0A=
    <td height=3D3D"55" width=3D3D"51"><a =
href=3D3D"http://www=3D2Eindirimim=3D2Ecom=3D=0A=
"><img src=3D3D"http://indirimim=3D2Ecom/banner/inkilapanim=3D2Egif" =
width=3D3D"50=3D=0A=
" height=3D3D"50" border=3D3D"0"></a></td>=0A=
    <td height=3D3D"55" width=3D3D"463"> =3DDDnk=3DFDlap Kitapevi<br>=0A=
      www=3D2Eindirimim=3D2Ecom ile kitap almak daha avantajli </td>=0A=
  </tr>=0A=
  <tr bgcolor=3D3D"#FFCCCC">=3D20=0A=
    <td height=3D3D"53" width=3D3D"51"><a =
href=3D3D"http://www=3D2Eindirimim=3D2Ecom=3D=0A=
"><img src=3D3D"http://indirimim=3D2Ecom/banner/solida=3D2Egif" =
width=3D3D"50" hei=3D=0A=
ght=3D3D"50" border=3D3D"0"></a></td>=0A=
    <td height=3D3D"53" width=3D3D"463">Solid Sa=3DF0l=3DFDk =
=3DDCr=3DFCnleri <br>=0A=
      www=3D2Eindirimim=3D2Ecom m=3DFC=3DFEterilerine =3DF6zel =
indirimler</td>=0A=
  </tr>=0A=
</table>=0A=
<p>&nbsp;</p>=0A=
</body>=0A=
</html>=0A=
=0A=
------=3D_NextPart_84815C5ABAF209EF376268C8--=0A=
=0A=

---------------------- multipart/mixed attachment
Return-Path: <MO2003_20021018_3606@2mbb.com>
Envelope-To: do-sig-request@python.org
Received: from [64.14.139.137] (helo=mybigbargains.com)
	by mail.python.org with smtp (Exim 4.05)
	id 183xNY-0008Gi-00
	for do-sig-request@python.org; Tue, 22 Oct 2002 07:39:20 -0400
Received: (qmail 55296 invoked from network); 22 Oct 2002 11:31:39 -0000
Received: from unknown (HELO app11) (64.14.139.216)
  by 10.3.220.146 with SMTP; 22 Oct 2002 11:31:39 -0000
Message-ID: <704023967.1035286299797.mu@2mbb.com>
Date: Tue, 22 Oct 2002 04:26:51 -0700 (PDT)
From: ConsumerDirect <MO2003_20021018_3606@2mbb.com>
To: do-sig-request@python.org
Subject: Refinance - Get 4 free loan offers
Mime-Version: 1.0
Content-Type: text/html
Content-Transfer-Encoding: 7bit
X-warning: 64.14.139.137 in blacklist at dnsbl.njabl.org
 (spam source -- 1032063500)
X-Spam-Status: No, hits=0.8 required=5.0 tests=CTYPE_JUST_HTML,FROM_ENDS_IN_NUMS,HTTP_WITH_EMAIL_IN_URL,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_OSIRUSOFT_COM,RCVD_IN_SBL,SPAM_PHRASE_03_05,USER_IN_WHITELIST_TO,WEB_BUGS,X_OSIRU_SPAM_SRC
X-Spam-Level:

<html><body leftmargin="0" marginwidth="0" topmargin="0" marginheight="0" bgcolor="#ffffff"><img src="http://2mbb.com/c/i.jsp?MO2003_20021018_3606;259578032=h" alt="2mbb.com"><br>

<table width="600" border="0" cellspacing="0" cellpadding="10"><tr><td bgcolor="#FFFFFF" align="center"><table width="500" cellspacing="0" cellpadding="0" border="0"><tr><td><a href="http://2mbb.com/c/cs.jsp?MO2003_20021018_3606;259578032=http://www.myalp.com/"><img src="http://2mbb.com/img/2003_20021018_3606/1.gif" border="0"></a></td><td rowspan="2"><a href="http://2mbb.com/c/cs.jsp?MO2003_20021018_3606;259578032=http://www.myalp.com/"><img src="http://2mbb.com/img/2003_20021018_3606/2.jpg" border="0"></a></td></tr><tr><td><a href="http://2mbb.com/c/cs.jsp?MO2003_20021018_3606;259578032=http://www.myalp.com/"><img src="http://2mbb.com/img/2003_20021018_3606/3.gif" border="0"></a></td></tr></table></td></tr></table>

<table width="600" cellspacing="0" cellpadding="5" border="1" bordercolor="#000000" style="border-collapse:collapse"><tr><td bgcolor="#ffffff"><img src="http://2mbb.com/c/i.jsp?MO2003_20021018_3606=f" alt="Why are you receiving this email?"><br><span style="font-family:arial; color:#666666; font-size:10px;">http://2mbb.com/c/u.jsp?E=do-sig-request@python.org&P=MO2003_20021018_3606&U=259578032<br>Your email address on record is do-sig-request@python.org.<br><br><a href="http://2mbb.com/c/u.jsp?E=do-sig-request@python.org&P=MO2003_20021018_3606&U=259578032">Unsubscribe me.</a></span></td></tr></table></body></html>


---------------------- multipart/mixed attachment--

From tim.one@comcast.net  Sun Oct 27 02:46:43 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 26 Oct 2002 22:46:43 -0400
Subject: [Spambayes] Re: python.org corpus updated
In-Reply-To: <15803.9391.949518.510594@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEIACAAB.tim.one@comcast.net>

[Skip Montanaro, to Greg Ward about the rising spam::ham ratio on
 python.org traffic]

> Do you get email from python-list?

Yes, I believe all mailing lists going thru python.org are part of this
traffic.  This includes some private non-tech lists, and administrative
requests, which is in part why Greg is rightfully reluctant to open the
corpus for free public consumption.

> Its source is primarily the c.l.py newsfeed, which has been down for
> about five days now.

Yup.


From tim.one@comcast.net  Sun Oct 27 02:57:06 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 26 Oct 2002 22:57:06 -0400
Subject: [Spambayes] Proposing to make chi-combining the default
In-Reply-To: <20021026042907.14850F5A4@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEIBCAAB.tim.one@comcast.net>

[Tim]
>>2'. Change the default ham_cutoff to 0.20 and the default spam_cutoff
>>    to 0.90.

[T. Alexander Popiel]
> I'm slightly surprised at the looseness of 2', but as you say,
> the boundaries aren't all that touchy.

For my own email, and on my large c.l.py test, I use cutoffs of 0.30 and
0.80 with chi-combining very happily, so the suggested defaults are
conservative relative to that.  But they're *just* defaults, and anyone
taking a default too seriously should be shot <wink>.  Certainly, they
should be closer to the endpoints if just starting training.

> I'm all for the above.

Nobody has objected, so I'll make the change next (I already made the other
changes threatened, BTW -- use_mixed_combining is gone, and ditto
ignore_redundant_html).

Anyone wedded to gary-combining, don't panic:  your database is unaffected
by changing this default.  There's no need to retrain it.  If you want to
continue using gary-combining for scoring (again, the remaining combining
schemes have nothing to do with training, they're purely a scoring-time
choice), you'll need to add

[Classifier]
use_gary_combining: True

to your .ini file.


From popiel@wolfskeep.com  Sun Oct 27 03:10:48 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Sat, 26 Oct 2002 20:10:48 -0700
Subject: [Spambayes] Mining the headers 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15803.8897.442152.315985@montanaro.dyndns.org> 
References: <20021026203410.166C1F54A@cashew.wolfskeep.com>
	<15803.8897.442152.315985@montanaro.dyndns.org> 
Message-ID: <20021027031048.3680AF54A@cashew.wolfskeep.com>

In message:  <15803.8897.442152.315985@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>I've had three other options knocking around locally which haven't seemed to
>help or hurt when applied to my collections: mine_date_headers,
>generate_time_buckets, and extract_dow. The first controls overall attention
>to the Date: header.  The second generates tokens like time:12:3 (the third
>six-minute bucket of the twelfth hour).  The third generates tokens like
>dow:0 (Monday).  Should I check them in to see if they are useful for other
>people?  (I seem to have a bit different fp & fn results than others.)

Yes, I'd love to test them.

- Alex

From tim.one@comcast.net  Sun Oct 27 04:06:25 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 27 Oct 2002 00:06:25 -0400
Subject: [Spambayes] hammie deployment without Outlook
In-Reply-To: <3DBB0834.7050504@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEIFCAAB.tim.one@comcast.net>

[Rob Hooft]
> Since I'm Linux-only, I've been trying to put hammie.py to work for me.
> I made the attached change to hammie.py.

I can't help you with hammie, but I did check your change in.  Especially
since I just changed the default scoring scheme to chi-combining, it's
important that tools take the middle ground seriously.

All, Rob's patch taught hammie how to use the ham_cutoff option, and
introduced a new

X-Hammie-Disposition: Unsure

disposition (in addition to its Yes and No dispositions).

> ...
> I'll be waiting for a mozilla plugin to make interactive use of
> spambayes....

Indeed, Mark Hammond and Sean True appear to be proving that people are more
willing to contribute mounds of hard work to Windows clients <wink>.


From skip@pobox.com  Sun Oct 27 04:17:20 2002
From: skip@pobox.com (Skip Montanaro)
Date: Sat, 26 Oct 2002 23:17:20 -0500
Subject: [Spambayes] training a single message w/ hammie.py?
Message-ID: <15803.26832.97712.301007@montanaro.dyndns.org>

I'm getting ready to set up my procmailrc file to run hammie.py.  I'm hoping
to do incremental training.  Given an existing persistent store that's
possible, right?  If incremental training is possible, it seems odd to me
that no capability is provided to accept a single message from stdin that is
tagged as known spam or non-spam.  Before I launch into modifying hammie.py,
am I missing something obvious?  Also, is it possible to train with the
hammiecli/hammiesrv pair?  Seems not.

Skip

From tim.one@comcast.net  Sun Oct 27 04:37:58 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 27 Oct 2002 00:37:58 -0400
Subject: [Spambayes] Mining the headers
In-Reply-To: <15803.8897.442152.315985@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEIICAAB.tim.one@comcast.net>

[Skip Montanaro]
> I've had three other options knocking around locally which
> haven't seemed to help or hurt when applied to my collections:
> mine_date_headers, generate_time_buckets, and extract_dow. The first
> controls overall attention to the Date: header.  The second generates
> tokens like time:12:3 (the third six-minute bucket of the twelfth hour).
> The third generates tokens like dow:0 (Monday).  Should I check them
> in to see if they are useful for other people?

Of course.  Please do.  Note that many state governments have agreed to give
you an extra hour tonight to do this <wink>.

>  (I seem to have a bit different fp & fn results than others.)

Actually, unless things have changed dramatically, you get worse results
than everyone else combined.  That still lacks an explanation.  Have you
tried chi-combining yet?


From skip@pobox.com  Sun Oct 27 05:37:51 2002
From: skip@pobox.com (Skip Montanaro)
Date: Sun, 27 Oct 2002 00:37:51 -0500
Subject: [Spambayes] Mining the headers 
In-Reply-To: <20021027031048.3680AF54A@cashew.wolfskeep.com>
References: <20021026203410.166C1F54A@cashew.wolfskeep.com>
        <15803.8897.442152.315985@montanaro.dyndns.org>
        <20021027031048.3680AF54A@cashew.wolfskeep.com>
Message-ID: <15803.31663.391441.711086@montanaro.dyndns.org>


    >> I've had three other options knocking around locally which haven't
    >> seemed to help or hurt....  Should I check them in....

    Alex> Yes, I'd love to test them.

Done.  Note that I deleted the mine_date_headers option.  It was just a
gatekeeper for the other two.  Seemed pointless to me.  Here's my latest
run.  The first run was the default.  My dates.ini file is

    [Tokenizer]
    generate_time_buckets: True
    extract_dow: True

The results:

    run1s -> datess
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    ... etc ...

    false positive percentages
        0.500  0.500  tied          
        0.000  0.000  tied          
        0.500  0.500  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.500  0.500  tied          
        0.000  0.000  tied          
        0.500  0.500  tied          
        0.000  0.000  tied          
        0.500  0.500  tied          

    won   0 times
    tied 10 times
    lost  0 times

    total unique fp went from 5 to 5 tied          
    mean fp % went from 0.25 to 0.25 tied          

    false negative percentages
        0.000  0.000  tied          
        0.000  0.000  tied          
        1.000  1.000  tied          
        1.000  1.000  tied          
        0.500  0.500  tied          
        1.000  0.500  won    -50.00%
        0.500  0.500  tied          
        1.500  1.500  tied          
        0.000  0.000  tied          
        2.000  2.000  tied          

    won   1 times
    tied  9 times
    lost  0 times

    total unique fn went from 15 to 14 won     -6.67%
    mean fn % went from 0.75 to 0.7 won     -6.67%

    ham mean                     ham sdev
       1.38    1.38   +0.00%       10.18   10.17   -0.10%
       0.42    0.43   +2.38%        3.77    3.78   +0.27%
       0.98    0.98   +0.00%        8.39    8.36   -0.36%
       0.17    0.21  +23.53%        1.05    1.52  +44.76%
       0.93    0.93   +0.00%        7.73    7.73   +0.00%
       1.40    1.40   +0.00%        8.36    8.39   +0.36%
       1.18    1.14   -3.39%        7.39    7.24   -2.03%
       0.73    0.74   +1.37%        7.54    7.54   +0.00%
       0.97    0.98   +1.03%        6.62    6.72   +1.51%
       0.79    0.79   +0.00%        7.74    7.74   +0.00%

    ham mean and sdev for all runs
       0.89    0.90   +1.12%        7.32    7.32   +0.00%

    spam mean                    spam sdev
      99.17   99.16   -0.01%        4.63    4.71   +1.73%
      98.65   98.66   +0.01%        6.34    6.27   -1.10%
      96.71   96.71   +0.00%       13.73   13.74   +0.07%
      96.74   96.73   -0.01%       13.46   13.46   +0.00%
      98.44   98.46   +0.02%        9.25    9.23   -0.22%
      97.35   97.36   +0.01%       12.00   11.92   -0.67%
      98.33   98.34   +0.01%        9.55    9.53   -0.21%
      97.17   97.17   +0.00%       13.68   13.68   +0.00%
      98.94   98.93   -0.01%        6.89    6.90   +0.15%
      97.46   97.45   -0.01%       13.72   13.73   +0.07%

    spam mean and sdev for all runs
      97.89   97.90   +0.01%       10.87   10.86   -0.09%

    ham/spam mean difference: 97.00 97.00 +0.00

Here's the cost table:

    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    ... yeah, yeah, yeah, enough already! ...
    filename:     run1   dates
    ham:spam:  2000:2000      
                       2000:2000
    fp total:        5       5
    fp %:         0.25    0.25
    fn total:       15      14
    fn %:         0.75    0.70
    unsure t:       93      93
    unsure %:     2.33    2.33
    real cost:  $83.60  $82.60
    best cost:  $53.80  $53.60
    h mean:       0.89    0.90
    h sdev:       7.32    7.32
    s mean:      97.89   97.90
    s sdev:      10.87   10.86
    mean diff:   97.00   97.00
    k:            5.33    5.34

Note that my numbers seem to be getting a lot better.  My ham/spam
collection has slowly gotten cleaner and I've been adding more new stuff,
not to mention which the default scheme (chi2?) seems a lot more
sensitive/accurate.  I noticed that as I lopped off old messages, first
those from 1999 and before then those from 2000, that the accuracy improved.
That suggests two things to me: first, the nature of "what is spam?" has
changed a bit, and two, someone ought to test this notion. ;-)

thanks-to-uncle-timmy-for-the-extra-hour-ly, y'rs,

Skip

From skip@pobox.com  Sun Oct 27 05:40:27 2002
From: skip@pobox.com (Skip Montanaro)
Date: Sun, 27 Oct 2002 00:40:27 -0500
Subject: [Spambayes] Mining the headers
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEIICAAB.tim.one@comcast.net>
References: <15803.8897.442152.315985@montanaro.dyndns.org>
        <LNBBLJKPBEHFEDALKOLCEEIICAAB.tim.one@comcast.net>
Message-ID: <15803.31819.353336.954786@montanaro.dyndns.org>


    >> (I seem to have a bit different fp & fn results than others.)

    Tim> Actually, unless things have changed dramatically, you get worse
    Tim> results than everyone else combined.  That still lacks an
    Tim> explanation.  Have you tried chi-combining yet?

Yes, now that it's the default, though I didn't make a conscious attempt to
compare it with other schemes.  I got a bit behind the past several weeks,
and still don't really understand all of what's been changed lately (I don't
even pretend to be a statistician as a pick-up line in online chat rooms!).

Skip


From skip@pobox.com  Sun Oct 27 05:46:09 2002
From: skip@pobox.com (Skip Montanaro)
Date: Sun, 27 Oct 2002 00:46:09 -0500
Subject: [Spambayes] setup.py?
Message-ID: <15803.32161.725680.230490@montanaro.dyndns.org>

I don't suppose someone would care to skim the setup.py file to see if I've
gotten most stuff that needs installing would they?  Pretty please?

Skip

From rob@hooft.net  Sun Oct 27 08:11:53 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 27 Oct 2002 09:11:53 +0100
Subject: [Spambayes] More proposed hammie changes: use Options
Message-ID: <3DBB9FC9.2070306@hooft.net>

This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Attached are some more changes I'd like to propose to make to hammie:

  * Add -D option to reverse the -d option
  * Make the default use of pickle/database configurable
  * Add a showclue-limit to limit the clues added to the
    Hammie-Disposition header. I found the header becoming a bit
    large for many of my messages. This option can be used to make
    it show only the strongest clues either way.
  * Add a section [Hammie] to the configuration file to take all
    these hammie configurations such that hammie doesn't always need
    to be run with half a dozen of options to work (I always forget one
    if I'm trying it interactively).

Furthermore, the patch changes a lot of the ' and " signs in the default
string in Options.py such that the parser in emacs/python-mode.el is now 
happy with it.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.59
diff -u -r1.59 Options.py
--- Options.py	27 Oct 2002 05:26:01 -0000	1.59
+++ Options.py	27 Oct 2002 08:02:09 -0000
@@ -48,7 +48,7 @@
 # Generate tokens just counting the number of instances of each kind of
 # header line, in a case-sensitive way.
 #
-# Depending on data collection, some headers aren't safe to count.
+# Depending on data collection, some headers are not safe to count.
 # For example, if ham is collected from a mailing list but spam from your
 # regular inbox traffic, the presence of a header like List-Info will be a
 # very strong ham clue, but a bogus one.  In that case, set
@@ -150,7 +150,7 @@
 #
 # The idea is that if something scores < hamc, it's called ham; if
 # something scores >= spamc, it's called spam; and everything else is
-# called "I'm not sure" -- the middle ground.
+# called 'I am not sure' -- the middle ground.
 #
 # Note that cvcost.py does a similar analysis.
 #
@@ -169,7 +169,7 @@
 
 # Display spam when
 #     show_spam_lo <= spamprob <= show_spam_hi
-# and likewise for ham.  The defaults here don't show anything.
+# and likewise for ham.  The defaults here do not show anything. 
 show_spam_lo: 1.0
 show_spam_hi: 0.0
 show_ham_lo: 1.0
@@ -179,8 +179,8 @@
 show_false_negatives: False
 show_unsure: False
 
-# Near the end of Driver.test(), you can get a listing of the 'best
-# discriminators' in the words from the training sets.  These are the
+# Near the end of Driver.test(), you can get a listing of the best
+# discriminators in the words from the training sets.  These are the
 # words whose WordInfo.killcount values are highest, meaning they most
 # often were among the most extreme clues spamprob() found.  The number
 # of best discriminators to show is given by show_best_discriminators;
@@ -196,7 +196,7 @@
 # pickle_basename, the extension is .pik, and increasing integers are
 # appended to pickle_basename.  By default (if save_trained_pickles is
 # true), the filenames are class1.pik, class2.pik, ...  If a file of that
-# name already exists, it's overwritten.  pickle_basename is ignored when
+# name already exists, it is overwritten.  pickle_basename is ignored when
 # save_trained_pickles is false.
 
 # if save_histogram_pickles is true, Driver.train() saves a binary
@@ -218,9 +218,9 @@
 # training each on N-1 sets, and the predicting against the set not trained
 # on.  By default, it does this in a clever way, learning *and* unlearning
 # sets as it goes along, so that it never needs to train on N-1 sets in one
-# gulp after the first time.  Setting this option true forces "one gulp
-# from-scratch" training every time.  There used to be a set of combining
-# schemes that needed this, but now it's just in case you're paranoid <wink>.
+# gulp after the first time.  Setting this option true forces ''one gulp
+# from-scratch'' training every time.  There used to be a set of combining
+# schemes that needed this, but now it is just in case you are paranoid <wink>.
 build_each_classifier_from_scratch: False
 
 [Classifier]
@@ -230,15 +230,15 @@
 max_discriminators: 150
 
 # These two control the prior assumption about word probabilities.
-# "x" is essentially the probability given to a word that's never been
+# "x" is essentially the probability given to a word that has never been
 # seen before.  Nobody has reported an improvement via moving it away
 # from 1/2.
 # "s" adjusts how much weight to give the prior assumption relative to
 # the probabilities estimated by counting.  At s=0, the counting estimates
 # are believed 100%, even to the extent of assigning certainty (0 or 1)
-# to a word that's appeared in only ham or only spam.  This is a disaster.
+# to a word that has appeared in only ham or only spam.  This is a disaster.
 # As s tends toward infintity, all probabilities tend toward x.  All
-# reports were that a value near 0.4 worked best, so this doesn't seem to
+# reports were that a value near 0.4 worked best, so this does not seem to
 # be corpus-dependent.
 # NOTE:  Gary Robinson previously used a different formula involving 'a'
 # and 'x'.  The 'x' here is the same as before.  The 's' here is the old
@@ -249,11 +249,11 @@
 # When scoring a message, ignore all words with
 # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength.
 # This may be a hack, but it has proved to reduce error rates in many
-# tests over Robinson's base scheme.  0.1 appeared to work well across
+# tests over Robinsons base scheme.  0.1 appeared to work well across
 # all corpora.
 robinson_minimum_prob_strength: 0.1
 
-# The combining scheme currently detailed on Gary Robinon's web page.
+# The combining scheme currently detailed on Gary Robinons web page.
 # The middle ground here is touchy, varying across corpus, and within
 # a corpus across amounts of training data.  It almost never gives extreme
 # scores (near 0.0 or 1.0), but the tail ends of the ham and spam
@@ -261,15 +261,15 @@
 use_gary_combining: False
 
 # For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
-# follows the chi-squared distribution with 2*n degrees of freedom.  That's
-# the "provably most-sensitive" test Gary's original scheme was monotonic
+# follows the chi-squared distribution with 2*n degrees of freedom.  That is
+# the "provably most-sensitive" test Garys original scheme was monotonic
 # with.  Getting closer to the theoretical basis appears to give an excellent
 # combining method, usually very extreme in its judgment, yet finding a tiny
 # (in # of msgs, spread across a huge range of scores) middle ground where
-# lots of the mistakes live.  This is the best method so far on Tim's data.
-# One systematic benefit is that it's immune to "cancellation disease".  One
-# systematic drawback is that it's sensitive to *any* deviation from a
-# uniform distribution, regardless of whether that's actually evidence of
+# lots of the mistakes live.  This is the best method so far on Tims data.
+# One systematic benefit is that it is immune to "cancellation disease".  One
+# systematic drawback is that it is sensitive to *any* deviation from a
+# uniform distribution, regardless of whether that is actually evidence of
 # ham or spam.  Rob Hooft alleviated that by combining the final S and H
 # measures via (S-H+1)/2 instead of via S/(S+H)).
 # In practice, it appears that setting ham_cutoff=0.05, and spam_cutoff=0.95,
@@ -278,6 +278,26 @@
 # with ham_cutoff=0.30 and spam_cutoff=0.80 across three test data sets
 # (original c.l.p data, his own email, and newer general python.org traffic).
 use_chi_squared_combining: True
+
+[Hammie]
+# The name of the header that hammie adds to an E-mail in filter mode
+header: X-Hammie-Disposition
+
+# The default database path used by hammie
+defaultdb: hammie.db
+
+# The range of clues that are added to the "hammie" header in the E-mail
+# All clues that have their probability smaller than this number, or larger
+# than one minus this number are added to the header such that you can see
+# why spambayes thinks this is ham/spam or why it is unsure. The default is
+# to show all clues, but you can reduce that by setting showclue to a lower
+# value, such as 0.1 (which Rob is using)
+showclue: 0.5
+
+# hammie can use either a database (quick to score one message) or a pickle
+# (quick to train on huge amounts of messages). Set this to True to use a
+# database by default.
+usedb: False
 """
 
 int_cracker = ('getint', None)
@@ -333,6 +353,12 @@
                    'use_gary_combining': boolean_cracker,
                    'use_chi_squared_combining': boolean_cracker,
                    },
+    'Hammie': {'header': string_cracker,
+               'defaultdb': string_cracker,
+               'showclue': float_cracker,
+               'usedb': boolean_cracker,
+               },
+
 }
 
 def _warn(msg):
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.30
diff -u -r1.30 hammie.py
--- hammie.py	27 Oct 2002 03:59:52 -0000	1.30
+++ hammie.py	27 Oct 2002 08:02:11 -0000
@@ -22,11 +22,14 @@
         Only meaningful with the -u option.
     -p FILE
         use file as the persistent store.  loads data from this file if it
-        exists, and saves data to this file at the end.  Default: %(DEFAULTDB)s
+        exists, and saves data to this file at the end.
+        Default: %(DEFAULTDB)s
     -d
         use the DBM store instead of cPickle.  The file is larger and
         creating it is slower, but checking against it is much faster,
-        especially for large word databases.
+        especially for large word databases. Default: %(USEDB)s
+    -D
+        the reverse of -d: use the cPickle instead of DBM
     -f
         run as a filter: read a single message from stdin, add an
         %(DISPHEADER)s header, and write it to stdout.  If you want to
@@ -52,15 +55,21 @@
 program = sys.argv[0] # For usage(); referenced by docstring above
 
 # Name of the header to add in filter mode
-DISPHEADER = "X-Hammie-Disposition"
+DISPHEADER = options.header
 
 # Default database name
-DEFAULTDB = "hammie.db"
+DEFAULTDB = options.defaultdb
 
 # Probability at which a message is considered spam
 SPAM_THRESHOLD = options.spam_cutoff
 HAM_THRESHOLD = options.ham_cutoff
 
+# Probability limit for a clue to be added to the DISPHEADER
+SHOWCLUE = options.showclue
+
+# Use a database? If False, use a pickle
+USEDB = options.usedb
+
 # Tim's tokenizer kicks far more booty than anything I would have
 # written.  Score one for analysis ;)
 from tokenizer import tokenize
@@ -208,7 +217,10 @@
     def formatclues(self, clues, sep="; "):
         """Format the clues into something readable."""
 
-        return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues])
+        return sep.join(["%r: %.2f" % (word, prob)
+                         for word, prob in clues
+                         if (word[0] == '*' or
+                             prob <= SHOWCLUE or prob >= 1.0 - SHOWCLUE)])
 
     def score(self, msg, evidence=False):
         """Score (judge) a message.
@@ -377,7 +389,7 @@
 def main():
     """Main program; parse options and go."""
     try:
-        opts, args = getopt.getopt(sys.argv[1:], 'hdfg:s:p:u:r')
+        opts, args = getopt.getopt(sys.argv[1:], 'hdDfg:s:p:u:r')
     except getopt.error, msg:
         usage(2, msg)
 
@@ -389,7 +401,8 @@
     spam = []
     unknown = []
     reverse = 0
-    do_filter = usedb = False
+    do_filter = False
+    usedb = USEDB
     for opt, arg in opts:
         if opt == '-h':
             usage(0)
@@ -401,6 +414,8 @@
             pck = arg
         elif opt == "-d":
             usedb = True
+        elif opt == "-D":
+            usedb = False
         elif opt == "-f":
             do_filter = True
         elif opt == '-u':

---------------------- multipart/mixed attachment--


From Alexander@Leidinger.net  Sun Oct 27 12:40:41 2002
From: Alexander@Leidinger.net (Alexander Leidinger)
Date: Sun, 27 Oct 2002 13:40:41 +0100
Subject: [Spambayes] Bugfix for hammie.py
Message-ID: <20021027134041.766a951a.Alexander@Leidinger.net>

This is a multi-part message in MIME format.

---------------------- multipart/mixed attachment
Hi,

it seems nobody is using multiple -u options besides me...

Bye,
Alexander.

-- 
            Yes, I've heard of "decaf." What's your point?

http://www.Leidinger.net                       Alexander @ Leidinger.net
  GPG fingerprint = C518 BC70 E67F 143F BE91  3365 79E2 9C60 B006 3FE7

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: hammie.py.diff
Type: application/octet-stream
Size: 819 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021027/86c846f9/hammie.py.exe

---------------------- multipart/mixed attachment--

From toby@tarind.com  Sun Oct 27 14:41:43 2002
From: toby@tarind.com (Toby Dickenson)
Date: Sun, 27 Oct 2002 14:41:43 +0000
Subject: [Spambayes] Maildir folders
Message-ID: <200210271441.43419@trumpet.tarind.com>

I guess noone else is using maildir folders:

diff -c -1 -r1.2 mboxutils.py
*** mboxutils.py        4 Oct 2002 19:41:36 -0000       1.2
--- mboxutils.py        27 Oct 2002 14:40:49 -0000
***************
*** 12,13 ****
--- 12,15 ----
               files
+ /foo/bar/ -- (existing directory with a cur/ subdirectory)
+              Maildir mailbox
  /foo/Mail/bar/ -- (existing directory with /Mail/ in its path)
***************
*** 83,85 ****
          # else a DirOfTxtFileMailbox.
!         if name.find("/Mail/") >= 0:
              mbox = mailbox.MHMailbox(name, _factory)
--- 85,89 ----
          # else a DirOfTxtFileMailbox.
!         if os.path.exists(name+'/cur'):
!             mbox = mailbox.Maildir(name, _factory)
!         elif name.find("/Mail/") >= 0:
              mbox = mailbox.MHMailbox(name, _factory)

From tim.one@comcast.net  Sun Oct 27 17:34:57 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 27 Oct 2002 12:34:57 -0500
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <15801.54632.545081.293386@slothrop.zope.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEJOCBAB.tim.one@comcast.net>

[Tim]
>> Fine by me if we nuke the Bayes __slots__ now.

[Jeremy]
> Cool.

Bayes.__slots__ has been nuked.  Old pickles should continue to load without
problem, though.

> ...
> I only train on spam when the existing classifier doesn't mark it as
> spam.

I expect that's a poor idea over time, because it ends up being too reliant
on hapaxes, which in turn makes the scheme brittle.  The intent has always
been to train on a random sampling of ham and spam, and in real life I'm
finding it valuable to train frequently (in large part because if I get a
new spam today, it seems I'm likely to get the same thing four more times
over the next day, and 100 *related* spams over the next week; the hapaxes
change across the variants, and particularly the specific numbers in numeric
IPs in embedded URLs, but key phrases like "private gold mine" stay the
same).  I'm also finding it valuable to train on ham that scores above 0.01
and spam that scores below 0.99 (under chi-combining), and again seemingly
because scores relying mostly on hapaxes work well over the short term but
can be counterproductive over time.

> I expect that the amount of spam I keep around won't be that
> big compared to all the other email that I keep :-).

Just don't go blaming the spambayes code when it breaks down <0.9 wink>.

> ...
> It's even worse, though, that it uses asycnore.  I found asyncore
> added a lot of complexity to ZEO and would rather we hadn't used it.
> Then add in a second asyncore app (the proxy) and you've got real
> trouble.  The complexity seems to be multiplicative rather than
> additive.

Upgrade to Outlook and you can enjoy Mark's many threads <wink>.


From tim.one@comcast.net  Sun Oct 27 17:44:36 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 27 Oct 2002 12:44:36 -0500
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <20021025233739.9C94DF5A4@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEJPCBAB.tim.one@comcast.net>

[Derek Simkowiak]
>> Having a starter.db would both (a) make life easier for getting
>> started, and (b) give us a well-established baseline to test against.

[T. Alexander Popiel]
> I disagree with (b), because changes in the tokenizer (where I suspect
> some of the advances will come from) will invalidate the database.

I expect virtually all future advances will come from the tokenizer now.
When staring at msgs, the classifier is the brain but the tokenizer supplies
the eyes.  We appear to have gotten all the drugs (biases) out of the brain,
and gave it some truly kick-ass chintelligence, but it can't judge what it
can't see.

Obvious example:  by default, the tokenizer still ignores the content of To
and Cc headers, except to *count* the number of recipients.  By default, we
still don't "see", e.g., Undisclosed Recipients in the To header!  This is
so because cracking To gave great results for bogus reasons when using
mixed-source corpora (which, e.g., includes people using their own personal
email, when mixing in a spam or ham archive from a time when they used a
different ISP with a different To address).


From tim.one@comcast.net  Sun Oct 27 18:02:07 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 27 Oct 2002 13:02:07 -0500
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <Pine.LNX.4.33L2.0210251614170.6059-100000@dev.itsite.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEKACBAB.tim.one@comcast.net>

[Derek Simkowiak]
> 	How can we test out new algorithms if the project doesn't have a
> control group?  We have no way of knowing if someone's successful (or
> poor) results are an attribute of the new algorithm, or if it's an
> attribute of their particular sample data.

Do read TESTING.txt, checked into the project.  The testing framework is set
up in a statistically sound way, so that even people working with a single
corpus get it sliced and diced in random ways across multiple testing runs.
In addition, as Alex already said, Big Changes have been made only after
multiple-corpora tests reported on this list.  When 10 randomized runs
across each of several distinct corpora all yield similar results, it's easy
to have confidence.

> 	Having a starter.db would both (a) make life easier for getting
> started,

I couldn't give you a starter db that would work well for your ham.  The
algorithms here aren't *trying* to "identify spam" -- you want something
like SpamAssassin if that's what you want.  The algorithms here are trying
to *separate* ham from spam, and the ham words are just as important to that
as the spam words.  I've run several experiments where a classifier trained
on one corpus was used to predict against a different corpus.  The false
negative rate remained good (little spam snuck thru), but the false positive
rate zoomed (many ham were *called* spam).  In IR terms, spam recall
remained good but spam precision suffered badly.

This isn't surprising, either:  except for foreign-language spam, spam is
still using ordinary words, and the same words show up in ham too.  For
example, in the very msg I'm replying to,

'give'                         0.648963
'skip:w 10'                    0.664292
'results'                      0.693332
'database.'                    0.718815
'successful'                   0.821229
'stock'                        0.867852
'data.'                        0.887295
"someone's"                    0.969799
'subject:+'                    0.987106

That's a decent collection of high-spamprob words.  Nevertheless,
chi-combining was extremely confident the msg was ham, because of a much
larger number of low-spamprob words, some of which are specific to the topic
being discussed on this mailing list, and some of which are specific to
computer-geek chatter:

'argument'                     0.0155709
'header:In-reply-to:1'         0.0158379
'subject:: ['                  0.0169746
'attribute'                    0.0196507
'url:mailman-21'               0.0196507
'skip:_ 40'                    0.0320263
"else's"                       0.0348837
'(b)'                          0.0412844
'header:Errors-to:1'           0.0458968
'started,'                     0.0505618
'subject:Spambayes'            0.0505618
'algorithms'                   0.0652174
'subject:ZODB'                 0.0652174
'subject:] '                   0.0772017
'from:derek'                   0.0918367
'spambayes'                    0.0918367
'header:Return-path:1'         0.0946929
'header:Message-id:1'          0.0962885
'header:MIME-version:1'        0.122459

The low-spamprob words specific to *your* ham will depend on the content of
your ham in equally quirky ways.


From dereks@itsite.com  Sun Oct 27 19:05:46 2002
From: dereks@itsite.com (Derek Simkowiak)
Date: Sun, 27 Oct 2002 11:05:46 -0800 (PST)
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEKACBAB.tim.one@comcast.net>
Message-ID: <Pine.LNX.4.33L2.0210271014490.24486-100000@dev.itsite.com>

>  I've run several experiments where a classifier trained on one corpus
> was used to predict against a different corpus.  The false negative
> rate remained good (little spam snuck thru), but the false positive
> rate zoomed (many ham were *called* spam).  In IR terms, spam recall
> remained good but spam precision suffered badly.

	It seems like you're saying that SpamBayes will not work for an
enterprise-wide deployment, since different individual's vocabularies,
writing styles, and interests vary so wildly.

	In the false positives you mention above, was the spam cutoff
being used?  (If so, what was it set to?)  Or, are those "false positives"
hams being assigned a spam probability >.50 ?

	I am a big fan of enterprise-wide anti-spam measures.  In my mind,
it makes sense to flag messages and have "default" filter rules for every
workstation.  It makes it much easier on the I.T. department.  Requiring
Python on every Windows box would immediately make SpamBayes a no-go in
many businesses and Universities, simply because of the (expensive) user
support that would be required.  So I am concerned when you present
evidence that every individual needs to do their own SpamBayes training.

	It is obvious and well-understood that a .db trained from a
specific individual's body of emails will work better for that individual
than for some other individual.  So what you say above does not surprise
me.  But what does surpise me is the argument that every individual should
do their own SpamBayes training.


> The low-spamprob words specific to *your* ham will depend on the
> content of your ham in equally quirky ways.

	No doubt; but over a large body of emails from many different
individuals, I think the "quirkies" would fall by the wayside (because any
one individual's quirkies would not be very frequent over the given
collection), and that the Spam-specific "quirkies" (things like
color=#FF0000) would hence become the strongest identifiers for any given
message.

	(Officially proposing the term "quirkie" to mean a strong spamprob
word -- either for or against -- that is specific a particular corpus of
email.)

	I'm guessing that if you did your tests again, but trained against
all the corpuses before doing the test, your false positive rate would
drop way down.  (Is that not how SpamBayes is supposed to work?)

	You see, I do not have access to a large corpus of email from many
different individuals.  All I have is my inbox, which is quite quirky
indeed.  But I want to set up a hammie.py installation for a small
workgroup, to see what kind of performance I get, and to monitor
SpamBayes' performance changes over time (as it's trained to the small
workgroup's incoming messages).

	If I had a starter .db file that was trained against many emails
from many different individuals, then I'd be able to get going.  Instead,
I'm stuck wondering what process I should go through to try to collect a
large corpus of email that will have its ham quirkies averaged away.

	But I know from reading test results here that many individuals
have already taken the time and effort to do that.  So I am asking for
someone to share that effort -- kind of like Open Source, except on
SpamBayes training instead of code writing.


--Derek


From seant@iname.com  Sun Oct 27 19:19:23 2002
From: seant@iname.com (Sean True)
Date: Sun, 27 Oct 2002 14:19:23 -0500
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <Pine.LNX.4.33L2.0210271014490.24486-100000@dev.itsite.com>
Message-ID: <MJEHLHJKGINLONDMMKNEOEEOHEAA.seant@iname.com>

> 	I am a big fan of enterprise-wide anti-spam measures.  In my mind,
> it makes sense to flag messages and have "default" filter rules for every
> workstation.  It makes it much easier on the I.T. department.  Requiring
> Python on every Windows box would immediately make SpamBayes a no-go in
> many businesses and Universities, simply because of the (expensive) user
> support that would be required.  So I am concerned when you present
> evidence that every individual needs to do their own SpamBayes training.

Tim's concerns seem to center on the very individual definition of what ham
is. I think I
remember an earlier concession about a more common definition of what spam
is.

Perhaps we need a starter database that has a predefined set of spam
probabilities, to which one could
add ones own ham (and additional spam). I have a lot more ham than spam
available, and I've been
saving spam for months, against a rainy day -- or a project like this. If
somebody jump started
my spam collection, I'd be a happy camper.

-- Sean


From tim.one@comcast.net  Sun Oct 27 20:11:50 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 27 Oct 2002 15:11:50 -0500
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <MJEHLHJKGINLONDMMKNEOEEOHEAA.seant@iname.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEKFCBAB.tim.one@comcast.net>

[Sean True]
> Tim's concerns seem to center on the very individual definition
> of what ham is.  I think I remember an earlier concession about a more
> common definition of what spam is.

They're intimately related:  the definition of spam that seems to work best
with this system is "automated bulk advertising the user doesn't want".  The
language and artifacts of advertising "are different", akin to the way it's
usually easy to distinguish a TV commercial from a sitcom or movie.  But
virtually everyone signs up for automated bulk advertising they *do* want,
and it's the "X wants Y's crap" part that can't be served by a single
database.  You don't want ads from esmokes.com but I do; you may want
marketing spam from Microsoft about .NET but I don't; etc.  It's consistent
in tests that all forms of marketing collateral get rated as spam unless
specifically trained for an individual's "ya, but I want *that* spam"
preferences.

> Perhaps we need a starter database that has a predefined set of spam
> probabilities, to which one could add ones own ham (and additional spam).
> I have a lot more ham than spam available, and I've been saving spam for
> months, against a rainy day -- or a project like this. If somebody jump
> started my spam collection, I'd be a happy camper.

There are extensive spam archives available for the taking:

    http://www.paulgraham.com/spamarchives.html

My large comp.lang.python test trained against Bruce Guenter's 2002 spam
collection (about 14,000 spam at the time).

When I used that classifier against Greg Ward's later and more-general
python.org corpus, it gave high false positive rates against the handful of
small private "hobby" mailing lists run thru python.org.  The ham in those
didn't look anything like legitimate c.l.py traffic (neither to the
classifier nor to human eyes), so got high spam scores:  all ham uses spam
words, but it's hard for spam to hit a significant # of low-spamprob words,
were "low" is defined relative to an individual.  Hence the FN rate doesn't
suffer much, but the FP rate zooms.

There's really no point in arguing about this, as it's something that can be
tested.  All tests in that direction to date have been discouraging.  OTOH,
tests on the general python.org corpus-- barring personal email --have been
very encouraging, but only when run with a classifier trained *on* the
general python.org corpus.  Tech mailing lists simply don't tolerate much
advertising of any kind, and it's still the case that non-Python non-Zope
conference announcements (which are a form of bulk advertising) get high
scores relative to other ham.


From tim.one@comcast.net  Sun Oct 27 20:55:25 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 27 Oct 2002 15:55:25 -0500
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <Pine.LNX.4.33L2.0210271014490.24486-100000@dev.itsite.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEKHCBAB.tim.one@comcast.net>

[Derek Simkowiak]
> 	It seems like you're saying that SpamBayes will not work for an
> enterprise-wide deployment, since different individual's vocabularies,
> writing styles, and interests vary so wildly.

That's a matter for testing to decide, but it's not a kind of thing I can
make time to test.  I doubt that their vocabularies or writing styles matter
(it's the email you get, not the email you write, that's judged), what
matters is what forms of advertising the individuals within the enterprise
want.  "Enterprise" is too vague a word to guess anything about that in
general.  If "the enterprise" is general tech mailing-list traffic going
thru python.org, then we have strong evidence (from testing) that a single
classifier will work great.  If "the enterprise" is an ISP serving 1,000
individuals' private email, I expect a single classifier would have such
high false positive rates as to be unacceptable.  If you have one user who
*wants* porn ads, a single classifier has to be trained to accept them (and
can be -- it's easy).  Then all users get them.  If one user signs up for a
minister-by-mail scam (a real-life example reported earlier on this list),
then all users get minister-by-mail scams.  Etc.

> 	In the false positives you mention above, was the spam cutoff
> being used?  (If so, what was it set to?)  Or, are those "false positives"
> hams being assigned a spam probability >.50 ?

Different tests were done at different times with different combining
schemes and different corpora.  They all had in common that "false positive"
scores were above a realistic middle-ground cutoff.

> 	I am a big fan of enterprise-wide anti-spam measures.  In my mind,
> it makes sense to flag messages and have "default" filter rules for every
> workstation.  It makes it much easier on the I.T. department.  Requiring
> Python on every Windows box would immediately make SpamBayes a no-go in
> many businesses and Universities, simply because of the (expensive) user
> support that would be required.  So I am concerned when you present
> evidence that every individual needs to do their own SpamBayes training.

Spam in the sense of "advertising I don't want to see, as opposed to
advertising I do want to see" is a personal judgment.  That doesn't preclude
server-based approaches, but would require knowing about (saving info about)
each individual, unless "the enterprise" has a single, fixed policy about
what constitutes advertising nobody in the enterprise should be allowed to
receive.

> 	It is obvious and well-understood that a .db trained from a
> specific individual's body of emails will work better for that individual
> than for some other individual.  So what you say above does not surprise
> me.  But what does surpise me is the argument that every individual should
> do their own SpamBayes training.

Test it and draw your own conclusions -- nothing is hidden here <wink>.

>> The low-spamprob words specific to *your* ham will depend on the
>> content of your ham in equally quirky ways.

> 	No doubt; but over a large body of emails from many different
> individuals, I think the "quirkies" would fall by the wayside (because any
> one individual's quirkies would not be very frequent over the given
> collection), and that the Spam-specific "quirkies" (things like
> color=#FF0000) would hence become the strongest identifiers for any given
> message.

In that case the ham quirkies become too weak to let that individual's
favored forms of advertising thru.  By the way, if you think #FF0000 is a
killer-strong spam clue, you don't have young relatives sending you HTML
birthday greetings <0.6 wink>.

> 	(Officially proposing the term "quirkie" to mean a strong spamprob
> word -- either for or against -- that is specific a particular corpus of
> email.)

Consider it adopted -- I like it!

> 	I'm guessing that if you did your tests again, but trained against
> all the corpuses before doing the test, your false positive rate would
> drop way down.  (Is that not how SpamBayes is supposed to work?)

Training on ham does improve the FP rate.  But if I have to train it to
allow the forms of bulk advertising you want to see, then a single
classifier can't block those forms of advertising for anyone else.  In the
python.org context, the only community-accepted advertising is highly
specific to Python and Zope, so a single classifier works fine.  In the
context of my personal email, the only advertising I want to see is from the
companies I do business with, and I indeed needed to train carefully on
several examples each of marketing email from various *specific* financial
institutions, companies, and special-interest newsletters *I* like to see.
I've even trained it to accept "Joke of the Day" spam, because I often like
the jokes, despite that the rest of those spams are trying to sell me the
usual range of crap from human growth hormone to miracle diets.  You don't
want to see that stuff, and that I've trained my classifier to accept
marketing blurbs from Strong isn't going to help you get marketing blurbs
from Oppenheimer.

> 	You see, I do not have access to a large corpus of email from many
> different individuals.  All I have is my inbox, which is quite quirky
> indeed.

So start with that.

>  But I want to set up a hammie.py installation for a small
> workgroup, to see what kind of performance I get, and to monitor
> SpamBayes' performance changes over time (as it's trained to the small
> workgroup's incoming messages).

Then start with that.

> 	If I had a starter .db file that was trained against many emails
> from many different individuals, then I'd be able to get going.

Just start and see what happens.  You're simply not going to get a DB from
anyone trained on personal email, because there are too many clues about
individual identities in the database, including things like passwords and
account numbers, and email addresses of friends and relatives.

> Instead, I'm stuck wondering what process I should go through to try to
> collect a large corpus of email that will have its ham quirkies averaged
> away.

You don't need a large corpus; the system learns quickly; just start.

> 	But I know from reading test results here that many individuals
> have already taken the time and effort to do that.  So I am asking for
> someone to share that effort -- kind of like Open Source, except on
> SpamBayes training instead of code writing.

I could give you a classifier trained on comp.lang.python traffic plus Bruce
G's 2002 spam collection.  Indeed, I used to make such a thing available on
SourceForge.  Few people bothered to try it, and those who did reported poor
results on their personal email, so I got rid of it.  I don't believe anyone
tried it in the context of corporate email.  I won't believe that you're
going to try it until you report that you've already started and are getting
poor results.


From tim.one@comcast.net  Sun Oct 27 21:36:16 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 27 Oct 2002 16:36:16 -0500
Subject: [Spambayes] Maildir folders
In-Reply-To: <200210271441.43419@trumpet.tarind.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEKLCBAB.tim.one@comcast.net>

[Toby Dickenson]
> I guess noone else is using maildir folders:

Thanks for the patch, Toby.  I checked it in but can't test it; you should,
since I changed

> !         if os.path.exists(name+'/cur'):

to use os.path.join().


From tim.one@comcast.net  Sun Oct 27 21:41:17 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 27 Oct 2002 16:41:17 -0500
Subject: [Spambayes] Bugfix for hammie.py
In-Reply-To: <20021027134041.766a951a.Alexander@Leidinger.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEKLCBAB.tim.one@comcast.net>

[Alexander Leidinger]
> it seems nobody is using multiple -u options besides me...

hammie users appear missing in action today -- I checked your patch in (and
thank you!), but haven't tested it.


From tim.one@comcast.net  Sun Oct 27 21:50:03 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 27 Oct 2002 16:50:03 -0500
Subject: [Spambayes] More proposed hammie changes: use Options
In-Reply-To: <3DBB9FC9.2070306@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEKNCBAB.tim.one@comcast.net>

[Rob Hooft]
> Attached are some more changes I'd like to propose to make to hammie:

I'd just check this in, if I were you.  One suggestion:  option names have
to be unique across all sections now, so tiny option names like "header"
would be better spelled, e.g., hammie_header_name.


From rob@hooft.net  Sun Oct 27 22:12:30 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 27 Oct 2002 23:12:30 +0100
Subject: [Spambayes] More proposed hammie changes: use Options
References: <LNBBLJKPBEHFEDALKOLCAEKNCBAB.tim.one@comcast.net>
Message-ID: <3DBC64CE.7040100@hooft.net>

Tim Peters wrote:
> [Rob Hooft]
> 
>>Attached are some more changes I'd like to propose to make to hammie:
> 
> 
> I'd just check this in, if I were you.  One suggestion:  option names have
> to be unique across all sections now, so tiny option names like "header"
> would be better spelled, e.g., hammie_header_name.

Done. I was just waiting for one review like this. New options and their 
defaults are now:

[Hammie]
hammie_header_name: X-Hammie-Disposition
persistant_storage_file: hammie.db
clue_mailheader_cutoff: 0.5
persistant_use_database: False

Note that this was crafted such that nothing changes for people that are 
using hammie.py already; except if they changed the defaults in the 
source (which would result in collisions now).

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tim.one@comcast.net  Sun Oct 27 22:31:36 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 27 Oct 2002 17:31:36 -0500
Subject: [Spambayes] Mining the headers
In-Reply-To: <15803.31663.391441.711086@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEKOCBAB.tim.one@comcast.net>

[Skip Montanaro]
> Done.  Note that I deleted the mine_date_headers option.  It was just a
> gatekeeper for the other two.  Seemed pointless to me.  Here's my latest
> run.  The first run was the default.  My dates.ini file is
>
>     [Tokenizer]
>     generate_time_buckets: True
>     extract_dow: True

Skip, I think there's a bug in the extract_dow code.  On a quick python.org
test, here are the dow tokens left behind in the database:

              #ham  #spam        spamprob
'dow:0'          2      7  0.890542594688
'dow:1'          3      7  0.854937008074
'dow:2'        725     71  0.220827483069
'dow:3'       1038    261  0.420993872704
'dow:4'        845    234  0.444677806501
'dow:5'        126    196  0.81766035841
'dow:6'          0    137  0.998363041106
'dow:invalid' 2741    946  0.499472081328

Those only trained on half a week's traffic, so it's not surprising that
half the days are virtually empty.  What is surprising is that every ham
trained on, and all but 2 of the spam, generated a dow:invalid token.
Because the

                for fmt in self.date_formats:

loop has no early exit, its "else:" clause always executes.  If I repair
that, dow:invalid becomes a mild spam clue:

'dow:invalid'    2     33  0.97338283678

I say it's "mild" just because it's infrequent in absolute terms.

I'll check that change in anyway, and run a better test.


From dereks@itsite.com  Sun Oct 27 22:46:53 2002
From: dereks@itsite.com (Derek Simkowiak)
Date: Sun, 27 Oct 2002 14:46:53 -0800 (PST)
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEKHCBAB.tim.one@comcast.net>
Message-ID: <Pine.LNX.4.33L2.0210271340090.25448-100000@dev.itsite.com>

> can be -- it's easy).  Then all users get them.  If one user signs up for a
> minister-by-mail scam (a real-life example reported earlier on this list),
> then all users get minister-by-mail scams.  Etc.

	I'm a little slow, so forgive me if this is... repetitive.  But
your argument sounds like it something of a showstopper to my intended use
of SpamBayes, and I want to make sure this behaviour is clearly documented
in the archives.

	Consider a group of a people who all use the same mail server.
I'm thinking of a university, or customers of one of those $20/month email
services, or a 1000-person company.

	Now consider the sysadmin who wants to use SpamBayes for the
purpose of flagging spam on that mail server, such that users can set up a
generic filter rule that is easily supported by the organization's Help
Desk.

	The way I understand it, if any _one_ person in the group of
people likes to get advertisements, porn mails, hotel conference info,
and/or minister-by-mail, and SpamBayes is trained on all incoming mail,
then everybody in the group will have their filtering rendered useless.

	In other words, Bayesian filtering (as popularized by the article
"A Plan for Spam") is only good for individuals, or small groups of
individuals who all like the same kinds of ham.

	I can't help but feel that I'm missing something.  In this
setting, it seems like training on hams is quite destructive to the goal
of flagging Spam.


	What if we pretend that all hams have exactly .5 probability, that
is, any given ham cannot be identified as either being a spam, or not
being a spam.  That is, all hams are just random noise.

	Then we train against a huge collection of spam, like Bruce G.'s
stuff.

	Each word in the database gets a "spam likelihood" rating,
depending on what percentage of the time it shows up in the spams.  A word
that shows up in every single spam gets a "1.0", and every word that does
not appear in the spam at all gets a "0.0".  We throw out ueber-common
words like a, and, the, it, just like Google does for its searches, as a
matter of efficiency.

	Then every email is rated word-by-word.  The scores for all the
words are then averaged together.  So an email with many words commonly
found in spam gets a high rating... (?)

	Um, I've overstepped my understanding of the problem, so I'll just
stop there.  But to you algorithm geniuses, I plead for a way to filter
spam that depends only on previously-seen Spam, and that does not depend
on what ham looks like.


Thanks,
Derek


From tim.one@comcast.net  Sun Oct 27 22:57:05 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 27 Oct 2002 17:57:05 -0500
Subject: [Spambayes] More proposed hammie changes: use Options
In-Reply-To: <3DBC64CE.7040100@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEELBCBAB.tim.one@comcast.net>

[Tim]
>> I'd just check this in, if I were you.  One suggestion:  option names ...

[Rob Hooft]
> Done. I was just waiting for one review like this.

Happy to oblige.

> New options and their defaults are now:
>
> [Hammie]
> hammie_header_name: X-Hammie-Disposition
> persistant_storage_file: hammie.db
> clue_mailheader_cutoff: 0.5
> persistant_use_database: False

Good!  NOTE that I did s/persistant/persistent/g later.

> Note that this was crafted such that nothing changes for people that are
> using hammie.py already; except if they changed the defaults in the
> source (which would result in collisions now).

Indeed you did a careful job -- it's appreciated.


From dereks@itsite.com  Sun Oct 27 23:27:04 2002
From: dereks@itsite.com (Derek Simkowiak)
Date: Sun, 27 Oct 2002 15:27:04 -0800 (PST)
Subject: [Spambayes] Maildir folders
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEKLCBAB.tim.one@comcast.net>
Message-ID: <Pine.LNX.4.33L2.0210271524190.25975-100000@dev.itsite.com>

> > !         if os.path.exists(name+'/cur'):
>
> to use os.path.join().

	os.path.join() will ignore all preceding entries as soon as one of
the entries starts with a '/'.  (Furthermore, '/' is not cross platform,
try to stick to os.sep.)

	So make sure it's

os.path.join(name, 'cur')


From tim.one@comcast.net  Sun Oct 27 23:43:18 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 27 Oct 2002 18:43:18 -0500
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <Pine.LNX.4.33L2.0210271340090.25448-100000@dev.itsite.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCELECBAB.tim.one@comcast.net>

[Tim]
>> If one user signs up for a minister-by-mail scam (a real-life example
>> reported earlier on this list), then all users get minister-by-mail
>> scams.  Etc.

[Derek Simkowiak]
> 	I'm a little slow, so forgive me if this is... repetitive.  But
> your argument sounds like it something of a showstopper to my intended
> use of SpamBayes, and I want to make sure this behaviour is clearly
> documented in the archives.

Arguments don't count for much here:  you can set up a test and measure
results.  That's the only way to know.  I've told you my best guess, but
guesses here are often wrong.

> 	Consider a group of a people who all use the same mail server.
> I'm thinking of a university, or customers of one of those $20/month
> email services, or a 1000-person company.
>
> 	Now consider the sysadmin who wants to use SpamBayes for the
> purpose of flagging spam on that mail server, such that users can set
> up a generic filter rule that is easily supported by the organization's
> Help Desk.

It's not an application I've got in mind, and not one that I've tested or
intend to test.  Other people here are interested in this, but they don't
appear to be around today.

> 	The way I understand it, if any _one_ person in the group of
> people likes to get advertisements, porn mails, hotel conference info,
> and/or minister-by-mail, and SpamBayes is trained on all incoming mail,
> then everybody in the group will have their filtering rendered useless.

Minus the hyperbole, yes, unless you've done whatever it takes to inject
some recipient-specific smarts.  If it passes on porn spam to me, how could
it possibly block it for you otherwise?  For a start, it would have to know
that you and I are different.  And that's got nothing to do with Bayesian
filters, or any other technicality:  if different people call different
things spam (and they do -- that's a fact), and any scheme that doesn't know
the difference between people necessarily treats all people the same (that
sure *seems* to be a fact <wink>), then if it lets my porn spam through then
you get it too, or if it blocks my porn spam for you then it blocks it for
me too.  Either way one of us is left unhappy.

> 	In other words, Bayesian filtering (as popularized by the article
> "A Plan for Spam") is only good for individuals, or small groups of
> individuals who all like the same kinds of ham.

I think that's too extreme a conclusion.  For example, python.org serves up
tech lists for tens of thousands of users, and we have strong evidence that
a single classifier will work fine there.  Tech lists have a *shared* notion
of what's spam, though.

> 	I can't help but feel that I'm missing something.  In this
> setting, it seems like training on hams is quite destructive to the goal
> of flagging Spam.

The algorithm doesn't try to flag spam, it tries *separate* ham from spam,
and the characteristics of both populations feed into that.  We've come full
circle, and I'll repeat that SpamAssassin may be more to your liking.  It
does try to flag spam largely independent of any notion of ham, although
from what I've seen of SpammAssassin admins they spend a lot of time
crafting "positive rules" to try to let through things *their* site
considers to be ham.  Whitelists seem very effective for that, and I expect
some form of whitelist would help a large deployment of the spambayes code
too.  OTOH, different people also want different whitelists.

> 	What if we pretend that all hams have exactly .5 probability, that
> is, any given ham cannot be identified as either being a spam, or not
> being a spam.  That is, all hams are just random noise.
>
> 	Then we train against a huge collection of spam, like Bruce G.'s
> stuff.
>
> 	Each word in the database gets a "spam likelihood" rating,
> depending on what percentage of the time it shows up in the spams.  A word
> that shows up in every single spam gets a "1.0", and every word that does
> not appear in the spam at all gets a "0.0".

I don't know, and it doesn't seem to make sense in the statistical framework
the spambayes project is built around.  You could test it, though, by
fiddling our codebase.  For example, replace update_probabilities like so:

    def update_probabilities(self):
        """Update the word probabilities in the spam database.

        This computes a new probability for every word in the database,
        so can be expensive.  learn() and unlearn() update the probabilities
        each time by default.  Thay have an optional argument that allows
        to skip this step when feeding in many messages, and in that case
        you should call update_probabilities() after feeding the last
        message and before calling spamprob().
        """

        nspam = float(self.nspam or 1)

        S = options.robinson_probability_s
        StimesX = S * options.robinson_probability_x

        for word, record in self.wordinfo.iteritems():
            spamcount = record.spamcount
            assert spamcount <= nspam
            prob = spamcount / nspam

            # Now do Robinson's Bayesian adjustment.
            # ...
            prob = (StimesX + spamcount * prob) / (S + spamcount)

            if record.spamprob != prob:
                record.spamprob = prob
                self.wordinfo[word] = record

BTW, there's no need to train on ham at all then (doing so would have no
effect on computed spamprobs).

> We throw out ueber-common words like a, and, the, it, just like Google
> does for its searches, as a matter of efficiency.

It's not really a matter of efficiency, it's more that since "a" appears in
virtually every spam *and* ham, the spamprob of "a" will be approximately
1.0 if you ignore hamcounts (it's approximately 0.5 now).  Note too that any
ham that just happens to mention "money" will also have a very high spamprob
word.  Words you used in this email:

'filtering'                    0.844828
'plead'                        0.844828
'is...'                        0.844828
'company.'                     0.895746
'spam.'                        0.899585
'scam'                         0.908163
'like.'                        0.934783
'flagging'                     0.958716
'rated'                        0.983271
'porn'                         0.988998

will have even higher spamprobs than those, because there will be no
hamcounts to counteract them.  Indeed, all words will have higher spamprobs
than they have now.

> 	Then every email is rated word-by-word.  The scores for all the
> words are then averaged together.  So an email with many words commonly
> found in spam gets a high rating... (?)

Here's a spam I picked at random from my personal collection.  Which words
in this can you hope to get a high spam rating?

"""
Hi i read your profile and you live in my area.  Maybe we could chat on line
or even meet for a coffee. If you would like to come and chat with me
i will be on line most of the night
at http://www.designerlove.com/?rid=love2
My screen name is "PenPal"
Log in and i'll be in the chat section. Hope to see you soon.
"""

As a matter of fact, none of those words are *common* in spam, except for
words like "and", "the" and "on".  My classifier nails it anyway (score of
0.97), because while words like "chat" appear in a small percentage of my
spam, they appear in even less of my ham (peeking inside a fat c.l.py
classifier, 'chat' appeared in 20 of 18,000 ham, and 164 of 12,600 spam:
it's rare by any measure, what matters here is that it's *relatively* rarer
in my ham than in my spam; and likewise for 'coffee.', and so on).

> 	Um, I've overstepped my understanding of the problem, so I'll just
> stop there.  But to you algorithm geniuses, I plead for a way to filter
> spam that depends only on previously-seen Spam, and that does not depend
> on what ham looks like.

Why do you think you get so much spam?  One reason is that one-size-fits-all
schemes don't work well.  I'd like to plead for world peace too, while the
algorithm geniuses are at it <wink>.


From tim.one@comcast.net  Sun Oct 27 23:57:10 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 27 Oct 2002 18:57:10 -0500
Subject: [Spambayes] Maildir folders
In-Reply-To: <Pine.LNX.4.33L2.0210271524190.25975-100000@dev.itsite.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCOELECBAB.tim.one@comcast.net>

[Derek Simkowiak]
> 	os.path.join() will ignore all preceding entries as soon as one of
> the entries starts with a '/'.  (Furthermore, '/' is not cross platform,
> try to stick to os.sep.)
>
> 	So make sure it's
>
> os.path.join(name, 'cur')

Yup, that's what it is. 

From skip@pobox.com  Mon Oct 28 00:08:35 2002
From: skip@pobox.com (Skip Montanaro)
Date: Sun, 27 Oct 2002 18:08:35 -0600
Subject: [Spambayes] Mining the headers
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEKOCBAB.tim.one@comcast.net>
References: <15803.31663.391441.711086@montanaro.dyndns.org>
        <LNBBLJKPBEHFEDALKOLCMEKOCBAB.tim.one@comcast.net>
Message-ID: <15804.32771.26821.701868@montanaro.dyndns.org>


    Tim> Skip, I think there's a bug in the extract_dow code.  

Thanks for catching it.  for: ... else: isn't a construct I use often, so
it's not entirely surprising that I muffed it.  How did you generate the
table of tokens in your note?

    Tim>               #ham  #spam        spamprob
    Tim> 'dow:0'          2      7  0.890542594688
    Tim> 'dow:1'          3      7  0.854937008074
    Tim> 'dow:2'        725     71  0.220827483069
    Tim> 'dow:3'       1038    261  0.420993872704
    Tim> 'dow:4'        845    234  0.444677806501
    Tim> 'dow:5'        126    196  0.81766035841
    Tim> 'dow:6'          0    137  0.998363041106
    Tim> 'dow:invalid' 2741    946  0.499472081328

The only tokens I've ever seen are in the summaries.

Skip

From tim.one@comcast.net  Mon Oct 28 01:07:01 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 27 Oct 2002 20:07:01 -0500
Subject: [Spambayes] Mining the headers
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEKOCBAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCOELHCBAB.tim.one@comcast.net>

About:

[Tokenizer]
generate_time_buckets: True
extract_dow: True

Across my c.l.py test (10-fold cv; mixed source; 20,000 c.l.py ham + 14,000
bruceg spam) it didn't change the FP, FN or unsure rates, but there's
nothing that's ever going to get rid of my 2 remaining FP and 2 remaining
FN.

There's evidence that bruceg got spam more often on weekends than c.l.py got
ham on weekends, and mostly because c.l.py traffic drops on weekends.  Here
in decreasing order of spamprob, but from just 1 of the 10 classifiers built
during the test:

'dow:invalid' 57  426 0.913973614869
'dow:5'     1611 1562 0.580722462655
'dow:6'     1599 1413 0.557982096725
'dow:1'     2982 1701 0.449007117316
'dow:0'     2738 1480 0.435736703465
'dow:3'     3067 1661 0.436204501487
'dow:4'     2860 1535 0.433990360589
'dow:2'     3086 1642 0.431861726944

NOTE:  Since I'm running with the default robinson_minimum_prob_strength ==
0.1, all words with spamprob between 0.4 and 0.6 are ignored.  Therefore
only the 'dow:invalid' token *could* have had an effect on this test.

Time buckets show higher spamprobs in the hours most of America is asleep.
Again this appears to have more to do with that there's less c.l.py traffic
then than with an increase in spam then -- but for purposes of prediction,
any regularity in spam *or* ham is exploitable:

hh:mm   #h   #s  spamprob
 0.00  133   66  0.415
 0.10  108   74  0.495
 0.20  114   81  0.504
 0.30   93   85  0.566
 0.40  103   87  0.547
 0.50  102   93  0.566
 1.00   82   62  0.519
 1.10   85   89  0.599
 1.20   83   70  0.546
-------------------------- above .60 starting roughly here
 1.30   79   89  0.616
 1.40  106   84  0.531
 1.50   74   88  0.629
 2.00   60   65  0.607
 2.10   67   99  0.678
 2.20   60   76  0.644
 2.30   79   89  0.616
 2.40   45   75  0.703
 2.50   81   81  0.588
 3.00   55   67  0.635
 3.10   58   99  0.709
 3.20   52   66  0.644
 3.30   66   81  0.636
 3.40   64   81  0.643
 3.50   62   81  0.651
 4.00   45   68  0.683
 4.10   47   53  0.616
 4.20   45   57  0.643
 4.30   45   85  0.729
 4.40   56   49  0.555
 4.50   46   57  0.638
 5.00   32   83  0.786
 5.10   47   77  0.700
 5.20   42   56  0.655
 5.30   50   49  0.583
 5.40   44   55  0.640
 5.50   48   63  0.652
 6.00   52   76  0.676
 6.10   46   48  0.598
 6.20   42   57  0.659
 6.30   53   59  0.613
 6.40   56   52  0.570
 6.50   41   65  0.693
 7.00   49   56  0.620
-------------------------- and ending roughly here
 7.10   58   53  0.566
 7.20   69   50  0.509
 7.30   75   64  0.549
 7.40   83   65  0.528
 7.50   94   57  0.464
 8.00   97   48  0.414
 8.10  113   69  0.466
 8.20  109   76  0.499
 8.30  141   70  0.415
 8.40  112   50  0.390
 8.50  117   58  0.415
 9.00  120   55  0.396
 9.10  137   57  0.373
 9.20  154   57  0.346
 9.30  171   57  0.323
 9.40  141   55  0.358
 9.50  170   49  0.292
10.00  159   81  0.421
10.10  182   76  0.374
10.20  200   73  0.343
10.30  176   69  0.359
10.40  132   81  0.467
10.50  163   67  0.370
11.00  184   92  0.417
11.10  174   66  0.352
11.20  181   58  0.314
11.30  169   73  0.382
11.40  170   69  0.367
11.50  167   60  0.339
12.00  191   95  0.416
12.10  182   73  0.365
12.20  128   63  0.413
12.30  156   70  0.391
12.40  153   82  0.434
12.50  170  106  0.471
13.00  157   78  0.415
13.10  149   77  0.425
13.20  160   82  0.423
13.30  140   71  0.420
13.40  172   66  0.354
13.50  192   64  0.323
14.00  169   99  0.456
14.10  170   90  0.431
14.20  203   69  0.327
14.30  168   89  0.431
14.40  192   78  0.367
14.50  199   63  0.312
15.00  200   68  0.327
15.10  195   59  0.302
15.20  183   71  0.357
15.30  198   67  0.326
15.40  193   78  0.366
15.50  195   60  0.306
16.00  181   81  0.390
16.10  176   72  0.369
16.20  194   98  0.419
16.30  177   70  0.361
16.40  175   81  0.398
16.50  185   88  0.405
17.00  187   79  0.377
17.10  167   67  0.365
17.20  165   80  0.409
17.30  185   74  0.364
17.40  179   82  0.396
17.50  166   90  0.437
18.00  136   65  0.406
18.10  141   77  0.438
18.20  165   65  0.360
18.30  168   74  0.386
18.40  148   96  0.481
18.50  144   70  0.410
19.00  140   83  0.459
19.10  130   93  0.505
19.20  139   67  0.408
19.30  111   79  0.504
19.40  129   64  0.415
19.50  128   80  0.472
20.00  121   71  0.456
20.10  129   71  0.440
20.20  124   63  0.421
20.30  127   95  0.517
20.40  140   86  0.467
20.50  131   78  0.460
21.00  142   84  0.458
21.10  144   87  0.463
21.20  143   79  0.441
21.30  143   82  0.450
21.40  142   93  0.483
21.50  132   98  0.515
22.00  135   70  0.426
22.10  133   69  0.426
22.20  136   98  0.507
22.30  121   95  0.529
22.40  128   82  0.478
22.50  117   70  0.461
23.00  121   60  0.415
23.10  142   61  0.381
23.20  146   74  0.420
23.30  115   77  0.489
23.40  107   71  0.487
23.50  114   80  0.501

So, overall, in this data there are mild indicators based on DOW and on time
bucket tokens.  The c.l.py test was already getting all the info it could
use, though, and they're too mild to make much of a dent in a chi-score
unless there are very few total clues.


From dereks@itsite.com  Mon Oct 28 01:12:00 2002
From: dereks@itsite.com (Derek Simkowiak)
Date: Sun, 27 Oct 2002 17:12:00 -0800 (PST)
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCELECBAB.tim.one@comcast.net>
Message-ID: <Pine.LNX.4.33L2.0210271707290.26415-100000@dev.itsite.com>

> Why do you think you get so much spam?  One reason is that one-size-fits-all
> schemes don't work well.

     Tim,
	Thanks for your responses today, they've been very helpful.  I
hold out hope, though, that one day the heuristics will be available to
magicallly know what I want :)

	Seriously speaking, my gut says that all the information we need
is in the spam collections like Bruce's.  Somebody just needs to figure
out how to mine it, methinks.


>  I'd like to plead for world peace too, while the algorithm geniuses
> are at it <wink>.

	Pfft, that one's easy.  It's the implementation that kills ya! :)


From tim.one@comcast.net  Mon Oct 28 01:20:32 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 27 Oct 2002 20:20:32 -0500
Subject: [Spambayes] Mining the headers
In-Reply-To: <15804.32771.26821.701868@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCGELJCBAB.tim.one@comcast.net>

>     Tim> Skip, I think there's a bug in the extract_dow code.

[Skip]
> Thanks for catching it.

You're welcome <wink>.

>  for: ... else: isn't a construct I use often, so it's not entirely
> surprising that I muffed it.

It was just one "break" away from perfection -- a loop needs an early exit
else "else:" is a bug (hmm -- I wonder whether PyChecker knows that rule!).

> How did you generate the table of tokens in your note?
>
>     Tim>               #ham  #spam        spamprob
>     Tim> 'dow:0'          2      7  0.890542594688
>     Tim> 'dow:1'          3      7  0.854937008074
>     Tim> 'dow:2'        725     71  0.220827483069
>     Tim> 'dow:3'       1038    261  0.420993872704
>     Tim> 'dow:4'        845    234  0.444677806501
>     Tim> 'dow:5'        126    196  0.81766035841
>     Tim> 'dow:6'          0    137  0.998363041106
>     Tim> 'dow:invalid' 2741    946  0.499472081328
>
> The only tokens I've ever seen are in the summaries.

I do that mostly by hand.  Here's a little Python program I didn't bother to
check in:

"""
import cPickle as pickle
#f = file('outlook2000/default_bayes_database.pck', 'rb')
#f = file('fat.pik', 'rb')
f = file('class1.pik', 'rb')

c = pickle.load(f)
f.close()
w = c.wordinfo

def root(prefix):
    for k, r in w.iteritems():
        if k.startswith(prefix):
            print `k`, r.hamcount, r.spamcount, r.spamprob
"""

Run that via, e.g.,

    python -i pik.py

It then loads the trained classifier pickle of your choice into 'c', its
wordinfo dict into 'w', and leaves you in an interactive session where you
can play around.  The utility root() function prints

    token  hamcount  spamcount  spamprob

for every token beginning with a given string.  So, in this case, I did

    root('dow:')

and pasted a screen scrape into the email.

Note that the option

    [TestDriver]
    save_trained_pickles: True

will leave behind classifier pickles for each classifier trained during a
test run.  So there's not much too it!  Spend a few minutes studying the
classes in Classifier:  their instance data members are very simple (esp.
since we got rid of a ton of combining schemes), and the whole thing will
make a lot more sense to you then.  The classifier's data structures are
very easy to rummage around in, and there are very few of them.

perfection-is-reached-when-there's-nothing-left-to-throw-away-ly y'rs  - tim


From popiel@wolfskeep.com  Mon Oct 28 03:15:17 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Sun, 27 Oct 2002 19:15:17 -0800
Subject: [Spambayes] progress on POP+VM+ZODB deployment 
In-Reply-To: Message from Derek Simkowiak <dereks@itsite.com> 
	<Pine.LNX.4.33L2.0210271340090.25448-100000@dev.itsite.com> 
References: <Pine.LNX.4.33L2.0210271340090.25448-100000@dev.itsite.com> 
Message-ID: <20021028031518.E6F8CF4D4@cashew.wolfskeep.com>

In message:  <Pine.LNX.4.33L2.0210271340090.25448-100000@dev.itsite.com>
             Derek Simkowiak <dereks@itsite.com> writes:
>> can be -- it's easy).  Then all users get them.  If one user signs up for a
>> minister-by-mail scam (a real-life example reported earlier on this list),
>> then all users get minister-by-mail scams.  Etc.
>
>	I'm a little slow, so forgive me if this is... repetitive.  But
>your argument sounds like it something of a showstopper to my intended use
>of SpamBayes, and I want to make sure this behaviour is clearly documented
>in the archives.
>
>	Consider a group of a people who all use the same mail server.
>I'm thinking of a university, or customers of one of those $20/month email
>services, or a 1000-person company.
>
>	Now consider the sysadmin who wants to use SpamBayes for the
>purpose of flagging spam on that mail server, such that users can set up a
>generic filter rule that is easily supported by the organization's Help
>Desk.

Okay.  There's many distinct ways to set this up.  Two ways of interest
are:

  1) Set up a common database for filtering all mail coming into
     the system.  This will be subject to all the problems and
     limitations that Tim is talking about.

  2) Set up separate databases for each user, filtering only their
     mail.  This will take a lot more space, but will probably
     have _MUCH_ better results, assuming that you teach your
     users how to train the beast.

With the new parameterization of the Hammie headers, you could even
run both of these, generating two separate headers to filter on.
With this sort of combo approach, people could set their clients up
to put in their normal inbox if _either_ of the headers said ham,
put in unsure if both of the headers said unsure, and put in spam
otherwise (at least one header said spam, and the other didn't say
ham).  This might solve the 'one guy wants farmgirls and horses spam'
problem of #1.  Dunno.

I don't receive mail (with spam) on a shared server, so there's no
way I could test this sort of thing.

>	In other words, Bayesian filtering (as popularized by the article
>"A Plan for Spam") is only good for individuals, or small groups of
>individuals who all like the same kinds of ham.

This sort of filtering (I don't call it Bayesian anymore) is probably
only good for small high-commonality groups if there's only a single
database.  Using a separate database for each (and no common database)
clearly will work (we've got several individuals now doing that).
Combining separate databases and a common database is uncharted
territory, which may have things of interest.

- Alex

From seant@iname.com  Mon Oct 28 03:38:45 2002
From: seant@iname.com (Sean True)
Date: Sun, 27 Oct 2002 22:38:45 -0500
Subject: [Spambayes] progress on POP+VM+ZODB deployment 
In-Reply-To: <20021028031518.E6F8CF4D4@cashew.wolfskeep.com>
Message-ID: <MJEHLHJKGINLONDMMKNEGEFKHEAA.seant@iname.com>

> >	In other words, Bayesian filtering (as popularized by the article
> >"A Plan for Spam") is only good for individuals, or small groups of
> >individuals who all like the same kinds of ham.
>

In the "real world", I suspect there are many companies who are probably
perfectly happy to block (or label) all "pony and farmgirl" messages,
independent of whether some lonely farm guy thinks they are ham. And for
corporate email, I think such practices are not unreasonable (granted, I've
tried hard
to not work places like that, or be in charge of MIS when I was).

There are legal standards that may require an employer to make a best effort
to keep the "pony and farmgirl" message away from those who might be
offended by even having to _label_ it as spam.

If users can nominate mail that violates community standards, and some MIS
person agrees,
a single filter might well be kept that would be a substantial help to a
large body of people.
As usual, for those with special needs (customer service), or special
privileges (the vice president for  reading naughty mail), more personalized
filters could be made available.

-- Sean


From tim.one@comcast.net  Mon Oct 28 05:49:08 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 28 Oct 2002 00:49:08 -0500
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <Pine.LNX.4.33L2.0210271707290.26415-100000@dev.itsite.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEMLCBAB.tim.one@comcast.net>

This is a multi-part message in MIME format.

---------------------- multipart/mixed attachment
[Derek Simkowiak]
> 	Thanks for your responses today, they've been very helpful.  I
> hold out hope, though, that one day the heuristics will be available to
> magicallly know what I want :)
>
> 	Seriously speaking, my gut says that all the information we need
> is in the spam collections like Bruce's.  Somebody just needs to figure
> out how to mine it, methinks.

I think you're missing a basic point, but not due to lack of repetition
<wink>.  Let's get real concrete.  Go to this site:

    http://www.esmokes.com/

I buy cigarettes from that site, and I get glitzy HTML promotional email
from them about once a week.  I want that email.  You don't (or so I guess),
and *any* filter trained on any spam collection on Earth is-- if it's worth
anything at all --going to say that's spam.  I'll attach one of their emails
for your perusal.

This isn't a question of classification technology so much as it's a
question of personal preference, and so long as you're determined that
everyone must use the same classifier, personal preference goes out the
window.  That's a bad use of technology, IMO -- I'm not interested in
treating everyone like interchangeable cogs.  Buy a server with enough disk
space so everyone can have their own classifier, and do whatever else it
takes to give people a system they'll truly love instead of merely endure.

The spambayes system had no trouble learning that *I* want this crap because
it found many lexical clues nearly unique to email from this particular
vendor:

'esmokes.com'                  0.0412844
'subject:eSmokes.com'          0.0412844
'url:esmokes'                  0.0412844
'carton'                       0.0505618
'esmokes.com!'                 0.0505618
'esmokes.com,'                 0.0505618
'from:email addr:esmokes.com'  0.0505618
'message-id:@mail-server'      0.0505618
'url:brandid'                  0.0505618
'url:side'                     0.0505618
'url:template1'                0.0505618
'url:vadcamp'                  0.0505618
'carton!'                      0.0652174

and there are about 25 other low-spamprob words (but with higher spamprobs
than those) common in this vendor's (but no other's) email too.  The
detection of other kinds of spam wasn't injured at all.  Even so, email from
this place still scores between 0.03 and 0.17 for me (which are high ham
scores under chi-combining, but well within my "I'm sure it's ham" range).

>>  I'd like to plead for world peace too, while the algorithm geniuses
>> are at it <wink>.

> 	Pfft, that one's easy.  It's the implementation that kills ya! :)

In my case, it will be the cigarettes <wink>.

---------------------- multipart/mixed attachment--

From anthony@interlink.com.au  Mon Oct 28 05:58:50 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Mon, 28 Oct 2002 16:58:50 +1100
Subject: [Spambayes] progress on POP+VM+ZODB deployment 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEMLCBAB.tim.one@comcast.net> 
Message-ID: <200210280558.g9S5woq01988@localhost.localdomain>


>>> Tim Peters wrote
> I think you're missing a basic point, but not due to lack of repetition
> <wink>.  Let's get real concrete.  Go to this site:
> 
>     http://www.esmokes.com/
> 
> I buy cigarettes from that site, and I get glitzy HTML promotional email
> from them about once a week.  I want that email.  You don't (or so I guess),
> and *any* filter trained on any spam collection on Earth is-- if it's worth
> anything at all --going to say that's spam.  I'll attach one of their emails
> for your perusal.

Something to bear in mind, though, is that a site like this is unlikely
to be a spammer. Reputable businesses don't spam often, and certainly
they don't do it twice :)

So when _other_ peddlers of Timmy's noxious habit send spam, he's 
unlikely to want their particular spam. For him, the many ham-clues
from esmokes.com should outweigh the marketing-speak in their
email.

Something I've been working on over the last week is the notion that
spam data can be shared between users (to a certain extent) while 
ham data is user-specific. This example doesn't (to me) seem to 
show that this isn't still a worthwhile goal to pursue...

Anthony

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From tim.one@comcast.net  Mon Oct 28 06:09:53 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 28 Oct 2002 01:09:53 -0500
Subject: [Spambayes] An interesting example of bad correlation
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGELJCBAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEMOCBAB.tim.one@comcast.net>

I just got two copies of this spam from python.org:

"""
Ol=E1 me chamo Marquinho. Acabei de lan=E7ar um site na WEB que fala =
sobre o
povo brasileiro e meu projeto... L=E1 voc=EA vai ver minhas fotos. Vo=
c=EA pode
divulgar o potencial de sua cidade. Al=E9m disso voc=EA pode concorre=
r a uma web
cam. dia 27 de dezembro.

Visite! e vote no meu site! Preciso de apoio...

http://www.nossobrasil.kit.net


Se n=E3o quiser mais receber nossa informa=E7=E3o favor somente respo=
nda.

NossoBrasil.kit.net


NossoBrasil.kit.net
"""

One of them showed up in my "I'm sure it's spam" folder, with a score=
 of
0.96.  The other showed up in my "I'm confused" folder, with a score =
of
0.75.  What's the difference?  The former was addressed to
webmaster@python.org, and the latter to help@python.org, and the latt=
er is a
(privately archived) mailing list so Mailman put its fingers on it.  =
Despite
that I *thought* I was ignoring all Mailman headers, I was <wink>.  B=
ut it
turns out Mailman does other stuff that reflects in the headers, addi=
ng this
stuff that didn't exist in the copy I got via webmaster:

'header:Errors-to:1'           0.045086
'subject:Python'               0.0644291
'subject:] '                   0.0772537
'subject:['                    0.147731
'subject:Help'                 0.270936
'subject:-'                    0.286281

The original didn't have an Errors-to header.  The last 5(!) are due =
to the

    [Python-Help]

inserted into the Subject line.

I believe spam that isn't caught by python.org, and comes thru on a m=
ailing
list, is my biggest source of Unsure msgs.


From tim.one@comcast.net  Mon Oct 28 06:25:48 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 28 Oct 2002 01:25:48 -0500
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <200210280558.g9S5woq01988@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEMOCBAB.tim.one@comcast.net>

[Anthony Baxter, discovers the joys of cigarettes-by-web]
> Something to bear in mind, though, is that a site like this is unlikely
> to be a spammer. Reputable businesses don't spam often, and certainly
> they don't do it twice :)

If you stared at the email from them I attached, you're not going to find
any lexical clue that they're a reputable business, and the spambayes code
certainly didn't before I trained on half a dozen msgs from them.  If we
shared classifiers, this would be called spam.

> So when _other_ peddlers of Timmy's noxious habit send spam, he's
> unlikely to want their particular spam.

This is true, and I indeed get a lot of shady "avoid state sales tax!" cig
spam from other sources.  They're nailed as spam.

> For him, the many ham-clues from esmokes.com should outweigh the
> marketing-speak in their email.

There is hope there:  marketing email from a firm with an actual marketing
dept is dead serious about establishing brand identification, so puts in
endless repetitions of their company name and slogans.  That's what makes
esmokes so easy to distinguish from other cig spam, and the same is true of
marketing blurbs from Microsoft, Sun, Expedia, Amazon, Fidelity, etc etc.
They're all considered spam (and strongly so) before training on them,
though, as they're dripping with the language of advertising.

> Something I've been working on over the last week is the notion that
> spam data can be shared between users (to a certain extent) while
> ham data is user-specific. This example doesn't (to me) seem to
> show that this isn't still a worthwhile goal to pursue...

Not at all -- it's an example of why sharing one classifier *completely* is
unlikely to work well, and is more an after-the-fact rationalization
attempting to explain why all tests in that *direction* have delivered
discouraging results.

I expect spam stats are quite sharable, and the same tests that showed high
FP rates when using a single classifier across multiple tech-list corpora
did not show significant increases in FN rates.


From anthony@interlink.com.au  Mon Oct 28 07:08:01 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Mon, 28 Oct 2002 18:08:01 +1100
Subject: [Spambayes] options.skip_max_word_size.
Message-ID: <200210280708.g9S781602374@localhost.localdomain>

I noticed a bunch of really nice ham clues were getting skipped in some
of my personal email's 'unsure' bucket. They were words like 'interconnection'
and other longer techie-words. I added an option skip_max_word_size and
tried boosting it to 20 (from the default of 12). 

cmp.py shows this (skip_max_word_size 12 on left, 20 on right)

false positive percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          

won   0 times
tied  4 times
lost  0 times

total unique fp went from 0 to 0 tied          
mean fp % went from 0.0 to 0.0 tied          

false negative percentages
    0.000  0.277  lost  +(was 0)
    0.559  0.838  lost   +49.91%
    0.836  1.114  lost   +33.25%
    0.279  0.836  lost  +199.64%

won   0 times
tied  0 times
lost  4 times

total unique fn went from 6 to 11 lost   +83.33%
mean fn % went from 0.418410855729 to 0.766214465324 lost   +83.12%

ham mean                     ham sdev
   0.67    0.58  -13.43%        4.54    3.89  -14.32%
   0.45    0.38  -15.56%        2.64    2.24  -15.15%
   0.68    0.67   -1.47%        4.44    4.57   +2.93%
   0.48    0.45   -6.25%        3.52    3.47   -1.42%

ham mean and sdev for all runs
   0.57    0.52   -8.77%        3.86    3.64   -5.70%

spam mean                    spam sdev
  98.30   98.76   +0.47%        8.61    7.95   -7.67%
  97.47   97.61   +0.14%       10.67   10.78   +1.03%
  98.51   98.43   -0.08%        9.13   10.93  +19.72%
  97.58   97.27   -0.32%       10.90   12.08  +10.83%

spam mean and sdev for all runs
  97.97   98.02   +0.05%        9.88   10.56   +6.88%

ham/spam mean difference: 97.40 97.50 +0.10

Unfortunately, cmp.py skips the important bit. My 'unsure' numbers
went from 164 to 135! 

I'm not sure if this is just something that's an artifact of my
own data, or more general - if others could try it as well, it 
would be good.

Anthony

From tim.one@comcast.net  Mon Oct 28 07:32:11 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 28 Oct 2002 02:32:11 -0500
Subject: [Spambayes] options.skip_max_word_size.
In-Reply-To: <200210280708.g9S781602374@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCAENDCBAB.tim.one@comcast.net>

[Anthony Baxter]
> I noticed a bunch of really nice ham clues were getting skipped in some
> of my personal email's 'unsure' bucket.  They were words like
> 'interconnection' and other longer techie-words. I added an option
> skip_max_word_size and tried boosting it to 20 (from the default of 12).
>
> cmp.py shows this (skip_max_word_size 12 on left, 20 on right)

...
> total unique fp went from 0 to 0 tied
> mean fp % went from 0.0 to 0.0 tied
>
> false negative percentages
>     0.000  0.277  lost  +(was 0)
>     0.559  0.838  lost   +49.91%
>     0.836  1.114  lost   +33.25%
>     0.279  0.836  lost  +199.64%
>
> won   0 times
> tied  0 times
> lost  4 times
>
> total unique fn went from 6 to 11 lost   +83.33%
> mean fn % went from 0.418410855729 to 0.766214465324 lost   +83.12%

> ...
> Unfortunately, cmp.py skips the important bit. My 'unsure' numbers
> went from 164 to 135!

Under the default costs, this would be judged close to a wash:  5 new fn @
$1 was a loss of $5, while 29 fewer unsure @ $.20 was a gain of $5.80.
table.py would show this more clearly, and the histogram analysis (which
table.py summarizes) would tell us whether you could have gotten just as
good an improvement by changing your ham_cutoff and spam_cutoff values (it's
impossible to guess that from what you posted).

> I'm not sure if this is just something that's an artifact of my
> own data, or more general - if others could try it as well, it
> would be good.

It's something I haven't tried under chi-combining yet, so I will, but not
right now.  In previous tests, boosting to 13 didn't have significant effect
on error rates but did boost the database size.  This was before we had a
usable notion of middle ground, though, so I've no idea what effect those
older tests may have had on the unsure rate.


From rob@hooft.net  Mon Oct 28 08:44:34 2002
From: rob@hooft.net (Rob W.W. Hooft)
Date: Mon, 28 Oct 2002 09:44:34 +0100
Subject: [Spambayes] options.skip_max_word_size.
References: <200210280708.g9S781602374@localhost.localdomain>
Message-ID: <3DBCF8F2.1060907@hooft.net>

Anthony Baxter wrote:
> I noticed a bunch of really nice ham clues were getting skipped in some
> of my personal email's 'unsure' bucket. They were words like 'interconnection'
> and other longer techie-words. I added an option skip_max_word_size and
> tried boosting it to 20 (from the default of 12). 

Here are my results:

nows122[179]spambayes%% python ./table.py skip12.txt skip20.txt
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
[...]
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
filename:   skip12  skip20
ham:spam:  16000:5800
                    16000:5800
fp total:       12      13
fp %:         0.07    0.08
fn total:        7       7
fn %:         0.12    0.12
unsure t:      178     184
unsure %:     0.82    0.84
real cost: $162.60 $173.80
best cost: $106.20 $109.60
h mean:       0.51    0.52
h sdev:       4.87    4.92
s mean:      99.42   99.39
s sdev:       5.22    5.34
mean diff:   98.91   98.87
k:            9.80    9.64

Regards,

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From anthony@interlink.com.au  Mon Oct 28 08:12:38 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Mon, 28 Oct 2002 19:12:38 +1100
Subject: [Spambayes] table.py patch to produce averages at end of line.
Message-ID: <200210280813.g9S8CcL02773@localhost.localdomain>

The following simple patch produces a final column in table.py of 
averages for all files and all measures. This is useful if you're
doing tests with very small amounts of data, and want to run the
test multiple times with different seeds to check that your results
are actually meaningful. For instance (ignore the actual results, they
won't make sense outside of the context of the testing I'm doing)

filename:  002a_100        002c_100       
                   002b_100        002d_100
ham:spam:  400:1000        400:1000       
                   400:1000        400:1000
fp total:      127     195     104     245         167
fp %:        31.75   48.75   26.00   61.25       41.94
fn total:        0       0       0       0           0
fn %:         0.00    0.00    0.00    0.00        0.00
unsure t:      282     162     287      86         204
unsure %:    20.14   11.57   20.50    6.14       14.59
real cost:$1326.40$1982.40$1097.40$2467.20    $1718.35
best cost: $231.00 $244.20 $228.00 $249.80     $238.25
h mean:      81.07   78.72   77.23   79.29       79.08
h sdev:      20.09   30.80   24.59   34.25       27.43
s mean:      99.94   99.94   99.93   99.99       99.95
s sdev:       0.71    0.90    1.01    0.09        0.68
mean diff:   18.87   21.22   22.70   20.70       20.87
k:            0.91    0.67    0.89    0.60        0.77

Not sure if this is generally useful enough to anyone else for it to
be checked in - any opinions?

Anthony

Index: table.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/table.py,v
retrieving revision 1.4
diff -u -r1.4 table.py
--- table.py	26 Oct 2002 15:30:23 -0000	1.4
+++ table.py	28 Oct 2002 08:06:07 -0000
@@ -122,6 +122,9 @@
 meand = "mean diff:"
 kval  = "k:        "
 
+tfptot = tfpper = tfntot = tfnper = tuntot = tunper = trcost = tbcost = \
+thmean = thsdev = tsmean = tssdev = tmeand = tkval =  0
+
 for filename in sys.argv[1:]:
     filename = windowsfy(filename)
     (htest, stest, fp, fn, un, fpp, fnp, unp, cost, bestcost,
@@ -147,20 +150,51 @@
         rat2  = rat2[0:(len(ratio) + 8)]
         ratio += " %7s" % ("%d:%d" % (htest, stest))
     fptot += "%8d"   % fp
+    tfptot += fp 
     fpper += "%8.2f" % fpp
+    tfpper += fpp
     fntot += "%8d"   % fn
+    tfntot += fn
     fnper += "%8.2f" % fnp
+    tfnper += fnp
     untot += "%8d"   % un
+    tuntot += un
     unper += "%8.2f" % unp
+    tunper += unp
     rcost += "%8s"   % ("$%.2f" % cost)
+    trcost += cost
     bcost += "%8s"   % ("$%.2f" % bestcost)
+    tbcost += bestcost
     hmean += "%8.2f" % hamdevall[0]
+    thmean += hamdevall[0]
     hsdev += "%8.2f" % hamdevall[1]
+    thsdev += hamdevall[1]
     smean += "%8.2f" % spamdevall[0]
+    tsmean += spamdevall[0]
     ssdev += "%8.2f" % spamdevall[1]
+    tssdev += spamdevall[1]
     meand += "%8.2f" % (spamdevall[0] - hamdevall[0])
+    tmeand += (spamdevall[0] - hamdevall[0])
     k = (spamdevall[0] - hamdevall[0]) / (spamdevall[1] + hamdevall[1])
     kval  += "%8.2f" % k
+    tkval  += k
+
+nfiles = len(sys.argv[1:])
+if nfiles:
+    fptot += "%12d"   % (tfptot/nfiles)
+    fpper += "%12.2f" % (tfpper/nfiles)
+    fntot += "%12d"   % (tfntot/nfiles)
+    fnper += "%12.2f" % (tfnper/nfiles)
+    untot += "%12d"   % (tuntot/nfiles)
+    unper += "%12.2f" % (tunper/nfiles)
+    rcost += "%12s"   % ("$%.2f" % (trcost/nfiles))
+    bcost += "%12s"   % ("$%.2f" % (tbcost/nfiles))
+    hmean += "%12.2f" % (thmean/nfiles)
+    hsdev += "%12.2f" % (thsdev/nfiles)
+    smean += "%12.2f" % (tsmean/nfiles)
+    ssdev += "%12.2f" % (tssdev/nfiles)
+    meand += "%12.2f" % (tmeand/nfiles)
+    kval  += "%12.2f" % (tkval/nfiles)
 
 print fname
 if len(fnam2.strip()) > 0:

From jeremy@alum.mit.edu  Mon Oct 28 15:08:04 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Mon, 28 Oct 2002 10:08:04 -0500
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <Pine.LNX.4.33L2.0210271340090.25448-100000@dev.itsite.com>
References: <LNBBLJKPBEHFEDALKOLCCEKHCBAB.tim.one@comcast.net>
	<Pine.LNX.4.33L2.0210271340090.25448-100000@dev.itsite.com>
Message-ID: <15805.21204.928940.814710@slothrop.zope.com>

>>>>> "DS" == Derek Simkowiak <dereks@itsite.com> writes:

  DS> In other words, Bayesian filtering (as popularized by the rticle
  DS> "A Plan for Spam") is only good for individuals, or small groups
  DS> of individuals who all like the same kinds of ham.

A single classifier is only good for individuals or for groups / lists
where there is a uniform notion of what is ham and what is spam.  The
general approach to filtering could certainly be used for a large
institution, but seems to require some tailoring to an individual's
ham.  There are a bunch of interesting UI and systems issues to
resolve for such usage.

Jeremy


From agmsmith@rogers.com  Mon Oct 28 15:45:26 2002
From: agmsmith@rogers.com (Alexander G. M. Smith)
Date: Mon, 28 Oct 2002 10:45:26 EST (-0500)
Subject: [Spambayes] options.skip_max_word_size.
In-Reply-To: <200210280708.g9S781602374@localhost.localdomain>
Message-ID: <10727137924-BeMail@CR593174-A>

Anthony Baxter wrote:
> I noticed a bunch of really nice ham clues were getting skipped in some
> of my personal email's 'unsure' bucket. They were words like 'interconnection'
> and other longer techie-words. I added an option skip=5Fmax=5Fword=5Fsize and
> tried boosting it to 20 (from the default of 12). 

I took the naive approach and allow words up to 50 bytes long.  I picked
that because I saw some uuencoded data with 60 bytes per line.  Also
while looking up the spelling of supercalifragilisticexpialidoceous,
I found pneumonoultramicroscopicsilicovolcanoconiosis mentioned as the
longest word in English, according to some web site* which refered back
to the Oxford English Dictionary.  So, 50 seems like a nice safe value.

- Alex

*: http://www.dictionary.com/doctor/faq/l/longestword.html


From skip@pobox.com  Mon Oct 28 16:31:57 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 28 Oct 2002 10:31:57 -0600
Subject: [Spambayes] incremental training strategies
Message-ID: <15805.26237.16266.425547@montanaro.dyndns.org>


I am now running hammie.py from my procmailrc file, but not yet doing any
filtering based on the results.  I trained it on my current setup (7000
hams, 5000 spams).  Should I:

    * train it on every message which passes through my inbox

    * only train it on messages which it incorrectly classifies

    * some other scheme

?  Or is that not yet known?

Skip


From nas@python.ca  Mon Oct 28 16:44:05 2002
From: nas@python.ca (Neil Schemenauer)
Date: Mon, 28 Oct 2002 08:44:05 -0800
Subject: [Spambayes] incremental training strategies
In-Reply-To: <15805.26237.16266.425547@montanaro.dyndns.org>
References: <15805.26237.16266.425547@montanaro.dyndns.org>
Message-ID: <20021028164405.GA22741@glacier.arctrix.com>

Skip Montanaro wrote:
> I am now running hammie.py from my procmailrc file, but not yet doing any
> filtering based on the results.  I trained it on my current setup (7000
> hams, 5000 spams).  Should I:
> 
>     * train it on every message which passes through my inbox
> 
>     * only train it on messages which it incorrectly classifies
> 
>     * some other scheme
> 
> ?  Or is that not yet known?

I've trained twice since I started using "neilfilter.py" two months ago.
One of those times it was because I updated the classifer and tokenizer
code.  I don't see the benefit of elaborate incremental updates.

  Neil

From tim.one@comcast.net  Mon Oct 28 17:01:15 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 28 Oct 2002 12:01:15 -0500
Subject: [Spambayes] options.skip_max_word_size.
In-Reply-To: <3DBCF8F2.1060907@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEPICBAB.tim.one@comcast.net>

On skip_max_word_size, my c.l.py test, 10-fold CV, ham_cutoff=0.20 and
spam_cutoff=0.80:

-> <stat> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams
[ditto]

filename:    max12   max20
ham:spam:  20000:14000
                   20000:14000
fp total:        2       2       the same
fp %:         0.01    0.01
fn total:        0       0       the same
fn %:         0.00    0.00
unsure t:      103     100       slight decrease
unsure %:     0.30    0.29
real cost:  $40.60  $40.00       slight improvement with these cutoffs
best cost:  $27.00  $27.40       best possible got slightly worse
h mean:       0.28    0.27
h sdev:       2.99    2.92
s mean:      99.94   99.93
s sdev:       1.41    1.47
mean diff:   99.66   99.66
k:           22.65   22.70

"Best possible" in max20 would have been to boost ham_cutoff to 0.50(!), and
drop spam_cutoff a little to 0.78.  This would have traded away most of the
unsures in return for letting 3 spam through:

-> smallest ham & spam cutoffs 0.5 & 0.78
->     fp 2; fn 3; unsure ham 11; unsure spam 11
->     fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0647%

Best possible in max12 was much the same:

-> largest ham & spam cutoffs 0.5 & 0.78
->     fp 2; fn 3; unsure ham 12; unsure spam 8
->     fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0588%

The classifier pickle size increased by about 1.5 MB (~8.4% bigger).

Anthony, you didn't respond to the question about whether you could have
gotten a similar improvement simply by changing cutoff values.  The data you
posted showed a large decrease in unsures at the expense of a large boost in
your FN rate.  It's quite plausible that exactly the same would have
happened if you raised ham_cutoff.  See my results above, where boosting ham
cutoff from 0.20 to 0.50 would get rid of 80% of my unsures at the cost of
letting 3 (vs 0) spam thru.


From dereks@itsite.com  Mon Oct 28 17:16:03 2002
From: dereks@itsite.com (Derek Simkowiak)
Date: Mon, 28 Oct 2002 09:16:03 -0800 (PST)
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEMLCBAB.tim.one@comcast.net>
Message-ID: <Pine.LNX.4.33L2.0210280842260.30862-100000@dev.itsite.com>

> > is in the spam collections like Bruce's.  Somebody just needs to figure
> > out how to mine it, methinks.
>
> I think you're missing a basic point, but not due to lack of repetition
> <wink>.

	I'm not missing the basic point, I'm disagreeing with it.  (You
can stop with the lengthy examples of one guy who wants commercial mails
from some particular company or subject domain -- I get it, really, I do.)

	I may personally consider messages from you to be "spam" (not as
Unsolicited Bulk Email, but simply as unwanted messages).  But I don't
think it would be the job of a general-purspose installation-wide spam
identifier to know that about me, as you seem to suggest.

	I would want a tool like SpamBayes to flag emails as being like
the ones in Bruce's collection.  If I like to get mails similar to those,
then nowhere am I obligated to filter those flagged messages into my
"Trash" folder.  If I like to get messages similar to those, but only if
they come from Company X, then I can set up my filters to do that, too.

	But for the vast majority of people, just knowing that a
particular email has Bruce-spam-like content would be enough to want to
filter it into a lower-priority folder, or even directly into Trash.  At
least, I see it as the job of the postmaster to provide a flag that could
be used like that.

	To summarize: I think it's the job of a spam filter (or "flagger")
to identify those messages univerally accepted as being spam -- whether or
not any one person likes that kind of mail.  And although for any given
spam there is _somebody_ on Earth who would want to read it, it would be
up to them to set up their client-app filter rules to work how they want
them to -- even if that includes running a local installation of SpamBayes
to do personalized (high-resolution) filtering.


> This isn't a question of classification technology so much as it's a
> question of personal preference, and so long as you're determined that
> everyone must use the same classifier, personal preference goes out the
> window.

	Yes, and that's exactly what I'm asking for.  I think that for
installation-wide filters (I'll use the term 'flagger' from here on since
no spam filtering should ever take place at a server -- for both legal and
privacy reasons) personal preference is irrelevant.  It's irrelevant
practically by definition.


>  That's a bad use of technology, IMO -- I'm not interested in treating
> everyone like interchangeable cogs.

	I think there are a great many people interested in having all
spam messages treated like interchangeable cogs.  "Spam" meaning a message
that would be universally accepted as being a "spam".

	I've seen many people on this list use Bruce's spam for their
training.  But undoubtedly there is a message in his collection that would
be of interest to at least *someone* on this list.  Does that invalidate
his collection as being a spam training repository?

	I would say no, it does not, because his collection is of the type
"universally accepted as spam".  That is the type of message I would like
to see flagged at Universities, ISPs, and companies.

	And to do that, I don't think ham training can be in the picture,
since somebody's "ham" is another person's "spam", and training on
people's "ham" can only weaken what is considered "universally accepted as
spam".


--Derek


From gward@python.net  Mon Oct 28 17:16:20 2002
From: gward@python.net (Greg Ward)
Date: Mon, 28 Oct 2002 12:16:20 -0500
Subject: [Spambayes] python.org corpus updated
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEHLCAAB.tim.one@comcast.net>
References: <20021026211118.GA29889@cthulhu.gerg.ca>
	<LNBBLJKPBEHFEDALKOLCKEHLCAAB.tim.one@comcast.net>
Message-ID: <20021028171620.GA31109@cthulhu.gerg.ca>

On 26 October 2002, Tim Peters said:
> -> <stat> 4 new false positives
>     new fp: ['pyham/02155.txt', 'pyham/01816.txt', 'pyham/02322.txt',
>              'pyham/02406.txt']
> 
> but I believe they're all spam.  I'll attach them for your review.  They
> correspond, respectively, to your

Can't really blame SpamAssassin for missing these -- they were all sent
to a Mailman -request address, which is explicitly whitelisted on
python.org (I don't want to reject unsubscribe requests from people who
happen to be on too many RBLs).  Moved 'em to spam folder.

> -> <stat> 9 new false positives
>     new fp: ['pyham/00277.txt', 'pyham/00278.txt', 'pyham/00275.txt',
>              'pyham/00267.txt', 'pyham/01346.txt', 'pyham/00261.txt',
>              'pyham/00276.txt', 'pyham/01284.txt', 'pyham/00645.txt']
> 
> Again I believe these are all spam, and some are so outrageously spam it's
> hard to believe SpamAssassin let them pass!  Then again, most are in a hated
> language <wink>.
> 
> ham/183BtE-00072Z-00   261
> ham/183DZB-0007dJ-00   267
> ham/183Epz-0001IH-00   275
> ham/183Epz-0001II-00   276
> ham/183Epz-0001IJ-00   277
> ham/183Epz-0001IK-00   278

These should have been dead easy: subject encoded in iso-2022-jp (which
is *now* a banned charset on python.org, but wasn't when this harvest
started), and are "To: a@a.a".  Unfortunately Exim can be made very
picky about addresses in sender headers ("From", "Reply-to", "Sender"),
but I don't think it has anything for rigorous checking of recipient
headers.  Hmmm.

> ham/183aCi-00024k-00   645
> ham/183ueG-0006vd-00  1284
> ham/183xNY-0008Gi-00  1346

These slipped through because they are to "-request" addresses.

> Take those away and there were no false positives in either direction.

Wow, awesome.

> One example:
> 
> spam/183UWS-00060A-00  633
> 
> seems a perfectly ordinary piece of mailman-users traffic.  chi-combining is
> quite certain it's ham:
> 
> prob = 3.37424532759e-012
> prob('*H*') = 1
> prob('*S*') = 6.63913e-012
> 
> OTOH, SpamAssassin seems certain it's spam:

Well, actually, it only scored 5.4.  SA doesn't have any formal notion
of certainty, but I'm pretty comfortable in stating that scores from 3.0
to 10.0 is the informal SA zone of uncertainty.  Blame me: I think I
forgot to manually review low-scoring messages in the spam folder for
FPs.  I'll do that before regenerating the tarballs.

> There also appear to be an awful lot of "false negatives" of the form:
> 
> """
>     This is a message from the IFL E-Mail Virus Protection Service
>     --------------------------------------------------------------
> 
> The original e-mail attachment
> 
>     "Card.DOC.pif"
> 
> appears to be infected by a virus and has been replaced by this=20
> warning message.
> """
> 
> That may be virus fallout, but I don't believe it belongs in the spam
> corpus, right?

Correct -- I usually put all that stuff in the virus folder, because I'd
like to see all virus-related junk mail stopped, and I think it should
be done with different tools from spam detectors.  Again, my fault for
not manually reviewing the spam folder.

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
I'm on a strict vegetarian diet -- I only eat vegetarians.

From tim.one@comcast.net  Mon Oct 28 17:24:46 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 28 Oct 2002 12:24:46 -0500
Subject: [Spambayes] incremental training strategies
In-Reply-To: <15805.26237.16266.425547@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEPLCBAB.tim.one@comcast.net>

[Skip Montanaro]
> I am now running hammie.py from my procmailrc file, but not yet doing any
> filtering based on the results.  I trained it on my current setup (7000
> hams, 5000 spams).  Should I:
>
>     * train it on every message which passes through my inbox
>
>     * only train it on messages which it incorrectly classifies
>
>     * some other scheme
>
> ?  Or is that not yet known?

Experiment <wink>.  Note that chi-combining has a very real middle ground,
and you're not used to that yet:  you should certainly train it on msgs it
says it's unsure about.

For my personal email, I've trained on about 1,000 ham and 1,500 spam.  As
an experiment, I'm going to stop training now, except for Unsure msgs and
mistakes; however, I haven't yet seen a mistake beyond one spam python.org
let thru (it let thru more than that, but all the rest of those wound up in
my Unsure folder despite the "I've been thru python.org" ham clues; the one
that fooled both of us is hopeless).


From popiel@wolfskeep.com  Mon Oct 28 17:28:55 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Mon, 28 Oct 2002 09:28:55 -0800
Subject: [Spambayes] incremental training strategies 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15805.26237.16266.425547@montanaro.dyndns.org> 
References: <15805.26237.16266.425547@montanaro.dyndns.org> 
Message-ID: <20021028172855.53969F53E@cashew.wolfskeep.com>

In message:  <15805.26237.16266.425547@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>I am now running hammie.py from my procmailrc file, but not yet doing any
>filtering based on the results.  I trained it on my current setup (7000
>hams, 5000 spams).  Should I:
>
>    * train it on every message which passes through my inbox
>
>    * only train it on messages which it incorrectly classifies
>
>    * some other scheme
>
>?  Or is that not yet known?
>
>Skip

Speaking from a theoretical purity standpoint, I suspect that training
it on everything that came through would be 'cleaner'... but I have no
idea if in practise it would work any better than just training on the
mistakes and unsure.

Try out variations, and post results?

- Alex

From gward@python.net  Mon Oct 28 17:31:13 2002
From: gward@python.net (Greg Ward)
Date: Mon, 28 Oct 2002 12:31:13 -0500
Subject: [Spambayes] python.org corpus updated
In-Reply-To: <20021028171620.GA31109@cthulhu.gerg.ca>
References: <20021026211118.GA29889@cthulhu.gerg.ca>
	<LNBBLJKPBEHFEDALKOLCKEHLCAAB.tim.one@comcast.net>
	<20021028171620.GA31109@cthulhu.gerg.ca>
Message-ID: <20021028173113.GA31162@cthulhu.gerg.ca>

OK, I've done a manual pass over all "low-scoring" messages in the spam
folder, and moved a bunch of stuff around.  Revised tarballs of the Oct
2002 python.org corpus are now online.

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
"Question authority!"  "Oh yeah?  Says who?"

From popiel@wolfskeep.com  Mon Oct 28 17:30:23 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Mon, 28 Oct 2002 09:30:23 -0800
Subject: [Spambayes] table.py patch to produce averages at end of line. 
In-Reply-To: Message from Anthony Baxter <anthony@interlink.com.au> 
	<200210280813.g9S8CcL02773@localhost.localdomain> 
References: <200210280813.g9S8CcL02773@localhost.localdomain> 
Message-ID: <20021028173023.EB363F53E@cashew.wolfskeep.com>

In message:  <200210280813.g9S8CcL02773@localhost.localdomain>
             Anthony Baxter <anthony@interlink.com.au> writes:
>The following simple patch produces a final column in table.py of 
>averages for all files and all measures. This is useful if you're
>doing tests with very small amounts of data, and want to run the
>test multiple times with different seeds to check that your results
>are actually meaningful.

>Not sure if this is generally useful enough to anyone else for it to
>be checked in - any opinions?

Make it a command line option, and I'm sure it'll be welcome. ;-)

- Alex

From tim.one@comcast.net  Mon Oct 28 17:42:15 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 28 Oct 2002 12:42:15 -0500
Subject: [Spambayes] python.org corpus updated
In-Reply-To: <20021028171620.GA31109@cthulhu.gerg.ca>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEPOCBAB.tim.one@comcast.net>

[Greg Ward]
> ...
> Correct -- I usually put all that stuff in the virus folder, because I'd
> like to see all virus-related junk mail stopped, and I think it should
> be done with different tools from spam detectors.

I don't expect that *this* spam detector is going to do well at viruses
anyway -- although that hasn't been tested.

> Again, my fault for not manually reviewing the spam folder.

Before you do (or maybe while <wink>), let's think about what we're trying
to accomplish:  whether or not you ban various charsets, and whether or not
you look at blacklists, and whether or not etc, all tests run against c.l.py
and python.org traffic have said that a spambayes classifer catches
virtually all the spam and has very low fp rates.

So, at this point, what is the purpose of more testing?  What's the goal
here from your POV?  I can run any number of tests over any number of coming
months, but it's begun to feel simply redundant from my POV.  We're not
learning anything new here, just confirming that this approach works great
for tech mailing lists, and even for python.org's private hobby lists
(provided the classifier is trained on them too).


From guido@python.org  Mon Oct 28 17:38:18 2002
From: guido@python.org (Guido van Rossum)
Date: Mon, 28 Oct 2002 12:38:18 -0500
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: Your message of "Mon, 28 Oct 2002 09:16:03 PST."
             <Pine.LNX.4.33L2.0210280842260.30862-100000@dev.itsite.com> 
References: <Pine.LNX.4.33L2.0210280842260.30862-100000@dev.itsite.com> 
Message-ID: <200210281738.g9SHcIL20111@pcp02138704pcs.reston01.va.comcast.net>

> 	But for the vast majority of people, just knowing that a
> particular email has Bruce-spam-like content would be enough to want
> to filter it into a lower-priority folder, or even directly into
> Trash.  At least, I see it as the job of the postmaster to provide a
> flag that could be used like that.
> 
> 	To summarize: I think it's the job of a spam filter (or "flagger")
> to identify those messages univerally accepted as being spam --
> whether or not any one person likes that kind of mail.  And although
> for any given spam there is _somebody_ on Earth who would want to
> read it, it would be up to them to set up their client-app filter
> rules to work how they want them to -- even if that includes running
> a local installation of SpamBayes to do personalized
> (high-resolution) filtering.

That would be a laudable goal, but the techniques pursued here don't
work like that.  They can only do a good job if you train them on
*both* spam and non-spam.  That's how the math of a Bayesian
classifier works, alas.  Someone can probably prove that you can't
reduce the false positives more without knowing what *your* non-spam
looks like.  It sounds like SpamAssassin might be your best bet if you
don't want to train on your non-spam (and even SpamAssassin requires
an elaborate "whitelist" setup to avoid flagging the most flagrant
spammish-looking non-spam).

--Guido van Rossum (home page: http://www.python.org/~guido/)

From popiel@wolfskeep.com  Mon Oct 28 17:50:04 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Mon, 28 Oct 2002 09:50:04 -0800
Subject: [Spambayes] Timestamp analysis
Message-ID: <20021028175004.64C20F53E@cashew.wolfskeep.com>

This set of runs took me a lot longer than expected; first
I had a couple errors in my scripts causing result files to
collide, then I wanted to do it again saving pickles for
probing, and finally I discovered that the day-of-week stuff
was failing (getting dow:invalid) for nearly all my mail.
I have not yet fixed the latter, so the day-of-week results
are invalid for the concept, but valid for the implementation.

Also, the implementation of generate_time_buckets seems to
use 10 minute time buckets, not 6 minute buckets as the code
comments suggest.

Overall, looking at the date in detail, unrelated to anything
else, seems neutral.  Almost perfectly so; at most, there was
a one unsure difference, which is not significant.

In the table below, 
    r) mine_received_headers: False
       basic_header_tokenize: False
    R) mine_received_headers: True
       basic_header_tokenize: True

    t) generate_time_buckets: False
    T) generate_time_buckets: True

    d) extract_dow: False
    D) extract_dow: True


-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
[...]
filename:      rtd     rtD     rTd     rTD     Rtd     RtD     RTd     RTD
ham:spam:  2000:2000       2000:2000       2000:2000       2000:2000      
                   2000:2000       2000:2000       2000:2000       2000:2000
fp total:        3       3       3       3       3       3       3       3
fp %:         0.15    0.15    0.15    0.15    0.15    0.15    0.15    0.15
fn total:       12      12      12      12      12      12      12      12
fn %:         0.60    0.60    0.60    0.60    0.60    0.60    0.60    0.60
unsure t:       53      53      54      54      31      31      31      31
unsure %:     1.32    1.32    1.35    1.35    0.78    0.78    0.78    0.78
real cost:  $52.60  $52.60  $52.80  $52.80  $48.20  $48.20  $48.20  $48.20
best cost:  $48.20  $48.20  $48.20  $48.20  $38.80  $38.80  $38.80  $38.80
h mean:       0.40    0.40    0.40    0.40    0.30    0.30    0.30    0.30
h sdev:       5.39    5.39    5.38    5.38    4.47    4.47    4.48    4.48
s mean:      98.45   98.46   98.46   98.46   98.85   98.85   98.85   98.85
s sdev:       9.76    9.76    9.76    9.75    9.06    9.06    9.06    9.05
mean diff:   98.05   98.06   98.06   98.06   98.55   98.55   98.55   98.55
k:            6.47    6.47    6.48    6.48    7.28    7.28    7.28    7.28

I have not yet posted this on my website...

- Alex

From tim.one@comcast.net  Mon Oct 28 18:11:18 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 28 Oct 2002 13:11:18 -0500
Subject: [Spambayes] Spam vs time-of-day
In-Reply-To: <15805.26237.16266.425547@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEADCCAB.tim.one@comcast.net>

This is a multi-part message in MIME format.

---------------------- multipart/mixed attachment
Attached is a plot of # of spams sent per 10-minute bucket (based on Skip's
Date header cracking), vs time-of-day, across my subset of BruceG's 2002
spam collection.  The idea that *his* spam is mostly sent overnight is
clearly bogus.  Someone who stops looking at email at 5pm and doesn't look
again until 8am could sure get that impression, though.

The wiggly red line is a one-hour moving average.  An obvious conclusion is
that many spammers have day jobs, and send out huge spikes at the beginning
and end of their lunch hours, but struggle with software problems in
between -- IOW, they're us <wink>.

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: spamtime.png
Type: image/png
Size: 14910 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021028/bb18b31e/spamtime.png

---------------------- multipart/mixed attachment--

From popiel@wolfskeep.com  Mon Oct 28 18:11:24 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Mon, 28 Oct 2002 10:11:24 -0800
Subject: [Spambayes] progress on POP+VM+ZODB deployment 
In-Reply-To: Message from Derek Simkowiak <dereks@itsite.com> 
	<Pine.LNX.4.33L2.0210280842260.30862-100000@dev.itsite.com> 
References: <Pine.LNX.4.33L2.0210280842260.30862-100000@dev.itsite.com> 
Message-ID: <20021028181124.D69E8F53E@cashew.wolfskeep.com>

In message:  <Pine.LNX.4.33L2.0210280842260.30862-100000@dev.itsite.com>
             Derek Simkowiak <dereks@itsite.com> writes:
>
>	To summarize: I think it's the job of a spam filter (or "flagger")
>to identify those messages univerally accepted as being spam -- whether or
>not any one person likes that kind of mail.

I'm reasonably sure there is no consensus on the definition of spam,
so the concept of 'universally accepted' spam is flawed at its root.
Some people restrict it to unsolicited commercial email; some consider
any marketing message to be spam.  Some don't care if its commercial
or not.  Worst, for the lowest-common-denominator UCE definition,
knowledge of the individual users is required (whether they solicited
it or not).

As such, I'd say your ideal universal flagger concept is unrealizable.

Even if the concept is sound, I think that the classifiers we're working
with are a bad fit for your concept, since at their core they need to
know something about what's good as well as what's bad.  Otherwise, you
end up saying stuff is spam because it used the words 'you', 'there',
'some', 'the', etc... the incidentals of the language, with no real
import on the message.

>	I've seen many people on this list use Bruce's spam for their
>training.  But undoubtedly there is a message in his collection that would
>be of interest to at least *someone* on this list.  Does that invalidate
>his collection as being a spam training repository?

I have avoided using _any_ outside source of spam, precisely because
I don't trust their judgement on my mail.  If there's a classification
error, I want it to be tracable only to me, not to some other person's
potentially warped ideas about mail.  (Note that this is not to say
that I think Bruce's collection is bad or warped... I haven't looked
at it, so cannot say.  I'm just paranoid about my mail.)

>	I would say no, it does not, because his collection is of the type
>"universally accepted as spam".  That is the type of message I would like
>to see flagged at Universities, ISPs, and companies.
>
>	And to do that, I don't think ham training can be in the picture,
>since somebody's "ham" is another person's "spam", and training on
>people's "ham" can only weaken what is considered "universally accepted as
>spam".

I'll run some experiments (I've been doing the most with ham:spam ratio,
anyway), but I suspect that without any ham the spambayes classifier
will fail horribly.

- Alex

From guido@python.org  Mon Oct 28 18:20:56 2002
From: guido@python.org (Guido van Rossum)
Date: Mon, 28 Oct 2002 13:20:56 -0500
Subject: [Spambayes] Spam vs time-of-day
In-Reply-To: Your message of "Mon, 28 Oct 2002 13:11:18 EST."
             <LNBBLJKPBEHFEDALKOLCAEADCCAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCAEADCCAB.tim.one@comcast.net> 
Message-ID: <200210281820.g9SIKuM20406@pcp02138704pcs.reston01.va.comcast.net>

> Attached is a plot of # of spams sent per 10-minute bucket (based on
> Skip's Date header cracking), vs time-of-day, across my subset of
> BruceG's 2002 spam collection.  The idea that *his* spam is mostly
> sent overnight is clearly bogus.  Someone who stops looking at email
> at 5pm and doesn't look again until 8am could sure get that
> impression, though.
> 
> The wiggly red line is a one-hour moving average.  An obvious
> conclusion is that many spammers have day jobs, and send out huge
> spikes at the beginning and end of their lunch hours, but struggle
> with software problems in between -- IOW, they're us <wink>.

The Date header reflects local time at the spammer's box, right?
Could it be local time on a box to which the spammer connects to send
his mail?  And would that box necessarily have the same local time?

In the graph, is there a difference between the narrow black bars and
the slightly wider blue/gray bars with black outlines?

--Guido van Rossum (home page: http://www.python.org/~guido/)

From skip@pobox.com  Mon Oct 28 18:25:44 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 28 Oct 2002 12:25:44 -0600
Subject: [Spambayes] incremental training strategies
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEPLCBAB.tim.one@comcast.net>
References: <15805.26237.16266.425547@montanaro.dyndns.org>
        <LNBBLJKPBEHFEDALKOLCIEPLCBAB.tim.one@comcast.net>
Message-ID: <15805.33064.884324.694879@montanaro.dyndns.org>

    Tim> you should certainly train it on msgs it says it's unsure about.

Sounds like a plan.  So far, I haven't seen any of these.

Skip

From jbublitz@nwinternet.com  Mon Oct 28 17:34:58 2002
From: jbublitz@nwinternet.com (Jim Bublitz)
Date: Mon, 28 Oct 2002 10:34:58 -0700 (PST)
Subject: [Spambayes] incremental training strategies
In-Reply-To: <20021028172855.53969F53E@cashew.wolfskeep.com>
Message-ID: <XFMail.021028103458.jbublitz@nwinternet.com>

On 28-Oct-02 T. Alexander Popiel wrote:
> In message:  <15805.26237.16266.425547@montanaro.dyndns.org>
>              Skip Montanaro <skip@pobox.com> writes:

>> I am now running hammie.py from my procmailrc file, but not yet
>> doing any filtering based on the results.  I trained it on my
>> current setup (7000 hams, 5000 spams).  Should I:

>>    * train it on every message which passes through my inbox

>>    * only train it on messages which it incorrectly classifies

>>    * some other scheme

>>?  Or is that not yet known?

> Speaking from a theoretical purity standpoint, I suspect that
> training it on everything that came through would be
> 'cleaner'... but I have no idea if in practise it would work any
> better than just training on the mistakes and unsure.
 
> Try out variations, and post results?

I ran tests in chronological order where I trained on 4000 of each
type of msg and then:

a. Tested 8000 msgs of each type without retraining

b. Tested 8000 msgs of each type, retraining on all new msgs after
each batch of 100 spam/100 ham

b gave clearly better results by nearly an order of magnitude, but
that's only 1% or 2% vs. 0.1% or 0.2% at most, so in absolute terms
the effect might not be huge depending on mail volume.

In theory a closed-loop system should give more accurate results,
but it also requires some measures to make sure the retraining data
is clean or performance will probably degrade more quickly than if
you never retrain at all.


Jim


From tim.one@comcast.net  Mon Oct 28 18:41:45 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 28 Oct 2002 13:41:45 -0500
Subject: [Spambayes] Spam vs time-of-day
In-Reply-To: <200210281820.g9SIKuM20406@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEAHCCAB.tim.one@comcast.net>

[Guido]
> The Date header reflects local time at the spammer's box, right?
> Could it be local time on a box to which the spammer connects to send
> his mail?  And would that box necessarily have the same local time?

Barry may know; I don't.  I have to suspect that the answer is "it depends
on the mail client".

> In the graph, is there a difference between the narrow black bars and
> the slightly wider blue/gray bars with black outlines?

No, there are six vertical bars per labelled hour (one for each of Skip's
144 10-minute buckets), and I don't know why Excel decided to color some
differently.  Under a magnified view, it appears that it tried to make them
strictly alternate, but every now and again put two blue ones next to each
other.  I may have left some X-axis minor-tick display option set to an
unfortunate value, as the instances of doubled blue bars appear to be
more-than-less regularly spaced.


From skip@pobox.com  Mon Oct 28 18:42:08 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 28 Oct 2002 12:42:08 -0600
Subject: [Spambayes] incremental training strategies 
In-Reply-To: <20021028172855.53969F53E@cashew.wolfskeep.com>
References: <15805.26237.16266.425547@montanaro.dyndns.org>
        <20021028172855.53969F53E@cashew.wolfskeep.com>
Message-ID: <15805.34048.389444.16035@montanaro.dyndns.org>


    Alex> Speaking from a theoretical purity standpoint, I suspect that
    Alex> training it on everything that came through would be
    Alex> 'cleaner'... but I have no idea if in practise it would work any
    Alex> better than just training on the mistakes and unsure.

Yeah, but theory and practice often disagree. ;-) The biggest problem I see
in training it on every message you encounter is you are likely to make
mistakes, generally of the inattentiveness or fumble-fingered variety.
That's fine when you're testing the algorithm.  You migrate the message to
the other pool, then test again.  It's a bit different proposition if you
are training messages on-the-fly, then delete them (or even if you don't
delete them).  How do you realize you misclassified a message?  If you
realize you misclassified a message, how do you undo the effect of the
misclassification, particularly if you no longer have the message laying
around?

>From the standpoint of minimizing human error, once you have a decent
hammie.db file, it seems to me that only training on either unsure or
incorrect messages is likely to be the best way to improve it.

Skip

From seant@iname.com  Mon Oct 28 18:47:36 2002
From: seant@iname.com (Sean True)
Date: Mon, 28 Oct 2002 13:47:36 -0500
Subject: [Spambayes] progress on POP+VM+ZODB deployment 
In-Reply-To: <20021028181124.D69E8F53E@cashew.wolfskeep.com>
Message-ID: <MJEHLHJKGINLONDMMKNEEEJNHEAA.seant@iname.com>


> -----Original Message-----
> From: spambayes-bounces@python.org
> [mailto:spambayes-bounces@python.org]On Behalf Of T. Alexander Popiel
> Sent: Monday, October 28, 2002 1:11 PM
> To: Derek Simkowiak
> Cc: spambayes@python.org; popiel@wolfskeep.com
> Subject: Re: [Spambayes] progress on POP+VM+ZODB deployment
>
>
> In message:  <Pine.LNX.4.33L2.0210280842260.30862-100000@dev.itsite.com>
>              Derek Simkowiak <dereks@itsite.com> writes:
> >
> >	To summarize: I think it's the job of a spam filter (or "flagger")
> >to identify those messages univerally accepted as being spam --
> whether or
> >not any one person likes that kind of mail.
>
> I'm reasonably sure there is no consensus on the definition of spam,
> so the concept of 'universally accepted' spam is flawed at its root.
> Some people restrict it to unsolicited commercial email; some consider
> any marketing message to be spam.  Some don't care if its commercial
> or not.  Worst, for the lowest-common-denominator UCE definition,
> knowledge of the individual users is required (whether they solicited
> it or not).
>
> As such, I'd say your ideal universal flagger concept is unrealizable.

At this risk of being repetitive, for many large email systems, sponsored by
large companies, spam is "things that don't contribute to productivity".
Being able
to preemptively -- and intelligently -- filter out porn, get rich quick, and
Nigerian scam mail may be of real interest to people who administer 10000+
seat email systems.

This may not be the preferred way to use this filters, from _our_ point of
view,
but it will likely be an interesting one to the MIS manager in charge
of keeping system usage reasonable.

-- Sean


From guido@python.org  Mon Oct 28 18:49:15 2002
From: guido@python.org (Guido van Rossum)
Date: Mon, 28 Oct 2002 13:49:15 -0500
Subject: [Spambayes] Spam vs time-of-day
In-Reply-To: Your message of "Mon, 28 Oct 2002 13:41:45 EST."
             <LNBBLJKPBEHFEDALKOLCIEAHCCAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCIEAHCCAB.tim.one@comcast.net> 
Message-ID: <200210281849.g9SInFr20784@pcp02138704pcs.reston01.va.comcast.net>

> > In the graph, is there a difference between the narrow black bars and
> > the slightly wider blue/gray bars with black outlines?
> 
> No, there are six vertical bars per labelled hour (one for each of
> Skip's 144 10-minute buckets), and I don't know why Excel decided to
> color some differently.  Under a magnified view, it appears that it
> tried to make them strictly alternate, but every now and again put
> two blue ones next to each other.  I may have left some X-axis
> minor-tick display option set to an unfortunate value, as the
> instances of doubled blue bars appear to be more-than-less regularly
> spaced.

Maybe it's a simple roundoff problem -- the bars could be
approximately 2.5 pixels wide, and this gets rounded to 2 or 3
depending on circumstance.  I'll ignore the difference then.

--Guido van Rossum (home page: http://www.python.org/~guido/)

From tim.one@comcast.net  Mon Oct 28 18:53:21 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 28 Oct 2002 13:53:21 -0500
Subject: [Spambayes] incremental training strategies
In-Reply-To: <15805.34048.389444.16035@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEAJCCAB.tim.one@comcast.net>

[Skip Montanaro]
> Yeah, but theory and practice often disagree. ;-)

In theory, training on every msg is, over time, the same as training on a
small random sampling of msgs.

> The biggest problem I see in training it on every message you encounter
> is you are likely to make mistakes, generally of the inattentiveness or
> fumble-fingered variety.
>
> That's fine when you're testing the algorithm.  You migrate the message
> to the other pool, then test again.  It's a bit different proposition if
> you are training messages on-the-fly, then delete them (or even if you
> don't delete them).  How do you realize you misclassified a message?

I save my personal training ham and spam in their own distinct folders, and
use Mark's GUI to score them too, then tell Outlook to sort them by score.
Mistakes very reliably end up "at the wrong end" of the display.

> If you realize you misclassified a message, how do you undo the effect of
> the misclassification,

Mark added a msg_id -> training_status database to the Outlook client.  If I
move a mistake into the other training folder, it automatically "does the
right thing" (realizes that the msg was trained in the other direction,
untrains it from that category, and retrains it for the correct category).

> particularly if you no longer have the message laying around?

Then you're hosed.

> From the standpoint of minimizing human error, once you have a decent
> hammie.db file, it seems to me that only training on either unsure or
> incorrect messages is likely to be the best way to improve it.

I don't believe it, but it hasn't been tested.  The problem I foresee is
scores that rely too much on accidental hapaxes.  This will appear to work
great over the short term.  When other messages appear containing the same
accidental rare strings, their classification will be a coin toss to a
proportional extent.


From skip@pobox.com  Mon Oct 28 18:53:50 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 28 Oct 2002 12:53:50 -0600
Subject: [Spambayes] Re: Spam vs time-of-day
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEADCCAB.tim.one@comcast.net>
References: <15805.26237.16266.425547@montanaro.dyndns.org>
        <LNBBLJKPBEHFEDALKOLCAEADCCAB.tim.one@comcast.net>
Message-ID: <15805.34750.447765.195346@montanaro.dyndns.org>


---------------------- multipart/mixed attachment

    Tim> Attached is a plot of # of spams sent per 10-minute bucket (based
    Tim> on Skip's Date header cracking), vs time-of-day, across my subset
    Tim> of BruceG's 2002 spam collection.  The idea that *his* spam is
    Tim> mostly sent overnight is clearly bogus.  Someone who stops looking
    Tim> at email at 5pm and doesn't look again until 8am could sure get
    Tim> that impression, though.

"*his*" refers to Bruce, right?  My contention after plotting time buckets
was the same: that spam was generally sent at a continuous rate.  Ham, on
the other hand, does have a strong diurnal pattern.  I posted a gnuplot
graph to that effect back at the end of September.  That's what convinced me
to try mining information from the Date: header.  For completeness, I've
attached my original graph.  I believe the x-axis is the 6-minute bucket
offset, starting from midnight.  The large spike at 0 is an artifact of my
simpleminded Date header scanning.  Invalid dates probably wound up with a
value of 0.  Buckets were calculated using local time.  That way I didn't
penalize Anthony Baxter and other folks who happen not to live in the US.

Skip


---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: hour.png
Type: image/png
Size: 7616 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021028/3548e9e9/hour.png

---------------------- multipart/mixed attachment--

From jbublitz@nwinternet.com  Mon Oct 28 17:43:43 2002
From: jbublitz@nwinternet.com (Jim Bublitz)
Date: Mon, 28 Oct 2002 10:43:43 -0700 (PST)
Subject: [Spambayes] python.org corpus updated
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEPOCBAB.tim.one@comcast.net>
Message-ID: <XFMail.021028104343.jbublitz@nwinternet.com>

On 28-Oct-02 Tim Peters wrote:
> [Greg Ward]
>> ...
>> Correct -- I usually put all that stuff in the virus folder,
>> because I'd like to see all virus-related junk mail stopped,
>> and I think it should be done with different tools from spam
>> detectors.
 
> I don't expect that *this* spam detector is going to do well at
> viruses anyway -- although that hasn't been tested.

Works great for me - I put all tagged/scrubbed or virgin virus msgs
in my spam corpus from the start and haven't had a problem. I don't
virus scan (Linux) but some of my ISPs do. The email module has some
problems with them though, because some of the virus taggers mung
the boundaries or attachments.

Viruses looks like spam to me.

Jim


From popiel@wolfskeep.com  Mon Oct 28 18:59:00 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Mon, 28 Oct 2002 10:59:00 -0800
Subject: [Spambayes] progress on POP+VM+ZODB deployment 
In-Reply-To: Message from "Sean True" <seant@iname.com> 
	<MJEHLHJKGINLONDMMKNEEEJNHEAA.seant@iname.com> 
References: <MJEHLHJKGINLONDMMKNEEEJNHEAA.seant@iname.com> 
Message-ID: <20021028185901.711BEF53E@cashew.wolfskeep.com>

In message:  <MJEHLHJKGINLONDMMKNEEEJNHEAA.seant@iname.com>
             "Sean True" <seant@iname.com> writes:
>
>At this risk of being repetitive, for many large email systems, sponsored
>by large companies, spam is "things that don't contribute to productivity".

Uh oh... better filter out this list, then... it is distracting me from
work! ;-)

- Alex

From tim.one@comcast.net  Mon Oct 28 19:03:08 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 28 Oct 2002 14:03:08 -0500
Subject: [Spambayes] python.org corpus updated
In-Reply-To: <XFMail.021028104343.jbublitz@nwinternet.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEAMCCAB.tim.one@comcast.net>

[Jim Bublitz]
> Works great for me - I put all tagged/scrubbed or virgin virus msgs
> in my spam corpus from the start and haven't had a problem. I don't
> virus scan (Linux) but some of my ISPs do. The email module has some
> problems with them though, because some of the virus taggers mung
> the boundaries or attachments.
>
> Viruses looks like spam to me.

How do you tokenize?  We ignore MIME sections that aren't text/*, except for
generating metatokens from the MIME armor (content-type,
content-disposition, charset and filename parameter values).  There's
another option to suck up the first 5 decoded bytes of octet-stream
sections, but enabling that hasn't made any difference in my tests.

IOW, a typical virus generates a very small set of tokens, the way we
tokenize.  We're also missing src=cid: clues from iframe tags.


From tim.one@comcast.net  Mon Oct 28 19:29:30 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 28 Oct 2002 14:29:30 -0500
Subject: [Spambayes] RE: Spam vs time-of-day
In-Reply-To: <15805.34750.447765.195346@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEAPCCAB.tim.one@comcast.net>

[Skip Montanaro]
> "*his*" refers to Bruce, right?

Right.

> My contention after plotting time buckets was the same: that spam was
> generally sent at a continuous rate.

No, your graph and mine both show that it falls off in the early-morning
hours.  Offline I did a chi-squared test against the hypothesis that the
spam was evenly distributed, and the probability that random data could be
so skewed was < 1e-18.  But ham falls off much more.

> Ham, on the other hand, does have a strong diurnal pattern.

Very.

> I posted a gnuplot graph to that effect back at the end of September.
> That's what convinced me to try mining information from the Date: header.
> For completeness, I've attached my original graph.  I believe the x-axis
> is the 6-minute bucket offset, starting from midnight.

Your buckets span 10 minutes.  The comment in the code is confused about
this too.  That's why your graph and mine both have 144 points on the X axis
(24 * 6 = 144; you have six *buckets* per hour, and each spans 10 minutes).

> The large spike at 0 is an artifact of my simpleminded Date header
> scanning.  Invalid dates probably wound up with a value of 0.

And at that time, *every* Date header generated a dow:invalid token (as well
as the correct token, when possible).  That's been repaired since then.

> Buckets were calculated using local time.  That way I didn't penalize
> Anthony Baxter and other folks who happen not to live in the US.

I'm unsure what "were calculated using local time" means.  Does the checked
in code do that or not?  I took what the checked-in code produced at face
value (after untangling the hour.bucket_number format into hour.minute).

I doubt that it matters, though.  Most c.l.py traffic in my corpus is sent
from the U.S., and in any case enabling these things didn't help my results
(the spamprobs were too mild to make a difference).


From barry@wooz.org  Mon Oct 28 19:33:14 2002
From: barry@wooz.org (Barry A. Warsaw)
Date: Mon, 28 Oct 2002 14:33:14 -0500
Subject: [Spambayes] Spam vs time-of-day
References: <200210281820.g9SIKuM20406@pcp02138704pcs.reston01.va.comcast.net>
	<LNBBLJKPBEHFEDALKOLCIEAHCCAB.tim.one@comcast.net>
Message-ID: <15805.37114.807848.125972@gargle.gargle.HOWL>


>>>>> "TP" == Tim Peters <tim.one@comcast.net> writes:

    >> The Date header reflects local time at the spammer's box,
    >> right?  Could it be local time on a box to which the spammer
    >> connects to send his mail?  And would that box necessarily have
    >> the same local time?

    TP> Barry may know; I don't.  I have to suspect that the answer is
    TP> "it depends on the mail client".

Actually, just "it depends" would be the correct answer. :)

Of course, given the right mail client, just about anything can be
shoved into a Date header (and often is).  How far messages with bogus
or even missing Date headers will make it along the delivery path is
dependent on all the tools in the change.  Many mail clients will add
Date headers and I can't imagine such would reflect anything other
than local time on the box composing the message.  Because RFC 2822
requires exactly one Date header, an SMTPd would be within its rights
to reject a message from a client that was missing Date, although I
think all but qmail probably just add one if it's missing.  I'd bet in
99% of the situations that would have the same local time as the
composing machine.

BTW, RFC 2822 has this to say about Date:

3.6.1. The origination date field

   [...]

   The origination date specifies the date and time at which the creator
   of the message indicated that the message was complete and ready to
   enter the mail delivery system.  For instance, this might be the time
   that a user pushes the "send" or "submit" button in an application
   program.  In any case, it is specifically not intended to convey the
   time that the message is actually transported, but rather the time at
   which the human or other creator of the message has put the message
   into its final form, ready for transport.  (For example, a portable
   computer user who is not connected to a network might queue a message
   for delivery.  The origination date is intended to contain the date
   and time that the user queued the message, not the time when the user
   connected to the network to send the message.)

So I think it's safe to treat Date as the moment in time when the
human hit "send".

-Barry

From tim.one@comcast.net  Mon Oct 28 19:36:14 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 28 Oct 2002 14:36:14 -0500
Subject: [Spambayes] incremental training strategies
In-Reply-To: <15805.33064.884324.694879@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEBDCCAB.tim.one@comcast.net>

>     Tim> you should certainly train it on msgs it says it's unsure about.

[Skip]
> Sounds like a plan.  So far, I haven't seen any of these.

If you're just starting, I suggest fiddling ham_cutoff to a low value and
spam_cutoff to a high value.  For example, 0.05 and 0.95.  That should get
you some valuable practice <wink> with unsures.


From popiel@wolfskeep.com  Mon Oct 28 20:43:33 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Mon, 28 Oct 2002 12:43:33 -0800
Subject: [Spambayes] Training without ham
Message-ID: <20021028204333.3A1BDF53E@cashew.wolfskeep.com>

Summary: Ham is required in the training set, as expected.

Okay, I made some modifications to my standard ham:spam ratio
tests for this one.  Previously, when I tested the ham:spam
ratio, I just used the --ham-keep and --spam-keep options to
timcv.py, meaning that the the entire mail stream (both training
and testing) was shaped to the given ratio.  For this set of
tests, I mangled timcv.py to use those options for the training
set only, and test with all the ham and spam in the given bucket.
(I'm thinking of rerunning the less extreme ratio tests, to see
if this changes the sweet-spot interpretation.)

Have a table:

-> <stat> tested 200 hams & 200 spams against 180 hams & 1620 spams
[...]
-> <stat> tested 200 hams & 200 spams against 0 hams & 1800 spams
filename:   20-180  15-185  10-190   5-195   2-198   1-199   0-200
ham:spam:  2000:2000       2000:2000       2000:2000       2000:2000
                   2000:2000       2000:2000       2000:2000      
fp total:       36      46      68      95     381     455    2000
fp %:         1.80    2.30    3.40    4.75   19.05   22.75  100.00
fn total:        5       6       4       2       0       0       0
fn %:         0.25    0.30    0.20    0.10    0.00    0.00    0.00
unsure t:      143     175     279     536    1307    1374       0
unsure %:     3.58    4.38    6.97   13.40   32.67   34.35    0.00
real cost: $393.60 $501.00 $739.80$1059.20$4071.40$4824.80$20000.00
best cost: $296.00 $386.20 $421.00 $425.60 $452.20 $465.20 $800.00
h mean:       5.48    6.65   10.39   18.54   54.99   61.84  100.00
h sdev:      17.80   19.52   23.29   27.58   30.36   27.06    0.00
s mean:      99.58   99.58   99.64   99.74   99.88   99.92  100.00
s sdev:       5.66    5.65    5.21    4.05    2.04    1.71    0.00
mean diff:   94.10   92.93   89.25   81.20   44.89   38.08    0.00
k:            4.01    3.69    3.13    2.57    1.39    1.32 --NaN--

Okay, looking at this, there is a very clear degredation as the
amount of ham drops.  This degredation is not just with the
default cutoffs (.02 and .9), but with the prescient best cutoffs,
too.  A quick peek in the actual run output shows that the best
cutoffs get progressively closer to .995 and 1.0 throughout.
In fact, the ideal spam cutoff is 1.0 for all runs with less than
15 ham trained from each bucket, effectively eliminating the spam
category and calling all spam unsure just to lower costs.

Also note that with no ham in the training set, *EVERYTHING* is
called spam (with sane cutoffs) or unsure (with .995 and 1.0).
in either case, there is no distinguishing ham from spam.

So yes, spambayes is worthless without ham in the training corpus.

- Alex

From tim.one@comcast.net  Mon Oct 28 20:51:20 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 28 Oct 2002 15:51:20 -0500
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <Pine.LNX.4.33L2.0210280842260.30862-100000@dev.itsite.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEBJCCAB.tim.one@comcast.net>

[Derek Simkowiak]
> 	I'm not missing the basic point, I'm disagreeing with it.  (You
> can stop with the lengthy examples of one guy who wants commercial mails
> from some particular company or subject domain -- I get it, really, I do.)

Good!

> 	I may personally consider messages from you to be "spam" (not as
> Unsolicited Bulk Email, but simply as unwanted messages).  But I don't
> think it would be the job of a general-purspose installation-wide spam
> identifier to know that about me, as you seem to suggest.

Then you're willing to settle for very little, and I'm glad you're not
running my installation <wink>.

> 	I would want a tool like SpamBayes to flag emails as being like
> the ones in Bruce's collection.  If I like to get mails similar to those,
> then nowhere am I obligated to filter those flagged messages into my
> "Trash" folder.  If I like to get messages similar to those, but only if
> they come from Company X, then I can set up my filters to do that, too.
>
> 	But for the vast majority of people, just knowing that a
> particular email has Bruce-spam-like content would be enough to want to
> filter it into a lower-priority folder, or even directly into Trash.  At
> least, I see it as the job of the postmaster to provide a flag that could
> be used like that.
>
> 	To summarize: I think it's the job of a spam filter (or "flagger")
> to identify those messages univerally accepted as being spam -- whether or
> not any one person likes that kind of mail.  And although for any given
> spam there is _somebody_ on Earth who would want to read it, it would be
> up to them to set up their client-app filter rules to work how they want
> them to -- even if that includes running a local installation of SpamBayes
> to do personalized (high-resolution) filtering.

In that case, try this code and see what happens.  Use all defaults, because
they still favor mixed-source corpora so won't suck out "too many" clues
specific to your machines or your recipients.  Generate a starter database
from your own email, and then teach it from the complaints your friendly
workgroup makes.  Put some elbow grease into this!

> ...
> 	I think there are a great many people interested in having all
> spam messages treated like interchangeable cogs.  "Spam" meaning a message
> that would be universally accepted as being a "spam".

I'll leave that argument to you and your users now.

> 	I've seen many people on this list use Bruce's spam for their
> training.

I know of two.

> But undoubtedly there is a message in his collection that would
> be of interest to at least *someone* on this list.  Does that invalidate
> his collection as being a spam training repository?

Of course not, but I've removed messages from his spam corpus that don't fit
an appropriate definition of spam for comp.lang.python purposes.  There are
other messages I'd remove from his spam corpus if training for my personal
purposes.  There are some messages that need to be removed for any purposes,
because they were plainly misclassified.

> 	I would say no, it does not, because his collection is of the type
> "universally accepted as spam".  That is the type of message I would like
> to see flagged at Universities, ISPs, and companies.
>
> 	And to do that, I don't think ham training can be in the picture,
> since somebody's "ham" is another person's "spam", and training on
> people's "ham" can only weaken what is considered "universally accepted as
> spam".

Set up a test and measure results.  I expect it will detect "BruceG spam"
quite reliably, but that it will also call many other msgs spam.  The
variety in spam is, I expect, much larger than you presently imagine, and
BruceG's collection includes msgs like this:

"""
Tim,


 It was great to talk to you today I should have the propsal done by
tommorrow


Take Care,

Susan
""""""

In fact, it contains *many* msgs like that.  They are in fact spam, but I
doubt you would claim that this msg would be "universally recognized as
spam".  If you don't want msgs "like that" classified as spam, and won't
train on ham too to give it a fighting chance, then you've got weeks of work
of your own to do to try and remove msgs like that from BruceG's (or anyone
else's) spam collection before training.  Our codebase will help you do
that, BTW:  this kind of spam usually does score as spam, but on the low end
of the spam scale.  It's statistically unusual compared to the bulk of the
spam.


From jbublitz@nwinternet.com  Mon Oct 28 20:04:27 2002
From: jbublitz@nwinternet.com (Jim Bublitz)
Date: Mon, 28 Oct 2002 13:04:27 -0700 (PST)
Subject: [Spambayes] python.org corpus updated
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEAMCCAB.tim.one@comcast.net>
Message-ID: <XFMail.021028130427.jbublitz@nwinternet.com>

On 28-Oct-02 Tim Peters wrote:
> How do you tokenize?  We ignore MIME sections that aren't text/*,
> except for generating metatokens from the MIME armor
> (content-type, content-disposition, charset and filename
> parameter values). There's another option to suck up the first 5
> decoded bytes of octet-stream sections, but enabling that hasn't
> made any difference in my tests.
 
> IOW, a typical virus generates a very small set of tokens, the
> way we tokenize.  We're also missing src=cid: clues from iframe
> tags.

When I tried spambayes I didn't have any problems with virus msgs
either (before the chi-combining scheme, but that should work better
if anything), so I don't think tokenizing is that critical. Viruses
do tend to score towards the middle and have been some of my
earlier fn problems, IMO for exactly the reasons you state. The ISP
tagged viruses do have more words though. Given sufficient training
these schemes appear able to classify almost anything. I only
tokenize anything that's Content-Type: text/* (or headers)

I end up (after 30K+ msgs) with a 2.5MB text file db of (token,
prob) - about 150K tokens. The complete code was posted to the list
a week or two ago.

Tokenizing is:

# > 50% Asian language spam, some English/Asian language 
# mixed ham
TOKEN_RE = re.compile(r"[\w'$_-]+", re.U)

# remove mixed alphanumerics or strictly numeric:
#  eg: HM6116, 555N, 1234 (also Windows98, 1337, h4X0r)
# removes about 500K tokens (also most boundaries, msg IDs,
# some date/time info)
pn1_re = re.compile (r"[a-zA-Z]+[0-9]+")
pn2_re = re.compile (r"[0-9]+[a-zA-Z]+")
num_re = re.compile (r"^[0-9]+")

>> in the method that actually tokenizes (headers and everything) <<

tokens = TOKEN_RE.findall(str (data))
if not len (tokens): return

# added the first 'if' in the loop to reduce
# total # of tokens by >75%
deletes = 0
for token in tokens:
    if (len (token) > 20)\
        or (pn1_re.search (token) != None)\
        or (pn2_re.search (token) != None)\
        or (num_re.search (token) != None)\
        or token in ignore:
            deletes += 1
            continue

    if token in self:
        self[token] += 1
    else:
        self[token] = 1

    # count tokens, not msgs
    self.count += len (tokens) - deletes

"ignore" is just a list of strings that scoring puts in the headers
- about 10 words. The regexes above strip the actual score, so "99"
or "0.99" won't be strong indicators after training.

The "if" stmt could probably be cleaned up - I was just adding and
subtracting different stuff for the best performance and settled on
what's there.

Jim


From nas@python.ca  Mon Oct 28 21:27:47 2002
From: nas@python.ca (Neil Schemenauer)
Date: Mon, 28 Oct 2002 13:27:47 -0800
Subject: [Spambayes] progress on POP+VM+ZODB deployment
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEBJCCAB.tim.one@comcast.net>
References: <Pine.LNX.4.33L2.0210280842260.30862-100000@dev.itsite.com>
	<LNBBLJKPBEHFEDALKOLCOEBJCCAB.tim.one@comcast.net>
Message-ID: <20021028212747.GB23637@glacier.arctrix.com>

It seems to me that the spambayes approach works best when integrated
into the user's MUA.  That's unfortunate because there are so many
different MUAs out there and most of them are not easily extendable.  I
suppose you could use IMAP and have both the ham and spam folders on the
server.

  Neil

From tim.one@comcast.net  Mon Oct 28 21:33:59 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 28 Oct 2002 16:33:59 -0500
Subject: [Spambayes] Training without ham
In-Reply-To: <20021028204333.3A1BDF53E@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCAECDCCAB.tim.one@comcast.net>

[T. Alexander Popiel]
> Summary: Ham is required in the training set, as expected.
> ...
> So yes, spambayes is worthless without ham in the training corpus.

Ya, but that doesn't prove we need to train on spam <wink>.

I posted a variant update_probabilities yesterday, which ignored hamcounts
when computing spamprobs.  What I didn't report on was trying that, after
fiddling a combining method to merely compute the average spamprob in a msg.

Histogram analysis consistently suggested that my best strategy was to set
ham_cutoff at 0.0 then, and spam_cutoff at 1.0; i.e., to call *everything*
"unsure".

The (possibly surprising) reason can be deduced from this (from a 10-fold
randomized CV run over 2000 of each):

-> <stat> Spam scores for all runs: 2000 items; mean 19.54; sdev 8.85
-> <stat> min 3.97721; median 19.6394; max 71.6909
-> <stat> percentiles: 5% 4.65238; 25% 14.8778; 75% 23.7485; 95% 34.8339

-> <stat> Ham scores for all runs: 2000 items; mean 24.17; sdev 7.97
-> <stat> min 4.2792; median 23.4837; max 73.9471
-> <stat> percentiles: 5% 12.0403; 25% 19.4017; 75% 28.2031; 95% 37.6717

IOW, ham scores *higher* than spam for spamness under this measure, although
the overlap is extreme.  I wasn't much motivated to pursue this <wink>.


From popiel@wolfskeep.com  Mon Oct 28 23:31:34 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Mon, 28 Oct 2002 15:31:34 -0800
Subject: [Spambayes] Training without ham 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	<LNBBLJKPBEHFEDALKOLCAECDCCAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCAECDCCAB.tim.one@comcast.net> 
Message-ID: <20021028233134.17645F53E@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCAECDCCAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>[T. Alexander Popiel]
>> Summary: Ham is required in the training set, as expected.
>> ...
>> So yes, spambayes is worthless without ham in the training corpus.
>
>Ya, but that doesn't prove we need to train on spam <wink>.

You are an evil man, Tim.  Just for that, I present the following:

Summary: We need to train on spam, too.

Methodology is identical to my no-ham test, except that I'm using
very little spam instead of very little ham.

-> <stat> tested 200 hams & 200 spams against 1620 hams & 180 spams
[...]
-> <stat> tested 200 hams & 200 spams against 1800 hams & 0 spams
filename:   180-20  185-15  190-10   195-5   198-2   199-1   200-0
ham:spam:  2000:2000       2000:2000       2000:2000       2000:2000
                   2000:2000       2000:2000       2000:2000      
fp total:        1       2       0       0       0       0       0
fp %:         0.05    0.10    0.00    0.00    0.00    0.00    0.00
fn total:       68      77     118     291     672    1223    2000
fn %:         3.40    3.85    5.90   14.55   33.60   61.15  100.00
unsure t:      318     378     554    1011    1160     707       0
unsure %:     7.95    9.45   13.85   25.27   29.00   17.68    0.00
real cost: $141.60 $172.60 $228.80 $493.20 $904.00$1364.40$2000.00
best cost:  $92.60  $98.40 $127.40 $209.40 $371.00 $607.80 $800.00
h mean:       0.29    0.28    0.21    0.11    0.06    0.04    0.00
h sdev:       4.29    4.20    3.04    1.84    1.30    1.19    0.00
s mean:      90.53   88.71   81.88   62.52   37.17   20.21    0.00
s sdev:      22.22   23.77   28.84   33.27   29.01   25.77    0.00
mean diff:   90.24   88.43   81.67   62.41   37.11   20.17    0.00
k:            3.40    3.16    2.56    1.78    1.22    0.75 --NaN--

This is almost a perfect mirror image of the problem on the other
end, including the cutoffs approaching 0.0 and 0.005.

I won't bother with more detail on this one.

Tim, you're evil.

- Alex

From anthony@interlink.com.au  Tue Oct 29 01:25:51 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Tue, 29 Oct 2002 12:25:51 +1100
Subject: [Spambayes] training on very small ham sets, normal sized spamsets.
Message-ID: <200210290125.g9T1Ppw09085@localhost.localdomain>

So I hacked on timcv.py and msgs.py to add options 'spam-test', 
'spam-train', 'ham-test' and 'ham-train', to allow you to set 
the training set size separately to the testing set size.
I haven't checked this in because it will break everyone's 
test scripts - --spam= will no longer be distinct, and getopt
will gripe. Let me know if I should check this in anyway - I 
think it's useful, but YMMV.

I wanted to see what would happen to the fp numbers when you
were testing against a full-sized spam corpus, but a very very
small number of ham messages. Here's some of what I found. 
The filenames are hamtrain_hamtest numbers - so 002_100 
tested 100 hams against 3x2 spams (my personal corpus has
4 data sets in it).

What I was trying to determine here was whether it would be feasible to
ship with a pre-canned set of spam wordinfo data, but no ham, and get 
the user to feed in a bunch of their own ham at the start. And if so,
how much ham does the user have to feed in to start getting useful
results.

I've snipped the 'ratio' numbers, as they didn't reflect reality
here. For each of these tests, the system trained on 350 spam per 
set, and tested on 250 spam per set. It then trained on a small
number of ham per set (the left hand side of the filename gives 
the number) and tested on 100 hams per set. The righthand-most 
set is the numbers for the 'full' set - 4 sets, each with around 2500
ham and 400 spam.

For all these tests, ham_cutoff was 0.27 and spam_cutoff was 0.99. 
They're just numbers I picked, after earlier testing. I'm more 
interested in how the data changes as the amount of training data 
changes.


filename:  001:100 002:100 003:100 005:100 010:100 015:100 020:100     full
fp total:      189     167     152     104      69      53      42        0
fp %:        47.44   41.94   38.19   26.12   17.25   13.33   10.69     0.00
fn total:        0       0       0       0       0       0       0        7
fn %:         0.00    0.00    0.00    0.00    0.03    0.03    0.00     0.49
unsure t:      215     204     187     209     191     176     148      164
unsure %:    15.41   14.59   13.39   14.95   13.66   12.57   10.59     1.30
real cost:$1940.65$1718.35$1565.00$1086.85 $728.50 $568.87 $457.40   $39.80
best cost: $244.10 $238.25 $235.90 $227.65 $222.10 $218.13 $215.65   $35.20
h mean:      88.62   79.08   71.88   61.70   48.45   40.65   33.62     0.56
h sdev:      16.26   27.43   33.44   35.76   37.99   37.23   36.79     3.79
s mean:      99.93   99.95   99.95   99.88   99.87   99.83   99.81    97.91
s sdev:       0.94    0.68    0.80    1.59    1.94    2.02    2.27    10.12
mean diff:   11.32   20.87   28.07   38.18   51.41   59.18   66.19    97.35
k:            0.66    0.77    0.83    1.02    1.29    1.51    1.70     7.00

The numbers for each (001:, 002:, 003:, 005:, 010:, 015:, 020:) are 
actually averages of 4 different runs for each, with different 
-s options on each one (same set of 4 -s used for each, tho). 
Otherwise the variation was just too damn high. It's still a little 
'bloopy' - the unsure bounces around a bit, but it's not bad.

The next set was just to confirm some of my predjudices - how
much did upping the number of test hams change things?

Testing with the same setup as before - this time, training with
25 hams from each of 3 sets, and testing with 100, 200, 300, 400, 500
hams. Once again, for reference, the 'full' data is the right hand
column. Obviously, for this test the key thing to look at are the
percentages, means, medians and the like, not the totals.

filename:  025_100 025_200 025_300 025_400 025_500    full
fp total:       50      70      91     159     150       0
fp %:        12.50    8.75    7.58    9.94    7.50    0.00
fn total:        0       0       0       0       0       7
fn %:         0.00    0.00    0.00    0.00    0.00    0.49
unsure t:      142     252     360     556     592     164
unsure %:    10.14   14.00   16.36   21.38   19.73    1.30
real cost: $528.40 $750.40 $982.00$1701.20$1618.40  $39.80
best cost: $216.80 $225.20 $230.20 $243.80 $242.00  $35.20
h mean:      35.31   29.04   26.69   32.89   27.12    0.56
h sdev:      37.76   35.33   33.94   36.29   34.21    3.79
s mean:      99.82   99.75   99.72   99.82   99.72   97.91
s sdev:       1.73    2.86    3.11    1.73    3.11   10.12
mean diff:   64.51   70.71   73.03   66.93   72.60   97.35
k:            1.63    1.85    1.97    1.76    1.95    7.00


Finally, decided to go again, this time with ham-test set to
500, and ham-train set to different numbers (going up much 
higher this time). 

This time, I couldn't be bothered screenscraping the summarised
averages from each, so here's the data for each of the 4 runs
at each ham-train setting, with averages on the right.

filename:  001a_500        001c_500       
                   001b_500        001d_500
fp total:     1576    1170     919     871        1134
fp %:        78.80   58.50   45.95   43.55       56.70
fn total:        0       0       0       0           0
fn %:         0.00    0.00    0.00    0.00        0.00
unsure t:      154     812    1079    1130         793
unsure %:     5.13   27.07   35.97   37.67       26.46
real cost:$15790.80$11862.40$9405.80$8936.00   $11498.75
best cost: $516.00 $438.00 $389.40 $379.80     $430.80
h mean:      86.07   91.31   87.46   87.14       87.99
h sdev:      31.85   16.40   16.67   16.51       20.36
s mean:     100.00   99.95   99.97   99.95       99.97
s sdev:       0.08    0.96    0.39    0.71        0.54
mean diff:   13.93    8.64   12.51   12.81       11.97
k:            0.44    0.50    0.73    0.74        0.60

filename:  010a_500        010c_500       
                   010b_500        010d_500
fp total:      310     297     308     346         315
fp %:        15.50   14.85   15.40   17.30       15.76
fn total:        0       1       0       0           0
fn %:         0.00    0.10    0.00    0.00        0.03
unsure t:      953     987     921     986         961
unsure %:    31.77   32.90   30.70   32.87       32.06
real cost:$3290.60$3168.40$3264.20$3657.20    $3345.10
best cost: $274.00 $270.60 $270.40 $281.20     $274.05
h mean:      46.54   47.73   45.74   50.05       47.52
h sdev:      36.64   35.98   37.15   37.73       36.88
s mean:      99.87   99.82   99.90   99.91       99.88
s sdev:       1.19    2.88    1.25    1.10        1.60
mean diff:   53.33   52.09   54.16   49.86       52.36
k:            1.41    1.34    1.41    1.28        1.36

filename:  020a_500        020c_500       
                   020b_500        020d_500
fp total:      243     173     202      71         172
fp %:        12.15    8.65   10.10    3.55        8.61
fn total:        0       1       0       0           0
fn %:         0.00    0.10    0.00    0.00        0.03
unsure t:      791     691     690     489         665
unsure %:    26.37   23.03   23.00   16.30       22.18
real cost:$2588.20$1869.20$2158.00 $807.80    $1855.80
best cost: $261.40 $246.60 $247.60 $226.40     $245.50
h mean:      38.01   31.48   33.09   19.77       30.59
h sdev:      37.36   35.42   36.67   29.44       34.72
s mean:      99.84   99.77   99.88   99.67       99.79
s sdev:       1.40    2.90    1.71    3.52        2.38
mean diff:   61.83   68.29   66.79   79.90       69.20
k:            1.60    1.78    1.74    2.42        1.89

filename:  030a_500        030c_500       
                   030b_500        030d_500
fp total:      173     150     155     133         152
fp %:         8.65    7.50    7.75    6.65        7.64
fn total:        0       0       0       0           0
fn %:         0.00    0.00    0.00    0.00        0.00
unsure t:      624     533     571     580         577
unsure %:    20.80   17.77   19.03   19.33       19.23
real cost:$1854.80$1606.60$1664.20$1446.00    $1642.90
best cost: $248.20 $242.80 $239.80 $237.80     $242.15
h mean:      29.65   24.97   27.12   25.45       26.80
h sdev:      35.74   33.86   35.35   33.55       34.62
s mean:      99.81   99.74   99.88   99.77       99.80
s sdev:       1.89    2.93    1.78    2.69        2.32
mean diff:   70.16   74.77   72.76   74.32       73.00
k:            1.86    2.03    1.96    2.05        1.98

filename:  040a_500        040c_500       
                   040b_500        040d_500
fp total:      121     114     109      81         106
fp %:         6.05    5.70    5.45    4.05        5.31
fn total:        0       0       0       0           0
fn %:         0.00    0.00    0.00    0.00        0.00
unsure t:      522     443     470     404         459
unsure %:    17.40   14.77   15.67   13.47       15.33
real cost:$1314.40$1228.60$1184.00 $890.80    $1154.45
best cost: $241.00 $235.60 $229.80 $231.20     $234.40
h mean:      23.62   20.82   21.42   17.71       20.89
h sdev:      33.35   32.06   32.15   29.90       31.87
s mean:      99.77   99.72   99.85   99.69       99.76
s sdev:       2.14    3.17    1.96    3.26        2.63
mean diff:   76.15   78.90   78.43   81.98       78.87
k:            2.15    2.24    2.30    2.47        2.29

filename:  060a_500        060c_500       
                   060b_500        060d_500
fp total:       54      80      74      60          67
fp %:         2.70    4.00    3.70    3.00        3.35
fn total:        0       1       0       0           0
fn %:         0.00    0.10    0.00    0.00        0.03
unsure t:      272     320     281     235         277
unsure %:     9.07   10.67    9.37    7.83        9.23
real cost: $594.40 $865.00 $796.20 $647.00     $725.65
best cost: $224.20 $230.60 $223.60 $223.80     $225.55
h mean:      11.95   15.23   14.08   11.38       13.16
h sdev:      25.25   28.86   27.35   25.31       26.69
s mean:      99.60   99.63   99.78   99.75       99.69
s sdev:       3.75    3.75    2.41    2.45        3.09
mean diff:   87.65   84.40   85.70   88.37       86.53
k:            3.02    2.59    2.88    3.18        2.92

filename:  100a_500        100c_500       
                   100b_500        100d_500
fp total:       40      56      61      40          49
fp %:         2.00    2.80    3.05    2.00        2.46
fn total:        0       0       0       1           0
fn %:         0.00    0.00    0.00    0.10        0.03
unsure t:      190     215     203     185         198
unsure %:     6.33    7.17    6.77    6.17        6.61
real cost: $438.00 $603.00 $650.60 $438.00     $532.40
best cost: $219.60 $225.20 $221.00 $218.40     $221.05
h mean:       8.03   10.37    9.74    8.33        9.12
h sdev:      21.41   24.44   23.46   22.11       22.86
s mean:      99.55   99.50   99.70   99.69       99.61
s sdev:       4.17    4.53    3.16    3.35        3.80
mean diff:   91.52   89.13   89.96   91.36       90.49
k:            3.58    3.08    3.38    3.59        3.41

filename:  150a_500        150c_500       
                   150b_500        150d_500
fp total:       34      36      33      50          38
fp %:         1.70    1.80    1.65    2.50        1.91
fn total:        1       0       1       1           0
fn %:         0.10    0.00    0.10    0.10        0.08
unsure t:      151     152     114     124         135
unsure %:     5.03    5.07    3.80    4.13        4.51
real cost: $371.20 $390.40 $353.80 $525.80     $410.30
best cost: $217.40 $221.00 $216.40 $219.60     $218.60
h mean:       6.43    6.84    5.56    6.50        6.33
h sdev:      19.43   20.21   18.38   20.33       19.59
s mean:      99.50   99.43   99.58   99.48       99.50
s sdev:       4.57    4.95    4.27    4.83        4.65
mean diff:   93.07   92.59   94.02   92.98       93.17
k:            3.88    3.68    4.15    3.70        3.85

filename:  200a_500        200c_500       
                   200b_500        200d_500
fp total:       20      24      16      10          17
fp %:         1.00    1.20    0.80    0.50        0.88
fn total:        1       1       1       1           1
fn %:         0.10    0.10    0.10    0.10        0.10
unsure t:      123     136     101     109         117
unsure %:     4.10    4.53    3.37    3.63        3.91
real cost: $225.60 $268.20 $181.20 $122.80     $199.45
best cost: $213.40 $217.60 $174.40 $114.20     $179.90
h mean:       4.69    5.31    3.95    3.88        4.46
h sdev:      16.53   17.87   15.37   14.41       16.05
s mean:      99.45   99.38   99.50   99.45       99.44
s sdev:       4.81    5.29    5.02    5.10        5.05
mean diff:   94.76   94.07   95.55   95.57       94.99
k:            4.44    4.06    4.69    4.90        4.52

filename:  250a_500        250c_500       
                   250b_500        250d_500
fp total:       13      14      11       8          11
fp %:         0.65    0.70    0.55    0.40        0.58
fn total:        1       1       1       0           0
fn %:         0.10    0.10    0.10    0.00        0.08
unsure t:      117     118      93     118         111
unsure %:     3.90    3.93    3.10    3.93        3.72
real cost: $154.40 $164.60 $129.60 $103.60     $138.05
best cost: $145.20 $158.40 $123.20  $93.40     $130.05
h mean:       3.93    4.03    3.41    4.00        3.84
h sdev:      14.76   15.08   14.07   14.78       14.67
s mean:      99.41   99.33   99.49   99.62       99.46
s sdev:       5.11    5.55    5.13    3.99        4.95
mean diff:   95.48   95.30   96.08   95.62       95.62
k:            4.81    4.62    5.00    5.09        4.88

filename:  300a_500        300c_500       
                   300b_500        300d_500
fp total:        9       8       7       9           8
fp %:         0.45    0.40    0.35    0.45        0.41
fn total:        2       2       1       1           1
fn %:         0.20    0.20    0.10    0.10        0.15
unsure t:      112     107      97      89         101
unsure %:     3.73    3.57    3.23    2.97        3.38
real cost: $114.40 $103.40  $90.40 $108.80     $104.25
best cost: $105.00  $96.00  $84.20 $102.00      $96.80
h mean:       3.43    3.25    3.14    2.99        3.20
h sdev:      13.57   13.03   13.41   13.16       13.29
s mean:      99.34   99.30   99.48   99.58       99.42
s sdev:       5.36    5.73    5.19    4.11        5.10
mean diff:   95.91   96.05   96.34   96.59       96.22
k:            5.07    5.12    5.18    5.59        5.24

filename:  350a_500        350c_500       
                   350b_500        350d_500
fp total:        8       5       4       4           5
fp %:         0.40    0.25    0.20    0.20        0.26
fn total:        2       1       1       3           1
fn %:         0.20    0.10    0.10    0.30        0.17
unsure t:      101     100      89      94          96
unsure %:     3.37    3.33    2.97    3.13        3.20
real cost: $102.20  $71.00  $58.80  $61.80      $73.45
best cost:  $93.00  $65.60  $53.80  $54.60      $66.75
h mean:       2.88    2.81    2.65    2.63        2.74
h sdev:      12.12   11.84   11.90   11.54       11.85
s mean:      99.33   99.28   99.43   99.34       99.34
s sdev:       5.48    5.83    5.41    5.93        5.66
mean diff:   96.45   96.47   96.78   96.71       96.60
k:            5.48    5.46    5.59    5.54        5.52

filename:  400a_500        400c_500       
                   400b_500        400d_500
fp total:        6       5       3       6           5
fp %:         0.30    0.25    0.15    0.30        0.25
fn total:        2       3       2       1           2
fn %:         0.20    0.30    0.20    0.10        0.20
unsure t:       94      96      83      80          88
unsure %:     3.13    3.20    2.77    2.67        2.94
real cost:  $80.80  $72.20  $48.60  $77.00      $69.65
best cost:  $73.40  $64.40  $44.20  $71.00      $63.25
h mean:       2.57    2.51    2.39    2.30        2.44
h sdev:      11.24   11.11   11.20   10.67       11.05
s mean:      99.29   99.23   99.38   99.46       99.34
s sdev:       5.60    6.21    5.62    4.70        5.53
mean diff:   96.72   96.72   96.99   97.16       96.90
k:            5.74    5.58    5.77    6.32        5.85

filename:  450a_500        450c_500       
                   450b_500        450d_500
fp total:        5       4       3       5           4
fp %:         0.25    0.20    0.15    0.25        0.21
fn total:        2       4       2       3           2
fn %:         0.20    0.40    0.20    0.30        0.28
unsure t:       81      88      77      88          83
unsure %:     2.70    2.93    2.57    2.93        2.78
real cost:  $68.20  $61.60  $47.40  $70.60      $61.95
best cost:  $63.40  $53.60  $42.80  $60.20      $55.00
h mean:       2.14    2.19    2.08    2.79        2.30
h sdev:      10.05   10.12   10.34   12.08       10.65
s mean:      99.21   99.09   99.32   99.56       99.30
s sdev:       5.98    6.91    5.89    5.16        5.99
mean diff:   97.07   96.90   97.24   96.77       96.99
k:            6.06    5.69    5.99    5.61        5.84

filename:  500a_500        500c_500       
                   500b_500        500d_500
fp total:        5       3       3       5           4
fp %:         0.25    0.15    0.15    0.25        0.20
fn total:        2       4       2       1           2
fn %:         0.20    0.40    0.20    0.10        0.23
unsure t:       78      88      74      76          79
unsure %:     2.60    2.93    2.47    2.53        2.63
real cost:  $67.60  $51.60  $46.80  $66.20      $58.05
best cost:  $63.00  $45.60  $41.80  $59.60      $52.50
h mean:       2.08    2.07    1.80    2.00        1.99
h sdev:       9.82    9.83    9.35    9.79        9.70
s mean:      99.19   99.06   99.21   99.39       99.21
s sdev:       6.11    6.98    6.51    5.31        6.23
mean diff:   97.11   96.99   97.41   97.39       97.22
k:            6.10    5.77    6.14    6.45        6.11

filename:  600a_500        600c_500       
                   600b_500        600d_500
fp total:        4       2       3       4           3
fp %:         0.20    0.10    0.15    0.20        0.16
fn total:        3       3       2       1           2
fn %:         0.30    0.30    0.20    0.10        0.23
unsure t:       79      77      70      79          76
unsure %:     2.63    2.57    2.33    2.63        2.54
real cost:  $58.80  $38.40  $46.00  $56.80      $50.00
best cost:  $51.60  $33.60  $40.80  $50.60      $44.15
h mean:       1.84    1.69    1.56    1.77        1.71
h sdev:       9.10    8.68    8.52    9.11        8.85
s mean:      99.13   98.98   99.15   99.25       99.13
s sdev:       6.40    7.29    6.80    5.74        6.56
mean diff:   97.29   97.29   97.59   97.48       97.41
k:            6.28    6.09    6.37    6.56        6.33

filename:  700a_500        700c_500       
                   700b_500        700d_500
fp total:        4       2       2       2           2
fp %:         0.20    0.10    0.10    0.10        0.12
fn total:        3       3       2       3           2
fn %:         0.30    0.30    0.20    0.30        0.28
unsure t:       75      70      70      62          69
unsure %:     2.50    2.33    2.33    2.07        2.31
real cost:  $58.00  $37.00  $36.00  $35.40      $41.60
best cost:  $51.60  $34.00  $33.40  $32.00      $37.75
h mean:       1.64    1.53    1.41    1.35        1.48
h sdev:       8.51    8.13    7.99    7.30        7.98
s mean:      99.07   98.92   99.07   99.23       99.07
s sdev:       6.60    7.60    6.99    6.45        6.91
mean diff:   97.43   97.39   97.66   97.88       97.59
k:            6.45    6.19    6.52    7.12        6.57

filename:  1000a_500       1000c_500      
                   1000b_500       1000d_500
fp total:        2       0       2       1           1
fp %:         0.10    0.00    0.10    0.05        0.06
fn total:        4       3       3       1           2
fn %:         0.40    0.30    0.30    0.10        0.28
unsure t:       68      66      61      56          62
unsure %:     2.27    2.20    2.03    1.87        2.09
real cost:  $37.60  $16.20  $35.20  $22.20      $27.80
best cost:  $35.20  $15.60  $33.80  $20.20      $26.20
h mean:       1.11    1.10    1.11    1.27        1.15
h sdev:       6.63    6.33    7.02    7.78        6.94
s mean:      98.80   98.71   98.90   99.26       98.92
s sdev:       7.62    8.13    7.66    5.63        7.26
mean diff:   97.69   97.61   97.79   97.99       97.77
k:            6.86    6.75    6.66    7.31        6.89

filename:  1500a_500       1500c_500      
                   1500b_500       1500d_500
fp total:        1       0       1       0           0
fp %:         0.05    0.00    0.05    0.00        0.03
fn total:        4       4       3       7           4
fn %:         0.40    0.40    0.30    0.70        0.45
unsure t:       76      71      77      74          74
unsure %:     2.53    2.37    2.57    2.47        2.48
real cost:  $29.20  $18.20  $28.40  $21.80      $24.40
best cost:  $28.00  $14.80  $23.80  $11.80      $19.60
h mean:       0.86    0.77    0.79    0.89        0.83
h sdev:       5.54    4.86    5.43    5.58        5.35
s mean:      98.39   98.31   98.42   98.45       98.39
s sdev:       8.81    9.31    9.04    9.10        9.06
mean diff:   97.53   97.54   97.63   97.56       97.56
k:            6.80    6.88    6.75    6.65        6.77

filename:  2000a_500       2000c_500      
                   2000b_500       2000d_500
fp total:        0       0       0       0           0
fp %:         0.00    0.00    0.00    0.00        0.00
fn total:        5       4       3       6           4
fn %:         0.50    0.40    0.30    0.60        0.45
unsure t:       81      82      86      75          81
unsure %:     2.70    2.73    2.87    2.50        2.70
real cost:  $21.20  $20.40  $20.20  $21.00      $20.70
best cost:  $20.60  $14.00  $14.40  $13.00      $15.50
h mean:       0.73    0.60    0.68    0.66        0.67
h sdev:       4.74    3.80    4.63    4.48        4.41
s mean:      98.06   98.02   98.13   98.11       98.08
s sdev:       9.77    9.94    9.60   10.14        9.86
mean diff:   97.33   97.42   97.45   97.45       97.41
k:            6.71    7.09    6.85    6.67        6.83

filename:  2500a_500       2500c_500      
                   2500b_500       2500d_500
fp total:        0       0       0       0           0
fp %:         0.00    0.00    0.00    0.00        0.00
fn total:        5       3       3       6           4
fn %:         0.50    0.30    0.30    0.60        0.43
unsure t:       87      92      92      82          88
unsure %:     2.90    3.07    3.07    2.73        2.94
real cost:  $22.40  $21.40  $21.40  $22.40      $21.90
best cost:  $15.40  $13.80  $14.40  $18.80      $15.60
h mean:       0.64    0.55    0.58    0.57        0.58
h sdev:       4.32    3.59    3.87    4.07        3.96
s mean:      97.79   97.76   97.90   97.72       97.79
s sdev:      10.38   10.50   10.02   10.85       10.44
mean diff:   97.15   97.21   97.32   97.15       97.21
k:            6.61    6.90    7.01    6.51        6.76


filename:  full_500 (trained on 3 sets of 2700 each, tested against 500)
fp total:        0           0
fp %:         0.00        0.00
fn total:        5           5
fn %:         0.50        0.50
unsure t:       89          89
unsure %:     2.97        2.97
real cost:  $22.80      $22.80
best cost:  $20.60      $20.60
h mean:       0.61        0.61
h sdev:       4.31        4.31
s mean:      97.77       97.77
s sdev:      10.63       10.63
mean diff:   97.16       97.16
k:            6.50        6.50


Note that the '0/5/89' fp/fn/unsure could be switched into 1/7/48
by adjusting the ham_cutoff to 0.33 and spam_cutoff to 0.90. 
I'm not re-running the above series of tests for that, though!

Here's the summary-summary table:
ham-train  bestcost  realcost    fp%   fn% unsure%
        1    430.80  11498.75  56.70  0.00   26.46
       10    274.05   3345.10  15.76  0.03   32.06
       20    245.50   1855.80   8.61  0.03   22.18
       30    242.15   1642.90   7.64  0.00   19.23
       40    234.40   1154.45   5.31  0.00   15.33
       60    225.55    725.65   3.35  0.03    9.23
      100    221.05    532.40   2.46  0.03    6.61
      150    218.60    410.30   1.91  0.08    4.51
      200    179.90    199.45   0.88  0.10    3.91
      250    130.05    138.05   0.58  0.08    3.72
      300     96.80    104.25   0.41  0.15    3.38
      350     66.75     73.45   0.26  0.17    3.20
      400     63.25     69.65   0.25  0.20    2.94
      450     61.95     61.95   0.21  0.28    2.78
      500     52.50     58.05   0.20  0.23    2.63
      600     44.15     50.00   0.16  0.23    2.54
      700     37.75     41.60   0.12  0.28    2.31
     1000     26.20     27.80   0.06  0.28    2.09
     1500     19.60     24.40   0.03  0.45    2.48
     2000     15.50     20.70   0.00  0.45    2.70
     2500     15.60     21.90   0.00  0.43    2.94
     2700     20.60     22.80   0.00  0.50    2.97

It seems like most of the wins come once you get up around 350, the
number of spam trained on. The unsure bucket actually gets a bit worse
as more ham is added - looking at the histograms, various bits of spam
are dragged downwards.

Anthony

From tim.one@comcast.net  Tue Oct 29 03:58:55 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 28 Oct 2002 22:58:55 -0500
Subject: [Spambayes] defaults vs. chi-square
In-Reply-To: <20021014220902.71797F4D4@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEDPCCAB.tim.one@comcast.net>

[T. Alexander Popiel [mailto:popiel@wolfskeep.com]
 Sent: Monday, October 14, 2002 6:09 PM]

> It appears to be a systematic error when a mailing list manager
> appends plain text to what should be a base64 encoded segment.
> Bad MLM, no biscuit.  This confuses the MIME decoder. Bad MIME
> decoder, too!

This is ironic <wink>:  it turns out that the cause is that the MIME decoder
was *too* forgiving, in a twisted relevant sense:

> As a sample:
>
> """
> ...
> Content-Type: text/plain
> Content-Transfer-Encoding: base64
> ...
>
> DQpUck1lbG9kaSwgS/1y/WsgbGlua2xpIOdhbP3+bWF5YW4gdmUgYmlydGVrIG1wMyD8IGlu
> ZGlyaXJrZW4gYmlsZSBpbnNhbmxhcv0ga2FocmVkZW4gc/Z6ZGUgbXAzIHNpdGVsZXJpbmUg
> YWx0ZXJuYXRpZiANCm9sYXJhayBzaXpsZXIgaedpbiD2emVubGUgaGF6/XJsYW5t/f50/XIu
> IEhlciB5Yf50YW4gaGVyIGtlc2ltZGVuIG38emlrc2V2ZXJlIGhpdGFwIGVkZWJpbG1layBp
> 52luIHRhc2FybGFubf3+IDEzIEdCIA0KbP1rIGRldiBNcDMgbGlzdGVzaXlsZSBz/W79Zv1u
> ZGEgcmFraXBzaXogb2xhY2FrIP5la2lsZGUgZG9uYXT9bG39/iB2ZSBzaXogbfx6aWtzZXZl
> cmxlcmluIGhpem1ldGluZSBzdW51bG11/nR1ci4gDQpodHRwOi8vd3d3LnRybWVsb2RpLmNv
> bSBhZHJlc2luZGVraSBkZXYgYXL+aXZpbWl6ZGUgc2l6aSBiZWtsZXllbiBlbiBzZXZkafBp
> bml6IHNhbmF05/1sYXL9biBlbiBzZXZkafBpbml6IA0K/mFya/1sYXL9bv0gYmlya2HnIGRh
> a2lrYSBp52luZGUgYmlsZ2lzYXlhcv1u/XphIGluZGlyaW4gdmUga2V5aWZsZSBkaW5sZW1l
> eWUgYmH+bGF5/W4uIA0KDQrdeWkgRfBsZW5jZWxlci4uIA0KaHR0cDovL3d3dy50cm1lbG9k
> aS5jb20NCg0KDQoNCg0K
>
>
> --
> To UNSUBSCRIBE, email to debian-java-request@lists.debian.org
> with a subject of "unsubscribe". Trouble? Contact
> listmaster@lists.debian.org
> """

I tried like hell to provoke this problem with base64 msgs, and couldn't.
It turns that the final "real base64" line was the key:

> aS5jb20NCg0KDQoNCg0K

Because this section didn't happen to need any '=' padding, the base64
decoder didn't know that it was over, and went on to take the entire
remainder of the text as if it were base64 too.  Until it sees a string of
'=' marks, it will accept darned near everything, and simply ignore
characters that don't make sense for base64.  In the end, the error it
raises is due to that treating the remainder of the msg as pseudo-base64 too
leads to an improperly padded base64 string.

I believe I've fixed this now, by falling back to a stricter(!) approach
when the builtin approach fails.

In cases where the base64 section is terminated by a string of '=', the
builtin approach doesn't fail, and in those cases we lose the plain text
part.  If it fails back to the stricter approach, we don't lose the plain
text part.  Perhaps I should lose the plain text part in this case too?

BTW, looks like your example was foreign-language MP3 spam.  It scores like
so for me:

0.99970963814
'*H*' 0.000577077329346
'*S*' 0.99999635361


From tim.one@comcast.net  Tue Oct 29 04:55:27 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 28 Oct 2002 23:55:27 -0500
Subject: [Spambayes] defaults vs. chi-square
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEDPCCAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEEDCCAB.tim.one@comcast.net>

[Tim, claims to have fixed the "plain text follows a base64 section"
 decoding glitch]

Just FYI, this had minor good effects on my c.l.py test (10-fold CV):

filename:       cv    tcap
ham:spam:  20000:14000
                   20000:14000
fp total:        2       2
fp %:         0.01    0.01
fn total:        0       0
fn %:         0.00    0.00
unsure t:      103      97
unsure %:     0.30    0.29
real cost:  $40.60  $39.40
best cost:  $27.00  $26.80
h mean:       0.28    0.26
h sdev:       2.99    2.89
s mean:      99.94   99.94
s sdev:       1.41    1.44
mean diff:   99.66   99.68
k:           22.65   23.02

Hmm!  That "after" run there also had

    replace_nonascii_chars: True

different.  Sorry about that; it's not worth it (to me) to separate those
out.

The percentiles for this large-training test have gotten very interesting:

-> <stat> Ham scores for all runs: 20000 items; mean 0.26; sdev 2.89
-> <stat> min 0; median 6.37101e-011; max 100
-> <stat> percentiles: 5% 0; 25% 2.22045e-014; 75% 8.15779e-007; 95%
0.0358985

-> <stat> Spam scores for all runs: 14000 items; mean 99.94; sdev 1.44
-> <stat> min 29.8279; median 100; max 100
-> <stat> percentiles: 5% 100; 25% 100; 75% 100; 95% 100

Histogram analysis still suggests it would be cheaper to let some FN go
through:

-> best cost for all runs: $26.80
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.5 & 0.775
->     fp 2; fn 3; unsure ham 11; unsure spam 8
->     fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0559%


From rob@hooft.net  Tue Oct 29 09:48:45 2002
From: rob@hooft.net (Rob W.W. Hooft)
Date: Tue, 29 Oct 2002 10:48:45 +0100
Subject: [Spambayes] training on very small ham sets, normal sized
	spamsets.
References: <200210290125.g9T1Ppw09085@localhost.localdomain>
Message-ID: <3DBE597D.2040504@hooft.net>

Anthony Baxter wrote:
> So I hacked on timcv.py and msgs.py to add options 'spam-test', 
> 'spam-train', 'ham-test' and 'ham-train', to allow you to set 
> the training set size separately to the testing set size.
> I haven't checked this in because it will break everyone's 
> test scripts - --spam= will no longer be distinct, and getopt
> will gripe. Let me know if I should check this in anyway - I 
> think it's useful, but YMMV.

Can you fix the backward compatibility by adding --spam and --ham
and --spam-keep and --ham-keep options that do both?

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From anthony@interlink.com.au  Tue Oct 29 09:50:16 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Tue, 29 Oct 2002 20:50:16 +1100
Subject: [Spambayes] training on very small ham sets, normal sized
	spamsets. 
In-Reply-To: <3DBE597D.2040504@hooft.net> 
Message-ID: <200210290950.g9T9oHU11606@localhost.localdomain>


>>> "Rob W.W. Hooft" wrote
> Anthony Baxter wrote:
> > So I hacked on timcv.py and msgs.py to add options 'spam-test', 
> > 'spam-train', 'ham-test' and 'ham-train', to allow you to set 
> > the training set size separately to the testing set size.
> > I haven't checked this in because it will break everyone's 
> > test scripts - --spam= will no longer be distinct, and getopt
> > will gripe. Let me know if I should check this in anyway - I 
> > think it's useful, but YMMV.
> 
> Can you fix the backward compatibility by adding --spam and --ham
> and --spam-keep and --ham-keep options that do both?

Nah - getopt doesn't like it.

Anthony


From popiel@wolfskeep.com  Tue Oct 29 18:41:24 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Tue, 29 Oct 2002 10:41:24 -0800
Subject: [Spambayes] training on very small ham sets, normal sized
	spamsets. 
In-Reply-To: Message from Anthony Baxter <anthony@interlink.com.au> 
	<200210290125.g9T1Ppw09085@localhost.localdomain> 
References: <200210290125.g9T1Ppw09085@localhost.localdomain> 
Message-ID: <20021029184124.1A3D9F595@cashew.wolfskeep.com>

In message:  <200210290125.g9T1Ppw09085@localhost.localdomain>
             Anthony Baxter <anthony@interlink.com.au> writes:

>So I hacked on timcv.py and msgs.py to add options 'spam-test', 
>'spam-train', 'ham-test' and 'ham-train', to allow you to set 
>the training set size separately to the testing set size.
>I haven't checked this in because it will break everyone's 
>test scripts - --spam= will no longer be distinct, and getopt
>will gripe. Let me know if I should check this in anyway - I 
>think it's useful, but YMMV.

I'd like to have it. :-)

>The numbers for each (001:, 002:, 003:, 005:, 010:, 015:, 020:) are 
>actually averages of 4 different runs for each, with different 
>-s options on each one (same set of 4 -s used for each, tho). 
>Otherwise the variation was just too damn high. It's still a little 
>'bloopy' - the unsure bounces around a bit, but it's not bad.

Cool.  Good to see someone more thorough than I am... I've
been getting(?) sloppy.  I'm not a real statistician, and
it shows.

>Here's the summary-summary table:
>ham-train  bestcost  realcost    fp%   fn% unsure%
>        1    430.80  11498.75  56.70  0.00   26.46
>       10    274.05   3345.10  15.76  0.03   32.06
>       20    245.50   1855.80   8.61  0.03   22.18
>       30    242.15   1642.90   7.64  0.00   19.23
>       40    234.40   1154.45   5.31  0.00   15.33
>       60    225.55    725.65   3.35  0.03    9.23
>      100    221.05    532.40   2.46  0.03    6.61
>      150    218.60    410.30   1.91  0.08    4.51
>      200    179.90    199.45   0.88  0.10    3.91
>      250    130.05    138.05   0.58  0.08    3.72
>      300     96.80    104.25   0.41  0.15    3.38
>      350     66.75     73.45   0.26  0.17    3.20
>      400     63.25     69.65   0.25  0.20    2.94
>      450     61.95     61.95   0.21  0.28    2.78
>      500     52.50     58.05   0.20  0.23    2.63
>      600     44.15     50.00   0.16  0.23    2.54
>      700     37.75     41.60   0.12  0.28    2.31
>     1000     26.20     27.80   0.06  0.28    2.09
>     1500     19.60     24.40   0.03  0.45    2.48
>     2000     15.50     20.70   0.00  0.45    2.70
>     2500     15.60     21.90   0.00  0.43    2.94
>     2700     20.60     22.80   0.00  0.50    2.97
>
>It seems like most of the wins come once you get up around 350, the
>number of spam trained on. The unsure bucket actually gets a bit worse
>as more ham is added - looking at the histograms, various bits of spam
>are dragged downwards.

Beautiful.  It looks like the excess ham only starts hurting
unsures after about 1000 (or about 3:1).

- Alex

From popiel@wolfskeep.com  Tue Oct 29 18:54:22 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Tue, 29 Oct 2002 10:54:22 -0800
Subject: [Spambayes] max word size
Message-ID: <20021029185422.2675EF595@cashew.wolfskeep.com>

Changing the max word size (for generating skip tokens)
doesn't seem to have much effect on my data.

Have table... it pretty much says it all.

-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
[...]
filename:   skip10  skip11  skip12  skip13  skip14  skip20  skip50
ham:spam:  2000:2000       2000:2000       2000:2000       2000:2000
                   2000:2000       2000:2000       2000:2000      
fp total:        4       3       3       3       3       4       4
fp %:         0.20    0.15    0.15    0.15    0.15    0.20    0.20
fn total:       12      10      12      11      12      10      10
fn %:         0.60    0.50    0.60    0.55    0.60    0.50    0.50
unsure t:       52      55      53      55      53      52      54
unsure %:     1.30    1.38    1.32    1.38    1.32    1.30    1.35
real cost:  $62.40  $51.00  $52.60  $52.00  $52.60  $60.40  $60.80
best cost:  $49.20  $49.00  $48.20  $48.40  $48.40  $49.40  $50.00
h mean:       0.42    0.41    0.40    0.40    0.38    0.39    0.39
h sdev:       5.47    5.42    5.39    5.35    5.22    5.30    5.22
s mean:      98.44   98.45   98.45   98.46   98.46   98.48   98.48
s sdev:       9.87    9.79    9.76    9.72    9.75    9.71    9.69
mean diff:   98.02   98.04   98.05   98.06   98.08   98.09   98.09
k:            6.39    6.45    6.47    6.51    6.55    6.53    6.58

It doesn't look like there's any significance in there, even with
the extreme sizes...

- Alex

From richie@entrian.com  Tue Oct 29 21:04:01 2002
From: richie@entrian.com (Richie Hindle)
Date: Tue, 29 Oct 2002 21:04:01 +0000
Subject: [Spambayes] Re: pop3proxy bug?
In-Reply-To: <E1856rA-0004AP-00@usw-sf-list1.sourceforge.net>
References: <E1856rA-0004AP-00@usw-sf-list1.sourceforge.net>
Message-ID: <e5qtru8kergge8t212u8jambe3vtkfta8r@4ax.com>

Hi Jeremy,

Sorry for the delay - I've been away since Friday.

> Did you ever test this code?

Yes, of course I tested it - I've been using it to retrieve all my email
since the day I wrote it!  8-)

The problem is probably down to platform-dependent behaviour - I'm running
on Windows 98 and it works like a charm for me.  I'll give it a go on Linux
over the next day or two and see what happens.

> I changed the code to use the raw server socket and sendall() instead
> of self.serverFile.write() and it worked.  But I'm uneasy.

That's a perfectly reasonable fix.  Once I've reproduced the problem on
Linux, I'll apply, test and commit that fix - thanks.

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Tue Oct 29 21:04:27 2002
From: richie@entrian.com (Richie Hindle)
Date: Tue, 29 Oct 2002 21:04:27 +0000
Subject: [Spambayes] Some minor nits ...
In-Reply-To: <001a01c27d17$dfec99a0$6300000a@holdenweb.com>
References: <001a01c27d17$dfec99a0$6300000a@holdenweb.com>
Message-ID: <pmttrushpb1vdrndco1b299u8bj51inc7h@4ax.com>

Hi Steve,

> the X-Hammie-Disposition header is treated as a part of the message body

This was a bug, now fixed.  In trying to deal with non-conforming emails, I
was converting all your emails into non-conforming ones.  Nice.  8-)

> Under cygwin (python 2.2.1) I see the folowing asyncore error:

See my reply to Jeremy - I'll look at this this week.

-- 
Richie Hindle
richie@entrian.com


From jeremy@alum.mit.edu  Tue Oct 29 21:40:33 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Tue, 29 Oct 2002 16:40:33 -0500
Subject: [Spambayes] Re: pop3proxy bug?
In-Reply-To: <e5qtru8kergge8t212u8jambe3vtkfta8r@4ax.com>
References: <E1856rA-0004AP-00@usw-sf-list1.sourceforge.net>
	<e5qtru8kergge8t212u8jambe3vtkfta8r@4ax.com>
Message-ID: <15807.81.616489.126728@slothrop.zope.com>

>>>>> "RH" == Richie Hindle <richie@entrian.com> writes:

  >> Did you ever test this code?

  RH> Yes, of course I tested it - I've been using it to retrieve all
  RH> my email since the day I wrote it!  8-)

Sorry for the testy response.  I didn't realize that you could do what
you were doing with makefile().

  RH> The problem is probably down to platform-dependent behaviour -
  RH> I'm running on Windows 98 and it works like a charm for me.
  RH> I'll give it a go on Linux over the next day or two and see what
  RH> happens.

You could also create two files with makefile(), just like
SocketServer.

Jeremy


From richie@entrian.com  Tue Oct 29 22:10:52 2002
From: richie@entrian.com (Richie Hindle)
Date: Tue, 29 Oct 2002 22:10:52 +0000
Subject: [Spambayes] Re: pop3proxy bug?
In-Reply-To: <15807.81.616489.126728@slothrop.zope.com>
References: <E1856rA-0004AP-00@usw-sf-list1.sourceforge.net>
	<e5qtru8kergge8t212u8jambe3vtkfta8r@4ax.com>
	<15807.81.616489.126728@slothrop.zope.com>
Message-ID: <ph1uruo00qg5phj29r71t3ufq7s7gb6blf@4ax.com>

Hi Jeremy,

> Jeremy: Did you ever test this code?
> Richie: Yes, of course I tested it
> Jeremy: Sorry for the testy response.

<smirk> 'testy', very good!

> I didn't realize that you could do what you were doing with makefile().

It probably shouldn't be allowed, but I guess the line in socket.py that
says:

        self.mode = mode # Not actually used in this version

means that someone somewhere is aware of this.

> You could also create two files with makefile(), just like
> SocketServer.

Thanks for the suggestion - that's probably the neatest fix.

-- 
Richie Hindle
richie@entrian.com


From anthony@interlink.com.au  Wed Oct 30 07:36:46 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Wed, 30 Oct 2002 18:36:46 +1100
Subject: [Spambayes] training on very small ham sets, normal sized
	spamsets. 
In-Reply-To: <20021029184124.1A3D9F595@cashew.wolfskeep.com> 
Message-ID: <200210300736.g9U7alm19317@localhost.localdomain>


>>> "T. Alexander Popiel" wrote
> >So I hacked on timcv.py and msgs.py to add options 'spam-test', 
> >'spam-train', 'ham-test' and 'ham-train', to allow you to set 
> >the training set size separately to the testing set size.
> >I haven't checked this in because it will break everyone's 
> >test scripts - --spam= will no longer be distinct, and getopt
> >will gripe. Let me know if I should check this in anyway - I 
> >think it's useful, but YMMV. 
> I'd like to have it. :-)

I figured out a backwards compatible way to do it - make the new
options --SpamTrain --SpamTest &c. I'll check it in shortly.

> Cool.  Good to see someone more thorough than I am... I've
> been getting(?) sloppy.  I'm not a real statistician, and
> it shows.

Neither am I - I just know enough to hurt myself :)

> >Here's the summary-summary table:
> >ham-train  bestcost  realcost    fp%   fn% unsure%
> >        1    430.80  11498.75  56.70  0.00   26.46
> >       10    274.05   3345.10  15.76  0.03   32.06
> >       20    245.50   1855.80   8.61  0.03   22.18
> >       30    242.15   1642.90   7.64  0.00   19.23
> >       40    234.40   1154.45   5.31  0.00   15.33
> >       60    225.55    725.65   3.35  0.03    9.23
> >      100    221.05    532.40   2.46  0.03    6.61
> >      150    218.60    410.30   1.91  0.08    4.51
> >      200    179.90    199.45   0.88  0.10    3.91
> >      250    130.05    138.05   0.58  0.08    3.72
> >      300     96.80    104.25   0.41  0.15    3.38
> >      350     66.75     73.45   0.26  0.17    3.20
> >      400     63.25     69.65   0.25  0.20    2.94
> >      450     61.95     61.95   0.21  0.28    2.78
> >      500     52.50     58.05   0.20  0.23    2.63
> >      600     44.15     50.00   0.16  0.23    2.54
> >      700     37.75     41.60   0.12  0.28    2.31
> >     1000     26.20     27.80   0.06  0.28    2.09
> >     1500     19.60     24.40   0.03  0.45    2.48
> >     2000     15.50     20.70   0.00  0.45    2.70
> >     2500     15.60     21.90   0.00  0.43    2.94
> >     2700     20.60     22.80   0.00  0.50    2.97
> >
> >It seems like most of the wins come once you get up around 350, the
> >number of spam trained on. The unsure bucket actually gets a bit worse
> >as more ham is added - looking at the histograms, various bits of spam
> >are dragged downwards.
> 
> Beautiful.  It looks like the excess ham only starts hurting
> unsures after about 1000 (or about 3:1).

fns also get worse after about 2:1, and most of the wins in the fp
are there by the time you get to 3:1. So I'd say from this something
like 2:1 or 3:1 ham:spam is a good number. But, as always, YMMV.
The 'best cost' column shows something different, but it's overly
weighting fp's vs everything else (for my tastes). (yes, I can 
tweak it, but chose not to for this test). 


-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From skip@pobox.com  Tue Oct 29 04:19:54 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 28 Oct 2002 22:19:54 -0600
Subject: [Spambayes] RE: Spam vs time-of-day
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEAPCCAB.tim.one@comcast.net>
References: <15805.34750.447765.195346@montanaro.dyndns.org>
        <LNBBLJKPBEHFEDALKOLCOEAPCCAB.tim.one@comcast.net>
Message-ID: <15806.3178.674153.634773@montanaro.dyndns.org>


    Tim> Your buckets span 10 minutes.  The comment in the code is confused
    Tim> about this too.  That's why your graph and mine both have 144
    Tim> points on the X axis (24 * 6 = 144; you have six *buckets* per
    Tim> hour, and each spans 10 minutes).

Yeah, after seeing this several times I'm beginning to think I made a
mistake. ;-)

    >> The large spike at 0 is an artifact of my simpleminded Date header
    >> scanning.  Invalid dates probably wound up with a value of 0.

    Tim> And at that time, *every* Date header generated a dow:invalid token
    Tim> (as well as the correct token, when possible).  That's been
    Tim> repaired since then.

Not really.  The graph was generated by a shell pipeline using suitable
non-spambayes tools (awk, sed, gnuplot, etc).  My dow:invalid mistake came
later.

    >> Buckets were calculated using local time.  That way I didn't penalize
    >> Anthony Baxter and other folks who happen not to live in the US.

    Tim> I'm unsure what "were calculated using local time" means.  

Simply that I ignored timezone information.  If the Date: header was

    Date: Mon, 28 Oct 2002 14:29:30 -0500

the send time was taken to be 14:29, local time.  The -0500 was ignored.

    Tim> Does the checked in code do that or not?  

Yes, the checked in code just uses a regular expression which matches
HH:MM:SS preceded and followed by a space.  Nothing else in the Date: header
is considered for this particular token.

Skip


From skip@pobox.com  Thu Oct 31 01:59:13 2002
From: skip@pobox.com (Skip Montanaro)
Date: Wed, 30 Oct 2002 19:59:13 -0600
Subject: [Spambayes] X-Hammie-Disposition split suggestion
Message-ID: <15808.36465.890458.583477@montanaro.dyndns.org>


The X-Hammie-Disposition header contains multiple bits of information.  I'm
not sure what the *H* and *S* chunks are for (overall hammieness?), but I
think it would be worthwhile to put the individual word probabilities in a
separate header.  That way, I could tell my mailer to display the much
smaller X-Hammie-Disposition header and suppress display of the (for
example) X-Hammie-Word-Probabilities header by default, e.g.:

    X-Hammie-Disposition: Yes; 1.00; '*H*': 0.00; '*S*': 1.00
    X-Hammie-Word-Probabilities:'rbl':0.07; 'script':0.07; 'to:2**1':0.09;
        'osirusoft':0.10; 'url:org':0.15; 'subject:; ':0.15; 'cgi':0.20;
        'sorry':0.22; 'mailing':0.23; 'list:':0.24; 'skip:" 10':0.27;
        'skip:r 20':0.28; 'subject:SPAM':0.30; 'called':0.31; 'body':0.33;
        'rcvd_in_dsbl':0.34; 'open':0.35; 'being':0.35; 'version':0.36;
        'from:':0.36; 'skip:u 10':0.37; ...

If something in the X-Hammie-Disposition header jumps out at you, you can
display all the message's headers.

Make sense?  If so, I'll be happy to modify hammie.py.

Skip

From tim.one@comcast.net  Thu Oct 31 02:18:08 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 30 Oct 2002 21:18:08 -0500
Subject: [Spambayes] Database reduction
Message-ID: <LNBBLJKPBEHFEDALKOLCIEDPCDAB.tim.one@comcast.net>

There's a semi-standard trick for database size reduction I haven't persued
and don't intend to pursue.  Those keenly interested in reducing database
size may wish to pursue it.

Currently, the classifier's wordinfo dict is indexed by strings S.  There's
no bound on how many unique strings may appear, and so also no bound on how
large the database may grow.

A cheesy but probably-effective trick is to pick an integer N for all time,
and index a wordinfo structure by hash(S) % N instead of by S.

Since strings are no longer stored:

+ Good:  Space for storing strings isn't needed.
+ Bad:  You can't get words out again (for, e.g., clue lists).

Since hash(S) is a many-to-one mapping:

+ Bad:  Words get combined more-than-less randomly.

Since mod N is also a many-to-one mapping:

+ Bad:  As above.
+ Good:  N is a solid upper bound on the maximum number of wordinfo records,
and so you can know exactly how big a classifier can get.
+ Good:  You could drop the dict and use hash(S)%N to index a contiguous
structure directly, like an mmap'ed file (http://crm114.sf.net/ uses that
specific trick after multiple layers of hashing, into distinct files of
one-byte clamped ham and spam counts).

Since database size would be bounded:

+ Good:  There's less obvious need to prune the database over time (a main
point of pruning is to reclaim space for words that aren't being used
anymore -- or ever).
+ Bad:  If the database is never pruned, it will adapt slower to changes in
the nature of ham and spam.

I suppose the scariest thing is combining words "at random".  It's possible,
e.g., that Python would get mapped to the same record as Viagra.  And the
smaller N is, the most certain "bad stuff like that" *will* happen.  We
won't know until someone tries it and measures results; my intuition <wink>
is that unless you get silly with N, it won't hurt much, as most words are
approximately worthless anyway.  Think about what happens when N=1 for the
other side of this coin.


From tim.one@comcast.net  Thu Oct 31 02:33:37 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 30 Oct 2002 21:33:37 -0500
Subject: [Spambayes] X-Hammie-Disposition split suggestion
In-Reply-To: <15808.36465.890458.583477@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEEBCDAB.tim.one@comcast.net>

[Skip Montanaro]
> The X-Hammie-Disposition header contains multiple bits of
> information.  I'm not sure what the *H* and *S* chunks are for
> (overall hammieness?),

chi-combining computes two scores internally, one for ham-ness (H) and the
other for spam-ness (S).  That's what *H* and *S* tell you.  The final score
is (S-H+1)/2.

> but I think it would be worthwhile to put the individual word
> probabilities in a separate header.

Or drop them altogether.  Geeks may find this stuff morbidly interesting,
and spambayes developers need to see this stuff when a msg gets a surprising
score, but I doubt anyone else has any earthly use for it.  It's also a bit
like giving away pieces of your private key in public-key cryptosystem:
"well, Mister Spammer, you can't guess what's spam and ham to me without
breaking into my database, but here are the 150 best & worst guesses you
made, along with exactly how good they were".

> That way, I could tell my mailer to display the much smaller
> X-Hammie-Disposition header and suppress display of the (for
> example) X-Hammie-Word-Probabilities header by default, e.g.:
>
>     X-Hammie-Disposition: Yes; 1.00; '*H*': 0.00; '*S*': 1.00

I suggest dropping the *H* and *S* here too.  In the Outlook client, we've
also switched to feeding the end user int(round(score * 100.0)), i.e. an
integer in 0 .. 100 inclusive.  There's really no need to bother pretty
users' heads with the mysteries of floating point <wink>.

>     X-Hammie-Word-Probabilities:'rbl':0.07; 'script':0.07; 'to:2**1':0.09;
>         'osirusoft':0.10; 'url:org':0.15; 'subject:; ':0.15; 'cgi':0.20;
>         'sorry':0.22; 'mailing':0.23; 'list:':0.24; 'skip:" 10':0.27;
>         'skip:r 20':0.28; 'subject:SPAM':0.30; 'called':0.31; 'body':0.33;
>         'rcvd_in_dsbl':0.34; 'open':0.35; 'being':0.35; 'version':0.36;
>         'from:':0.36; 'skip:u 10':0.37; ...
>
> If something in the X-Hammie-Disposition header jumps out at you, you can
> display all the message's headers.
>
> Make sense?  If so, I'll be happy to modify hammie.py.

I'm not a hammie user, but I know my sisters.  That leaves me more neutral
than I may sound, as one of my sisters doubtless has no idea "headers"
exist.  She pays to download them, though!


From tim.one@comcast.net  Thu Oct 31 03:47:35 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 30 Oct 2002 22:47:35 -0500
Subject: [Spambayes] Database reduction
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEDPCDAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEEHCDAB.tim.one@comcast.net>

[Tim]
> ...
> A cheesy but probably-effective trick is to pick an integer N for
> all time, and index a wordinfo structure by hash(S) % N instead of by S.

FYI, if you want to pursue this, here's a start (there's not much to it if
you just want to see what happens):

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.45
diff -c -u -r1.45 classifier.py
--- classifier.py	27 Oct 2002 17:11:00 -0000	1.45
+++ classifier.py	31 Oct 2002 03:33:40 -0000
@@ -40,6 +40,9 @@

 PICKLE_VERSION = 1

+def HashSet(words):
+    return [n % 100003 for n in map(hash, Set(words))]
+
 class WordInfo(object):
     __slots__ = ('atime',     # when this record was last used by
scoring(*)
                  'spamcount', # # of spams in which this word appears
@@ -320,11 +323,11 @@
             # adjustment following keeps them in a sane range, and one
             # that naturally grows the more evidence there is to back up
             # a probability.
-            hamcount = record.hamcount
+            hamcount = min(record.hamcount, nham)
             assert hamcount <= nham
             hamratio = hamcount / nham

-            spamcount = record.spamcount
+            spamcount = min(record.spamcount, nspam)
             assert spamcount <= nspam
             spamratio = spamcount / nspam

@@ -397,7 +400,7 @@
         wordinfo = self.wordinfo
         wordinfoget = wordinfo.get
         now = time.time()
-        for word in Set(wordstream):
+        for word in HashSet(wordstream):
             record = wordinfoget(word)
             if record is None:
                 record = wordinfo[word] = WordInfo(now)
@@ -419,7 +422,7 @@
             self.nham -= 1

         wordinfoget = self.wordinfo.get
-        for word in Set(wordstream):
+        for word in HashSet(wordstream):
             record = wordinfoget(word)
             if record is not None:
                 if is_spam:
@@ -440,7 +443,7 @@

         wordinfoget = self.wordinfo.get
         now = time.time()
-        for word in Set(wordstream):
+        for word in HashSet(wordstream):
             record = wordinfoget(word)
             if record is None:
                 prob = unknown

Since N is 100003 there, no more than 100003 "words" can exist in the
database.  On my large c.l.py test, about 325,000 unique words exist, so at
least 225,000 words get folded into other words.  Accuracy does suffer:

filename:       cv    tcap
ham:spam:  20000:14000
                   20000:14000
fp total:        2       4
fp %:         0.01    0.02
fn total:        0       1
fn %:         0.00    0.01
unsure t:       97     179
unsure %:     0.29    0.53
real cost:  $39.40  $76.80
best cost:  $26.80  $58.20
h mean:       0.26    0.42
h sdev:       2.89    3.39
s mean:      99.94   99.70
s sdev:       1.44    3.20
mean diff:   99.68   99.28
k:           23.02   15.07

although the distros remain highly skewed:

-> <stat> Ham scores for all runs: 20000 items; mean 0.42; sdev 3.39
-> <stat> min 0; median 2.4147e-006; max 100
-> <stat> percentiles: 5% 2.22045e-014; 25% 1.57802e-009; 75% 0.000878141;
95% 0.588783

-> <stat> Spam scores for all runs: 14000 items; mean 99.70; sdev 3.20
-> <stat> min 17.485; median 100; max 100
-> <stat> percentiles: 5% 99.9864; 25% 100; 75% 100; 95% 100

-> best cost for all runs: $58.20
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 2 cutoff pairs
-> smallest ham & spam cutoffs 0.435 & 0.93
->     fp 2; fn 7; unsure ham 27; unsure spam 129
->     fp rate 0.01%; fn rate 0.05%; unsure rate 0.459%
-> largest ham & spam cutoffs 0.44 & 0.93
->     fp 2; fn 7; unsure ham 27; unsure spam 129
->     fp rate 0.01%; fn rate 0.05%; unsure rate 0.459%

There's not much point digging into "what went wrong" in the new error
cases, since the list of clues is worthless; e.g., here's the list for one
of the new FP:

Data/Ham/Set1/64316.txt
prob = 0.838786702949
prob('*H*') = 0.130012
prob('*S*') = 0.807586
prob(744) = 0.0228281
prob(34690) = 0.0918367
prob(87505) = 0.0970545
prob(91999) = 0.304589
prob(29591) = 0.328993
prob(70192) = 0.371651
prob(46915) = 0.634625
prob(60034) = 0.646331
prob(49959) = 0.648468
prob(63366) = 0.686216
prob(66610) = 0.702733
prob(25331) = 0.731237
prob(81757) = 0.747858
prob(13278) = 0.751421
prob(89046) = 0.758242
prob(13498) = 0.773519
prob(5337) = 0.779329
prob(26879) = 0.805219
prob(50301) = 0.912593
prob(26426) = 0.918411
prob(35130) = 0.943716

I can say that all the new errors were difficult cases before this too, and
often popped in out of my FP and FN sets over the weeks.

Have fun <wink>.


From tim.one@comcast.net  Thu Oct 31 05:51:00 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 31 Oct 2002 00:51:00 -0500
Subject: [Spambayes] Spam Clues: Share source code securely, inexpensively
Message-ID: <LNBBLJKPBEHFEDALKOLCEEFDCDAB.tim.one@comcast.net>

This is a multi-part message in MIME format.

---------------------- multipart/mixed attachment
Great spam!  Fooled my personal spambayes, and python.org.  It did
everything right.

BTW, this email and its attachment was auto-generated by MarkH's Outlook
client code, by hitting the "Show spam clues for current msg" button while
looking at the spam.  Since HTML is (I think) disabled on this list for no
good reason, you won't get the intended effect.  But it's still great spam
<wink>.

Spam Score: 0.000848613


'*H*'                          0.999804
'*S*'                          0.00150092
'url:python-list'              0.00653675
'header:X-Complaints-to:1'     0.0103695
'url:mailman'                  0.0140407
'url:listinfo'                 0.0171948
'url:python'                   0.019608
'repository'                   0.0196507
'message-id:@posting.google.com' 0.0215311
'subject:skip:i 10'            0.0302013
'algorithm'                    0.0348837
'repository.'                  0.0412844
'replaced'                     0.0412844
'facility'                     0.0412844
'encrypted'                    0.0412844
'(b)'                          0.0412844
'url:org'                      0.0474286
'header:Errors-to:1'           0.059578
'get,'                         0.0652174
'ssl'                          0.0918367
'someone,'                     0.0918367
'approach.'                    0.0918367
'header:Organization:1'        0.0967191
'header:Return-path:1'         0.0986292
'header:Message-id:1'          0.102295
'url:mail'                     0.127291
'header:Received:4'            0.147359
'feature'                      0.147511
'communicate'                  0.155172
'standard.'                    0.155172
'preferable'                   0.155172
'converted'                    0.155172
'stored'                       0.162027
'resources,'                   0.164415
'code'                         0.187502
'person.'                      0.197597
'web'                          0.213598
'downloaded'                   0.221874
'remote'                       0.242271
'server,'                      0.251262
'site,'                        0.260551
'(d)'                          0.267484
'development'                  0.2861
'which'                        0.286755
'source'                       0.293831
'when'                         0.311512
'standard'                     0.314052
'purposes'                     0.325776
'single'                       0.32782
'relationship'                 0.34209
'using'                        0.349124
'skip:l 10'                    0.349551
'used'                         0.353388
'browser'                      0.358337
'software'                     0.360879
'skip:u 10'                    0.363596
'days.'                        0.378856
'documents'                    0.379023
'with'                         0.383038
'what'                         0.38355
'set'                          0.383969
'there'                        0.390437
'say'                          0.392813
'that'                         0.395213
'note,'                        0.399936
'"free"'                       0.399936
'document,'                    0.399936
'now'                          0.603761
'even'                         0.607558
'government'                   0.612155
'based'                        0.613413
'account'                      0.618173
'pay'                          0.634181
'secure.'                      0.637489
'securely'                     0.637489
'give'                         0.644942
'skip:w 10'                    0.650571
'system'                       0.677844
'special'                      0.679297
'is:'                          0.684005
'site'                         0.684034
'secure'                       0.68629
'place.'                       0.71577
'unlike'                       0.71577
'highly'                       0.720868
'sites,'                       0.757669
'high'                         0.759307
'offshore'                     0.767183
'professional'                 0.770397
'fax'                          0.773719
'information'                  0.781197
'cost'                         0.783448
'offers'                       0.833401
'required.'                    0.836352
'received'                     0.839223
'encrypted,'                   0.844828
'subject:source'               0.844828
'format,'                      0.866464
'cheap'                        0.866464
'permits'                      0.934783
'emailing'                     0.969799
'dirt'                         0.973373
'inexpensive'                  0.9947
Message Stream:


Return-path: <python-list-admin@python.org>
Path:

news.baymountain.com!uunet!ash.uu.net!prodigy.com!news.cc.ukans.edu!logbridg
e.uoregon.edu!newsfeed.stanford.edu!postnews1.google.com!not-for-mail
Received: from bright14. (bright14-qfe0.icomcast.net [172.20.4.103])
 by msgstore01.icomcast.net
 (iPlanet Messaging Server 5.1 HotFix 1.5 (built Sep 23 2002))
 with ESMTP id <0H4T00D57ZFLRO@msgstore01.icomcast.net> for
 tim.one@ims-ms-daemon (ORCPT tim.one@comcast.net); Thu,
 31 Oct 2002 00:33:21 -0500 (EST)
Received: from mtain03 (bright-LB.icomcast.net [172.20.3.155])
	by bright14. (8.11.6/8.11.6) with ESMTP id g9V5XZq28319	for
 <@msgstore01.icomcast.net:tim.one@comcast.net>; Thu,
 31 Oct 2002 00:33:35 -0500 (EST)
Received: from mail.python.org (mail.python.org [12.155.117.29])
 by mtain03.icomcast.net
 (iPlanet Messaging Server 5.1 HotFix 1.5 (built Sep 23 2002))
 with ESMTP id <0H4T00GR0ZFXIU@mtain03.icomcast.net> for tim.one@comcast.net
 (ORCPT tim.one@comcast.net); Thu, 31 Oct 2002 00:33:33 -0500 (EST)
Received: from localhost.localdomain ([127.0.0.1] helo=mail.python.org)
	by mail.python.org with esmtp (Exim 4.05)	id 1877xW-00008g-00; Thu,
	31 Oct 2002 00:33:34 -0500
X-Trace: posting.google.com 1036042065 27240 127.0.0.1
 (31 Oct 2002 05:27:45 GMT)
Date: Wed, 30 Oct 2002 21:27:45 -0800
From: post@ironcitadel.com (ICWeb)
Subject: Share source code securely, inexpensively
Sender: python-list-admin@python.org
To: python-list@python.org
Errors-to: python-list-admin@python.org
Message-id: <99c3d303.0210302127.15dee024@posting.google.com>
Organization: http://groups.google.com/
X-Complaints-to: groups-abuse@google.com
Content-type: text/plain; charset=ISO-8859-1
Content-transfer-encoding: 8bit
NNTP-posting-date: 31 Oct 2002 05:27:45 GMT
Precedence: bulk
X-BeenThere: python-list@python.org
Newsgroups: comp.lang.python
Lines: 31
NNTP-posting-host: 66.166.68.234
X-Mailman-Version: 2.0.13 (101270)
List-Post: <mailto:python-list@python.org>
List-Subscribe: <http://mail.python.org/mailman/listinfo/python-list>,
	<mailto:python-list-request@python.org?subject=subscribe>
List-Unsubscribe: <http://mail.python.org/mailman/listinfo/python-list>,
	<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-list/>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Id: General discussion list for the Python programming language
 <python-list.python.org>
Xref: news.baymountain.com comp.lang.python:187588


When doing source code development using offshore resources, or when
team members are geographically distributed, this site offers a very
inexpensive and secure approach. It can be used as an adjunct to a
source code control system (manually) when a small team does not have
access to a secure web based source code control repository. For
professional and client relationship purposes this can be preferable
to emailing code around or using a "free" document repository which is
insecure.

There is a good site, www.ironcitadel.com, which permits the secure
storage and communication of documents. Unlike most "free" sites, this
site is:
(a) Securely encrypted - all document uploads/downloads are 128bit SSL
encrypted; all documents stored are encrypted using the new Rinjdael
algorithm which has replaced Triple-DES as the US government high
security standard.
(b) Has a fax feature - if you just have a hardcopy of a document, say
a handwritten note, you can fax it in and the document is received by
a fax server, converted to TIFF format, and stored encrypted.
(c) All documents can be downloaded (also uploaded) just via a
standard web browser - no special software required.
(d) Cost is dirt cheap - $5/month and you don't even pay until the end
of your first 30 days. Considering what you get, that is CHEAP!

If you want to communicate securely with someone, just set up an
account and give the login/password to a single other person. All
communications are now highly secure.

If you want to store secure information remotely, where all the
information is at a remote facility that is encrypted, then
www.ironcitadel.com is the place.
--
http://mail.python.org/mailman/listinfo/python-list

---------------------- multipart/mixed attachment
An embedded message was scrubbed...
From: ICWeb <post@ironcitadel.com>
Subject: Share source code securely, inexpensively
Date: Thu, 31 Oct 2002 00:27:45 -0500
Size: 2999
Url: http://mail.python.org/pipermail/spambayes/attachments/20021031/2dc295c9/attachment.txt

---------------------- multipart/mixed attachment--

From rob@hooft.net  Thu Oct 31 06:37:44 2002
From: rob@hooft.net (Rob Hooft)
Date: Thu, 31 Oct 2002 07:37:44 +0100
Subject: [Spambayes] X-Hammie-Disposition split suggestion
References: <LNBBLJKPBEHFEDALKOLCAEEBCDAB.tim.one@comcast.net>
Message-ID: <3DC0CFB8.2010900@hooft.net>

Tim Peters wrote:
> 
>>That way, I could tell my mailer to display the much smaller
>>X-Hammie-Disposition header and suppress display of the (for
>>example) X-Hammie-Word-Probabilities header by default, e.g.:
>>
>>    X-Hammie-Disposition: Yes; 1.00; '*H*': 0.00; '*S*': 1.00
> 
> 
> I suggest dropping the *H* and *S* here too.  In the Outlook client, we've
> also switched to feeding the end user int(round(score * 100.0)), i.e. an
> integer in 0 .. 100 inclusive.  There's really no need to bother pretty
> users' heads with the mysteries of floating point <wink>.

Hm. Sure, the *H* and *S* could be moved to the "debugging" header, 
which should be switched by an option (with default off).

But I am actually "bothered" ;-) by having only two digits. For 
chuckles, I'd like to have an indication for the "0" and "100" scores 
how far they are away from the actual 0 and 100 (as a 10-log). Something 
like "0 (4)" could mean "0.0000XXX" and "100 (5)" could mean 
"0.99999XXX". Again, this would probably only be for hackers....

Rob
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tim.one@comcast.net  Thu Oct 31 06:47:26 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 31 Oct 2002 01:47:26 -0500
Subject: [Spambayes] FW: [Spambayes-checkins] spambayes tokenizer.py,1.57,1.58
Message-ID: <LNBBLJKPBEHFEDALKOLCCEFGCDAB.tim.one@comcast.net>

FYI, for those not on the checkin list.

-----Original Message-----
From: spambayes-checkins-bounces@python.org
[mailto:spambayes-checkins-bounces@python.org]On Behalf Of Tim Peters
Sent: Thursday, October 31, 2002 1:43 AM
To: spambayes-checkins@python.org
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.57,1.58


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv30231

Modified Files:
	tokenizer.py 
Log Message:
A new mini-phase of body tokenization scours HTML for common virus clues,
variations of

    <script    </script
    <iframe    </iframe
    src=cid:
    height=0   width=0

I'm seeing a lot of this in my personal email lately, and it usually ends
up in my Unsure folder because the msgs have almost nothing in them
except for a bit of triggering HTML.  Adding this stuff almost always
scores them as solid spam now, and had no effect on my c.l.py test
(no change in FP or FN rates, insignificant improvement in Unsure rate):

filename:       cv    tcap
ham:spam:  20000:14000
                   20000:14000
fp total:        2       2
fp %:         0.01    0.01
fn total:        0       0
fn %:         0.00    0.00
unsure t:       97      96
unsure %:     0.29    0.28
real cost:  $39.40  $39.20
best cost:  $26.80  $26.80
h mean:       0.26    0.27
h sdev:       2.89    2.90
s mean:      99.94   99.94
s sdev:       1.44    1.44
mean diff:   99.68   99.67
k:           23.02   22.97


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.57
retrieving revision 1.58
diff -C2 -d -r1.57 -r1.58
*** tokenizer.py	29 Oct 2002 03:43:58 -0000	1.57
--- tokenizer.py	31 Oct 2002 06:42:48 -0000	1.58
***************
*** 948,951 ****
--- 948,967 ----
      return ''.join(new_text), clues
  
+ # Scan HTML for constructs often seen in viruses and worms.
+ # <script  </script
+ # <iframe  </iframe
+ # src=cid:
+ # height=0  width=0
+ 
+ virus_re = re.compile(r"""
+     < /? \s* (?: script | iframe) \b
+ |   \b src= ['"]? cid:
+ |   \b (?: height | width) = ['"]? 0
+ """, re.VERBOSE)
+ 
+ def find_html_virus_clues(text):
+     for bingo in virus_re.findall(text):
+         yield bingo
+ 
  class Tokenizer:
  
***************
*** 1219,1222 ****
--- 1235,1241 ----
              for t in tokens:
                  yield t
+ 
+             for t in find_html_virus_clues(text):
+                 yield "virus:%s" % t
  
              # Remove HTML/XML tags.  Also &nbsp;.


_______________________________________________
Spambayes-checkins mailing list
Spambayes-checkins@python.org
http://mail.python.org/mailman/listinfo/spambayes-checkins

From sjoerd@acm.org  Thu Oct 31 09:39:45 2002
From: sjoerd@acm.org (Sjoerd Mullender)
Date: Thu, 31 Oct 2002 10:39:45 +0100
Subject: [Spambayes] RE: Spam vs time-of-day
In-Reply-To: <15806.3178.674153.634773@montanaro.dyndns.org> 
References: <15805.34750.447765.195346@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCOEAPCCAB.tim.one@comcast.net> 
	<15806.3178.674153.634773@montanaro.dyndns.org> 
Message-ID: <200210310939.g9V9djb20069@indus.ins.cwi.nl>

On Mon, Oct 28 2002 Skip Montanaro wrote:

> Yes, the checked in code just uses a regular expression which matches
> HH:MM:SS preceded and followed by a space.  Nothing else in the Date: header
> is considered for this particular token.

I hope you realize that the seconds part is optional according to the
RFC?

-- Sjoerd Mullender <sjoerd@acm.org>

From skip@pobox.com  Thu Oct 31 11:19:16 2002
From: skip@pobox.com (Skip Montanaro)
Date: Thu, 31 Oct 2002 05:19:16 -0600
Subject: [Spambayes] RE: Spam vs time-of-day
In-Reply-To: <200210310939.g9V9djb20069@indus.ins.cwi.nl>
References: <15805.34750.447765.195346@montanaro.dyndns.org>
        <LNBBLJKPBEHFEDALKOLCOEAPCCAB.tim.one@comcast.net>
        <15806.3178.674153.634773@montanaro.dyndns.org>
        <200210310939.g9V9djb20069@indus.ins.cwi.nl>
Message-ID: <15809.4532.911748.768608@montanaro.dyndns.org>


    >> Yes, the checked in code just uses a regular expression which matches
    >> HH:MM:SS preceded and followed by a space.  Nothing else in the Date:
    >> header is considered for this particular token.

    Sjoerd> I hope you realize that the seconds part is optional according
    Sjoerd> to the RFC?

Nope, I'll fix that.

Thx,

Skip


From guido@python.org  Thu Oct 31 16:22:10 2002
From: guido@python.org (Guido van Rossum)
Date: Thu, 31 Oct 2002 11:22:10 -0500
Subject: [Spambayes] FW: [Spambayes-checkins] spambayes
	tokenizer.py,1.57,1.58
In-Reply-To: Your message of "Thu, 31 Oct 2002 01:47:26 EST."
             <LNBBLJKPBEHFEDALKOLCCEFGCDAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCCEFGCDAB.tim.one@comcast.net> 
Message-ID: <200210311622.g9VGMAC07720@odiug.zope.com>

> A new mini-phase of body tokenization scours HTML for common virus clues,
> variations of
> 
>     <script    </script
>     <iframe    </iframe
>     src=cid:
>     height=0   width=0

This gets us awfully close to SA's "precompiled list of clues to look
for" approach. :-(

--Guido van Rossum (home page: http://www.python.org/~guido/)


From tim.one@comcast.net  Thu Oct 31 16:40:14 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 31 Oct 2002 11:40:14 -0500
Subject: [Spambayes] FW: [Spambayes-checkins] spambayes
 tokenizer.py,1.57,1.58
In-Reply-To: <200210311622.g9VGMAC07720@odiug.zope.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHEEICDNAA.tim.one@comcast.net>

[Tim]
>> A new mini-phase of body tokenization scours HTML for common
>> virus clues, variations of
>>
>>     <script    </script
>>     <iframe    </iframe
>>     src=cid:
>>     height=0   width=0

[Guido]
> This gets us awfully close to SA's "precompiled list of clues to look
> for" approach. :-(

We're throwing away *all* HTML tags now, and missing a lot of info because
of that.  As I said about this one, virus/worm msgs of this nature often
have no other content period.  The classifier can't score what it can't see.

Feel free to design a principled approach to tokenizing HTML tags that still
allows some HTML messages to avoid getting called spam.  In the absence of
that, I've got no qualms about adding special cases that help.  For goodness
sake, it was a massive special-case hack to *strip* HTML tags to begin
with -- think of this as a minor unhack of that <wink>.


From bkc@murkworks.com  Thu Oct 31 17:55:52 2002
From: bkc@murkworks.com (Brad Clements)
Date: Thu, 31 Oct 2002 12:55:52 -0500
Subject: [Spambayes] FW: [Spambayes-checkins] spambayes
	tokenizer.py,1.57,1.58
In-Reply-To: <200210311622.g9VGMAC07720@odiug.zope.com>
References: Your message of "Thu, 31 Oct 2002 01:47:26 EST."
	<LNBBLJKPBEHFEDALKOLCCEFGCDAB.tim.one@comcast.net> 
Message-ID: <3DC127A8.14297.28B44F20@localhost>

On 31 Oct 2002 at 11:22, Guido van Rossum wrote:

> > A new mini-phase of body tokenization scours HTML for common virus clues,
> > variations of
> > 
> >     <script    </script
> >     <iframe    </iframe
> >     src=cid:
> >     height=0   width=0
> 
> This gets us awfully close to SA's "precompiled list of clues to look
> for" approach. :-(
> 

I get valid messages with embedded images that contain cid: clues. Hopefully my ham 
tokens will overpower ;-)

Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From tim.one@comcast.net  Thu Oct 31 18:11:05 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 31 Oct 2002 13:11:05 -0500
Subject: [Spambayes] FW: [Spambayes-checkins] spambayes tokenizer.py,
 1.57,1.58
Message-ID: <34ff43414e.3414e34ff4@icomcast.net>

[Brad Clements]
> I get valid messages with embedded images that contain cid: clues. 
> Hopefully my ham tokens will overpower ;-)

We work on sets of words, so no matter how many src=cid: instances 
appear in a single msg, only one will count against it -- provided your 
training data is such that src=cid: is actually a spam clue!  In your 
corpus, it may be a ham clue.

Remember that there's no way to force the system to consider anything 
to be ham or spam -- it learns the correct classification, and how 
strong an indicator a thing is, purely from what you train it on.  In 
that sense too Guido's characterization shouldn't be taken seriously:  
we're only letting the classifier see these things, not telling it what 
to think about them.


From tdickenson@devmail.geminidataloggers.co.uk  Thu Oct 31 18:22:09 2002
From: tdickenson@devmail.geminidataloggers.co.uk (Toby Dickenson)
Date: Thu, 31 Oct 2002 18:22:09 +0000
Subject: [Spambayes] non-ascii mail in hammiecli
Message-ID: <200210311817.57268.tdickenson@geminidataloggers.com>


---------------------- multipart/mixed attachment
hammiecli is giving its input an xmlrpc binary wrapper, to avoid marshall=
ing=20
problems with non-ascii input.  However hammiesrv wasnt doing the same fo=
r=20
its output.

diff attached


---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: binaryspam.diff
Type: text/x-diff
Size: 1110 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021031/a2b786d0/binaryspam.bin

---------------------- multipart/mixed attachment--


From rmunn@pobox.com  Thu Oct 31 21:18:19 2002
From: rmunn@pobox.com (Robin Munn)
Date: Thu, 31 Oct 2002 15:18:19 -0600
Subject: [Spambayes] Email client integration -- what's needed?
Message-ID: <20021031211819.GC27454@rmunnlfs>


---------------------- multipart/signed attachment
Hi all,

I just joined the spambayes mailing list a couple of days ago and have
been trying to skim through the archives. It looks like a lot of time is
being spent on algorithm refining and not as much time on email client
integration or end-user documentation. I'm a Python programmer who's
currently unemployed (though the "unemployed" part of that might be
changing soon, I hope), so at the moment I have quite a bit of time to
contribute. I want to see a project that's useable not just by techies
but also by computer-illiterate end-users. So I'm offering my services
to write client integration code or documentation. I don't want to
duplicate work that's already being done by someone else, so can anyone
tell me what's already being worked on -- and what the most urgent TODO
items are? Thanks.

--=20
Robin Munn <rmunn@pobox.com>
http://www.rmunn.com/
PGP key ID: 0x6AFB6838    50FF 2478 CFFB 081A 8338  54F7 845D ACFD 6AFB 6838

---------------------- multipart/signed attachment
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021031/42c332d9/attachment.bin

---------------------- multipart/signed attachment--

From richie@entrian.com  Thu Oct 31 22:27:03 2002
From: richie@entrian.com (Richie Hindle)
Date: Thu, 31 Oct 2002 22:27:03 +0000
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <20021031211819.GC27454@rmunnlfs>
References: <20021031211819.GC27454@rmunnlfs>
Message-ID: <3f93sukfqj1i8qqn8spnppuopb2jsua799@4ax.com>

Hi Robin,

> I'm offering my services to write client integration code or documentation.

Great!  I'd love to see some <pointyhair> productization </pointyhair> done
- the code is fantastically good at what it does, and opening it up to a
wider audience by making it more accessible to end users would be great.

The part of the project that I'm responsible for is pop3proxy.py, which
sits between an email client and a POP3 server and adds a spam-judgement
header to each email as it comes through.  You can then set up your email
client to filter/highlight/delete/whatever on that header.  It's not as
integrated as Mark's Outlook stuff, but it works with any email client on
any platform (and with the proxy and the client on different platforms,
should you wish).

The missing pieces are:

 o A user interface
 o An easier way of training it
 o A way of installing it
 o A fix for the bug that's preventing it from working on Linux - this
   is on its way!
 o A more sensible header than 'X-Hammie-Disposition' - I agree with
   whoever said that we need a better name.

My prefered user interface to the proxy would be presented in HTML and
served by the proxy itself - platform independent, no GUI toolkits, no
deployment hassles, as pretty as you like.  Similar to Zoe
(http://sourceforge.net/projects/zoe) - either your point your browser at
it explicitly, or it launches a browser pointing at itself on startup
(except that unlike Zoe, this feature would work 8-).  I'm partway towards
having a toolkit that makes this easy.

The only non-techie way of training it that I've come up with so far (and
David Priest independently came up with the same idea, so it can't be all
bad) is to make it an SMTP proxy as well, and to return all spams to a
special 'spam' address which triggers it to train on the message rather
than forward it on.  Messages that weren't returned would be assumed to be
ham after some delay had passed, or after some spam was forwarded (proving
that the user hadn't just gone on holiday or something).  After a little
while it would get most messages right, and you'd only have to forward the
odd misclassified message back to it.

All that said, the fact that this Linux pop3proxy bug has only just cropped
up means that not many people have tried it - is that because everyone here
is in charge of their own mail delivery system and is therefore using
hammie, or because the whole idea is unattractive?

-- 
Richie Hindle
richie@entrian.com


From vanhorn@whidbey.com  Thu Oct 31 23:09:27 2002
From: vanhorn@whidbey.com (G. Armour Van Horn)
Date: Thu, 31 Oct 2002 15:09:27 -0800
Subject: [Spambayes] Email client integration -- what's needed?
References: <20021031211819.GC27454@rmunnlfs>
Message-ID: <3DC1B827.C3A0E25C@whidbey.com>

I'm really new to this as well, having only joined yesterday. (I did read Paul
Graham's piece a couple of months ago though.) I'm not a programmer, just a
stumbling system administrator, so my interests run in a different direction.

I currently am running MailScanner with SpamAssassin on a low-volume mail
server. I'd really like to have the Bayesian filter on the server instead of
SpamAssassin (or in addition, I suppose). I'd want to have two mail addresses
(one for spam, one for ham) that users could return messages to. I know the
ultimate precision of the system comes from having individuals do this for their
own mailboxes, but I think the community of users on this server is small enough
that we would get 99% of the benefit.

I could be wrong, but I think this would work for a lot of company servers,
probably for a lot of small ISPs as well, protecting thousands of users both
from Spam and the need to install tricky software on their machines.

Van

Robin Munn wrote:

> Hi all,
>
> I just joined the spambayes mailing list a couple of days ago and have
> been trying to skim through the archives. It looks like a lot of time is
> being spent on algorithm refining and not as much time on email client
> integration or end-user documentation. I'm a Python programmer who's
> currently unemployed (though the "unemployed" part of that might be
> changing soon, I hope), so at the moment I have quite a bit of time to
> contribute. I want to see a project that's useable not just by techies
> but also by computer-illiterate end-users. So I'm offering my services
> to write client integration code or documentation. I don't want to
> duplicate work that's already being done by someone else, so can anyone
> tell me what's already being worked on -- and what the most urgent TODO
> items are? Thanks.
>
> --
> Robin Munn <rmunn@pobox.com>
> http://www.rmunn.com/
> PGP key ID: 0x6AFB6838    50FF 2478 CFFB 081A 8338  54F7 845D ACFD 6AFB 6838
>
>   ------------------------------------------------------------------------
>    Part 1.2Type: application/pgp-signature

--
----------------------------------------------------------
Sign up now for Quotes of the Day, a handful of quotations
on a theme delivered every morning.
Enlightenment! Daily, for free!
mailto:twisted@whidbey.com?subject=Subscribe_QOTD

For web hosting and maintenance,
visit Van's home page: http://www.domainvanhorn.com/van/
----------------------------------------------------------


From tim@fourstonesforum.com  Thu Oct 31 23:28:57 2002
From: tim@fourstonesforum.com (Tim Stone Four Stones Forum)
Date: Thu, 31 Oct 2002 17:28:57 -0600
Subject: [Spambayes] Email client integration -- what's needed?
Message-ID: <SRMJLJVPA8WQQ43874XEC51RQ1USNZX.3dc1bcb9@riven>

Van, you and Robin have started to take this project to the next level.  Thanks.  I'm anxious to see what comes next.  That Spambayes is great at 
filtering is unquestionable.  But can it be useful to the masses?  The pop3proxy is the right way to go in my opinion.  I use an SMTP proxy now, it's 
so easy to set up, and it just works.  Let me volunteer my services as a tester.  I would consider myself to be a completely typical email user.  I use 
the Opera mailer on a windoze platform.  I certainly have plenty of spam volume... I also have a linux box that makes a nice little playground.

10/31/2002 5:09:27 PM, "G. Armour Van Horn" <vanhorn@whidbey.com> wrote:

>I'm really new to this as well, having only joined yesterday. (I did read Paul
>Graham's piece a couple of months ago though.) I'm not a programmer, just a
>stumbling system administrator, so my interests run in a different direction.
>
>I currently am running MailScanner with SpamAssassin on a low-volume mail
>server. I'd really like to have the Bayesian filter on the server instead of
>SpamAssassin (or in addition, I suppose). I'd want to have two mail addresses
>(one for spam, one for ham) that users could return messages to. I know the
>ultimate precision of the system comes from having individuals do this for their
>own mailboxes, but I think the community of users on this server is small enough
>that we would get 99% of the benefit.
>
>I could be wrong, but I think this would work for a lot of company servers,
>probably for a lot of small ISPs as well, protecting thousands of users both
>from Spam and the need to install tricky software on their machines.
>
>Van
>
>Robin Munn wrote:
>
>> Hi all,
>>
>> I just joined the spambayes mailing list a couple of days ago and have
>> been trying to skim through the archives. It looks like a lot of time is
>> being spent on algorithm refining and not as much time on email client
>> integration or end-user documentation. I'm a Python programmer who's
>> currently unemployed (though the "unemployed" part of that might be
>> changing soon, I hope), so at the moment I have quite a bit of time to
>> contribute. I want to see a project that's useable not just by techies
>> but also by computer-illiterate end-users. So I'm offering my services
>> to write client integration code or documentation. I don't want to
>> duplicate work that's already being done by someone else, so can anyone
>> tell me what's already being worked on -- and what the most urgent TODO
>> items are? Thanks.
>>
>> --
>> Robin Munn <rmunn@pobox.com>
>> http://www.rmunn.com/
>> PGP key ID: 0x6AFB6838    50FF 2478 CFFB 081A 8338  54F7 845D ACFD 6AFB 6838
>>
>>   ------------------------------------------------------------------------
>>    Part 1.2Type: application/pgp-signature
>
>--
>----------------------------------------------------------
>Sign up now for Quotes of the Day, a handful of quotations
>on a theme delivered every morning.
>Enlightenment! Daily, for free!
>mailto:twisted@whidbey.com?subject=Subscribe_QOTD
>
>For web hosting and maintenance,
>visit Van's home page: http://www.domainvanhorn.com/van/
>----------------------------------------------------------
>
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
>
>


From popiel@wolfskeep.com  Thu Oct 31 23:51:45 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Thu, 31 Oct 2002 15:51:45 -0800
Subject: [Spambayes] Email client integration -- what's needed? 
In-Reply-To: Message from Tim Stone Four Stones Forum
	<tim@fourstonesforum.com> 
	<SRMJLJVPA8WQQ43874XEC51RQ1USNZX.3dc1bcb9@riven> 
References: <SRMJLJVPA8WQQ43874XEC51RQ1USNZX.3dc1bcb9@riven> 
Message-ID: <20021031235145.C0376F59F@cashew.wolfskeep.com>

In message:  <SRMJLJVPA8WQQ43874XEC51RQ1USNZX.3dc1bcb9@riven>
             Tim Stone Four Stones Forum <tim@fourstonesforum.com> writes:

>But can it be useful to the masses?  The pop3proxy is the right way to go
>in my opinion.

You folks make me feel like such a fuddy-duddy, still using MH
from a shell account with the mailboxes fetched through the
filesystem, instead of through some network mailbox protocol...
Heck, I don't even have software to access a POP mailbox installed...

I guess that raises the question: what is our target audience,
and how strictly do we want to cater to them?  Do we want to
offer support for processing in direct-delivery situations,
even though it's only old-school fuddy-duddies like myself
who use them, anymore?

- Alex