[Spambayes] Mining the headers

Skip Montanaro skip@pobox.com
Sun Oct 27 05:37:51 2002


    >> I've had three other options knocking around locally which haven't
    >> seemed to help or hurt....  Should I check them in....

    Alex> Yes, I'd love to test them.

Done.  Note that I deleted the mine_date_headers option.  It was just a
gatekeeper for the other two.  Seemed pointless to me.  Here's my latest
run.  The first run was the default.  My dates.ini file is

    [Tokenizer]
    generate_time_buckets: True
    extract_dow: True

The results:

    run1s -> datess
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    ... etc ...

    false positive percentages
        0.500  0.500  tied          
        0.000  0.000  tied          
        0.500  0.500  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.500  0.500  tied          
        0.000  0.000  tied          
        0.500  0.500  tied          
        0.000  0.000  tied          
        0.500  0.500  tied          

    won   0 times
    tied 10 times
    lost  0 times

    total unique fp went from 5 to 5 tied          
    mean fp % went from 0.25 to 0.25 tied          

    false negative percentages
        0.000  0.000  tied          
        0.000  0.000  tied          
        1.000  1.000  tied          
        1.000  1.000  tied          
        0.500  0.500  tied          
        1.000  0.500  won    -50.00%
        0.500  0.500  tied          
        1.500  1.500  tied          
        0.000  0.000  tied          
        2.000  2.000  tied          

    won   1 times
    tied  9 times
    lost  0 times

    total unique fn went from 15 to 14 won     -6.67%
    mean fn % went from 0.75 to 0.7 won     -6.67%

    ham mean                     ham sdev
       1.38    1.38   +0.00%       10.18   10.17   -0.10%
       0.42    0.43   +2.38%        3.77    3.78   +0.27%
       0.98    0.98   +0.00%        8.39    8.36   -0.36%
       0.17    0.21  +23.53%        1.05    1.52  +44.76%
       0.93    0.93   +0.00%        7.73    7.73   +0.00%
       1.40    1.40   +0.00%        8.36    8.39   +0.36%
       1.18    1.14   -3.39%        7.39    7.24   -2.03%
       0.73    0.74   +1.37%        7.54    7.54   +0.00%
       0.97    0.98   +1.03%        6.62    6.72   +1.51%
       0.79    0.79   +0.00%        7.74    7.74   +0.00%

    ham mean and sdev for all runs
       0.89    0.90   +1.12%        7.32    7.32   +0.00%

    spam mean                    spam sdev
      99.17   99.16   -0.01%        4.63    4.71   +1.73%
      98.65   98.66   +0.01%        6.34    6.27   -1.10%
      96.71   96.71   +0.00%       13.73   13.74   +0.07%
      96.74   96.73   -0.01%       13.46   13.46   +0.00%
      98.44   98.46   +0.02%        9.25    9.23   -0.22%
      97.35   97.36   +0.01%       12.00   11.92   -0.67%
      98.33   98.34   +0.01%        9.55    9.53   -0.21%
      97.17   97.17   +0.00%       13.68   13.68   +0.00%
      98.94   98.93   -0.01%        6.89    6.90   +0.15%
      97.46   97.45   -0.01%       13.72   13.73   +0.07%

    spam mean and sdev for all runs
      97.89   97.90   +0.01%       10.87   10.86   -0.09%

    ham/spam mean difference: 97.00 97.00 +0.00

Here's the cost table:

    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    ... yeah, yeah, yeah, enough already! ...
    filename:     run1   dates
    ham:spam:  2000:2000      
                       2000:2000
    fp total:        5       5
    fp %:         0.25    0.25
    fn total:       15      14
    fn %:         0.75    0.70
    unsure t:       93      93
    unsure %:     2.33    2.33
    real cost:  $83.60  $82.60
    best cost:  $53.80  $53.60
    h mean:       0.89    0.90
    h sdev:       7.32    7.32
    s mean:      97.89   97.90
    s sdev:      10.87   10.86
    mean diff:   97.00   97.00
    k:            5.33    5.34

Note that my numbers seem to be getting a lot better.  My ham/spam
collection has slowly gotten cleaner and I've been adding more new stuff,
not to mention which the default scheme (chi2?) seems a lot more
sensitive/accurate.  I noticed that as I lopped off old messages, first
those from 1999 and before then those from 2000, that the accuracy improved.
That suggests two things to me: first, the nature of "what is spam?" has
changed a bit, and two, someone ought to test this notion. ;-)

thanks-to-uncle-timmy-for-the-extra-hour-ly, y'rs,

Skip