[Spambayes] new option: generate_long_skips

Skip Montanaro skip@pobox.com
Mon, 30 Sep 2002 17:07:49 -0500


I just checked in a new option for the tokenizer: generate_long_skips.  The
default is True.  I noticed when reviewing my false positives that one was
overwhelmingly dominated by these tokens (which scored very high) because it
contained an Excel spreadsheet attachment.  "cutoff.ini" is

    [TestDriver]
    spam_cutoff: 0.4

while "noskips.ini" is

    [Tokenizer]
    generate_long_skips: False

    [TestDriver]
    spam_cutoff: 0.4

I am currently running a test with 10 sets of 200 messages per set.  With 5
sets I got two more fn's and two less fp's:

    cutoffs -> noskipss
    -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
    -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
    -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
    -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
    -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
    -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
    -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
    -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
    -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
    -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams

    false positive percentages
        1.500  1.000  won    -33.33%
        4.000  4.000  tied          
        3.000  2.500  won    -16.67%
        1.500  1.500  tied          
        1.000  1.000  tied          

    won   2 times
    tied  3 times
    lost  0 times

    total unique fp went from 22 to 20 won     -9.09%
    mean fp % went from 2.2 to 2.0 won     -9.09%

    false negative percentages
        2.000  2.500  lost   +25.00%
        1.000  1.000  tied          
        0.500  0.500  tied          
        1.500  1.500  tied          
        2.000  2.500  lost   +25.00%

    won   0 times
    tied  3 times
    lost  2 times

    total unique fn went from 14 to 16 lost   +14.29%
    mean fn % went from 1.4 to 1.6 lost   +14.29%

    ham mean                     ham sdev
      22.12   21.85   -1.22%        6.01    5.74   -4.49%
      23.46   23.25   -0.90%        7.31    6.93   -5.20%
      23.50   23.38   -0.51%        6.64    6.51   -1.96%
      23.54   23.32   -0.93%        6.88    6.87   -0.15%
      23.08   22.79   -1.26%        6.77    6.62   -2.22%

    ham mean and sdev for all runs
      23.14   22.92   -0.95%        6.76    6.57   -2.81%

    spam mean                    spam sdev
      72.49   71.95   -0.74%       13.82   14.02   +1.45%
      71.34   70.61   -1.02%       13.70   13.45   -1.82%
      73.12   72.58   -0.74%       12.88   12.80   -0.62%
      72.40   72.01   -0.54%       12.71   12.65   -0.47%
      70.71   70.10   -0.86%       13.91   13.74   -1.22%

    spam mean and sdev for all runs
      72.01   71.45   -0.78%       13.44   13.37   -0.52%

    ham/spam mean difference: 48.87 48.53 -0.34

I think it might be helpful for people whose ham tends to get the occasional
legitimate binary attachment.  In any case, it's easier for people to test
it if I check it in.  We can always remove it if it turns out not to be
generally useful.  Also, it makes it easier for me to add the date mining
without interference with this change.

Note the very low spam_cutoff.  That was as suggested by an earlier run.
After I switched I got a dramatic drop in fn's.  

Skip