[spambayes-dev] removing punctuation redux

Skip Montanaro skip at pobox.com
Wed Oct 29 11:27:47 EST 2003


Just as I was nodding off to sleep last night I realized I hadn't done the
punctuation removal right.  I was stripping punctuation after testing word
length, so (for example) a nine-letter word containing four punctuation
characters got evaluated as a "long" word.  I redid it this morning so the
punctuation stripping occurs before the length test and got different
results (a bit better overall I think).  Here's the summary (minus the first
block).  All rounds scored 500 against 4500 (using the same training sets as
yesterday).

    stds.txt -> puncts.txt

    false positive percentages
        0.200  0.200  tied          
        0.200  0.400  lost  +100.00%
        0.200  0.200  tied          
        0.400  0.400  tied          
        1.000  1.000  tied          
        0.400  0.400  tied          
        0.000  0.000  tied          
        0.200  0.200  tied          
        0.400  0.400  tied          
        0.400  0.400  tied          

    won   0 times
    tied  9 times
    lost  1 times

    total unique fp went from 17 to 18 lost    +5.88%
    mean fp % went from 0.34 to 0.36 lost    +5.88%

    false negative percentages
        2.000  2.000  tied          
        0.800  1.000  lost   +25.00%
        2.200  2.200  tied          
        2.200  2.400  lost    +9.09%
        2.200  2.000  won     -9.09%
        1.200  1.200  tied          
        2.000  1.800  won    -10.00%
        2.000  1.800  won    -10.00%
        1.600  1.600  tied          
        1.200  0.800  won    -33.33%

    won   4 times
    tied  4 times
    lost  2 times

    total unique fn went from 87 to 84 won     -3.45%
    mean fn % went from 1.74 to 1.68 won     -3.45%

    ham mean                     ham sdev
       4.70    4.35   -7.45%       14.04   13.40   -4.56%
       6.12    6.06   -0.98%       16.43   16.55   +0.73%
       4.14    4.01   -3.14%       12.86   12.62   -1.87%
       4.64    4.56   -1.72%       14.35   14.23   -0.84%
       5.49    5.31   -3.28%       17.21   16.99   -1.28%
       5.21    5.03   -3.45%       16.10   15.81   -1.80%
       3.74    3.62   -3.21%       12.19   12.03   -1.31%
       3.38    3.30   -2.37%       11.70   11.51   -1.62%
       5.50    5.46   -0.73%       17.07   17.09   +0.12%
       5.44    5.27   -3.13%       15.85   15.57   -1.77%

    ham mean and sdev for all runs
       4.84    4.70   -2.89%       14.93   14.74   -1.27%

    spam mean                    spam sdev
      89.47   89.61   +0.16%       20.40   20.36   -0.20%
      91.15   91.22   +0.08%       17.17   17.12   -0.29%
      91.04   90.97   -0.08%       20.00   20.22   +1.10%
      89.39   89.74   +0.39%       20.88   20.73   -0.72%
      89.56   89.70   +0.16%       19.92   19.72   -1.00%
      90.72   90.78   +0.07%       18.41   18.48   +0.38%
      90.97   91.09   +0.13%       18.66   18.54   -0.64%
      91.66   91.85   +0.21%       18.20   17.90   -1.65%
      91.58   91.44   -0.15%       18.03   18.24   +1.16%
      92.24   92.25   +0.01%       16.18   16.22   +0.25%

    spam mean and sdev for all runs
      90.78   90.87   +0.10%       18.86   18.83   -0.16%

    ham/spam mean difference: 85.94 86.17 +0.23

The new context diff is attached.

Would someone please try this with their training database?  The shell
script I used to run the test is

    #!/bin/sh

    # point SBDIR at Spambayes root directory
    SBDIR=$HOME/src/spambayes

    # everything below here should be okay
    TIMCV=$SBDIR/testtools/timcv.py
    RATE=$SBDIR/testtools/rates.py
    CMP=$SBDIR/testtools/cmp.py

    base=$1
    trial=$2

    BAYESCUSTOMIZE=$base.ini ; export BAYESCUSTOMIZE
    python $TIMCV -n 10 -s 12345 > $base.txt

    BAYESCUSTOMIZE=$trial.ini ; export BAYESCUSTOMIZE
    python $TIMCV -n 10 -s 12345 > $trial.txt

    python $RATE $base.txt
    python $RATE $trial.txt

    python $CMP ${base}s.txt ${trial}s.txt > $base-$trial.txt

So I created an empty std.ini and a punct.ini with this:

    [Tokenizer]
    remove_punctuation: True

and executed

    sh runtest std punct

Thx,

Skip

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sb.diff
Type: application/octet-stream
Size: 2408 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031029/871de894/sb-0001.obj


More information about the spambayes-dev mailing list