[spambayes-dev] removing punctuation - no apparent benefit

Skip Montanaro skip at pobox.com
Tue Oct 28 17:26:29 EST 2003


I tried removing punctuation from words as we discussed a week or so ago.
No overall change, though per/hap.s we just ne.ed to see more mes*sages with
em%bedded p-unctua-tion.  In general, the ham means decreased slightly and
the spam means increased slightly while the standard deviations of both
groups decreased a little.

    stds.txt -> puncts.txt
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
    -> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams

    false positive percentages
        0.200  0.200  tied          
        0.200  0.200  tied          
        0.200  0.200  tied          
        0.400  0.400  tied          
        1.000  1.000  tied          
        0.400  0.400  tied          
        0.000  0.000  tied          
        0.200  0.200  tied          
        0.400  0.400  tied          
        0.400  0.400  tied          

    won   0 times
    tied 10 times
    lost  0 times

    total unique fp went from 17 to 17 tied          
    mean fp % went from 0.34 to 0.34 tied          

    false negative percentages
        2.000  2.000  tied          
        0.800  0.800  tied          
        2.200  2.200  tied          
        2.200  2.200  tied          
        2.200  2.200  tied          
        1.200  1.200  tied          
        2.000  2.000  tied          
        2.000  2.000  tied          
        1.600  1.600  tied          
        1.200  1.200  tied          

    won   0 times
    tied 10 times
    lost  0 times

    total unique fn went from 87 to 87 tied          
    mean fn % went from 1.74 to 1.74 tied          

    ham mean                     ham sdev
       4.70    4.53   -3.62%       14.04   13.69   -2.49%
       6.12    6.13   +0.16%       16.43   16.48   +0.30%
       4.14    4.07   -1.69%       12.86   12.73   -1.01%
       4.64    4.62   -0.43%       14.35   14.34   -0.07%
       5.49    5.41   -1.46%       17.21   17.16   -0.29%
       5.21    5.16   -0.96%       16.10   16.01   -0.56%
       3.74    3.70   -1.07%       12.19   12.12   -0.57%
       3.38    3.34   -1.18%       11.70   11.68   -0.17%
       5.50    5.48   -0.36%       17.07   17.05   -0.12%
       5.44    5.38   -1.10%       15.85   15.82   -0.19%

    ham mean and sdev for all runs
       4.84    4.78   -1.24%       14.93   14.86   -0.47%

    spam mean                    spam sdev
      89.47   89.57   +0.11%       20.40   20.37   -0.15%
      91.15   91.24   +0.10%       17.17   17.02   -0.87%
      91.04   91.03   -0.01%       20.00   20.03   +0.15%
      89.39   89.54   +0.17%       20.88   20.76   -0.57%
      89.56   89.63   +0.08%       19.92   19.91   -0.05%
      90.72   90.77   +0.06%       18.41   18.36   -0.27%
      90.97   91.03   +0.07%       18.66   18.60   -0.32%
      91.66   91.78   +0.13%       18.20   18.08   -0.66%
      91.58   91.61   +0.03%       18.03   18.03   +0.00%
      92.24   92.30   +0.07%       16.18   16.18   +0.00%

    spam mean and sdev for all runs
      90.78   90.85   +0.08%       18.86   18.81   -0.27%

    ham/spam mean difference: 85.94 86.07 +0.13

The training database I used is just my current collection of ham and spam I
use every day.

Attached is the context diff to tokenizer.py and Options.py should anyone
care to check my work for mistakes or tweak it.

Skip

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sb.diff
Type: application/octet-stream
Size: 2675 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031028/f47f3299/sb.obj


More information about the spambayes-dev mailing list