[spambayes-dev] image cracking results

Mark Hammond mhammond at skippinet.com.au
Fri Dec 22 03:56:19 CET 2006


I wrote:
> I've finally managed to get something working with the
> Outlook addin and
> Skip's cool new ocrad stuff.  the results look promising! :)

Here are a few more details on what I am doing.

To make things work with the image cracking code, I took the route of having
the Outlook addin generate a valid multipart message when there are images.
If there are no images, we return the same as we did in the past (ie, a
singlepart message with text and HTML in the normal "body"), so where
possible, the tokens generated for a message will be the same.  When there
are images, the tokens will now be different - due to the extra image
cracking tokens (obviously), but also due to the different mime related
tokens that will now be seen by the standard tokenizer.

This is a fairly subtle change, but could be signficant to the classifier.
For the purposes of comparison, I exported all ham and spam using the "old"
scheme (ie, before images were handled), and with the new scheme but with
image options disabled (but importantly, the new scheme *does* include the
image data).  The idea is to test only the impact of the new mime structure
without looking at image content.

I *think* these results are OK, but they are a little strange.  Below is the
result of cmp.py comparing the "old" scheme with the "new" scheme - note we
won 6 times, lost 4 times, and never tied, with the best win by 29%, but the
worst loss by 25%.  Another value of "+900.00%" in "ham sdev" also appears
extreme, but as I mentioned, I'm not very good at reading these.  One thing
I noticed is that the fact a message has a .gif attached is now a signficant
spam clue - I expect those new tokens account for the significant swings in
the results.

Does anyone have comments about this?

Cheers, and Happy Holidays!

Mark

<snip false positive percentages - all zero>

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied 10 times
lost  0 times

total unique fp went from 0 to 0 tied
mean fp % went from 0.0 to 0.0 tied

false negative percentages
    6.897  4.906  won    -28.87%
    6.777  6.367  won     -6.05%
    5.206  6.526  lost   +25.36%
    5.323  6.655  lost   +25.02%
    6.397  6.430  lost    +0.52%
    6.391  4.727  won    -26.04%
    5.587  5.204  won     -6.86%
    5.415  5.769  lost    +6.54%
    6.186  5.495  won    -11.17%
    6.470  6.239  won     -3.57%

won   6 times
tied  0 times
lost  4 times

total unique fn went from 331 to 319 won     -3.63%
mean fn % went from 6.06478720218 to 5.8317685488 won     -3.84%

ham mean                     ham sdev
   0.04    0.01  -75.00%        0.53    0.11  -79.25%
   0.01    0.01   +0.00%        0.12    0.13   +8.33%
   0.00    0.00 +(was 0)        0.01    0.10 +900.00%
   0.00    0.00 +(was 0)        0.02    0.08 +300.00%
   0.00    0.03 +(was 0)        0.00    0.66 +(was 0)
   0.00    0.00 +(was 0)        0.06    0.01  -83.33%
   0.05    0.02  -60.00%        1.05    0.28  -73.33%
   0.10    0.03  -70.00%        1.67    0.47  -71.86%
   0.01    0.05 +400.00%        0.14    0.87 +521.43%
   0.02    0.08 +300.00%        0.29    1.36 +368.97%

ham mean and sdev for all runs
   0.02    0.02   +0.00%        0.66    0.59  -10.61%

spam mean                    spam sdev
  89.52   91.50   +2.21%       25.85   23.32   -9.79%
  88.98   89.37   +0.44%       25.99   25.72   -1.04%
  91.25   89.82   -1.57%       23.36   25.43   +8.86%
  90.74   89.59   -1.27%       23.72   25.61   +7.97%
  89.76   89.78   +0.02%       26.01   25.75   -1.00%
  89.98   90.99   +1.12%       25.19   23.12   -8.22%
  90.89   89.93   -1.06%       23.97   24.20   +0.96%
  91.34   90.33   -1.11%       23.41   24.35   +4.02%
  89.88   90.23   +0.39%       25.39   24.58   -3.19%
  88.73   90.43   +1.92%       26.01   25.24   -2.96%

spam mean and sdev for all runs
  90.11   90.19   +0.09%       24.94   24.77   -0.68%

ham/spam mean difference: 90.09 90.17 +0.08



More information about the spambayes-dev mailing list