[spambayes-dev] A URL experiment

Skip Montanaro skip at pobox.com
Wed Dec 31 10:53:27 EST 2003


    Tim> Note that this part of the patch can't be helping much:

    Tim> +             num_pcs = url.count("%")
    Tim> +             if num_pcs:
    Tim> +                 pushclue("url:%d %%s" % num_pcs)

    Tim> That is, raw counts are almost never useful -- if I have a URL in a
    Tim> spam that embeds 40 escapes, that does nothing to indict a URL with
    Tim> 39 (or 41) escapes.  Pumping out log2(a_count) usually does more
    Tim> good.  

<aside type="slight">

"url:has user" seems to be fairly spammy for me:

    % spamcounts -r -d ~/tmp/hammie.db '^url:has user'
    db: /Users/skip/tmp/hammie.db
    token,nspam,nham,spam prob
    url:has user,42,4,0.91016660508

</aside>

Okay, here are the raw number of URL percents as present in my current
ham/spam database:

    npcs    nspam   nham
    1       21      46  
    2       4       1   
    3       2       2   
    4       1       2   
    5       0       1   
    6       2       2   
    7       1       1   
    8       0       2   
    14      2       0   
    15      0       1   
    16      1       0   
    18      1       0   
    23      1       0   
    24      1       0   
    28      1       0   
    30      1       0   
    38      2       0   
    40      1       0   
    42      1       0   
    74      1       0   
    75      1       0   
    84      1       0   
    97      1       0   
    103     1       0   
    109     1       0   
    191     1       0   

I redid my patch to generate tokens like so:

    pushclue("url:%%%d" % int(log2(num_pcs)))

Converting the first column to int(log(n,2)) then rebuilding the database
gives: 

    log(npcs)   nspam   nham
    0           21      46
    1           6       3
    2           4       2
    3           2       2
    4           5       0
    5           3       0
    6           2       0
    7           1       0

The new cv test results are essentially the same (I still have just five
sets):

stds.txt -> pickurlss.txt
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams

false positive percentages
    0.000  0.000  tied          
    0.400  0.400  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          

won   0 times
tied  5 times
lost  0 times

total unique fp went from 1 to 1 tied          
mean fp % went from 0.08 to 0.08 tied          

false negative percentages
    3.333  3.333  tied          
    5.000  5.000  tied          
    7.333  7.333  tied          
    5.667  5.667  tied          
    4.000  4.000  tied          

won   0 times
tied  5 times
lost  0 times

total unique fn went from 76 to 76 tied          
mean fn % went from 5.06666666667 to 5.06666666667 tied          

ham mean                     ham sdev
   1.64    1.64   +0.00%        8.44    8.45   +0.12%
   0.99    0.99   +0.00%        8.29    8.29   +0.00%
   2.82    2.82   +0.00%       12.52   12.52   +0.00%
   1.58    1.58   +0.00%        8.29    8.29   +0.00%
   1.30    1.30   +0.00%        8.04    8.04   +0.00%

ham mean and sdev for all runs
   1.66    1.66   +0.00%        9.30    9.30   +0.00%

spam mean                    spam sdev
  93.80   93.83   +0.03%       19.39   19.31   -0.41%
  90.56   90.59   +0.03%       24.31   24.26   -0.21%
  89.24   89.28   +0.04%       27.03   27.04   +0.04%
  89.27   89.27   +0.00%       25.51   25.50   -0.04%
  92.72   92.74   +0.02%       21.67   21.67   +0.00%

spam mean and sdev for all runs
  91.12   91.14   +0.02%       23.81   23.79   -0.08%

ham/spam mean difference: 89.46 89.48 +0.02

Skip



More information about the spambayes-dev mailing list