[spambayes-dev] A URL experiment

Tue Dec 30 20:59:44 EST 2003

[Skip Montanaro]
> I tried a somewhat different approach (patch is attached) and got
> similar results (all ties at the more gross level, slight increase in
> spam mean and slight decrease in spam sdev, no change to ham at all
> (*)):

3-way compare on my data:

filename:   before   after    skip
ham:spam:  1510:520        1510:520
                   1510:520
fp total:        1       1       1
fp %:         0.07    0.07    0.07
fn total:        3       3       3
fn %:         0.58    0.58    0.58
unsure t:       39      39      39
unsure %:     1.92    1.92    1.92
real cost:  $20.80  $20.80  $20.80
best cost:  $17.60  $17.40  $17.80
h mean:       0.36    0.36    0.36
h sdev:       4.77    4.78    4.77
s mean:      97.07   97.11   97.08
s sdev:      11.09   11.05   11.03
mean diff:   96.71   96.75   96.72
k:            6.10    6.11    6.12

The "best cost" measure actually got marginally worse, but not significantly
so.

Note that this part of the patch can't be helping much:

+             num_pcs = url.count("%")
+             if num_pcs:
+                 pushclue("url:%d %%s" % num_pcs)

That is, raw counts are almost never useful -- if I have a URL in a spam
that embeds 40 escapes, that does nothing to indict a URL with 39 (or 41)
escapes.  Pumping out log2(a_count) usually does more good.  I *expect* the
approach in my patch would work better, though (generating lots of
correlated tokens -- there are good reasons to escape some punctuation
characters in URLs, but the only good reason to escape a letter or digit is
to obfuscate; let the classifier see these things, and it will learn that on
its own, as appropriate, for each escape code; then a URL escaping several
letters or digits will get penalized more the more heavily it employs this
kind of obfuscation).

> (*) Operational question: Given that my training data is somewhat
> small at the moment (roughly 1000-1500 each of ham and spam), would I
> be better off testing with fewer larger sets (e.g, 5 sets w/ 250 msgs
> each) or with more smaller sets (e.g, 10 sets w/ 125 msgs each)?

If you ask me <wink>, cross-validation should *always* be done with a
minimum of 10 sets, regardless of how much data you have.  There are many
reasons for this, from statistical reliability of the grand averages at the
end (they're subject to central-limit theorem constraints, and the more sets
the more reliable they are, growing with the square root of the # of sets);
to that it's extremely important to see run-by-run comparisons (how many
runs won, lost, tied), and just about any distribution of those numbers is
achievable by chance with few sets (IOW, "9 won, 1 tied, 0 lost" is very
much harder to account for by chance than "4 won, 1 tied, 0 lost"; likewise
"1 won, 8 tied, 1 lost" is much less likely to be produced by a significant
(good or bad) change than "1 won, 3 tied, 1 lost").

Note, though, that cross-validation is modeling the performance of a
train-on-everything strategy, and in random time order to boot.  If that's
not how you train, the results may be irrelevant to what you'll see in real
life.  It should be good enough to weed out really bad ideas-- and highlight
really good ones --regardless, though.