[spambayes-dev] A URL experiment

Sun Jan 4 18:51:53 EST 2004

[Tim]
>> I *expect* the approach in my patch would work better, though
>> (generating lots of correlated tokens -- there are good reasons to
>> escape some punctuation characters in URLs, but the only good
>> reason to escape a letter or digit is to obfuscate; let the
>> classifier see these things, and it will learn that on its own,
>> appropriate, for each escape code; then a URL escaping several
>> letters or digits will get penalized more the more heavily it
>> employs this kind of obfuscation).

[Skip Montanaro]
> My problem with that approach is the stuff the spammers escape can be
> essentially random, as in the bogus URL you received.  I think you
> might get scads of hapaxes (or at least low-count escapes).  Stuff
> with high-counts will be legitimate (%20 and so forth).

There won't be scads of hapaxes, because the number of escape codes is
finite (small, even -- only 256 make sense).  I *expect* that only 62 of
those will be interesting (attempts to obfuscate letters and digits), but
there's no need to try to out-think that, and just sucking up every escape
code without prejudice lets the classifier learn to be smarter than I am.

The pre-judgment here comes from the *belief* that this is a case where
generating multiple correlated clues will help more than it hurts.
Especially with smaller databases, multiple clues do a lot more toward
forcing a decision than a single clue can do.

> Conclusions obviously await some eyeballing of databases.

Yup!

> ...
> The random time order isn't so important to me at the moment, because
> all the messages I'm using are recent (received within the past month
> or so). The "train on everything" aspect is more interesting.  I find
> the cross-validation tests never perform as well as in real life. ;-)

I expect that's because the CV tests *do* lose time-ordering.

> ...
> There's the rub.  What might be really good ideas at this point will
> probably only result in very small changes in performance because the
> baseline system is currently so good.

That's OK -- accumulating many tiny improvements is as good finding a single
small improvement <wink>.  That's a sure way to make ongoing progress too,
and is the *usual* fate of mature statistical systems.  A question remaining
is whether each tiny improvement is worth the costs it incurs (in processing
time, database size, and code complexity).  I think this one does well on
all those counts, as it only triggers in a specific context, can't add more
than a few hundred tokens total to a database, and the code is simple.