[Spambayes] URLSlurper (Was: Latest spammer trick stymied -QUESTION)

Mark Hammond mhammond at skippinet.com.au
Mon May 5 12:34:29 EDT 2003


> Time to pull out the testing data again...

OK - I will bite.  These short URL messages are still coming strong, and
usually still ending up squarely as "unsure".

Eg, the most recent I can find has a body of:
"""
Brother and Sister
Nephew and Aunt
Nasty relatives performing intrafamilial party in bedroom

http://peipeisoq.incestuals.com/
"""

To my mind, if you are a "customer" of such Spam, you know exactly what it
is offering - so presumably is as effective as any other porn spam -
possibly more so due to the "teasing" qualities.

> In the interests of testing, I've done this.  The code isn't based at
> all on the stuff that Richard sent (it was easier to just do it from
> scratch), but does follow the same rules (more-or-less).

*sigh* - while running the Outlook "export" program, I still see some errors
de-composing Outlook messages:

Failed to get message text for 'ADV: The Star Trek Cast NAKED! - This months
special feature. -  Must Be 18 Years or Older': string payload expected:
<type 'list'>
Failed to get message text for 'ADV: Exclusive Hot Young Centerfold Girls! -
Must Be 18 Years or Older': string payload expected: <type 'list'>
Failed to get message text for '[PMX:#] Be impressed!': string payload
expected: <type 'list'>
Failed to get message text for 'My father and his bitches...': string
payload expected: <type 'list'

Someone must look into this one of these days <wink>.  And to think it
appears profitable to send spam advertising the StarTrek cast naked <frown>.

> urls that will timeout (60s a pop!), and not a good idea to run if you
> pay per kb downloaded, or something like that).

I admit I am a little suspect on this test code though - it *seems* to be
slurping URLs even when there are plenty of discriminators.  I may be wrong
though, and have no time to check.

Anyway, after waiting ages for things to complete, and making some changes
to the test code, I have some results.  For me, it seems a loss!

timtest_outs.txt -> urlslurper_outs.txt
-> <stat> tested 423 hams & 1430 spams against 462 hams & 1493 spams
...
false positive percentages
...
won   0 times
tied 20 times
lost  0 times

total unique fp went from 8 to 8 tied
mean fp % went from 0.159381556389 to 0.159381556389 tied

false negative percentages
...
won   0 times
tied 14 times
lost  6 times
...

Which seems surprising.  I also note my ham/spam inbalance is getting high,
and it *seems* spambayes isn't doing the job it used to (far more "unsures"
than seems reasonable).  Unfortunately (or fortunately) it is still doing
good enough for me to not have the time to investigate further.

Another option is that my changes to the urlslurper (eg, only fetching
text/html) broke the test.  Can you re-run again from CVS and make sure you
still see a win.  I'd *love* to see a win (even though implementing anything
like this for outlook would be a challenge ;)

Mark.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3360 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20030505/2bd8de5c/winmail.bin


More information about the Spambayes mailing list