[Spambayes] An interesting example of bad correlation

Tim Peters tim.one@comcast.net
Mon Oct 28 06:09:53 2002


I just got two copies of this spam from python.org:

"""
Olá me chamo Marquinho. Acabei de lançar um site na WEB que fala sobre o
povo brasileiro e meu projeto... Lá você vai ver minhas fotos. Você pode
divulgar o potencial de sua cidade. Além disso você pode concorrer a uma web
cam. dia 27 de dezembro.

Visite! e vote no meu site! Preciso de apoio...

http://www.nossobrasil.kit.net



Se não quiser mais receber nossa informação favor somente responda.

NossoBrasil.kit.net


NossoBrasil.kit.net
"""

One of them showed up in my "I'm sure it's spam" folder, with a score of
0.96.  The other showed up in my "I'm confused" folder, with a score of
0.75.  What's the difference?  The former was addressed to
webmaster@python.org, and the latter to help@python.org, and the latter is a
(privately archived) mailing list so Mailman put its fingers on it.  Despite
that I *thought* I was ignoring all Mailman headers, I was <wink>.  But it
turns out Mailman does other stuff that reflects in the headers, adding this
stuff that didn't exist in the copy I got via webmaster:

'header:Errors-to:1'           0.045086
'subject:Python'               0.0644291
'subject:] '                   0.0772537
'subject:['                    0.147731
'subject:Help'                 0.270936
'subject:-'                    0.286281

The original didn't have an Errors-to header.  The last 5(!) are due to the

    [Python-Help]

inserted into the Subject line.

I believe spam that isn't caught by python.org, and comes thru on a mailing
list, is my biggest source of Unsure msgs.