Ham or Spam? (was RE: [Spambayes] RE: Central Limit Theorem??!! :))

Greg Ward gward@python.net
Fri, 27 Sep 2002 14:07:35 -0400


On 27 September 2002, Tim Peters said:
> FYI, another preliminary observation is that the logarithmic central-limit
> scheme seems (in my data) to be very unsure (high sdevs away from both
> means) about:
> 
> + Msgs in German.
> + Msgs in French.
> + Msgs in Spanish.

Good, I think!  Two of my persistent FPs are:

  * a message from CIRA (the Canadian Internet Registration Authority),
    which is half in French
  * a one-line "subscribe" sent to <mumble>-request@python.org,
    with a loooong disclaimer in Spanish

Both of these are killed by perfectly ordinary, vanilla French and
Spanish words, because I get very little real email in French and none
in Spanish.  (Oddly enough, I get very little spam in French too.  In
fact, most of the spam that gets past SpamAssassin to me is in German,
which highlights another of SA's weaknesses and a *potential* strength
of the statistical approach.  I used to get a lot of spam in Portuguese,
but I added some SA tests that score messages from Brazil and Portugal a
bit higher.  Seems to have put a stop to that.)

I bet Italian would suffer the same fate except that I have a week's
worth of traffic from zope-it@zope.org in my ham collection.

Incidentally, here are the probs for the CIRA message -- interestingly,
this is one of the few messages in the gward-ham corpus that was *not*
sent to gward@python.net, so it still has many traces of my ISP in the
Received headers.  None of my spam went through my ISP, so all those
clues are 0.01 clues -- but they still didn't help.

Data/Ham/Set9/1014826921.10087.cthulhu.gerg.ca:2,S
prob = 1.0
prob('membres') = 0.01
prob('to:videotron.ca') = 0.01
prob('received:greg') = 0.01
prob('cher') = 0.01
prob('positions.') = 0.01
prob('received:videotron.ca') = 0.01
prob('header:Return-path:1') = 0.01
prob('received:bellnexxia.net') = 0.01
prob('nominations') = 0.01
prob('re:') = 0.01
prob('received:24.201') = 0.01
prob('received:pop.videotron.ca') = 0.01
prob('received:24.201.245') = 0.01
prob('avis') = 0.01
prob('received:24.201.245.36') = 0.01
prob('received:sc1.videotron.ca') = 0.01
prob('return-path:members') = 0.01
prob('received:vl-ms-mr002.sc1.videotron.ca') = 0.01
prob('des') = 0.9875
prob('sera') = 0.99
prob('comit\xe9') = 0.99
prob('qui') = 0.99
prob('enabling') = 0.99
prob('voter') = 0.99
prob('mai') = 0.99
prob('besoin') = 0.99
prob('mises') = 0.99
prob('apr\xe8s') = 0.99
prob('ligne') = 0.99
prob('mars') = 0.99
prob('candidat') = 0.99
prob('election') = 0.99
prob('vous') = 0.99
prob('aussi') = 0.99
prob('une') = 0.99
prob('trois') = 0.99
prob('entre') = 0.99
prob('ces') = 0.99
prob('avril') = 0.99
prob('leur') = 0.99
prob('directors.') = 0.99
prob('objet') = 0.99
prob('faire') = 0.99
prob('nominated') = 0.99
prob('personnes') = 0.99
prob('veuillez') = 0.99
prob('elected') = 0.99
prob('toute') = 0.99
prob('received:sympatico.ca') = 0.99
prob('possession,') = 0.99
prob('poser') = 0.99
prob('peut') = 0.99

Would be interesting to score just the English half of that message.

Oh, here are the probs for the "subscribe" request with Spanish
disclaimer:

Data/Ham/Set10/17rK7U-0007e7-00:2,S
prob = 1.0
prob('subject:subscribe') = 0.01
prob('confidencial') = 0.01
prob('autorizado') = 0.01
prob('contenida') = 0.01
prob('fon') = 0.01
prob('mensaje.') = 0.01
prob('message-id:skip:A 40') = 0.01
prob('est\xe1') = 0.99
prob('cual') = 0.99
prob('mensajes') = 0.99
prob('pueda') = 0.99
prob('hacer') = 0.99
prob('diego') = 0.99
prob('llegar') = 0.99
prob('informaci\xf3n') = 0.99
prob('alguna') = 0.99
prob('exclusivo') = 0.99
prob('empresa') = 0.99
prob('motivo') = 0.99
prob('telef\xf3nica') = 0.99
prob('gesti\xf3n') = 0.99
prob('haber') = 0.99
prob('received:167]') = 0.99
prob('propias') = 0.99
prob('ud.') = 0.99
prob('todas') = 0.99
prob('mismo,') = 0.99
prob('son') = 0.99
prob('personales') = 0.99
prob('pueden') = 0.99
prob('responsable') = 0.99

This message has got to win some sort of prize for the wheat-to-chaff
ratio: it requires 4.6 kB of signature, MIME junk, and disclaimer to say
"subscribe".  Here are the headers:

"""
Return-Path: <DQuevedo@uniFON.com.ar>
Envelope-To: python-list-request@python.org
Received: from [200.16.211.167] (helo=estcp.tcp.com.ar)
        by mail.python.org with esmtp (Exim 4.05)
        id 17rK7U-0007e7-00
        for python-list-request@python.org; Tue, 17 Sep 2002 11:18:32 -0400
Received: by noticias.unifon.com.ar with Internet Mail Service (5.5.2653.19)
        id <S6X3XHKB>; Tue, 17 Sep 2002 12:15:11 -0300
Message-ID: <A128D751272CD411BC9200508BC2194D019B6215@escpl.tcp.com.ar>
From: "Quevedo, Diego" <DQuevedo@uniFON.com.ar>
To: "'python-list-request@python.org'" <python-list-request@python.org>
Subject: subscribe
Date: Tue, 17 Sep 2002 12:15:33 -0300
Return-Receipt-To: "Quevedo, Diego" <DQuevedo@uniFON.com.ar>
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: multipart/alternative;
        boundary="----_=_NextPart_001_01C25E5D.1189FAE0"
"""

and here is the text/plain part:

"""
subscribe


Diego Quevedo
Gesti?n de Red
Tel: (54-11) 4324-9103
Cel: 15-5132-0135
dquevedo@unifon.com.ar
unifon








. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
ADVERTENCIA

La informaci?n contenida en este mensaje y cualquier archivo anexo al mismo,
son para uso exclusivo del destinatario y pueden contener informaci?n
confidencial o propietaria, cuya divulgaci?n es sancionada por la ley.

Si Ud. No es uno de los destinatarios consignados o la persona responsable
de hacer llegar este mensaje a los destinatarios consignados, no est?
autorizado a divulgar, copiar, distribuir o retener informaci?n (o parte de
ella) contenida en este mensaje. Por favor notif?quenos respondiendo al
remitente, borre el mensaje original y borre las copias (impresas o grabadas
en cualquier medio magn?tico) que pueda haber realizado del mismo.

Todas las opiniones contenidas en este mail son propias del autor del
mensaje y no necesariamente coinciden con las de Telef?nica Comunicaciones
Personales S.A. o alguna empresa asociada.

Los mensajes electr?nicos pueden ser alterados, motivo por el cual
Telef?nica Comunicaciones Personales S.A. no aceptar? ninguna obligaci?n
cualquiera sea el resultante de este mensaje.

Muchas Gracias.
"""

And I'm not even showing the text/html part -- good *grief*!

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
Yield to temptation; it may not pass your way again.