From seandarcy at hotmail.com  Thu Apr  1 17:47:17 2004
From: seandarcy at hotmail.com (sean darcy)
Date: Thu Apr  1 17:47:21 2004
Subject: [spambayes-dev] Dibbler.py error in training 
Message-ID: <BAY9-F553G494n8BxTL00005417@hotmail.com>

I'm  training using the web interface. When I click train I got the 
following:

500 Server error

Traceback (most recent call last):

  File "/usr/lib/python2.3/site-packages/spambayes/Dibbler.py", line 461, in 
found_terminator
    getattr(plugin, name)(**params)

  File "/usr/lib/python2.3/site-packages/spambayes/ProxyUI.py", line 391, in 
onReview
    fromCache=True)

  File "/usr/lib/python2.3/site-packages/spambayes/Corpus.py", line 200, in 
takeMessage
    types.StringsTypes):

NameError: global name 'types' is not defined


So, I used CVS today. Now I get:

500 Server error

Traceback (most recent call last):

  File "/usr/lib/python2.3/site-packages/spambayes/Dibbler.py", line 461, in 
found_terminator
    getattr(plugin, name)(**params)

  File "/usr/lib/python2.3/site-packages/spambayes/ProxyUI.py", line 391, in 
onReview
    fromCache=True)

  File "/usr/lib/python2.3/site-packages/spambayes/Corpus.py", line 213, in 
takeMessage
    fromcorpus.removeMessage(msg)

  File "/usr/lib/python2.3/site-packages/spambayes/FileCorpus.py", line 151, 
in removeMessage
    Corpus.Corpus.removeMessage(self, message, observer_flags)

  File "/usr/lib/python2.3/site-packages/spambayes/Corpus.py", line 147, in 
removeMessage
    obs.onRemoveMessage(message, observer_flags)

  File "/usr/lib/python2.3/site-packages/spambayes/storage.py", line 606, in 
onRemoveMessage
    if flags.find(NO_TRAINING_FLAG) < 0:

AttributeError: 'NoneType' object has no attribute 'find'


sean

_________________________________________________________________
Tax headache? MSN Money provides relief with tax tips, tools, IRS forms and 
more! http://moneycentral.msn.com/tax/workshop/welcome.asp


From skip at pobox.com  Thu Apr  1 18:41:28 2004
From: skip at pobox.com (Skip Montanaro)
Date: Thu Apr  1 18:41:42 2004
Subject: [spambayes-dev] Dibbler.py error in training 
In-Reply-To: <BAY9-F553G494n8BxTL00005417@hotmail.com>
References: <BAY9-F553G494n8BxTL00005417@hotmail.com>
Message-ID: <16492.43176.91868.294903@montanaro.dyndns.org>


    sean> So, I used CVS today. Now I get:
    ...
    sean>   File "/usr/lib/python2.3/site-packages/spambayes/storage.py", line 606, in 
    sean> onRemoveMessage
    sean>     if flags.find(NO_TRAINING_FLAG) < 0:

    sean> AttributeError: 'NoneType' object has no attribute 'find'

This looks like a bug in onRemoveMessage().  I don't know what the meaning
of a flags value of None is supposed to be so I can't fix it, but it's clear
that the flags.find() call has to be conditional on flags not being None.
Tony added that code in the past week or so.  I trust he will know the
correct fix.

Skip

From skip at pobox.com  Fri Apr  2 11:33:38 2004
From: skip at pobox.com (Skip Montanaro)
Date: Fri Apr  2 11:33:50 2004
Subject: [spambayes-dev] sb_bnfilter.py/sb_bnserver.py
In-Reply-To: <16492.45455.798964.179426@montanaro.dyndns.org>
References: <1080424390.4065.24.camel@porsche.hq.simlog.com>
	<16487.22766.448116.900778@montanaro.dyndns.org>
	<4068267B.5030206@videotron.ca>
	<16490.18079.594164.489427@montanaro.dyndns.org>
	<16492.45455.798964.179426@montanaro.dyndns.org>
Message-ID: <16493.38370.576646.51972@montanaro.dyndns.org>


    Skip> Toby's timeout changes coupled with a change to the PATH setting
    Skip> in my procmailrc file seem to have fixed the problems I was
    Skip> having.

I've been running with Toby's sb_bnfilter.py this setup since yesterday.
It's processed around 1500 messages with no hiccups (no procmail.log
messages so far).  This is looking good.  After a bit more exercise I think
we should consider it as a replacement for sb_filter.py.

Skip

From kennypitt at hotmail.com  Fri Apr  2 13:35:35 2004
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Fri Apr  2 13:36:51 2004
Subject: [spambayes-dev] Dibbler.py error in training
In-Reply-To: <16492.43176.91868.294903@montanaro.dyndns.org>
Message-ID: <BAY16-DAV60lgmbQe3O0000e51b@hotmail.com>

Skip Montanaro wrote:
>     sean> So, I used CVS today. Now I get:
>     ...
>     sean>   File
"/usr/lib/python2.3/site-packages/spambayes/storage.py", line 606, in
>     sean> onRemoveMessage
>     sean>     if flags.find(NO_TRAINING_FLAG) < 0: 
> 
>     sean> AttributeError: 'NoneType' object has no attribute 'find'
> 
> This looks like a bug in onRemoveMessage().  I don't know what the
> meaning of a flags value of None is supposed to be so I can't fix it,
> but it's clear that the flags.find() call has to be conditional on
> flags not being None.  Tony added that code in the past week or so.
> I trust he will know the correct fix.

I just checked in a fix for this.  flags=None was supposed to represent
that no flags had been passed.  I changed the code to use integer bit
values that could be OR'd together if we ever add more flags in the
future, and it now defaults to flags=0 for no flags.

This error only seems to occur if you retrain a message that was trained
into the wrong corpus (either correcting a training mistake, or
retraining a false positive or false negative with a train-on-everything
strategy).  onRemoveMessage is not called on the unknown corpus.

-- 
Kenny Pitt


From seandarcy at hotmail.com  Fri Apr  2 18:47:29 2004
From: seandarcy at hotmail.com (sean darcy)
Date: Fri Apr  2 18:47:33 2004
Subject: [spambayes-dev] Dibbler.py error in training
Message-ID: <BAY9-F58uCXLFSquj2g0006e2ad@hotmail.com>


>I just checked in a fix for this.  flags=None was supposed to represent
>that no flags had been passed.  I changed the code to use integer bit
>values that could be OR'd together if we ever add more flags in the
>future, and it now defaults to flags=0 for no flags.
>
>This error only seems to occur if you retrain a message that was trained
>into the wrong corpus (either correcting a training mistake, or
>retraining a false positive or false negative with a train-on-everything
>strategy).  onRemoveMessage is not called on the unknown corpus.

Updated from cvs. New error message:


Training...
500 Server error

Traceback (most recent call last):

  File "/usr/lib/python2.3/site-packages/spambayes/Dibbler.py", line 461, in 
found_terminator
    getattr(plugin, name)(**params)

  File "/usr/lib/python2.3/site-packages/spambayes/ProxyUI.py", line 391, in 
onReview
    fromCache=True)

  File "/usr/lib/python2.3/site-packages/spambayes/Corpus.py", line 209, in 
takeMessage
    if opt in notate_opt and \

AttributeError: 'NoneType' object has no attribute 'startswith'


Did I get the fix from cvs?  Maybe Sourceforge just didn't update it.

cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/spambayes co 
spambayes
U spambayes/contrib/sb_bnfilter.py
U spambayes/spambayes/Corpus.py
U spambayes/spambayes/FileCorpus.py
U spambayes/spambayes/storage.py


sean

_________________________________________________________________
Limited-time offer: Fast, reliable MSN 9 Dial-up Internet access FREE for 2 
months! 
http://join.msn.com/?page=dept/dialup&pgmarket=en-us&ST=1/go/onm00200361ave/direct/01/


From kennypitt at hotmail.com  Mon Apr  5 09:44:40 2004
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Mon Apr  5 09:45:56 2004
Subject: [spambayes-dev] Dibbler.py error in training
Message-ID: <BAY16-DAV23x3G5aYY6000102e3@hotmail.com>

sean darcy wrote:
>> I just checked in a fix for this. {snip]
> 
> Updated from cvs. New error message:
> 
> 
> 
> Training...
> 500 Server error
> 
> Traceback (most recent call last):
> 
>   File "/usr/lib/python2.3/site-packages/spambayes/Dibbler.py", line
>     461, in found_terminator getattr(plugin, name)(**params)
> 
>   File "/usr/lib/python2.3/site-packages/spambayes/ProxyUI.py", line
>     391, in onReview fromCache=True)
> 
>   File "/usr/lib/python2.3/site-packages/spambayes/Corpus.py", line
>     209, in takeMessage if opt in notate_opt and \
> 
> AttributeError: 'NoneType' object has no attribute 'startswith'
> 
> 
> Did I get the fix from cvs?  Maybe Sourceforge just didn't update it.
> 
> cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/spambayes
> co spambayes 
> U spambayes/contrib/sb_bnfilter.py
> U spambayes/spambayes/Corpus.py
> U spambayes/spambayes/FileCorpus.py
> U spambayes/spambayes/storage.py

Well, anonymous CVS does have a little bit of a delay but it looks like
you got all of the affected files, and it looks like this error is in a
different location.  I'll try to make time to take a look if someone
else doesn't beat me to it.

-- 
Kenny Pitt


From kennypitt at hotmail.com  Mon Apr  5 10:02:35 2004
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Mon Apr  5 10:03:53 2004
Subject: [spambayes-dev] Dibbler.py error in training
In-Reply-To: <BAY16-DAV23x3G5aYY6000102e3@hotmail.com>
Message-ID: <BAY16-DAV30hi0inEiP000104cf@hotmail.com>

Kenny Pitt wrote:
>> 
>> Training...
>> 500 Server error
>> 
>> Traceback (most recent call last):
>> 
>>   File "/usr/lib/python2.3/site-packages/spambayes/Dibbler.py", line
>>     461, in found_terminator getattr(plugin, name)(**params)
>> 
>>   File "/usr/lib/python2.3/site-packages/spambayes/ProxyUI.py", line
>>     391, in onReview fromCache=True)
>> 
>>   File "/usr/lib/python2.3/site-packages/spambayes/Corpus.py", line
>>     209, in takeMessage if opt in notate_opt and \
>> 
>> AttributeError: 'NoneType' object has no attribute 'startswith'
>> ...
> 
> Well, anonymous CVS does have a little bit of a delay but it looks
> like you got all of the affected files, and it looks like this error
> is in a different location.  I'll try to make time to take a look if
> someone else doesn't beat me to it.

I went ahead and took a look at this.  It was a different problem
accidentally introduced a little while ago while fixing a previous bug.
I checked in another fix in Corpus.py for it.  Look for revision 1.18 to
come through anon cvs.

We really appreciate these problem reports.  Everyone uses the software
differently so it is often impossible for the developers to catch
problems with every combination of option settings.

-- 
Kenny Pitt


From ta-meyer at ihug.co.nz  Mon Apr  5 18:48:56 2004
From: ta-meyer at ihug.co.nz (Tony Meyer)
Date: Mon Apr  5 18:49:13 2004
Subject: [spambayes-dev] Incremental filtering and the spam folder
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677BC9@its-xchg4.massey.ac.nz>


From ta-meyer at ihug.co.nz  Mon Apr  5 18:52:15 2004
From: ta-meyer at ihug.co.nz (Tony Meyer)
Date: Mon Apr  5 18:52:33 2004
Subject: [spambayes-dev] Incremental filtering and the spam folder
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677BCA@its-xchg4.massey.ac.nz>

[Sorry about the blank one of these.  Did the shortcut for 'send' instead of
'save'].

The log file has a rather confusing message when incremental training is
*disabled*:

"""
SpamBayes: Watching (for incremental training) in 'Personal Folders/Spam'
"""

This gets logged because the spam folder is (from what I can tell) still
watched for incremental training, but when an item gets added to it, if the
train_manual_spam option is False, nothing happens.

Would it not be better to only add the hook if the train_manual_spam option
is True?  Or is there some other reason that the spam folder has to be
hooked?

=Tony Meyer


From seandarcy at hotmail.com  Mon Apr  5 19:25:11 2004
From: seandarcy at hotmail.com (sean darcy)
Date: Mon Apr  5 19:25:18 2004
Subject: [spambayes-dev] Dibbler.py error in training
Message-ID: <BAY9-F30JKVApwOLZCH00031eeb@hotmail.com>


>I went ahead and took a look at this.  It was a different problem
>accidentally introduced a little while ago while fixing a previous bug.
>I checked in another fix in Corpus.py for it.  Look for revision 1.18 to
>come through anon cvs.

Did that. Sadly:


Training...
500 Server error

Traceback (most recent call last):

  File "/usr/lib/python2.3/site-packages/spambayes/Dibbler.py", line 461, in 
found_terminator
    getattr(plugin, name)(**params)

  File "/usr/lib/python2.3/site-packages/spambayes/ProxyUI.py", line 391, in 
onReview
    fromCache=True)

  File "/usr/lib/python2.3/site-packages/spambayes/Corpus.py", line 209, in 
takeMessage
    if (notate_opt is not None) and (opt in notate_opt) and \

AttributeError: 'NoneType' object has no attribute 'startswith'


>We really appreciate these problem reports.  Everyone uses the software
>differently so it is often impossible for the developers to catch
>problems with every combination of option settings.
>
>--
>Kenny Pitt

Thanks for the kind words - but I really appreciate you guys for doing all 
the work on this.

sean

_________________________________________________________________
Limited-time offer: Fast, reliable MSN 9 Dial-up Internet access FREE for 2 
months! 
http://join.msn.com/?page=dept/dialup&pgmarket=en-us&ST=1/go/onm00200361ave/direct/01/


From davejameson at comcast.net  Mon Apr  5 20:17:31 2004
From: davejameson at comcast.net (Dave Jameson)
Date: Mon Apr  5 20:19:58 2004
Subject: [spambayes-dev] spamBayes ideas
Message-ID: <LOEKJKCFHJKNNPMEMICKIEGOCBAA.davejameson@comcast.net>

Hello,
First let me say thank you for all your hard work on this project ? it is
fantastic! I have recommended it to many people who have found it to be
everything I claimed ;-)

I am a product planner for a very large software project so hopefully my
ideas aren?t to lame.


1.	I have noticed lately that may spammers are moving to add fake HTML tags
in the middle of the words to screw the parsers up, much in the same way
that people obfuscate their email addresses on web pages to beat the
spambots. (E.G. from a spam received today -
www.lif<kdhpzam>eisimpo</gortcxld>rtant.biz<br><br>). I was thinking if you
could database valid HTML tags (perhaps learned and pre-populated?) so that
new unknown tags would count as spam probability. This would primarily mean
inverting the way < > tags are handled compared to other words, that is
assuming spam, learning ham. In the above example <br> would be ham and the
others spam. You could even set a property file to allow x number of false
tags to score the whole email as spam. In the above example spam there were
11 fake tags.


2.	The last one is a bit fancy but here goes. On possible spam measure the
recovered vs. bad and look at the scores. With an algorithm you should be
able to auto adjust the thresholds ? just a thought.


HTH,
Dave
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040405/2bed38cc/attachment.html
From kennypitt at hotmail.com  Tue Apr  6 09:13:39 2004
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Tue Apr  6 09:14:59 2004
Subject: [spambayes-dev] Dibbler.py error in training
Message-ID: <BAY16-DAV43wj3ErkXW00011169@hotmail.com>

sean darcy wrote:
>> I went ahead and took a look at this.  It was a different problem
>> accidentally introduced a little while ago while fixing a previous
>> bug. I checked in another fix in Corpus.py for it.  Look for
>> revision 1.18 to come through anon cvs.
> 
> Did that. Sadly:
> 
> Training...
> 500 Server error
> 
> Traceback (most recent call last):
> 
>   File "/usr/lib/python2.3/site-packages/spambayes/Dibbler.py", line
>     461, in found_terminator getattr(plugin, name)(**params)
> 
>   File "/usr/lib/python2.3/site-packages/spambayes/ProxyUI.py", line
>     391, in onReview fromCache=True)
> 
>   File "/usr/lib/python2.3/site-packages/spambayes/Corpus.py", line
>     209, in takeMessage if (notate_opt is not None) and (opt in
> notate_opt) and \ 
> 
> AttributeError: 'NoneType' object has no attribute 'startswith'

Oops, looks like I misread the original error message.  The fix I put in
is probably a useful safeguard, but not the one that was causing the
problem.

In looking more closely, though, something seems a little odd here.  The
offending object that is coming back None appears to be the msg[header]
reference.  If I'm not mistaken, that means that either the Subject: or
To: header is missing entirely from the message, which is very unusual.

Could you, by chance, attach a copy of the message that is causing the
error?  A copy of it should appear as a file in one of the cache
directories below the directory containing your training database, or
you could just view the message source from Review Messages and
copy-and-paste it.

-- 
Kenny Pitt


From kennypitt at hotmail.com  Tue Apr  6 09:37:21 2004
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Tue Apr  6 09:38:38 2004
Subject: [spambayes-dev] Incremental filtering and the spam folder
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677BCA@its-xchg4.massey.ac.nz>
Message-ID: <BAY16-DAV26h09qAavG00010f44@hotmail.com>

Tony Meyer wrote:
> The log file has a rather confusing message when incremental training
> is *disabled*:
> 
> """
> SpamBayes: Watching (for incremental training) in 'Personal
> Folders/Spam' """
> 
> This gets logged because the spam folder is (from what I can tell)
> still watched for incremental training, but when an item gets added
> to it, if the train_manual_spam option is False, nothing happens.
> 
> Would it not be better to only add the hook if the train_manual_spam
> option is True?  Or is there some other reason that the spam folder
> has to be hooked?

Haven't looked at the code to see if anything else is going on in the
hook function, but at the very least it seems like we should check the
train_manual_spam option and not generate the log message if incremental
training is disabled.

-- 
Kenny Pitt


From seandarcy at hotmail.com  Tue Apr  6 14:22:33 2004
From: seandarcy at hotmail.com (sean darcy)
Date: Tue Apr  6 14:22:41 2004
Subject: [spambayes-dev] Dibbler.py error in training
Message-ID: <BAY9-F12sWoHLwo6vLx00061c01@hotmail.com>


----Original Message Follows----
From: &quot;Kenny Pitt&quot; &lt;kennypitt@hotmail.com&gt;
To: &quot;'sean darcy'&quot; 
&lt;seandarcy@hotmail.com&gt;,&lt;skip@pobox.com&gt;
CC: &lt;spambayes-dev@python.org&gt;
Subject: RE: [spambayes-dev] Dibbler.py error in training
Date: Tue, 6 Apr 2004 09:13:39 -0400
...................................
>Oops, looks like I misread the original error message.  The fix I put in
>is probably a useful safeguard, but not the one that was causing the
>problem.
>
>In looking more closely, though, something seems a little odd here.  The
>offending object that is coming back None appears to be the msg[header]
>reference.  If I'm not mistaken, that means that either the Subject: or
>To: header is missing entirely from the message, which is very unusual.

It's not that unusual for the Subject header to be missing. Looking over 
past emails, I've found some "ham" posts that had no subject. In any event, 
some of the posts to be trained do have  no Subject - all spam.

Here's an example from "tokens" on the untrained message page:

Tokens for: (none) (15)

Word 	Probability 	Times in ham 	Times in spam
content-type:text/plain 	0.288326 	1576 	556
from:addr:qziwpklwit 	- 	0 	0
from:addr:musician.org 	0.844828 	0 	1
from:no real name:2**0 	0.186886 	825 	165
to:none 	0.878691 	2 	14
cc:none 	0.351951 	979 	463
sender:none 	0.410456 	978 	593
reply-to:none 	0.271479 	746 	242
x-mailer:none 	0.417812 	832 	520
message-id:@mta13.srv.hcvlny.cv.net 	0.844828 	0 	1
header:Date:1 	0.500287 	1742 	1519
header:Received:3 	0.77726 	215 	654
header:Message-id:1 	0.907877 	144 	1238
header:From:1 	0.500718 	1739 	1519
header:Return-path:1 	0.940104 	95 	1302

Here's the mesage source:

Return-path: <qziwpklwit@musician.org>
Received: from mta13.srv.hcvlny.cv.net (mta13.srv.hcvlny.cv.net 
[167.206.5.82])
	by mstr9.srv.hcvlny.cv.net
	(iPlanet Messaging Server 5.2 HotFix 1.16 (built May 14 2003))
	with ESMTP id <0HVC00G0PB4QME@mstr9.srv.hcvlny.cv.net>; Mon,
	29 Mar 2004 08:36:26 -0500 (EST)
Received: from f94006.upc-f.chello.nl (f94006.upc-f.chello.nl [80.56.94.6])
	by mta13.srv.hcvlny.cv.net
	(iPlanet Messaging Server 5.2 HotFix 1.16 (built May 14 2003))
	with SMTP id <0HVC00ISEAU5TL@mta13.srv.hcvlny.cv.net>; Mon,
	29 Mar 2004 08:34:03 -0500 (EST)
Received: from 123.224.24.65 by 80.56.94.6 with qdtrhun [1
Date: Mon, 29 Mar 2004 08:34:03 -0500 (EST)
Date-warning: Date header was inserted by mta13.srv.hcvlny.cv.net
From: qziwpklwit@musician.org
Message-id: <0HVC00IM1B0CTL@mta13.srv.hcvlny.cv.net>
Content-transfer-encoding: 7BIT
X-Spambayes-Classification: unsure
X-Spambayes-Spam-Probability: 0.84
X-Spambayes-Level: ********
X-Spambayes-MailId: 1080858684-6

>Could you, by chance, attach a copy of the message that is causing the
>error?

The untrained message page has about 60 messages. How do I know which one is 
the problem?

>A copy of it should appear as a file in one of the cache
>directories below the directory containing your training database, or
>you could just view the message source from Review Messages and
>copy-and-paste it.

You've lost me. Here's my spambayes data directory:

ls
bayescustomize.ini      _pop3proxy.log            pop3proxy-spam-cache
bayescustomize.ini~     pop3proxy.log-1           pop3proxy-unknown-cache
bayescustomize.ini.bak  pop3proxy.log-evolution   spambayes.messageinfo.db
hammie.db               pop3proxy.log-evolution~  start.info
pop3proxy-ham-cache     pop3proxy.log-mozilla


When I grep for the odd "From" name I get nothing:
grep -R qziwpklwit  *

I'm looking for spam in all the wrong places.

>--
>Kenny Pitt


sean

_________________________________________________________________
Tax headache? MSN Money provides relief with tax tips, tools, IRS forms and 
more! http://moneycentral.msn.com/tax/workshop/welcome.asp


From kennypitt at hotmail.com  Tue Apr  6 15:55:46 2004
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Tue Apr  6 15:57:06 2004
Subject: [spambayes-dev] Dibbler.py error in training
Message-ID: <BAY16-DAV59vpFVjCRG00011536@hotmail.com>

sean darcy wrote:
>> In looking more closely, though, something seems a little odd here. 
>> The offending object that is coming back None appears to be the
>> msg[header] reference.  If I'm not mistaken, that means that either
>> the Subject: or To: header is missing entirely from the message,
>> which is very unusual. 
> 
> It's not that unusual for the Subject header to be missing. Looking
> over past emails, I've found some "ham" posts that had no subject. In
> any event, some of the posts to be trained do have  no Subject - all
> spam.   

Well, it's certainly not unusual for the Subject: header to be empty but
I didn't realize that it was legal to leave out the header entirely.
Guess I'll have to go back and re-read the spec! <wink>

Anyway, I checked in a new fix (Corpus.py 1.19) to guard against missing
headers, so give that a try when it comes through and let us know the
results.

>> Could you, by chance, attach a copy of the message that is causing
>> the error?
> 
> The untrained message page has about 60 messages. How do I know which
> one is the problem? 

Click the "Defer" heading to make sure that is the default for all
messages, then select a classification for only one message at a time to
see which one dies.  You can then go back to Review Messages and click
the subject of that message to display the message source.

>> A copy of it should appear as a file in one of the cache
>> directories below the directory containing your training database, or
>> you could just view the message source from Review Messages and
>> copy-and-paste it.
> 
> You've lost me. Here's my spambayes data directory:
> 
> ls
> bayescustomize.ini      _pop3proxy.log            pop3proxy-spam-cache
> bayescustomize.ini~     pop3proxy.log-1          
> pop3proxy-unknown-cache 
> bayescustomize.ini.bak  pop3proxy.log-evolution  
> spambayes.messageinfo.db 
> hammie.db               pop3proxy.log-evolution~  start.info
> pop3proxy-ham-cache     pop3proxy.log-mozilla

The pop3proxy-unknown-cache subdirectory contains copies of e-mails that
haven't been trained yet, up to the expiration age which I believe
defaults to 7 days.  No worries, though.  The message source you
included in the message was what I was interested in.

-- 
Kenny Pitt


From tameyer at ihug.co.nz  Wed Apr  7 02:59:10 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Wed Apr  7 02:59:20 2004
Subject: [spambayes-dev] RE: [Spambayes-checkins] spambayes/spambayes
	Corpus.py, 1.18, 1.19
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305CE452B@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677BE7@its-xchg4.massey.ac.nz>

> The real culprit seems to be msg[header], so check that for None 
> instead.  It seems odd for a message to be missing a Subject: or
> To: header, but this is spam after all and malformed messages are
> not unusual.

I believe this is the correct fix.  Sorry I didn't manage to look at this
previously (or just think of this case in the first place and allow for it
in the code I wrote), but I've been flat out this week.

Thanks heaps for doing the work to fix it Kenny :)

=Tony Meyer


From seandarcy at hotmail.com  Thu Apr  8 00:00:01 2004
From: seandarcy at hotmail.com (sean darcy)
Date: Thu Apr  8 00:00:06 2004
Subject: [spambayes-dev] Dibbler.py error in training
Message-ID: <BAY9-F27Ijn04x29xDb0006c894@hotmail.com>


>----Original Message Follows----
>From: &quot;Kenny Pitt&quot; &lt;kennypitt@hotmail.com&gt;
>To: &quot;'sean darcy'&quot; 
>&lt;seandarcy@hotmail.com&gt;,&lt;skip@pobox.com&gt;
>CC: &lt;spambayes-dev@python.org&gt;
>Subject: RE: [spambayes-dev] Dibbler.py error in training
>Date: Tue, 6 Apr 2004 15:55:46 -0400
>
>Anyway, I checked in a new fix (Corpus.py 1.19) to guard against missing
>headers, so give that a try when it comes through and let us know the
>results.

Tada! It worked.

Thanks for all the help.

sean

_________________________________________________________________
Persistent heartburn? Check out Digestive Health & Wellness for information 
and advice. http://gerd.msn.com/default.asp


From ta-meyer at ihug.co.nz  Thu Apr  8 02:47:31 2004
From: ta-meyer at ihug.co.nz (Tony Meyer)
Date: Thu Apr  8 02:47:42 2004
Subject: [spambayes-dev] RE: 1.0b1 Release candidates
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305A4976A@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677BF2@its-xchg4.massey.ac.nz>

Sorry about the slow response.

> I've just installed it on my Win2K machine with Outlook 2000 
> and it fails upon Outlook startup. Message is pretty 
> unhelpful ("failed to load, close Outlook & restart", or 
> something similar). Needless to say, restart did not help. 
> I've unchecked it in the COM add-ins, re-checked but to no 
> avail.  So I've uninstalled it and reinstalled previous 
> version (09).  

I think that maybe this is to do with the version of Outlook that is used to
build the installer.  There is a comment somewhere that says that Outlook
2000 should be used, and I only have Outlook 2002, so used that.  Nothing
much else has changed with Outlook, so I'm fairly confident that that's the
problem.  It does mean that Mark needs to build the installer, though, or
someone else with OL2K.

Apart from the ActivePython problem (and assuming the above is correct), I
think that we're ready to put the release out.  I'm happy to do the build &
everything, if Mark is available to do build the binaries (there's been a
lot of pywin32 activity lately, so he might have the time).

Mark (if you're reading this) - what are your thoughts about putting a
release out?  Are you able to use the code that Thomas posted to patch the
ActivePython pythoncom.dll, or do we just require a newer ActivePython
install to use new spambayes releases?  (Once the new ActivePython is out,
of course!).

> What do you need in order to analyse the problem?

About 2 extra hours in each day <wink>.

=Tony Meyer


From theller at python.net  Thu Apr  8 05:46:39 2004
From: theller at python.net (Thomas Heller)
Date: Thu Apr  8 05:46:48 2004
Subject: [spambayes-dev] Re: 1.0b1 Release candidates
References: <1ED4ECF91CDED24C8D012BCF2B034F1305A4976A@its-xchg4.massey.ac.nz>
	<1ED4ECF91CDED24C8D012BCF2B034F1304677BF2@its-xchg4.massey.ac.nz>
Message-ID: <3c7eiyyo.fsf@python.net>

"Tony Meyer" <ta-meyer@ihug.co.nz> writes:

> Sorry about the slow response.
>
>> I've just installed it on my Win2K machine with Outlook 2000 
>> and it fails upon Outlook startup. Message is pretty 
>> unhelpful ("failed to load, close Outlook & restart", or 
>> something similar). Needless to say, restart did not help. 
>> I've unchecked it in the COM add-ins, re-checked but to no 
>> avail.  So I've uninstalled it and reinstalled previous 
>> version (09).  
>
> I think that maybe this is to do with the version of Outlook that is used to
> build the installer.  There is a comment somewhere that says that Outlook
> 2000 should be used, and I only have Outlook 2002, so used that.  Nothing
> much else has changed with Outlook, so I'm fairly confident that that's the
> problem.  It does mean that Mark needs to build the installer, though, or
> someone else with OL2K.
>
> Apart from the ActivePython problem (and assuming the above is correct), I
> think that we're ready to put the release out.  I'm happy to do the build &
> everything, if Mark is available to do build the binaries (there's been a
> lot of pywin32 activity lately, so he might have the time).
>
> Mark (if you're reading this) - what are your thoughts about putting a
> release out?  Are you able to use the code that Thomas posted to patch the
> ActivePython pythoncom.dll,

... and should that code go into py2exe (maybe with an additional change
to py2exe so that it can specify the LCID for the resources) ...

> or do we just require a newer ActivePython
> install to use new spambayes releases?  (Once the new ActivePython is out,
> of course!).

IIUC, nobody needs the new ActiveState Python to release spambayes, but
the existing AS Python dll conflicts with the py2exe'd binaries.

(Besides: The existing AS Visual Python plugin for MS Visual Studio also
relies on the pywin32 registry entries, so it might take some time to
sort this out).

>
>> What do you need in order to analyse the problem?
>
> About 2 extra hours in each day <wink>.

Now that's a great idea - where can I get them ;-) ?

>
> =Tony Meyer

Thomas


From seandarcy at hotmail.com  Fri Apr  9 19:35:58 2004
From: seandarcy at hotmail.com (sean darcy)
Date: Fri Apr  9 19:36:04 2004
Subject: [spambayes-dev] train on missing headers?
Message-ID: <BAY9-F30ParRKkcl0cu0004ec62@hotmail.com>

[ I posted this before, but it didn't show up. So if it does...]

Now that there a fix for missing headers, I realize how much of my spam is 
in fact missing headers, esp. Subject headers. But when I look at clues, 
missing headers isn't one of them.  Most of this spam is classed as either 
unsure or ham. Maybe continued training will sort this out.

I seems to me that missing a header has lots of predictive value. Can this 
be incorporated in the spambayes tokens/clues?

sean

_________________________________________________________________
Is your PC infected? Get a FREE online computer virus scan from McAfee� 
Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963


From mhammond at skippinet.com.au  Fri Apr  9 21:02:14 2004
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri Apr  9 21:02:37 2004
Subject: [spambayes-dev] RE: 1.0b1 Release candidates
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677BF2@its-xchg4.massey.ac.nz>
Message-ID: <000601c41e97$771c60e0$0200a8c0@eden>

> Sorry about the slow response.

You should talk! :)

> I think that maybe this is to do with the version of Outlook
> that is used to
> build the installer.  There is a comment somewhere that says
> that Outlook
> 2000 should be used, and I only have Outlook 2002, so used
> that.  Nothing
> much else has changed with Outlook, so I'm fairly confident
> that that's the
> problem.

Unfortunately, the typelibs for office 2000 are hard-coded in a couple of
spots.  I fear that upgrading these to later typelibs will prevent SpamBayes
working at all for the older users.  No time to, or easy way to look at this
particular problem.

> It does mean that Mark needs to build the
> installer, though, or
> someone else with OL2K.

Which I am still struggling to do!  The binary builds and registers fine,
but silently fails to load when I start outlook.  But this time I am not
trying to do it 1 day before taking off, and will nail it :)  Most wolves
have moved away from my door, so I have a little time now.

> Apart from the ActivePython problem (and assuming the above
> is correct), I
> think that we're ready to put the release out.  I'm happy to
> do the build &
> everything, if Mark is available to do build the binaries
> (there's been a
> lot of pywin32 activity lately, so he might have the time).

See above :)  I think for now I will stick with the original plan - manually
edit the resource string in the Python DLL we ship.

> Mark (if you're reading this) - what are your thoughts about putting a
> release out?  Are you able to use the code that Thomas posted
> to patch the
> ActivePython pythoncom.dll,

I'm sorry, but I seem to have missed that, and can't find it.  I've a
message or 2 from Thomas to catch up on next, but don't recall it being one
of them.

Mark.


From tim.one at comcast.net  Fri Apr  9 22:43:09 2004
From: tim.one at comcast.net (Tim Peters)
Date: Fri Apr  9 22:43:15 2004
Subject: [spambayes-dev] train on missing headers?
In-Reply-To: <BAY9-F30ParRKkcl0cu0004ec62@hotmail.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEGDKCAB.tim.one@comcast.net>

[sean darcy]
> Now that there a fix for missing headers, I realize how much of my
> spam is in fact missing headers, esp. Subject headers. But when I
> look at clues, missing headers isn't one of them.  Most of this spam
> is classed as either unsure or ham. Maybe continued training will
> sort this out.
>
> I seems to me that missing a header has lots of predictive value. Can
> this be incorporated in the spambayes tokens/clues?

You can set the option

[Tokenizer]
record_header_absence: True

to experiment with this.  I know it's helpful for me (or was, more than a
year ago, when I last tested it <wink/sigh>).


From mhammond at skippinet.com.au  Fri Apr  9 23:06:38 2004
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri Apr  9 23:06:57 2004
Subject: [spambayes-dev] RE: Incremental filtering and the spam folder
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677BCA@its-xchg4.massey.ac.nz>
Message-ID: <002001c41ea8$d8708bd0$0200a8c0@eden>

> Would it not be better to only add the hook if the
> train_manual_spam option
> is True?  Or is there some other reason that the spam folder has to be
> hooked?

I think you are correct - we could avoid the hook alltogether.  However, I'm
still reluctant to change this, as it does risk breakage.  We can fix it
post 1.0.

Mark.


From matt at mondoinfo.com  Sat Apr 10 17:05:51 2004
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Sat Apr 10 17:08:47 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
Message-ID: <1081622811.78.614@mint-julep.mondoinfo.com>

I've lately been getting a bunch of spam that's almost entirely
nonsense except for a link or two. Perhaps not surprisingly,
SpamBayes hasn't been catching it all that well.

I could probably improve SpamBayes's performance by turning on more
header checks but on account of some peculiarities of my email, I'm
reluctant to do that. (I read various postmaster, webmaster, and ARIN
contact addresses that get almost nothing but spam but it's important
that I see what little legitimate mail goes to them.)

I don't remember who mentioned it here first, but it seemed to me
that adding a DNS lookup for URLs to the tokenizer would be a good
idea. There's hardly any limit to the number of domains a spammer can
register, but the number of networks that are willing to host a
spammer's website seems to be reasonably small. So I hacked the
tokenizer to generate tokens for the address that a URL in a message
resolves to. It generates four tokens for each address, stripping
values from the dotted-quad from right to left. That is, 10.1.2.3
would generate:

url-ip:10/8
url-ip:10.1/16
url-ip:10.1.2/24
url-ip:10.1.2.3/32

(I realize that that's not how networks are allocated these days, but
byte boundaries seemed as good an arbitrary place to make the cuts as
any other.)

A day's worth of unscientific testing suggested that it works pretty
well; the new tokens quickly started to show up in the classifier's
evidence.

So I set up buckets for a 5-way cross-validation set and ran
timcv.py. The only classification difference between the two runs is
that unsures dropped from 27 to 25. Here's the output from cmp.py for
those who can interpret it better than I can:


nodnss.txt -> dnss.txt
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams

false positive percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.500  0.500  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          

won   0 times
tied  5 times
lost  0 times

total unique fp went from 1 to 1 tied          
mean fp % went from 0.1 to 0.1 tied          

false negative percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.500  0.500  tied          
    0.000  0.000  tied          

won   0 times
tied  5 times
lost  0 times

total unique fn went from 1 to 1 tied          
mean fn % went from 0.1 to 0.1 tied          

ham mean                     ham sdev
   0.27    0.22  -18.52%        3.16    2.51  -20.57%
   0.36    0.33   -8.33%        3.83    3.61   -5.74%
   0.68    0.66   -2.94%        7.28    7.21   -0.96%
   0.14    0.10  -28.57%        1.03    0.89  -13.59%
   0.31    0.30   -3.23%        2.54    2.54   +0.00%

ham mean and sdev for all runs
   0.35    0.32   -8.57%        4.13    3.97   -3.87%

spam mean                    spam sdev
  99.90   99.82   -0.08%        1.02    1.28  +25.49%
  99.74   99.83   +0.09%        2.99    1.98  -33.78%
  98.91   98.91   +0.00%        5.15    5.11   -0.78%
  98.39   98.44   +0.05%        9.37    9.35   -0.21%
  98.86   98.79   -0.07%        6.36    6.84   +7.55%

spam mean and sdev for all runs
  99.16   99.16   +0.00%        5.77    5.79   +0.35%

ham/spam mean difference: 98.81 98.84 +0.03


I suspect that the results would have been better if I had chosen
more recent spam. I think that I inadvertently chose the oldest spam
from my spam archive.

In case anyone would like to play with it, I'll append my trivial
patch. It requires pydns from:

http://sourceforge.net/projects/pydns/

I think that some lines may need to un-wrapped by hand. The code is
governed by the option x-pick_apart_urls so you'll need to have that
turned on for it to work. If want to do comparison testing, you'll
want that option turned on for both runs. You should note that while
an individual DNS lookup is pretty cheap, doing thousands of them
slows the test down a lot and may hammer your resolving nameserver
pretty hard.

I hacked it up in a way that suits me for testing only. Among the
things that ought to be changed if anyone wants it added to the
distributed code:

It should have its own option
The timeout should be configurable
The imports should be moved to a sane place <wink>

Regards,
Matt


*** tokenizer.py.orig   2004-04-10 12:13:20.000000000 -0500
--- tokenizer.py        2004-04-10 15:34:21.000000000 -0500
***************
*** 1052,1057 ****
--- 1052,1078 ----
              url = urllib.unquote(url)
scheme, netloc, path, params, query, frag =
urlparse.urlparse(url)
  
+ 
+             import DNS
+             import DNS.Base
+             DNS.DiscoverNameServers()
+             r=DNS.DnsRequest(timeout=1)
+             try:
+               replies=r.req(netloc).answers
+             except DNS.Base.DNSError:
+               pass
+             else:
+               for reply in replies: # Should we limit to one A
record?
+                 if reply["typename"]=="A":
+                   dottedQuad=reply["data"]
+                   pushclue("url-ip:%s/32" % dottedQuad)
+                   dottedQuadList=dottedQuad.split(".")
+                   pushclue("url-ip:%s/8" % dottedQuadList[0])
+                   pushclue("url-ip:%s.%s/16" %
(dottedQuadList[0],dottedQuadList[1]))
+                   pushclue("url-ip:%s.%s.%s/24" %
(dottedQuadList[0],
+                     dottedQuadList[1],dottedQuadList[2]))
+ 
+ 
# one common technique in bogus "please (re-)authorize
yourself"
# scams is to make it appear as if you're visiting a
valid
# payment-oriented site like PayPal, CitiBank or eBay,
when you


From skip at pobox.com  Sat Apr 10 22:49:25 2004
From: skip at pobox.com (Skip Montanaro)
Date: Sat Apr 10 22:49:29 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <1081622811.78.614@mint-julep.mondoinfo.com>
References: <1081622811.78.614@mint-julep.mondoinfo.com>
Message-ID: <16504.45621.192141.340899@montanaro.dyndns.org>


    Matt> I don't remember who mentioned it here first, but it seemed to me
    Matt> that adding a DNS lookup for URLs to the tokenizer would be a good
    Matt> idea. There's hardly any limit to the number of domains a spammer
    Matt> can register, but the number of networks that are willing to host
    Matt> a spammer's website seems to be reasonably small. So I hacked the
    Matt> tokenizer to generate tokens for the address that a URL in a
    Matt> message resolves to. 

Matt,

Doesn't mine_received_headers work for you?  I've got lots of tokens in my
database like:

    received:65.248
    received:65.248.59
    received:65.248.59.178
    received:65.248.59.196
    received:65.248.59.35

which records all the possible fragments of the ip addresses through which
the mail moves.

Skip

From matt at mondoinfo.com  Sat Apr 10 23:19:29 2004
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Sat Apr 10 23:20:30 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <16504.45621.192141.340899@montanaro.dyndns.org>
References: <1081622811.78.614@mint-julep.mondoinfo.com>
	<16504.45621.192141.340899@montanaro.dyndns.org>
Message-ID: <1081652165.08.529@mint-julep.mondoinfo.com>

Dear Skip,

> Doesn't mine_received_headers work for you?  I've got lots of
> tokens in my database like:

>     received:65.248
>     received:65.248.59
>     received:65.248.59.178
>     received:65.248.59.196
>     received:65.248.59.35

> which records all the possible fragments of the ip addresses
> through which the mail moves.

I expect that it would help most of the time, but it's not what I
wanted to do. Some of the addresses that I read go through different
SMTP servers. In particular, there are two servers that receive mail
for a webmaster address, a postmaster address, and an ARIN contact
address that I read. Those addresses get almost nothing but spam, but
I need to get what little legitimate mail does get sent to them.
Using mine_received_headers, I'd have a very strong spam clue that
was really for the wrong reason. Whether that one clue would push the
legitimate mail that I get at those addresses into the wrong bucket
is hard for me to tell since I don't get enough legitimate mail sent
to them to be able to perform much of an experiment.

In addition, my unscientific poking at recent spam suggests to me
that spam is sent to my servers from a lot of different places. But
the sites spamvertized tend to be on a much smaller number of
networks. It seems that it's easier for a spammer to find a
compromised PC to relay though than it is for them to find someone
willing to host a their site.

For example, looking though my logs for this evening, I find four
spams that advertise seemingly unrelated products but which have URLs
that resolve to addresses within the same /24 in China.

Regards,
Matt


From skip at pobox.com  Sat Apr 10 23:48:20 2004
From: skip at pobox.com (Skip Montanaro)
Date: Sat Apr 10 23:48:23 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <1081652165.08.529@mint-julep.mondoinfo.com>
References: <1081622811.78.614@mint-julep.mondoinfo.com>
	<16504.45621.192141.340899@montanaro.dyndns.org>
	<1081652165.08.529@mint-julep.mondoinfo.com>
Message-ID: <16504.49156.490243.142533@montanaro.dyndns.org>


    Matt> In particular, there are two servers that receive mail for a
    Matt> webmaster address, a postmaster address, and an ARIN contact
    Matt> address that I read. Those addresses get almost nothing but spam,
    Matt> but I need to get what little legitimate mail does get sent to
    Matt> them.  Using mine_received_headers, I'd have a very strong spam
    Matt> clue that was really for the wrong reason. Whether that one clue
    Matt> would push the legitimate mail that I get at those addresses into
    Matt> the wrong bucket is hard for me to tell since I don't get enough
    Matt> legitimate mail sent to them to be able to perform much of an
    Matt> experiment.

Unless those messages are extremely short, I doubt it would matter much.
It's going to be just one clue among many.  I have no trouble getting the
occasional good mail from the pychecker mailing list, which gets almost
nothing but spam these days.

    Matt> It seems that it's easier for a spammer to find a compromised PC
    Matt> to relay though than it is for them to find someone willing to
    Matt> host a their site.

In which case I doubt either of these network ip classification schemes will
have much effect.

Skip

From sethg at GoodmanAssociates.com  Sun Apr 11 00:05:47 2004
From: sethg at GoodmanAssociates.com (Seth Goodman)
Date: Sun Apr 11 00:05:49 2004
Subject: [spambayes-dev] various Outlook version and RFC2822 compliance
Message-ID: <MHEGIFHMACFNNIMMBACAGEKAHLAA.sethg@GoodmanAssociates.com>

I have heard before that certain types of checks, such as the name of file
attachments, are not possible in the Outlook plug-in since the data format
used was pre-RFC2822 and Outlook, in fact, destroys some of the MIME
structure necessary to see these things.  I have also heard that later
versions of Outlook are more RFC2822 compliant.  I have a few questions here
that have probably been discussed among the developers already.

1) How RFC2822 compliant is the stored message format in the various
versions of Outlook subsequent to Outlook2000?  Without even looking at the
code I can tell that Outlook2000 is not compliant due to the total absence
of a References: header, which causes many people real problems who view
mailing lists by conversation thread.

2) If the later versions of Outlook are more (or perhaps even fully?)
RFC2822 compliant, would it be possible to detect the Outlook version and
enable generating the additional tokens that are available with the web
proxy?

I realize this is not a simple matter.  I was just wondering how far we are
from a more unified code base.

--

Seth Goodman


From tim.one at comcast.net  Sun Apr 11 00:51:50 2004
From: tim.one at comcast.net (Tim Peters)
Date: Sun Apr 11 00:51:57 2004
Subject: [spambayes-dev] various Outlook version and RFC2822 compliance
In-Reply-To: <MHEGIFHMACFNNIMMBACAGEKAHLAA.sethg@GoodmanAssociates.com>
Message-ID: <E1BCWwl-0007BC-Bx@mail.python.org>

[Seth Goodman]
> I have heard before that certain types of checks, such as the name of file
> attachments, are not possible in the Outlook plug-in

It's not that this is impossible, it's that nobody has written
Outlook-specific code necessary to do it.

> since the data format used was pre-RFC2822 and Outlook, in fact, destroys
> some of the MIME structure necessary to see these things.

Outlook destroys all MIME structure.  Our parser understands only MIME
structure.

> I have also heard that later versions of Outlook are more RFC2822
> compliant.

Outlook keeps getting better at both accepting and creating standard email,
but it doesn't store email in this format.  Our Outlook addin sees email in
the way Outlook stores it.

> I have a few questions here that have probably been discussed among the
> developers already.
>
> 1) How RFC2822 compliant is the stored message format in the various
> versions of Outlook subsequent to Outlook2000?

Outlook's storage format has nothing to do with any Internet standard
(regardless of Outlook version).  It's possible to get the original headers
as a blob of text from the Outlook message store (and we do), but that's all
of the original MIME structure Outlook preserves.

...

> 2) If the later versions of Outlook are more (or perhaps even fully?)
> RFC2822 compliant, would it be possible to detect the Outlook version and
> enable generating the additional tokens that are available with the web
> proxy?

If the antecedent were true, yes <wink>.

> I realize this is not a simple matter.  I was just wondering how far we
> are from a more unified code base.

Tokenizing anything beyond what the Outlook addin can tokenize now will
require new Outlook-specific code.


From sethg at GoodmanAssociates.com  Sun Apr 11 01:45:44 2004
From: sethg at GoodmanAssociates.com (Seth Goodman)
Date: Sun Apr 11 01:45:46 2004
Subject: FW: [spambayes-dev] various Outlook version and RFC2822 compliance
Message-ID: <MHEGIFHMACFNNIMMBACAGEKIHLAA.sethg@GoodmanAssociates.com>

> From: Tim Peters
> Sent: Saturday, April 10, 2004 11:52 PM
>
>
> [Seth Goodman]

<...>

> > I have also heard that later versions of Outlook are more RFC2822
> > compliant.
>
> Outlook keeps getting better at both accepting and creating
> standard email,
> but it doesn't store email in this format.  Our Outlook addin
> sees email in
> the way Outlook stores it.

Those filthy buggers.  You'd think with the rest of the world using RFC2822,
or at least trying to, these guys would relent and store the messages in
that format so that just in case they ever wanted to use a message for
anything later, it would all be there.  But noooooo!


> Tokenizing anything beyond what the Outlook addin can tokenize now will
> require new Outlook-specific code.

Sounds like a lot of work and an undocumented, moving target.  A recipe for
a mess.

Too bad I'm so habituated to this mail client (and you guys have done such a
fine job of integrating SpamBayes into it).  Maybe OpenOffice will
eventually make an Outlook look-alike, maybe even with an RFC2822 storage
option.  With open source code, you could actually see how the internals
worked instead of reverse engineering it.  But for better or worse, Outlook
will probably remain the "standard" for many years to come.

--

Seth Goodman


From sethg at GoodmanAssociates.com  Sun Apr 11 01:46:51 2004
From: sethg at GoodmanAssociates.com (Seth Goodman)
Date: Sun Apr 11 01:46:52 2004
Subject: FW: [spambayes-dev] Results for DNS lookup in tokenizer
Message-ID: <MHEGIFHMACFNNIMMBACAKEKIHLAA.sethg@GoodmanAssociates.com>

> From: Skip Montanaro
> Sent: Saturday, April 10, 2004 10:48 PM
>

<...>

>     Matt> It seems that it's easier for a spammer to find a compromised PC
>     Matt> to relay though than it is for them to find someone willing to
>     Matt> host a their site.
>
> In which case I doubt either of these network ip classification
> schemes will
> have much effect.

I don't know, Matt may have a point here.  I've been getting a lot of salad
spams that mostly end up in the Unsure folder and tend to score somewhat
neutral.  Many of them do not even use real words to dilute the sales pitch,
they use random combinations of letters separated by white space so there
are relatively few significant tokens.  It's not the smartest strategy, but
I've seen quite a bit of it.  In such cases, could a strong spam clue, such
as the netblock of a spamvertised web site, possibly push it from Unsure
into Spam?  I don't have a feel for Chi-squared combining so this is a
question, not an assertion.

I agree with Matt that because of the huge number of compromised windows
boxes with cables modems on providers (like Comcast) that do not restrict
outgoing port 25 connections to their smarthost, the chance of getting two
spams from the same compromised box are almost nil.  Even if you fragment
the header IP addresses in the same way that Matt suggests (maybe you
already do?), the sheer size of IP address space allocated to dynamic IP
pools at major providers is orders of magnitude larger than the IP space of
hosting services willing to host sites for enlargement products.  It seems
that the hosting service IP's are more likely generate strong spam clues
than the source IP's of the compromised windows boxes.  Whether this would
ultimately make enough of a difference, I don't know.

--

Seth Goodman


From skip at pobox.com  Sun Apr 11 08:25:15 2004
From: skip at pobox.com (Skip Montanaro)
Date: Sun Apr 11 08:25:26 2004
Subject: FW: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <MHEGIFHMACFNNIMMBACAKEKIHLAA.sethg@GoodmanAssociates.com>
References: <MHEGIFHMACFNNIMMBACAKEKIHLAA.sethg@GoodmanAssociates.com>
Message-ID: <16505.14635.741313.773689@montanaro.dyndns.org>


    Matt> It seems that it's easier for a spammer to find a compromised PC
    Matt> to relay though than it is for them to find someone willing to
    Matt> host a their site.

    Skip> In which case I doubt either of these network ip classification
    Skip> schemes will have much effect.

    Seth> I don't know, Matt may have a point here.  I've been getting a lot
    Seth> of salad spams ....  In such cases, could a strong spam clue, such
    Seth> as the netblock of a spamvertised web site, possibly push it from
    Seth> Unsure into Spam?  

Sure, if there are few tokens, one extra token may have a large enough
effect.  That wasn't the case I was referring to.  Matt was worried about
losing the occasional good message in a sea of spam on a few important
mailing lists.  If those good messages are fairly typical (or if he's
trained on a few of them), there are probably plenty of hammy tokens in each
one, in which case throwing in a netblock isn't going to add much.

    Seth> Even if you fragment the header IP addresses in the same way that
    Seth> Matt suggests (maybe you already do?), the sheer size of IP
    Seth> address space allocated to dynamic IP pools at major providers is
    Seth> orders of magnitude larger than the IP space of hosting services
    Seth> willing to host sites for enlargement products.  

Yes, I believe mine_received_headers does fragment in the same way as Matt's
scheme (minus the /(8,16,24,32) suffix which I think is superfluous), which
was why I mentioned it in the first place.

I think with mine_received_headers enabled we're already collecting the same
information (actually more in most instances, since all Received: headers
are parsed).  Here are some examples gotten using spamcounts (post-sorted by
the spam prob) from my current database.

* mail.python.org (slightly hammy):

    % spamcounts -r 'received:12.155'
    db: /Users/skip/.hammiedb
    token,nspam,nham,spam prob
    received:12.155,269,387,0.40438528783
    received:12.155.117,269,387,0.40438528783
    received:12.155.117.29,269,387,0.40438528783

* pobox.com, main relay for most of my mail (again, mostly mildly mildly
  hammy, though with some outliers):

    % spamcounts -r 'received:(208\.58|207\.8)'
    db: /Users/skip/.hammiedb
    token,nspam,nham,spam prob
    received:208.58.216,0,1,0.155172413793
    received:208.58.216.73,0,1,0.155172413793
    received:207.8.226.3,66,92,0.412197950796
    received:207.8.214.3,67,93,0.413216308473
    received:207.8.214,73,98,0.42129893514
    received:208.58.1.193,87,116,0.422927556996
    received:207.8,208,269,0.430284644233
    received:207.8.226,135,171,0.435415990012
    received:208.58,193,239,0.440949675391
    received:208.58.1,193,238,0.441982563175
    received:208.58.1.194,99,118,0.450429768447
    received:207.8.226.2,69,79,0.460422504704
    received:208.58.1.197,5,5,0.494310099573
    received:207.8.214.2,6,5,0.53799693756
    received:208.58.1.198,4,1,0.771713070997

* mail.mojam.com, where my mail eventually winds up (mildly spammy because I
  get lots of non-skip@mojam.com stuff there which is primarily spam):

    % spamcounts -r 'received:199.249'
    db: /Users/skip/.hammiedb
    token,nspam,nham,spam prob
    received:199.249.165.21,0,1,0.155172413793
    received:199.249.165.25,0,1,0.155172413793
    received:199.249,90,55,0.614718002838
    received:199.249.165,90,55,0.614718002838
    received:199.249.165.175,90,54,0.619037063122

Now I cheat and just sort all received: features by spam prob.  The highest
is 

    received:69.6,7,0,0.969798657718
    received:biz,7,0,0.969798657718

(perhaps not surprising).  Looking up some of the individual addresses in
the 69.6 block yields a bunch of "host not found" responses.  Also, not all
that surprising.

Looking at the other end of the spectrum, I see

    received:66.163,0,6,0.0348837209302

The ip's I have in that block refer to Yahoo's mail servers.  This suggests
to me they do a pretty good job keeping their relays closed to abuse.

    Seth> It seems that the hosting service IP's are more likely generate
    Seth> strong spam clues than the source IP's of the compromised windows
    Seth> boxes.  Whether this would ultimately make enough of a difference,
    Seth> I don't know.

Of course, whether or not this helps on any given message depends to a large
degree on how many other features the tokenizer extracts from the message.

Switching gears a bit, I suspect we could probably toss out the
received:N.N.N.N and received:N.N.N features and not lose much in the way of
accuracy since all but a few of them are hapaxes.  

    feature pattern             total           hapaxes
    ---------------             -----           -------
    received:N                   177             77 (44%)
    received:N.N                1606           1228 (76%)
    received:N.N.N              2140           1927 (90%)
    received:N.N.N.N            2548           2362 (93%)

Perhaps the same holds true for hostname-based features (received:biz,
received:creosote.python.org, etc), though it's less clear cut.  Perhaps
none of them are worth keeping:

    feature pattern             total           hapaxes
    ---------------             -----           -------
    received:a                   320            257 (80%)
    received:a.a                1046            867 (83%)
    received:a.a.a              1222           1062 (87%)
    received:a.a.a.a             682            609 (89%)

The above data are from my database which currently contains 102863 tokens.
If I removed all the three- and four-component received: features I'd reduce
the database size by about six percent.

I'll restate my question.  What does Matt's proposal do that
mine_received_headers doesn't do already?

Skip

From matt at mondoinfo.com  Sun Apr 11 12:28:49 2004
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Sun Apr 11 12:30:00 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <16504.49156.490243.142533@montanaro.dyndns.org>
References: <1081622811.78.614@mint-julep.mondoinfo.com>
	<16504.45621.192141.340899@montanaro.dyndns.org>
	<1081652165.08.529@mint-julep.mondoinfo.com>
	<16504.49156.490243.142533@montanaro.dyndns.org>
Message-ID: <1081656741.01.529@mint-julep.mondoinfo.com>

Dear Skip,

> Unless those messages are extremely short, I doubt it would matter
> much. It's going to be just one clue among many.  I have no trouble
> getting the occasional good mail from the pychecker mailing list,
> which gets almost nothing but spam these days.

Thanks for the clue. I'll give it a try.

>> It seems that it's easier for a spammer to find a compromised PC
>> to relay though than it is for them to find someone willing to
>> host their site.

> In which case I doubt either of these network ip classification
> schemes will have much effect.

Sorry for not being clear. What I should have mentioned earlier is
that it doesn't seem to me that an unusual amount of spam comes from
the networks that host spammers' websites. I don't think that
mine_received_headers and the scheme I'm testing will generate much
of the same data.

In the last 24 hours, I've had 29 spams for which SpamBayes's
classifier used as evidence URL's IPs in 202/8, 218/8, 219/8, and
221/8. On the ham side, the IP for mail.python.org has figured in
evidence for 15 hams.

Spammers seem to be limited in their choice of networks for hosting,
but they can't know what networks the URLs that you or I get in ham
messages will resolve to. In that respect, those IPs fit well with
what SpamBayes does: spammers have a constrained spam "vocabulary"
and can't know a random individual's limited ham "vocabulary".

Regards,
Matt


From Arnold.Lou829 at rogers.com  Sun Apr 11 12:41:45 2004
From: Arnold.Lou829 at rogers.com (Lou Arnold)
Date: Sun Apr 11 12:38:07 2004
Subject: [spambayes-dev] Compatibility with Norton Antivirus
Message-ID: <FBEKJBMGJJDKLDKIMDFBCEAFCAAA.Arnold.Lou829@rogers.com>

I have installed Norton Anti-Virus (NAV) to secure my incoming email. As I
understand things, it has a POP3 Proxy that sits between the ISP mail server
and my MS Outlook 2000.

Q: Will the SpamBayes' POP3 proxy replace or interfere with NAV?


From pje at telecommunity.com  Sun Apr 11 12:55:05 2004
From: pje at telecommunity.com (Phillip J. Eby)
Date: Sun Apr 11 12:55:05 2004
Subject: FW: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <E1BCe1i-00085C-1K@mail.python.org>
Message-ID: <5.1.1.6.0.20040411124603.026a56a0@mail.telecommunity.com>

At 08:25 AM 4/11/04 -0400, spambayes-dev-request@python.org wrote:
>I'll restate my question.  What does Matt's proposal do that
>mine_received_headers doesn't do already?

It looks at URLs embedded in the message *body*.  As a simple contrast, if 
I link here to:

http://enlarge-my-spam.com?id=123456

That will produce a very *different* set of IP tokens than the Received: 
headers of this message.  And, if the same spam is sent from a thousand 
compromised PC's, they will all still have the same URL IP cues, despite 
lacking any Received: headers in common.  Yes, they'll also have tokens 
representing parts of the domain name, but spammers can cheaply change 
their domain names to avoid being recognized.

Their website IP addresses are not only harder to change, but take 
advantage of the fact that so-called "bulletproof hosting" providers are a 
"bad neighborhood" for links.  So, if you train on these tokens, then you 
could potentially nail entirely unrelated spammers who simply host with the 
same ISP.

Of course, the spammers' next move would likely be to use redirects from 
non-"bulletproof" hosts, but everything we can do to make it more difficult 
and more costly for them is a good thing.


From sethg at GoodmanAssociates.com  Sun Apr 11 14:33:53 2004
From: sethg at GoodmanAssociates.com (Seth Goodman)
Date: Sun Apr 11 14:33:54 2004
Subject: FW: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <5.1.1.6.0.20040411124603.026a56a0@mail.telecommunity.com>
Message-ID: <MHEGIFHMACFNNIMMBACACEKOHLAA.sethg@GoodmanAssociates.com>

> From: Phillip J. Eby
> Sent: Sunday, April 11, 2004 11:55 AM
>
>
> At 08:25 AM 4/11/04 -0400, spambayes-dev-request@python.org wrote:
> >[Skip]
> >I'll restate my question.  What does Matt's proposal do that
> >mine_received_headers doesn't do already?
>
> It looks at URLs embedded in the message *body*.  ...

That's _exactly_ what I was getting at.  Mine_received_headers only looks at
headers, which don't contain the IP's of spamvertised sites.  Much of, if
not most, spam today comes direct-to-MX from compromised windows boxes
operating on broadband, dynamic IP connections from providers that don't
limit customers' use of outgoing port 25 connections.

The theory, if it is worth anything, is that the total size of the IP
address space for "bad-boy" hosting service web-servers is puny compared
with the dynamic IP pools of major providers who do not block outgoing port
25 connections.  Having the token database learn the former is feasible,
while having it learn the latter is pretty hopeless.

For exactly the same reason, I would guess that the message source IP is
probably better at identifying ham than spam.  For this property alone, it
is extremely valuable.  My friends' tendency to use an occasional spammy
word is partially offset by the strong ham clues from their outgoing MTA IP
and their personal email address.  In terms of detecting spam, the token
database does a great job at detecting repetitive spam sources, but is
somewhat ill-suited for the dynamic IP phenomenon.  Rather than have the
token database learn to be a mediocre dynamic IP blacklist, it would
probably be better to use a proxy to query a real dynamic IP blacklist and
add a header for SpamBayes to mine.  However, that's outside the scope of
SpamBayes.

--

Seth Goodman


From skip at pobox.com  Sun Apr 11 18:58:30 2004
From: skip at pobox.com (Skip Montanaro)
Date: Sun Apr 11 18:58:35 2004
Subject: [spambayes-dev] Compatibility with Norton Antivirus
In-Reply-To: <FBEKJBMGJJDKLDKIMDFBCEAFCAAA.Arnold.Lou829@rogers.com>
References: <FBEKJBMGJJDKLDKIMDFBCEAFCAAA.Arnold.Lou829@rogers.com>
Message-ID: <16505.52630.74591.355838@montanaro.dyndns.org>


    Lou> I have installed Norton Anti-Virus (NAV) to secure my incoming
    Lou> email. As I understand things, it has a POP3 Proxy that sits
    Lou> between the ISP mail server and my MS Outlook 2000.

    Lou> Q: Will the SpamBayes' POP3 proxy replace or interfere with NAV?

If you configure things correctly, it should work just fine.  Your setup
might look like this:

    real         Spambayes        NAV              Your
    POP3   <---> POP3 proxy <---> POP3 proxy <---> email
    server       server           server           client

Let's pick some hypothetical names.  Your machine is "localhost".  Your real
POP3 server is mail.myisp.com.  Configure Spambayes to get mail from
mail.myisp.com on port 110 and listen to port 110 on localhost.  Configure
NAV's proxy server to get mail from Spambayes on port 110 of localhost and
listen to port 1110 on localhost.  Configure your email client to get mail
from port 1110 on localhost.

The reason I suggest placing Spambayes ahead of NAV is that we've seen
situations where NAV popped up a dialog box seeking input from the user
which the user apparently didn't see (maybe it didn't get raised to the top
of the stack of windows).  If NAV is upstream from Spambayes everything
grinds to a halt and it looks like Spambayes has hung.  By placing Spambayes
between the real POP3 server and NAV at least its web interface should still
be responsive.

Maybe it's a useless distinction.  It should work either way.  You might
have to configure things so NAV comes first if it can't connect to localhost
on a port other than 110.

Skip


From skip at pobox.com  Sun Apr 11 19:09:49 2004
From: skip at pobox.com (Skip Montanaro)
Date: Sun Apr 11 19:09:54 2004
Subject: FW: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <5.1.1.6.0.20040411124603.026a56a0@mail.telecommunity.com>
References: <E1BCe1i-00085C-1K@mail.python.org>
	<5.1.1.6.0.20040411124603.026a56a0@mail.telecommunity.com>
Message-ID: <16505.53309.360402.514258@montanaro.dyndns.org>


    >> I'll restate my question.  What does Matt's proposal do that
    >> mine_received_headers doesn't do already?

    Phillip> It looks at URLs embedded in the message *body*.  As a simple
    Phillip> contrast, if I link here to:

    Phillip> http://enlarge-my-spam.com?id=123456

    Phillip> That will produce a very *different* set of IP tokens than the
    Phillip> Received: headers of this message.  

Ah, okay.  I missed that in Matt's post.  If the tokenizer's
x-pick_apart_urls option is True, it picks apart URLs embedded in the body
of the message.  It's not as ip-centered as Matt's code.

Skip

From matt at mondoinfo.com  Sun Apr 11 19:25:11 2004
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Sun Apr 11 19:26:01 2004
Subject: FW: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <16505.53309.360402.514258@montanaro.dyndns.org>
References: <E1BCe1i-00085C-1K@mail.python.org>
	<5.1.1.6.0.20040411124603.026a56a0@mail.telecommunity.com>
	<16505.53309.360402.514258@montanaro.dyndns.org>
Message-ID: <1081725547.82.472@mint-julep.mondoinfo.com>

> Ah, okay.  I missed that in Matt's post.  If the tokenizer's
> x-pick_apart_urls option is True, it picks apart URLs embedded in
> the body of the message.  It's not as ip-centered as Matt's code.

Yes, the two needn't be related. That just turned out to be the best
place in the code to do the lookup. I wanted the patch to be as
simple as possible in the hopes that someone else would like to test
it.

Here's a little more data: Since yesterday morning, SpamBayes has
scored 352 messages for me. Of those, a url-ip token has figured in
the evidence for 262 of them. Only 90 were scored without one.

Regards,
Matt


From skip at pobox.com  Sun Apr 11 19:56:14 2004
From: skip at pobox.com (Skip Montanaro)
Date: Sun Apr 11 19:56:25 2004
Subject: FW: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <1081725547.82.472@mint-julep.mondoinfo.com>
References: <E1BCe1i-00085C-1K@mail.python.org>
	<5.1.1.6.0.20040411124603.026a56a0@mail.telecommunity.com>
	<16505.53309.360402.514258@montanaro.dyndns.org>
	<1081725547.82.472@mint-julep.mondoinfo.com>
Message-ID: <16505.56094.688745.208522@montanaro.dyndns.org>


    >> Ah, okay.  I missed that in Matt's post.  If the tokenizer's
    >> x-pick_apart_urls option is True, it picks apart URLs embedded in the
    >> body of the message.  It's not as ip-centered as Matt's code.

    Matt> Yes, the two needn't be related. That just turned out to be the
    Matt> best place in the code to do the lookup. I wanted the patch to be
    Matt> as simple as possible in the hopes that someone else would like to
    Matt> test it.

Can your mods be easily factored into the x-pick_apart_urls option?

Skip

From matt at mondoinfo.com  Sun Apr 11 20:10:01 2004
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Sun Apr 11 20:10:09 2004
Subject: FW: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <16505.56094.688745.208522@montanaro.dyndns.org>
References: <E1BCe1i-00085C-1K@mail.python.org>
	<5.1.1.6.0.20040411124603.026a56a0@mail.telecommunity.com>
	<16505.53309.360402.514258@montanaro.dyndns.org>
	<1081725547.82.472@mint-julep.mondoinfo.com>
	<16505.56094.688745.208522@montanaro.dyndns.org>
Message-ID: <1081728065.93.472@mint-julep.mondoinfo.com>

Dear Skip,

> Can your mods be easily factored into the x-pick_apart_urls option?

If I understand your question correctly, the answer is that the DNS
lookup code is governed by that option now.

If my patch were ever added to the distributed code, I suspect that
it would make sense to leave it under x-pick_apart_urls and add
another option that affected only the DNS lookup code. It would be
necessary to note in the documentation that turning that option on
only had an effect if x-pick_apart_urls was also turned on but I
don't imagine that that would be a serious problem.

Regards,
Matt


From tameyer at ihug.co.nz  Sun Apr 11 21:25:13 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Sun Apr 11 21:25:34 2004
Subject: [spambayes-dev] RE: Incremental filtering and the spam folder
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305E10E63@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677C04@its-xchg4.massey.ac.nz>

> > Would it not be better to only add the hook if the 
> > train_manual_spam option is True?  Or is there some
> > other reason that the spam folder has to be hooked?
> 
> I think you are correct - we could avoid the hook 
> alltogether.  However, I'm still reluctant to change this, as 
> it does risk breakage.  We can fix it post 1.0.

Sounds good.  So I remember, I've opened a tracker:

[ 933473 ] Unnecessary spam folder hook
<http://sourceforge.net/tracker/index.php?func=detail&aid=933473&group_id=61
702&atid=498103>

I'll run with the patch until it gets checked in as well, for some
additional assurance that it'll work.

=Tony Meyer


From tameyer at ihug.co.nz  Sun Apr 11 21:36:02 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Sun Apr 11 21:36:29 2004
Subject: [spambayes-dev] RE: 1.0b1 Release candidates
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305E10E34@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B7E@its-xchg4.massey.ac.nz>

> Unfortunately, the typelibs for office 2000 are hard-coded in 
> a couple of spots.  I fear that upgrading these to later 
> typelibs will prevent SpamBayes working at all for the older 
> users.  No time to, or easy way to look at this particular problem.

In my local copies, I've changed the hard-coded codes to match the ones I
generate.  I suspect that you're right, since it didn't appear that my
binary built worked for Amir.  I have access to OL2K and OL2k2, so I could
play around with this at some point, but there doesn't seem to be much of a
need for it right now.  (i.e for as long as you're willing to be the plug-in
builder).

> I'm sorry, but I seem to have missed that, and can't find it. 
> I've a message or 2 from Thomas to catch up on next, but 
> don't recall it being one of them.

This one:

<http://mail.python.org/pipermail/spambayes-dev/2004-March/002628.html>

Though note also his comment here:

<http://mail.python.org/pipermail/spambayes-dev/2004-April/002669.html>

=Tony Meyer


From mhammond at skippinet.com.au  Sun Apr 11 22:06:38 2004
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sun Apr 11 22:06:56 2004
Subject: [spambayes-dev] Re: 1.0b1 Release candidates
In-Reply-To: <3c7eiyyo.fsf@python.net>
Message-ID: <07e701c42032$cad3fa80$0200a8c0@eden>

Tony:
> release out?  Are you able to use the code that Thomas
> posted to patch the ActivePython pythoncom.dll

Tony - note that the code posted by Thomas is to patch python23.dll, as
packaged by py2exe and shipped by us.  It does not touch the ActivePython
DLLs at all.

What we are doing is changing the registry location that *we* read Python
options from.  This way, we won't be reading the standard "2.3", which is
what ActivePython uses.

> ... and should that code go into py2exe (maybe with an
> additional change
> to py2exe so that it can specify the LCID for the resources) ...

Yes, I believe it should, and I agree losing that version information is no
big deal.  IIUC, you were also implying that Python 2.4 should be patched to
use language independent resource here?

I'll have a bash at making a py2exe patch, while I'm making the other
patches up.

Mark.


From tameyer at ihug.co.nz  Sun Apr 11 22:14:30 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Sun Apr 11 22:15:24 2004
Subject: [spambayes-dev] Re: 1.0b1 Release candidates
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305E112EF@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677C0A@its-xchg4.massey.ac.nz>

> Tony - note that the code posted by Thomas is to patch 
> python23.dll, as packaged by py2exe and shipped by us.  It 
> does not touch the ActivePython DLLs at all.
> 
> What we are doing is changing the registry location that *we* 
> read Python options from.  This way, we won't be reading the 
> standard "2.3", which is what ActivePython uses.

Ah, ok - I understand this now.  I'm glad I left this alone <wink>.

=Tony Meyer


From kennypitt at hotmail.com  Mon Apr 12 08:57:29 2004
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Mon Apr 12 08:58:46 2004
Subject: [spambayes-dev] Compatibility with Norton Antivirus
In-Reply-To: <FBEKJBMGJJDKLDKIMDFBCEAFCAAA.Arnold.Lou829@rogers.com>
Message-ID: <BAY16-DAV25dKGwAYGa0001a95b@hotmail.com>

Lou Arnold wrote:
> I have installed Norton Anti-Virus (NAV) to secure my incoming email.
> As I understand things, it has a POP3 Proxy that sits between the ISP
> mail server and my MS Outlook 2000.
> 
> Q: Will the SpamBayes' POP3 proxy replace or interfere with NAV?

I run NAV 2003 at home, and at least as of that version NAV no longer
operates as a POP3 proxy.  It hooks directly into the network protocol
stack as a "filter" that sees the traffic on the POP3 port before
whatever application is accessing the port.  I simply configure
SpamBayes to talk to my POP3 server and my mail client to talk to
SpamBayes, and NAV filters the mail traffic as SpamBayes reads it from
the POP3 server.

-- 
Kenny Pitt


From kennypitt at hotmail.com  Mon Apr 12 09:06:26 2004
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Mon Apr 12 09:07:48 2004
Subject: [spambayes-dev] RE: 1.0b1 Release candidates
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B7E@its-xchg4.massey.ac.nz>
Message-ID: <BAY16-DAV5MWuLn5U9T0001aaac@hotmail.com>

Tony Meyer wrote:
>> Unfortunately, the typelibs for office 2000 are hard-coded in
>> a couple of spots.  I fear that upgrading these to later
>> typelibs will prevent SpamBayes working at all for the older
>> users.  No time to, or easy way to look at this particular problem.
> 
> In my local copies, I've changed the hard-coded codes to match the
> ones I generate.  I suspect that you're right, since it didn't appear
> that my binary built worked for Amir.  I have access to OL2K and
> OL2k2, so I could play around with this at some point, but there
> doesn't seem to be much of a need for it right now.  (i.e for as long
> as you're willing to be the plug-in builder).

On my system, I just grabbed copies of MSO9.DLL and MSOUTL9.OLB from an
old installation of Outlook 2K and registered them with regsvr32 on my
Outlook 2003 system.  That installed the two type libraries that
SpamBayes is looking for.  Now everything builds without errors and
creates the right genpy interface GUIDs when I run py2exe.  Haven't
actually installed and run the binary on an OL2K system to verify
operation, though.

-- 
Kenny Pitt


From rmalayter at bai.org  Mon Apr 12 11:45:51 2004
From: rmalayter at bai.org (Ryan Malayter)
Date: Mon Apr 12 11:45:57 2004
Subject: [spambayes-dev] various Outlook version and RFC2822 compliance
Message-ID: <792DE28E91F6EA42B4663AE761C41C2A021C8D95@cliff.bai.org>

[Seth Goodman]

> 1) How RFC2822 compliant is the stored message format in the various
> versions of Outlook subsequent to Outlook2000?  Without even 
> looking at the
> code I can tell that Outlook2000 is not compliant due to the 
> total absence
> of a References: header, which causes many people real 
> problems who view
> mailing lists by conversation thread.

I've done a bit of research into this, as I am trying to find a way to
reliably reconstruct the MIME structure in the Outlook plug-in, beyond
simply synthesizing a token for attachments.

When any version of Outlook, even 2003, stores mail in a .PST file, the
messages are converted to Microsoft's "MAPI" format, which destroys the
MIME structure. The MAPI format is mostly proprietary and only partially
documented, and seems to get tweaks from version to version. This
situation is not likely to change, since MS needs to preserve some form
of backwards compatibility. Many people run different versions of
Outlook on different machines, and they would get a boatload of support
calls from people trying to open newer PST files on older Outlook
versions if they changed the format drastically.

A version of this MAPI format is what is exposed via the Outlook APIs to
the SpamBayes Outlook plug-in, and is the source of the issue with
attachments. 

Now, when you use Outlook 2003 and mail is *stored* on a Microsoft
Exchange *2003* server (not a PST file), the mail is not converted from
to MAPI format automatically. It remains in RFC format the in the
Exchange server database and even when it is sent to the Outlook 2003
client. This is nice, because it drastically reduces the "format
conversion" CPU load on the Exchange server.

However, there still appears to be no way to access this RFC-compliant
message stream programmatically from within the Outlook 2003 client. The
Outlook client performs the RFC-to-MAPI format conversion on the fly.

You can get the RFC format message stream through various means on the
server-side, but this is not much help to the SpamBayes plug-in. One
thing I have been able to do is create a windows file share of the
Exchange Installable File-system (EXIFS), which basically gives you
access to a set of read-only files representing each message in RFC
format. Assuming you were to set up this file share on your Exchange
server with appropriate permissions, you could then have add code to the
SpamBayes plug-in to look at the RFC-formatted message from this file
share.

This method is certainly a hack, and may not work in the future, since
MS appears to be moving away from the ExIFS. And since most users of the
SB code base do not use Exchange servers, but rather connect to standard
POP3 or IMAP servers, it is probably not worth pursuing a patch to the
general SB code base to make this work.


> 
> 2) If the later versions of Outlook are more (or perhaps even fully?)
> RFC2822 compliant, would it be possible to detect the Outlook 
> version and
> enable generating the additional tokens that are available 
> with the web
> proxy?
> 

Another option I was looking at would be to use a subset of the
SpamBayes POP3/IMAP filter in the Outlook client to retrieve messages in
RFC format. This way, if you left your mail on the server, you could
still use the Outlook plug-in user interface, but it would actually go
and retrieve the mail from the server via MAPI or POP3 rather than using
Outlook's API to get a message stream. If it couldn't find the message
via IMAP or POP3, that means the message is no longer on the mail server
and it would use the version provided by Outlook's API.

This basically would mean there would need to be a level of integration
between the Outlook plug-in and the MAPI/POP3 proxies, and *all* Outlook
plug-in installations of SpamBayes would also be MAPI or POP3 proxy
installations.

It seems this is going to be difficult to get working, though, with the
possibility of little gain if tokenizing file attachments doesn't prove
generally useful.

So I'm going to go back to trying to synthesize a MIME header for
attachments when I have the time.

If you have any more thoughts, please let me know.

Thanks,
	Ryan

From pje at telecommunity.com  Mon Apr 12 18:47:47 2004
From: pje at telecommunity.com (Phillip J. Eby)
Date: Mon Apr 12 18:48:08 2004
Subject: FW: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <E1BCqNF-0001re-Q7@mail.python.org>
Message-ID: <5.1.1.6.0.20040412184300.0200b790@telecommunity.com>

At 09:36 PM 4/11/04 -0400, "Phillip J. Eby" <pje@telecommunity.com> wrote:
>Of course, the spammers' next move would likely be to use redirects from
>non-"bulletproof" hosts, but everything we can do to make it more difficult
>and more costly for them is a good thing.

Oh, by the way, Slashdot ran an article today on a similar scheme:

http://slashdot.org/article.pl?sid=04/04/12/1956252


From mhammond at skippinet.com.au  Mon Apr 12 23:20:57 2004
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Apr 12 23:21:17 2004
Subject: [spambayes-dev] various Outlook version and RFC2822 compliance
In-Reply-To: <792DE28E91F6EA42B4663AE761C41C2A021C8D95@cliff.bai.org>
Message-ID: <119901c42106$56b465c0$0200a8c0@eden>

> I've done a bit of research into this, as I am trying to find a way to
> reliably reconstruct the MIME structure in the Outlook plug-in, beyond
> simply synthesizing a token for attachments.

As a matter of interest, why do you want to do this?  Won't you have to
reconstruct all the attachments in this stream, just to have them pulled
apart (but promptly ignored) by the tokenizer?  For binary attachments,
including virus payload, this would seem significant.

Given the various problems and version dependencies we have extracting this
stream, it would seem much simpler to use documented stable Outlook
interfaces to synthesize the few tokens we are talking about.

While I agree it is an interesting problem, I don't see why it would be the
best way for the Outlook addin to approach it.  Is there something I am
missing?

Thanks,

Mark.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 2032 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040413/3052dc03/winmail.bin
From mhammond at skippinet.com.au  Mon Apr 12 23:22:23 2004
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Apr 12 23:22:41 2004
Subject: [spambayes-dev] Compatibility with Norton Antivirus
In-Reply-To: <BAY16-DAV25dKGwAYGa0001a95b@hotmail.com>
Message-ID: <119d01c42106$8a58c3d0$0200a8c0@eden>

> operates as a POP3 proxy.  It hooks directly into the network protocol
> stack as a "filter" that sees the traffic on the POP3 port before
> whatever application is accessing the port. 

Hmm - now that sounds like fun :)

Mark.


From rmalayter at bai.org  Tue Apr 13 00:00:16 2004
From: rmalayter at bai.org (Ryan Malayter)
Date: Tue Apr 13 00:00:21 2004
Subject: [spambayes-dev] various Outlook version and RFC2822 compliance
Message-ID: <792DE28E91F6EA42B4663AE761C41C2A021C8DED@cliff.bai.org>

[Ryan Malayter]
>As a matter of interest, why do you want to do this?  Won't 
>you have to reconstruct all the attachments in this stream, 
>just to have them pulled apart (but promptly ignored) by the 
>tokenizer?  

Because there may be more tokenizing options that require the full
RFC2822-plus-MIME structure of the message. I also figured it would be
neat and tidy to put the Outlook plug-in on equal footing with the proxy
versions of Outlook. It would certainly make the two versions respond
the same when testing the same corpora.

Synthesizing tokens would scratch this particular itch a few people are
having with attachment names, but not necessarily any future itches. I
figured if I could solve the RFC-format problem easily (which I can't),
it would solve my current issue and also be better for the future of the
code base.

	-ryan-

From tameyer at ihug.co.nz  Tue Apr 13 01:59:34 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Apr 13 01:59:52 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305E11030@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677C1C@its-xchg4.massey.ac.nz>

Have you tried using the x-slurp_urls option as a solution for this problem?
(I'm not saying it's a better solution, just curious if you have, and if so,
what the results were).

> In case anyone would like to play with it, I'll append my trivial
> patch. It requires pydns from:
> 
> http://sourceforge.net/projects/pydns/

This concerns me a bit.  I'd want to see really dramatic results before
something in the core distribution required non-standard libraries to be
installed.  How complex is the code that the patch is using?  Running
timcv.py was *really* slow, too - I don't know whether this was because a
lot of messages timed out, or that the DNS lookup was slow, or what, but it
worries me a bit.  Doing the DNS enquiry interactively was very quick, and
at this time of night our DNS server isn't used much at all, so quite
responsive.

Here are my results using timcv.py -n5 with two corpora.  First cmp.py
results, then a table.py with just running with defaults as well.

The first one (my wife's mail for the last few months) is a win (-1 fn, -4
unsure).  The second one (my work mail for the last few months) is a loss
(two unsure move into fn in one run, the rest unchanged).

Note that in both of these the standard x-pick_apart_urls option does
nothing (good or bad) for me.

-> <stat> tested 101 hams & 358 spams against 398 hams & 1427 spams
-> <stat> tested 100 hams & 359 spams against 399 hams & 1426 spams
-> <stat> tested 100 hams & 358 spams against 399 hams & 1427 spams
-> <stat> tested 99 hams & 353 spams against 400 hams & 1432 spams
-> <stat> tested 99 hams & 357 spams against 400 hams & 1428 spams
-> <stat> tested 101 hams & 358 spams against 398 hams & 1427 spams
-> <stat> tested 100 hams & 359 spams against 399 hams & 1426 spams
-> <stat> tested 100 hams & 358 spams against 399 hams & 1427 spams
-> <stat> tested 99 hams & 353 spams against 400 hams & 1432 spams
-> <stat> tested 99 hams & 357 spams against 400 hams & 1428 spams

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied  5 times
lost  0 times

total unique fp went from 0 to 0 tied
mean fp % went from 0.0 to 0.0 tied

false negative percentages
    0.279  0.279  tied
    0.279  0.279  tied
    0.559  0.559  tied
    2.266  2.266  tied
    2.521  2.241  won    -11.11%

won   1 times
tied  4 times
lost  0 times

total unique fn went from 21 to 20 won     -4.76%
mean fn % went from 1.18076754281 to 1.12474513385 won     -4.74%

ham mean                     ham sdev
   0.00    0.01 +(was 0)        0.04    0.04   +0.00%
   0.49    0.49   +0.00%        4.91    4.91   +0.00%
   0.02    0.01  -50.00%        0.12    0.11   -8.33%
   0.03    0.02  -33.33%        0.21    0.21   +0.00%
   0.01    0.01   +0.00%        0.08    0.08   +0.00%

ham mean and sdev for all runs
   0.11    0.11   +0.00%        2.21    2.21   +0.00%

spam mean                    spam sdev
  96.02   96.11   +0.09%       13.44   13.60   +1.19%
  97.15   97.31   +0.16%       11.27   11.10   -1.51%
  97.12   97.30   +0.19%       11.86   11.89   +0.25%
  94.93   94.92   -0.01%       17.08   17.53   +2.63%
  94.99   95.08   +0.09%       17.16   17.26   +0.58%

spam mean and sdev for all runs
  96.05   96.15   +0.10%       14.40   14.55   +1.04%

ham/spam mean difference: 95.94 96.04 +0.10

filename:       libbys libby_picks libby_pickms
ham:spam:     499:1785    499:1785    499:1785
fp total:            0           0           0
fp %:             0.00        0.00        0.00
fn total:           21          21          20
fn %:             1.18        1.18        1.12
unsure t:          118         119         114
unsure %:         5.17        5.21        4.99
real cost:      $44.60      $44.80      $42.80
best cost:      $11.80      $11.80      $12.00
h mean:           0.11        0.11        0.11
h sdev:           2.21        2.21        2.21
s mean:          96.04       96.05       96.15
s sdev:          14.40       14.40       14.55
mean diff:       95.93       95.94       96.04
k:                5.78        5.78        5.73

-> <stat> tested 280 hams & 131 spams against 1111 hams & 512 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 277 hams & 128 spams against 1114 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 280 hams & 131 spams against 1111 hams & 512 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 277 hams & 128 spams against 1114 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied  5 times
lost  0 times

total unique fp went from 0 to 0 tied
mean fp % went from 0.0 to 0.0 tied

false negative percentages
    6.870  6.870  tied
    3.125  3.125  tied
    7.813  9.375  lost   +19.99%
    3.906  3.906  tied
    5.469  5.469  tied

won   0 times
tied  4 times
lost  1 times

total unique fn went from 35 to 37 lost    +5.71%
mean fn % went from 5.43654580153 to 5.74904580153 lost    +5.75%

ham mean                     ham sdev
   0.18    0.18   +0.00%        1.77    1.77   +0.00%
   0.01    0.01   +0.00%        0.17    0.17   +0.00%
   0.01    0.01   +0.00%        0.12    0.12   +0.00%
   0.03    0.01  -66.67%        0.39    0.13  -66.67%
   0.28    0.29   +3.57%        3.37    3.38   +0.30%

ham mean and sdev for all runs
   0.10    0.10   +0.00%        1.72    1.71   -0.58%

spam mean                    spam sdev
  88.89   88.89   +0.00%       25.38   25.48   +0.39%
  90.07   90.39   +0.36%       23.20   22.75   -1.94%
  87.23   87.13   -0.11%       28.96   29.35   +1.35%
  90.79   90.92   +0.14%       23.89   23.80   -0.38%
  90.31   90.67   +0.40%       25.99   25.52   -1.81%

spam mean and sdev for all runs
  89.46   89.60   +0.16%       25.59   25.52   -0.27%

ham/spam mean difference: 89.36 89.50 +0.14

filename:    exchanges exchange_picks
                                   exchange_pickms
ham:spam:     1391:643    1391:643    1391:643
fp total:            0           0           0
fp %:             0.00        0.00        0.00
fn total:           35          35          37
fn %:             5.44        5.44        5.75
unsure t:           83          82          80
unsure %:         4.08        4.03        3.93
real cost:      $51.60      $51.40      $53.00
best cost:      $33.80      $33.80      $33.00
h mean:           0.10        0.10        0.10
h sdev:           1.72        1.72        1.71
s mean:          89.34       89.46       89.60
s sdev:          25.65       25.59       25.52
mean diff:       89.24       89.36       89.50
k:                3.26        3.27        3.29

=Tony Meyer


From mhammond at skippinet.com.au  Tue Apr 13 02:27:13 2004
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Apr 13 02:27:40 2004
Subject: [spambayes-dev] ANNOUNCE: SpamBayes release 1.0b1
Message-ID: <00a701c42120$5ca69e70$0200a8c0@eden>

The SpamBayes team is pleased to announce the latest release of SpamBayes -
1.0b1.

Like the last version, this is both a release of the source code and of an
installation program for all Microsoft Windows users.

The Windows installation program will install either the Outlook add-in (for
Microsoft Outlook users), or the SpamBayes server program (for all other
mail client users, including Microsoft Outlook Express). All Windows users
(including existing users of the Outlook add-in) are encouraged to use the
installation program.

If you wish to use the source-code version, you will also need to install
Python - see README.txt in the source tree for more information.

This release fixes a number of bugs in the last release, including a bug
that could cause your PC to operate as an open mail relay in some cases.  We
recommend that all existing users upgrade. For a detailed description of
everything (well, everything we remember) that has changed since the last
release, you can view our WHAT_IS_NEW.txt file, either online, or in the
source distribution.

Get it via the 'Download' page at

    http://www.spambayes.org/download.html

Enjoy the new release and your spam-free mailbox :-)

Thanks to everyone involved in this release, particularly, and as usual,
Tony Meyer for putting most of the actual release together!

Mark.
(on behalf of the SpamBayes team)

--- What is SpamBayes? ---

The SpamBayes project is working on developing a Bayesian (of sorts)
anti-spam filter (in Python), initially based on the work of Paul Graham.
The major difference between this and other, similar projects is the
emphasis on testing newer approaches to scoring messages.

The project includes a number of different applications, all using the same
core code, ranging from a plug-in for Microsoft Outlook, to a POP3 proxy, to
various command-line tools.


From anthony at interlink.com.au  Tue Apr 13 04:48:27 2004
From: anthony at interlink.com.au (Anthony Baxter)
Date: Tue Apr 13 04:50:01 2004
Subject: [spambayes-dev] Re: [Spambayes] ANNOUNCE: SpamBayes release 1.0b1
In-Reply-To: <00a701c42120$5ca69e70$0200a8c0@eden>
References: <00a701c42120$5ca69e70$0200a8c0@eden>
Message-ID: <407BA95B.201@interlink.com.au>

Mark Hammond wrote:
> The SpamBayes team is pleased to announce the latest release of SpamBayes -
> 1.0b1.

Woohoo! Well done to everyone involved in this release process.

When we get to 1.0, it's probably worth mentioning in the
announcement that although it's a "1.0" release, it's actually
the 9th? 10th? release of the software.


and-now-onto-the-almost-mythical-"1.0"-release...

Anthony
-- 
Anthony Baxter     <anthony@interlink.com.au>
It's never too late to have a happy childhood.

From sjoerd at acm.org  Tue Apr 13 06:35:28 2004
From: sjoerd at acm.org (Sjoerd Mullender)
Date: Tue Apr 13 06:35:32 2004
Subject: [spambayes-dev] python setup.py build is failing
Message-ID: <407BC270.5040900@acm.org>

With a completely up-to-date checkout of spambayes, the command "python 
setup.py build" fails with the message
	error: file 'scripts/sb_bnfilter.py' does not exist

It seems to me somebody checked in a change to setup.py to also compile 
and install sb_bnfilter.py but forgot to check in the file itself...

-- 
Sjoerd Mullender <sjoerd@acm.org>

From skip at pobox.com  Tue Apr 13 09:41:54 2004
From: skip at pobox.com (Skip Montanaro)
Date: Tue Apr 13 09:42:01 2004
Subject: [spambayes-dev] python setup.py build is failing
In-Reply-To: <407BC270.5040900@acm.org>
References: <407BC270.5040900@acm.org>
Message-ID: <16507.60962.924060.82138@montanaro.dyndns.org>


    Sjoerd> With a completely up-to-date checkout of spambayes, the command
    Sjoerd> "python setup.py build" fails with the message
    Sjoerd>     error: file 'scripts/sb_bnfilter.py' does not exist

I noticed the change to the installation procedure float by on the checkins
list but never saw anything which indicated that sb_bnfilter.py and
sb_bnserver.py were moved from contrib to scripts.

    Sjoerd> It seems to me somebody checked in a change to setup.py to also
    Sjoerd> compile and install sb_bnfilter.py but forgot to check in the
    Sjoerd> file itself...

It was there, just not where setup.py was looking.  I cvs remove'd them from
contrib and cvs add'ed them to scripts, then modified setup.py to also
install sb_bnserver.py.

Skip

From kennypitt at hotmail.com  Tue Apr 13 10:28:39 2004
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Tue Apr 13 10:29:59 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677C1C@its-xchg4.massey.ac.nz>
Message-ID: <BAY16-DAV319VMczi9y0000c994@hotmail.com>

Tony Meyer wrote:
> Have you tried using the x-slurp_urls option as a solution for this
> problem? (I'm not saying it's a better solution, just curious if you
> have, and if so, what the results were).
> 
>> In case anyone would like to play with it, I'll append my trivial
>> patch. It requires pydns from:
>> 
>> http://sourceforge.net/projects/pydns/
> 
> This concerns me a bit.  I'd want to see really dramatic results
> before something in the core distribution required non-standard
> libraries to be installed.

Any reason why socket.gethostbyname(hostname) wouldn't work?  I wrote a
patch a while back using that function to do DNS queries against a DNSBL
blacklist server and create additional tokens based on the results.

There are two problems with doing DNS queries during tokenization.  The
first is performance because you're having to wait for the result of
network operations instead of just manipulating local data.  My DNSBL
queries worked well, but didn't improve the overall accuracy enough to
justify the performance hit.

The second is training.  DNS lookups are by nature dynamic, so the
results generated are not necessarily the same every time you do it.
Training (in particular, correcting the training of a message that was
previously trained incorrectly) relies on the tokens that get generated
for a particular message being identical every time the message is
tokenized.  If some of the tokens rely on additional data from a DNS
query, those tokens may be different when the user gets around to
retraining the message.

-- 
Kenny Pitt


From matt at mondoinfo.com  Tue Apr 13 13:52:41 2004
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Tue Apr 13 13:53:01 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <BAY16-DAV319VMczi9y0000c994@hotmail.com>
References: <1ED4ECF91CDED24C8D012BCF2B034F1304677C1C@its-xchg4.massey.ac.nz>
	<BAY16-DAV319VMczi9y0000c994@hotmail.com>
Message-ID: <1081876416.34.1193@mint-julep.mondoinfo.com>

>>> http://sourceforge.net/projects/pydns/

[Tony Meyer]
>> This concerns me a bit.  I'd want to see really dramatic results
>> before something in the core distribution required non-standard
>> libraries to be installed.

I don't necessarily disagree. Still, even if it went into the core
distribution, it would surely be sensible to have it turned off by
default and distutils makes installing PyDNS pretty simple.

I've thought for a while that it would be good to get some DNS module
into Python's standard library but I've never thought that I had a
strong enough argument to bring it up publicly. Using it in SpamBayes
might be a start.

[Kenny Pitt]
> Any reason why socket.gethostbyname(hostname) wouldn't work?  I
> wrote a patch a while back using that function to do DNS queries
> against a DNSBL blacklist server and create additional tokens based
> on the results.

As far as I can tell, socket.gethostbyname() doesn't respect the
timeout set by socket.setdefaulttimeout(). That's apt to make the
performance hit rather worse.

> There are two problems with doing DNS queries during tokenization.
> The first is performance because you're having to wait for the
> result of network operations instead of just manipulating local
> data.  My DNSBL queries worked well, but didn't improve the overall
> accuracy enough to justify the performance hit.

Personally, as long as I set the timeout pretty low, I barely notice
the difference. When my mail client fetches a couple of emails,
they're scored quickly enough that I don't notice an additional
delay. If it fetches 100 or so, that's going to take a while in
either case. No doubt, other people would have different experiences.

> The second is training.  DNS lookups are by nature dynamic, so the
> results generated are not necessarily the same every time you do
> it. Training (in particular, correcting the training of a message
> that was previously trained incorrectly) relies on the tokens that
> get generated for a particular message being identical every time
> the message is tokenized.  If some of the tokens rely on additional
> data from a DNS query, those tokens may be different when the user
> gets around to retraining the message.

That's certainly a disadvantage. I think that legitimate servers
don't move around all that much, so it may turn out to be a
relatively small one but it would be nice to know for sure.

Regards,
Matt


From tameyer at ihug.co.nz  Tue Apr 13 18:58:34 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Apr 13 18:58:57 2004
Subject: [spambayes-dev] python setup.py build is failing
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305E1170A@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B87@its-xchg4.massey.ac.nz>

> I noticed the change to the installation procedure float by 
> on the checkins list but never saw anything which indicated 
> that sb_bnfilter.py and sb_bnserver.py were moved from 
> contrib to scripts.
[...]
> It was there, just not where setup.py was looking.  I cvs 
> remove'd them from contrib and cvs add'ed them to scripts, 
> then modified setup.py to also install sb_bnserver.py.

Sorry, this is my fault.  I noticed that there was a new script, but didn't
notice that it wasn't in the scripts directory (or that there were actually
two).

Annoyingly, I forgot that I had made this change when I built the 1.0b1
dists, and so figured that the testing I did with 1.0b1rc1 would still be ok
(lesson to be learned there).  I'll put 1.0b1.1's on sf now.

=Tony Meyer


From tameyer at ihug.co.nz  Tue Apr 13 19:16:37 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Apr 13 19:16:49 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677C1C@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B88@its-xchg4.massey.ac.nz>

> Here are my results using timcv.py -n5 with two corpora.  
> First cmp.py results, then a table.py with just running with 
> defaults as well.

And here are two more (they were running too slow to get out yesterday, but
completed overnight).

The first one is my non-work mail for the last few months; the second one is
the five sets that make up the SpamAssassin Public Archive (the bzip files
starting with 2003...).

Once again, the standard x-pick_apart_urls option does nothing (good or bad)
for me.  The SAPC one is just a loss, and the other is a more substantial
loss (although each win with one run).

-> <stat> tested 4692 hams & 386 spams against 18762 hams & 1537 spams
-> <stat> tested 4695 hams & 381 spams against 18759 hams & 1542 spams
-> <stat> tested 4693 hams & 383 spams against 18761 hams & 1540 spams
-> <stat> tested 4690 hams & 384 spams against 18764 hams & 1539 spams
-> <stat> tested 4684 hams & 389 spams against 18770 hams & 1534 spams
-> <stat> tested 4692 hams & 386 spams against 18762 hams & 1537 spams
-> <stat> tested 4695 hams & 381 spams against 18759 hams & 1542 spams
-> <stat> tested 4693 hams & 383 spams against 18761 hams & 1540 spams
-> <stat> tested 4690 hams & 384 spams against 18764 hams & 1539 spams
-> <stat> tested 4684 hams & 389 spams against 18770 hams & 1534 spams

false positive percentages
    0.000  0.000  tied
    0.021  0.021  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied  5 times
lost  0 times

total unique fp went from 1 to 1 tied
mean fp % went from 0.00425985090522 to 0.00425985090522 tied

false negative percentages
    1.036  1.036  tied
    1.050  1.575  lost   +50.00%
    0.783  0.522  won    -33.33%
    1.823  2.083  lost   +14.26%
    1.285  1.799  lost   +40.00%

won   1 times
tied  1 times
lost  3 times

total unique fn went from 23 to 27 lost   +17.39%
mean fn % went from 1.19553834481 to 1.40321699713 lost   +17.37%

ham mean                     ham sdev
   0.09    0.10  +11.11%        1.73    1.72   -0.58%
   0.11    0.11   +0.00%        2.24    2.09   -6.70%
   0.12    0.12   +0.00%        2.05    2.05   +0.00%
   0.09    0.08  -11.11%        2.01    1.78  -11.44%
   0.04    0.05  +25.00%        0.88    1.19  +35.23%

ham mean and sdev for all runs
   0.09    0.09   +0.00%        1.85    1.80   -2.70%

spam mean                    spam sdev
  95.65   95.35   -0.31%       15.15   16.13   +6.47%
  95.77   95.20   -0.60%       15.18   16.83  +10.87%
  97.06   96.05   -1.04%       11.42   13.61  +19.18%
  95.32   94.61   -0.74%       16.75   18.41   +9.91%
  95.57   95.40   -0.18%       15.57   16.05   +3.08%

spam mean and sdev for all runs
  95.87   95.32   -0.57%       14.94   16.29   +9.04%

ham/spam mean difference: 95.78 95.23 -0.55

-> <stat> tested 830 hams & 380 spams against 3320 hams & 1517 spams
-> <stat> tested 830 hams & 380 spams against 3320 hams & 1517 spams
-> <stat> tested 830 hams & 379 spams against 3320 hams & 1518 spams
-> <stat> tested 830 hams & 379 spams against 3320 hams & 1518 spams
-> <stat> tested 830 hams & 379 spams against 3320 hams & 1518 spams
-> <stat> tested 830 hams & 380 spams against 3320 hams & 1517 spams
-> <stat> tested 830 hams & 380 spams against 3320 hams & 1517 spams
-> <stat> tested 830 hams & 379 spams against 3320 hams & 1518 spams
-> <stat> tested 830 hams & 379 spams against 3320 hams & 1518 spams
-> <stat> tested 830 hams & 379 spams against 3320 hams & 1518 spams

false positive percentages
    0.241  0.241  tied
    0.482  0.482  tied
    0.000  0.000  tied
    0.120  0.120  tied
    0.000  0.000  tied

won   0 times
tied  5 times
lost  0 times

total unique fp went from 7 to 7 tied
mean fp % went from 0.168674698795 to 0.168674698795 tied

false negative percentages
    0.789  1.053  lost   +33.46%
    0.526  0.526  tied
    0.528  0.264  won    -50.00%
    0.264  0.264  tied
    1.055  1.319  lost   +25.02%

won   1 times
tied  2 times
lost  2 times

total unique fn went from 12 to 13 lost    +8.33%
mean fn % went from 0.632551034579 to 0.685182613526 lost    +8.32%

ham mean                     ham sdev
   0.67    0.61   -8.96%        6.87    6.56   -4.51%
   0.95    0.85  -10.53%        8.69    8.08   -7.02%
   0.87    0.81   -6.90%        7.10    6.79   -4.37%
   0.60    0.57   -5.00%        6.64    6.49   -2.26%
   0.48    0.42  -12.50%        4.87    4.62   -5.13%

ham mean and sdev for all runs
   0.71    0.65   -8.45%        6.94    6.60   -4.90%

spam mean                    spam sdev
  97.13   96.89   -0.25%       12.08   13.00   +7.62%
  98.59   98.50   -0.09%        8.09    8.49   +4.94%
  98.57   98.44   -0.13%        8.03    8.15   +1.49%
  98.59   98.54   -0.05%        7.51    7.68   +2.26%
  97.91   97.72   -0.19%       11.50   12.22   +6.26%

spam mean and sdev for all runs
  98.16   98.02   -0.14%        9.66   10.18   +5.38%

ham/spam mean difference: 97.45 97.37 -0.08

filename:        ihugs  ihug_picks ihug_pickms
ham:spam:   23454:1923  23454:1923  23454:1923
fp total:            1           1           1
fp %:             0.00        0.00        0.00
fn total:           23          23          27
fn %:             1.20        1.20        1.40
unsure t:          169         171         176
unsure %:         0.67        0.67        0.69
real cost:      $66.80      $67.20      $72.20
best cost:      $57.00      $56.60      $62.40
h mean:           0.09        0.09        0.09
h sdev:           1.89        1.85        1.80
s mean:          95.86       95.87       95.32
s sdev:          14.99       14.94       16.29
mean diff:       95.77       95.78       95.23
k:                5.67        5.70        5.26

filename:        sapcs  sapc_picks sapc_pickms
ham:spam:    4150:1897   4150:1897   4150:1897
fp total:            7           7           7
fp %:             0.17        0.17        0.17
fn total:           12          12          13
fn %:             0.63        0.63        0.69
unsure t:           99          99         100
unsure %:         1.64        1.64        1.65
real cost:     $101.80     $101.80     $103.00
best cost:      $70.60      $70.20      $70.80
h mean:           0.71        0.71        0.65
h sdev:           6.92        6.94        6.60
s mean:          98.14       98.16       98.02
s sdev:           9.72        9.66       10.18
mean diff:       97.43       97.45       97.37
k:                5.86        5.87        5.80


From matt at mondoinfo.com  Tue Apr 13 22:12:38 2004
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Tue Apr 13 22:13:06 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B88@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F1304677C1C@its-xchg4.massey.ac.nz>
	<1ED4ECF91CDED24C8D012BCF2B034F13026F2B88@its-xchg4.massey.ac.nz>
Message-ID: <1081905546.39.1651@mint-julep.mondoinfo.com>

Dear Tony,

> And here are two more (they were running too slow to get out
> yesterday, but completed overnight).

> Once again, the standard x-pick_apart_urls option does nothing
> (good or bad) for me.  The SAPC one is just a loss, and the other
> is a more substantial loss (although each win with one run).

Hm. Well that's probably enough evidence. A tiny win for me and a
small loss for you.

What's odd is that doesn't seem to match what I'm seeing in my inbox.
I was seeing nonsense spams there and now I'm not. Perhaps the range
of spams that DNS lookup is useful for is just too narrow.

Regards,
Matt


From tameyer at ihug.co.nz  Wed Apr 14 03:03:49 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Wed Apr 14 03:04:07 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305E118EA@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B93@its-xchg4.massey.ac.nz>

> Hm. Well that's probably enough evidence. A tiny win for me 
> and a small loss for you.

I don't know if it's enough, but it's likely that it's all you'll be able to
solicit here <0.1 wink>.

> What's odd is that doesn't seem to match what I'm seeing in 
> my inbox. I was seeing nonsense spams there and now I'm not. 

If you go through your spam folder and look at the clues for messages that
look like the ones that used to be there, do you see these tokens?  It could
be that the spammers sending these types of messages took a holiday this
week <0.5 wink>.  In any case, if you're happy running from source, then
there's nothing stopping you keeping the patch going for your own system -
it seems unlikely that it'll conflict with any tokenizer changes in the near
future.

> Perhaps the range of spams that DNS lookup is useful for is 
> just too narrow.

I suspect that it's that the spams that this helps to nail are already
nailed with other techniques.

I was reading some past messages today and that reminded me to suggest that
you try (if you haven't already) the x-use_bigrams option.  At least some
people have found that it's better at nailing short spams (although maybe
not quite as good at some of the more 'talky' spams).  Testing and developer
experience (I'm not sure if any users have turned the option on) does
indicate that it's a win overall.

=Tony Meyer


From tdickenson at geminidataloggers.com  Wed Apr 14 05:55:06 2004
From: tdickenson at geminidataloggers.com (Toby Dickenson)
Date: Wed Apr 14 05:55:10 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677C1C@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F1304677C1C@its-xchg4.massey.ac.nz>
Message-ID: <200404141055.06066.tdickenson@geminidataloggers.com>

On Tuesday 13 April 2004 06:59, Tony Meyer wrote:
> Have you tried using the x-slurp_urls option as a solution for this
> problem? (I'm not saying it's a better solution, just curious if you have,
> and if so, what the results were).

Like x-slurp_urls, enabling this option could allow host names to be used as a 
bug by spammers to determine whether an email address is live. That doesnt 
seem likely, but its not impossible.

(and it would need custom dns hosting too.... so if we ever see this happening 
we would be able to expand this patch to use dns NS records as spam clues!)

> Running
> timcv.py was *really* slow, too - I don't know whether this was because a
> lot of messages timed out, or that the DNS lookup was slow, or what,

A barely related question..... which of our filtering methods allow for 
parallel filtering? sb_filter out of procmail does, provided your mta runs 
procmail concurrently :-) but sb_bnfilter will serialise them again :-(

-- 
Toby Dickenson


From tameyer at ihug.co.nz  Wed Apr 14 19:17:51 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Wed Apr 14 19:20:20 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305E119E7@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B97@its-xchg4.massey.ac.nz>

> Like x-slurp_urls, enabling this option could allow host 
> names to be used as a bug by spammers to determine whether
> an email address is live. That doesnt seem likely, but its
> not impossible.

This was discussed (a lot) back when the x-slurp_urls option was first
offered.  It's probably the main reason why even if it does live past being
an experimental option, it'll never default to True.  It's also the reason
for the x-only_slurp_base option - I can't see any way (other than
registering a domain per message) that it could then be used as a 'address
is live' indicator.  OTOH, enabling x-only_slurp_base does a lot of hurt to
the results in my testing.  If the x-slurp_urls option is ever shown to be
really effective, then it seems likely that a middle path between the two,
where it's very difficult to put any tracking information in, could be
created easily enough.

> A barely related question..... which of our filtering methods 
> allow for parallel filtering? sb_filter out of procmail does,
> provided your mta runs procmail concurrently :-) but sb_bnfilter
> will serialise them again :-(

I'm pretty sure that we don't support more than one process accessing the
database at one time at all.  As for one process filtering multiple messages
at a time, I believe sb_server can do this (i.e. if two connections are
made, to different local proxy ports, at the same time).  sb_imapfilter and
the Outlook plug-in don't.

I do have a version of the testing setup than runs on a cluster, but I
presume that's not the sort of parallel you were meaning?

=Tony Meyer


From spambayes at kungfoocoder.org  Wed Apr 14 19:40:33 2004
From: spambayes at kungfoocoder.org (Paul Wagland)
Date: Wed Apr 14 19:40:45 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B97@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B97@its-xchg4.massey.ac.nz>
Message-ID: <1081986032.11188.63.camel@morsel.kungfoocoder.org>

On Thu, 2004-04-15 at 01:17, Tony Meyer wrote:
> > Like x-slurp_urls, enabling this option could allow host 
> > names to be used as a bug by spammers to determine whether
> > an email address is live. That doesnt seem likely, but its
> > not impossible.
> 
> This was discussed (a lot) back when the x-slurp_urls option was first
> offered.  It's probably the main reason why even if it does live past being
> an experimental option, it'll never default to True.  It's also the reason
> for the x-only_slurp_base option - I can't see any way (other than
> registering a domain per message) that it could then be used as a 'address
> is live' indicator.

Just as a side issue... they only need a subdomain for message, not a
full domain. I.e. aaa.spamisevil.com is just as unique as
aaaspamisevil.com

So, it would be fairly easy to setup to harvest "good" addresses. And,
as a bonus, if you don't care about the image being shown, just about
the e-mail address, you can return a false random response for the DNS
lookup.

Indeed, one early web site that I saw actually did cookie-less session
tracking using URL rewriting, but instead of playing with the URL, they
played with the hostname in a manner similar to aaacookieid.www.host.com

Food for thought,
Cheers,
Paul


From tameyer at ihug.co.nz  Wed Apr 14 19:46:23 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Wed Apr 14 19:46:36 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305E92428@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677C46@its-xchg4.massey.ac.nz>

> Just as a side issue... they only need a subdomain for 
> message, not a full domain. I.e. aaa.spamisevil.com is just 
> as unique as aaaspamisevil.com

I was really talking about the x-slurp_urls option, rather than the DNS
lookup.  With that option's x-only_slurp_base the URL that is retrieved is
the simplest form of the url, i.e. "aaaspamisevil.com" or "massey.ac.nz".
Doing a simple HTTP request for a webpage like that does (AFAICT) include
any information at all about who is doing the request.  This means that you
*do* need a domain per message.  It also means that if I have a spammy page
at "spam.massey.ac.nz", but "massey.ac.nz" is ham, the clues generated will
make things worse, not better.  Of course, if the root domain is
legitimately hammy and they have spammy subdomains/pages, there's a
reasonable chance that you can get the spammy people kicked off.

=Tony Meyer


From matt at mondoinfo.com  Wed Apr 14 21:40:40 2004
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Wed Apr 14 21:41:34 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B93@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F1305E118EA@its-xchg4.massey.ac.nz>
	<1ED4ECF91CDED24C8D012BCF2B034F13026F2B93@its-xchg4.massey.ac.nz>
Message-ID: <1081990843.72.559@mint-julep.mondoinfo.com>

[me]
> Hm. Well that's probably enough evidence. A tiny win for me 
> and a small loss for you.

[Tony Meyer]
> I don't know if it's enough, but it's likely that it's all you'll
> be able to solicit here <0.1 wink>.

<0.9 chuckle>

> If you go through your spam folder and look at the clues for
> messages that look like the ones that used to be there, do you see
> these tokens?

I do. For example, I have a nonsense spam ("ostrich rimy cowlick
derange...") that has the subject "Our little secret". And its clues
include:

0.908 url-ip:221.5.250.122/32
0.908 url-ip:221.5.250/24
0.908 url-ip:221.5/16
0.965 url-ip:221/8

> It could be that the spammers sending these types of messages took
> a holiday this week <0.5 wink>.

<grin> It may also be that sending nonsense spams is a new tactic
among spammers (born of the success of SpamBayes of course) and
testing against spam even a month old won't show much advantage. I
was certainly motivated to try the url-ip thing because of the
unsures I had seen in the previous week or so.

> In any case, if you're happy running from source, then there's
> nothing stopping you keeping the patch going for your own system -
> it seems unlikely that it'll conflict with any tokenizer changes in
> the near future.

Indeed, I plan to. It doesn't seem to do me any harm. I'm mostly
miffed that the value of my Fabulously Clever Idea isn't borne out by
actual testing. I expect that Tim Peters in particular has enormous
sympathy <wink>.

> I suspect that it's that the spams that this helps to nail are
> already nailed with other techniques.

That seems like the most likely explanation.

> I was reading some past messages today and that reminded me to
> suggest that you try (if you haven't already) the x-use_bigrams
> option.  At least some people have found that it's better at
> nailing short spams (although maybe not quite as good at some of
> the more 'talky' spams).  Testing and developer experience (I'm not
> sure if any users have turned the option on) does indicate that
> it's a win overall.

Since I now have a nifty set of ten buckets, I'm glad to try out
other folks' Fabulously Clever Ideas. Here's the result:

normals.txt -> bigramss.txt
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams

false positive percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.500  0.500  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          

won   0 times
tied  5 times
lost  0 times

total unique fp went from 1 to 1 tied          
mean fp % went from 0.1 to 0.1 tied          

false negative percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.500  0.500  tied          
    0.000  0.000  tied          

won   0 times
tied  5 times
lost  0 times

total unique fn went from 1 to 1 tied          
mean fn % went from 0.1 to 0.1 tied          

ham mean                     ham sdev
   0.27    0.28   +3.70%        3.13    2.97   -5.11%
   0.36    0.58  +61.11%        3.86    4.91  +27.20%
   0.68    0.92  +35.29%        7.28    8.16  +12.09%
   0.14    0.24  +71.43%        1.03    1.83  +77.67%
   0.31    0.30   -3.23%        2.53    2.78   +9.88%

ham mean and sdev for all runs
   0.35    0.46  +31.43%        4.13    4.71  +14.04%

spam mean                    spam sdev
  99.89   99.77   -0.12%        1.02    1.61  +57.84%
  99.74   99.89   +0.15%        2.99    1.29  -56.86%
  98.92   99.24   +0.32%        5.15    4.27  -17.09%
  98.37   98.38   +0.01%        9.43    8.39  -11.03%
  98.86   98.82   -0.04%        6.36    6.71   +5.50%

spam mean and sdev for all runs
  99.16   99.22   +0.06%        5.79    5.28   -8.81%

ham/spam mean difference: 98.81 98.76 -0.05

Alas, it seems that there's not much advantage there either. The only
classification difference seems to be that the number of unsures went
up by two.

Regards,
Matt


From mcclurgm at bellsouth.net  Thu Apr 15 00:21:59 2004
From: mcclurgm at bellsouth.net (Mark McClurg)
Date: Thu Apr 15 00:22:04 2004
Subject: [spambayes-dev] _pop3proxyspam.mbox on Desktop?
Message-ID: <407E0DE7.7060200@bellsouth.net>

I have just installed spambayes for use on XP with Outlook Express.  
I've run a few emails through for training, and all appears operational 
- I'm excited to have this program to decrease the SPAM i've been 
dealing with.

 I've one question though.  A file by the name of
_pop3proxyspam.mbox
has now shown up on my Desktop.

I don't see any reference to this file being placed in any directory, 
and I see no directory in the .ini file that I would modify to have this 
file written elsewhere.
Can someone explain what I can/should do - it's confusing with it 
visible on the Desktop.

Thanks!
Mark


From tameyer at ihug.co.nz  Thu Apr 15 02:21:45 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Thu Apr 15 02:21:52 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305E92470@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B9F@its-xchg4.massey.ac.nz>

> Since I now have a nifty set of ten buckets, I'm glad to try 
> out other folks' Fabulously Clever Ideas.

Always appreciated!  If you contribute nothing else to SpamBayes (and I'm
sure you will :) simply testing out other people's ideas and letting
everyone know the results helps a lot - especially since not many people
manage to get time to do this these days.  If you want to do more (it gets
addictive, trust me ;) there are all the current x- options...

> Here's the result:
[...]
> Alas, it seems that there's not much advantage there either. 
> The only classification difference seems to be that the 
> number of unsures went up by two.

I should have looked at your original cmp.py posting more closely (and have
now).  I think that you've hit the "Peters barrier", i.e. your results with
the defaults are so good that it's hard to measure whether any changes are
doing you any good or not.

Your defaults run only has one fp and one fn - to improve on this, the new
Fabulously Clever Idea would need to directly target those two messages
(without losing the rest).  Unless the improvement is all in the unsures -
since cmp.py output doesn't mention them, I can't tell how many there are in
the defaults; maybe this is where the room to improve is.  (If you still
have the rates.py output around, could you post a table.py for the defaults,
dns and bigrams outputs?)

If you run "fpfn.py ratespyoutputs.txt" (with the appropriate rates.py
output file) it'll spit out a list of the fp's and fn's (all two of them ;)
for that test.  It'd be worth taking a look at these two messages and seeing
what they are.  It might be that they are basically impossible to get right
- for example, a message from someone you've never had mail from before
quoting a spam with a single line addition - that's very difficult to
classify as ham without getting a lot of fn's, too.

=Tony Meyer


From skip at pobox.com  Thu Apr 15 10:51:35 2004
From: skip at pobox.com (Skip Montanaro)
Date: Thu Apr 15 10:52:05 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <1081990843.72.559@mint-julep.mondoinfo.com>
References: <1ED4ECF91CDED24C8D012BCF2B034F1305E118EA@its-xchg4.massey.ac.nz>
	<1ED4ECF91CDED24C8D012BCF2B034F13026F2B93@its-xchg4.massey.ac.nz>
	<1081990843.72.559@mint-julep.mondoinfo.com>
Message-ID: <16510.41335.349259.911009@montanaro.dyndns.org>


    Matt> It may also be that sending nonsense spams is a new tactic among
    Matt> spammers (born of the success of SpamBayes of course) and testing
    Matt> against spam even a month old won't show much advantage. I was
    Matt> certainly motivated to try the url-ip thing because of the unsures
    Matt> I had seen in the previous week or so.

My guess is that for the most part spammers need to move their websites only
somewhat less often than they need to move mail hosts.  If they are
connected to the web via a more-or-less respectalble ISP they probably get
shut out pretty quickly.  Accordingly, month-old IP addresses may indeed not
be worth much.

Motivated mostly by my desire to keep my database size small, I routinely
(every few weeks) sort my ham and spam databases by date and whack of the
oldest 5% to 20% of the messages they contain.  This may have the side
effect of improving the IP address sensitivity.

Skip

From skip at pobox.com  Thu Apr 15 12:01:56 2004
From: skip at pobox.com (Skip Montanaro)
Date: Thu Apr 15 12:02:09 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B9F@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F1305E92470@its-xchg4.massey.ac.nz>
	<1ED4ECF91CDED24C8D012BCF2B034F13026F2B9F@its-xchg4.massey.ac.nz>
Message-ID: <16510.45556.523672.814107@montanaro.dyndns.org>


    >> Since I now have a nifty set of ten buckets, I'm glad to try out
    >> other folks' Fabulously Clever Ideas.

    Tony> Always appreciated!  If you contribute nothing else to SpamBayes
    Tony> (and I'm sure you will :) simply testing out other people's ideas
    Tony> and letting everyone know the results helps a lot - especially
    Tony> since not many people manage to get time to do this these days.

One thing I think we need to be careful of is using test data sets whose
messages are too old.  It's apparent the spammers are a moving target, so
what worked one or six months ago (or perhaps even a week ago) may not work
as well today.

Skip

From matt at mondoinfo.com  Thu Apr 15 14:48:12 2004
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Thu Apr 15 15:00:06 2004
Subject: [spambayes-dev] Results for DNS lookup in tokenizer
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B9F@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F1305E92470@its-xchg4.massey.ac.nz>
	<1ED4ECF91CDED24C8D012BCF2B034F13026F2B9F@its-xchg4.massey.ac.nz>
Message-ID: <1082053205.04.641@mint-julep.mondoinfo.com>

[Tony Meyer]
> Your defaults run only has one fp and one fn - to improve on this,
> the new Fabulously Clever Idea would need to directly target those
> two messages (without losing the rest).  Unless the improvement is
> all in the unsures - since cmp.py output doesn't mention them, I
> can't tell how many there are in the defaults; maybe this is where
> the room to improve is.

There is some room to improve the unsures. With the defaults, I get
27 unsures out of 1000 messages.

> (If you still have the rates.py output around, could you post a
> table.py for the defaults, dns and bigrams outputs?)

Here you go:

filename:       normal     bigrams         dns
ham:spam:    1000:1000   1000:1000   1000:1000
fp total:            1           1           1
fp %:             0.10        0.10        0.10
fn total:            1           1           1
fn %:             0.10        0.10        0.10
unsure t:           27          29          26
unsure %:         1.35        1.45        1.30
real cost:      $16.40      $16.80      $16.20
best cost:      $10.20      $11.60       $9.60
h mean:           0.35        0.46        0.32
h sdev:           4.13        4.71        3.97
s mean:          99.16       99.22       99.16
s sdev:           5.79        5.28        5.79
mean diff:       98.81       98.76       98.84
k:                9.96        9.89       10.13

> If you run "fpfn.py ratespyoutputs.txt" (with the appropriate
> rates.py output file) it'll spit out a list of the fp's and fn's
> (all two of them ;) for that test.  It'd be worth taking a look at
> these two messages and seeing what they are.  It might be that they
> are basically impossible to get right - for example, a message from
> someone you've never had mail from before quoting a spam with a
> single line addition - that's very difficult to classify as ham
> without getting a lot of fn's, too.

The false positive is one I ran into in real life. It's a
confirmation of an order for a pair of headphones. There are lots of
spammy words in it and I don't think I have much other ham from that
company or on that subject. The false negative is harder to explain.
The subject is "Help your employees avoid heat-related illnesses".
It's not the most traditional sort of spam since it doesn't ask me to
buy anything now. Scoring it against my normal database, it gets
0.789. Judging from the evidence reported, it seems that's because I
live in Minneapolis and talk about the weather a lot <22 winks
celsius>.

Regards,
Matt


From matt at mondoinfo.com  Thu Apr 15 21:37:26 2004
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Thu Apr 15 21:48:35 2004
Subject: [spambayes-dev] New results for DNS lookup in tokenizer
In-Reply-To: <1082053205.04.641@mint-julep.mondoinfo.com>
References: <1ED4ECF91CDED24C8D012BCF2B034F1305E92470@its-xchg4.massey.ac.nz>
	<1ED4ECF91CDED24C8D012BCF2B034F13026F2B9F@its-xchg4.massey.ac.nz>
	<1082053205.04.641@mint-julep.mondoinfo.com>
Message-ID: <1082078348.2.1077@mint-julep.mondoinfo.com>

It turns out that I was right when I speculated that using DNS
lookups would work better on more-recent spam. I re-did my spam sets
from the thousand most recent spams in my spam archive and got rather
better results:


new-pick-aparts.txt -> new-dnss.txt
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams

false positive percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.500  0.500  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          

won   0 times
tied  5 times
lost  0 times

total unique fp went from 1 to 1 tied          
mean fp % went from 0.1 to 0.1 tied          

false negative percentages
    0.500  0.000  won   -100.00%
    4.500  3.500  won    -22.22%
    1.000  0.500  won    -50.00%
    0.000  0.000  tied          
    3.000  2.500  won    -16.67%

won   4 times
tied  1 times
lost  0 times

total unique fn went from 18 to 13 won    -27.78%
mean fn % went from 1.8 to 1.3 won    -27.78%

ham mean                     ham sdev
   0.34    0.33   -2.94%        3.28    3.22   -1.83%
   0.14    0.14   +0.00%        1.39    1.36   -2.16%
   0.50    0.50   +0.00%        6.84    6.84   +0.00%
   0.48    0.31  -35.42%        3.75    2.10  -44.00%
   0.35    0.38   +8.57%        3.78    4.15   +9.79%

ham mean and sdev for all runs
   0.36    0.33   -8.33%        4.19    4.02   -4.06%

spam mean                    spam sdev
  98.00   98.49   +0.50%       10.43    8.53  -18.22%
  94.60   95.38   +0.82%       19.89   18.38   -7.59%
  97.52   97.96   +0.45%       11.40   10.63   -6.75%
  98.77   98.87   +0.10%        6.47    6.81   +5.26%
  94.78   95.38   +0.63%       18.47   17.51   -5.20%

spam mean and sdev for all runs
  96.73   97.22   +0.51%       14.37   13.33   -7.24%

ham/spam mean difference: 96.37 96.89 +0.52


In addition, unsures decreased some:


filename:  new-pick-apart         
                           new-dns
ham:spam:    1000:1000   1000:1000
fp total:            1           1
fp %:             0.10        0.10
fn total:           18          13
fn %:             1.80        1.30
unsure t:           46          40
unsure %:         2.30        2.00
real cost:      $37.20      $31.00
best cost:      $21.60      $19.80
h mean:           0.36        0.33
h sdev:           4.19        4.02
s mean:          96.73       97.22
s sdev:          14.37       13.33
mean diff:       96.37       96.89
k:                5.19        5.58


That's not an enormous win but it suggests that I probably am seeing
the improvement in my inbox that I think I'm seeing. And the
false-negatives that are eliminated are nonsense spams or spams with
lots of bland, unrelated text in them.

It's very arguable that a technique that only works well on recent
spam shouldn't be included in SpamBayes until it has proven its value
over some time.

Regards,
Matt


From pekka.takala at pp.inet.fi  Fri Apr 16 09:00:46 2004
From: pekka.takala at pp.inet.fi (Pekka Takala)
Date: Fri Apr 16 08:59:49 2004
Subject: [spambayes-dev] Getting Mozilla (or Netscape) to work with
	pop3proxy (LINUX)
Message-ID: <407FD8FE.7030303@pp.inet.fi>

Here is a way to get mozilla to work with pop3proxy and multiple users 
on same machine. The pop3proxy software is started and stopped when 
needed so the amount of users on same machine can be theoretically 
countless.

The configuration needs a little patience, but it is easy to find and 
also can be used with other mail clients, not just mozilla or netscape. 
This is tested on mozilla and netscape.

1. Install the pop3proxy software normally as root to /usr/bin and 
configure it normally. Test that your setup works ABSOLUTELY NORMALLY, 
when starting pop3proxy by hand and then stopping it after read your mails.

2. Locate the startupt script of Mozilla. In Debian Linux systems it is 
/usr/bin/mozilla.

3. With your favorite editor, search line containing MOZ_PROGRAM. 
Comment it out as reference, then make a new copy of the line.

On mine system it looks like this after modification. The original line 
is mozilla-bin, new line has popmozilla.sh. Netscape startup script has 
same kind of system, except the mozilla-bin is netscape-bin. DO NOT 
TOUCH THE REST OF THE FILE.

Remember that you need the path and name of the original binary when 
creating the popmozilla.sh

-----
MOZ_DIST_BIN="/usr/lib/mozilla"
#MOZ_PROGRAM="/usr/lib/mozilla/mozilla-bin"
MOZ_PROGRAM="/usr/lib/mozilla/popmozilla.sh"
MOZ_CLIENT_PROGRAM="/usr/lib/mozilla/mozilla-xremote-client" 
                                                                  -----

4. Save the file after modifications and go to the path where 
mozilla-bin is.

5. Create a new file popmozilla.sh and put this inside:

#!/bin/sh

#Check if the pop3proxy is already running, do not start
#if it is
if (ps ax |grep -q sb_server.py); then
    echo "pop3proxy already running -> not starting"
else
    echo "Starting pop3proxy"
    /usr/bin/pop3proxy &
fi
#The mozilla binary, we start it here with arguments given by
#/usr/bin/mozilla.

/usr/lib/mozilla/mozilla-bin $1

#after mozilla quits, we stop the pop3proxy.

#try first sigterm:
killall -15 sb_server.py
#if sigterm did not do it, then sigkill will do.
killall -9 sb_server.py

6. Save the file, then apply chmod 755 to it.

7. To test: start mozilla, then try to go localhost:8880 with mozilla. 
If spambayes page comes, the pop3proxy starts ok!

8. Shut down mozilla, then try "ps ax |grep sb_server.py". If the binary 
does not show, the script works. If shows shomething is wrong and you 
need to re-check your scripts.


This script allows multiple users to use pop3proxy, without fear of 
reading each other's private e-mails. The users only need to start 
mozilla, then configure it and that's all.

The "already running" -test is there, because mozilla can be started 
multiple times (i.e when reading news you click a link and so on).

The script may not be full-featured, but at least it allows multiple 
users to read their e-mails without much problems.

-- 
Pekka "Pihti" Takala
Nothing can be so bad that you cannot find something good in it!
65XXX assembler programmer/developer, linux user


From juntunen at well.com  Fri Apr 16 21:54:22 2004
From: juntunen at well.com (Thomas Juntunen)
Date: Fri Apr 16 21:56:40 2004
Subject: [spambayes-dev] Re: spambayes-dev Digest, Vol 12, Issue 15
In-Reply-To: <E1BEVmy-00080R-0q@mail.python.org>
Message-ID: <r02010000-1033-266E536A901211D8BF83000393A6F1C8@[10.0.0.3]>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 04/16/04, Skip Montanaro imposed order on a stream of electrons to say:

>One thing I think we need to be careful of is using test data sets whose
>messages are too old.  It's apparent the spammers are a moving target, so
>what worked one or six months ago (or perhaps even a week ago) may not work
>as well today.

You might be surprised. I've saved all the spam I've received since late 2000. In recent months, I set up SpamAssassin 2.6.1 with the default rules and ran everything (some 12K messages) through it. A guy named Terry Sullivan (who knows a _lot_ more about statistics than I do) analyzed them and presented some conclusions about spam volatility at the MIT conference this past January. He composed a summary article about it here:

http://www.qaqd.com/research/spam-e1.htm

The upshot was spam changes a lot more slowly than common thought suggests.


Thomas Juntunen

-----BEGIN PGP SIGNATURE-----
Version: PGP SDK 3.0

iQA/AwUBQICAPdFoei/9T3YdEQIf1QCgknpLGMgUAaQSChg+GNw3mL0feCoAoJEi
0CprW+cw1AISUFLI8qC0Jm3n
=lGJK
-----END PGP SIGNATURE-----

From skip at pobox.com  Fri Apr 16 23:52:53 2004
From: skip at pobox.com (Skip Montanaro)
Date: Fri Apr 16 23:53:00 2004
Subject: [spambayes-dev] Re: spambayes-dev Digest, Vol 12, Issue 15
In-Reply-To: <r02010000-1033-266E536A901211D8BF83000393A6F1C8@[10.0.0.3]>
References: <E1BEVmy-00080R-0q@mail.python.org>
	<r02010000-1033-266E536A901211D8BF83000393A6F1C8@[10.0.0.3]>
Message-ID: <16512.43541.10392.272029@montanaro.dyndns.org>


    >> One thing I think we need to be careful of is using test data sets
    >> whose messages are too old.  It's apparent the spammers are a moving
    >> target, so what worked one or six months ago (or perhaps even a week
    >> ago) may not work as well today.

    Thomas> .... A guy named Terry Sullivan (who knows a _lot_ more about
    Thomas> statistics than I do) analyzed [Thomas's data] and presented
    Thomas> some conclusions about spam volatility at the MIT conference
    Thomas> this past January. He composed a summary article about it here:

    Thomas> http://www.qaqd.com/research/spam-e1.htm

    Thomas> The upshot was spam changes a lot more slowly than common
    Thomas> thought suggests.

I'm not going to try and argue with statistics, however, if I understand the
summary article, it appears that two features in the principle component
analysis account for 86% of the properties of your data set and that all the
other features were indistinguishable from noise.  I don't know how 86%
relates to how much spam those two features would reliably detect,
especially in the presence of ham, but my guess is that it's much less than
the 99+% we need to have an effective spam filtering solution.  Looking at
how Spambayes has classified my mail since mid-December, I see 168k spams (~
60%), 87k hams (~ 31%) and 27k unsures (~ 10%).  If Spambayes was only
identifying 86% of the spams (does the PCA number imply that?), that would
be another 23k spams I'd have had to look at.  In addition, PCA doesn't seem
like it begins to address the issue of false positives and false negatives.
Who cares if it identifies 86% of the spams if it also erroneously
classifies 1% (to pick a number out of thin air) of the hams as spams?

It's clear that spammers try different things.  They have to move from one
mail host to another.  They have to cover their tracks by routing mail
through open relays.  They have to "infect" vulnerable machines to create
open relays for themselves.  They have to add hash busters.  They have to
disguise key words (like "v1@grA").  They have to gut their sales pitch and
just refer you to a URL.  They have to add word salad (both nonsense words
and real, but randomly chosen words).  They do this and lots of other stuff
to try and squeak as much spam past filters as they can.  I believe they
will continue to try other tricks.  One can hope that they are running out
of tricks to try, but I'm pessimistic.

Skip


From juntunen at well.com  Sat Apr 17 11:27:29 2004
From: juntunen at well.com (Thomas Juntunen)
Date: Sat Apr 17 11:27:57 2004
Subject: [spambayes-dev] Re: spambayes-dev Digest, Vol 12, Issue 15
In-Reply-To: <16512.43541.10392.272029@montanaro.dyndns.org>
Message-ID: <r02010000-1033-BE13FCE4908311D89951000393A6F1C8@[10.0.0.3]>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 04/16/04, Skip Montanaro imposed order on a stream of electrons to say:


>I don't know how 86% relates to how much spam those two features would
>reliably detect, especially in the presence of ham, but my guess is that it's
>much less than the 99+% we need to have an effective spam filtering solution.

Absolutely. I was trying to make the point that we've found spammers change their tactics much more slowly than is commonly assumed. FWIW, the single most common characteristic of my corpus, HTML/mutlipart with no other parts, would stop around 37% of spam by itself. If anything, this research says a two-stage system, simple SA or some such to stop the real grunge, then SB or some such to apply more selective filtering on a smaller inflow, should be workable.


>It's clear that spammers try different things.

Yep. I don't have a number handy, but consider that a message can only be munged in so many ways before it is undeliverable. The total useful permutations might be too large for a human to handle easily, but I'm betting not for a computer.


[snip description of spammer tricks]

>I believe they will continue to try other tricks.  One can hope that they are
>running out of tricks to try, but I'm pessimistic.

It's interesting you mention this. I can't say a whole lot right now, but Dr. Sullivan has devised an interesting technique that statistically looks at all the sorts of things you've mentioned. We looked at that stuff in order to try and pin down which spamware any particular spammer might be using, since all those tricks can be considered characteristics of spamware. We came to realize, they are also characteristics of the spammers themselves. Working in conjunction with some folks from Spamhaus, Sullivan is refining a technique to "fingerprint" particualr spammers by their choices of URLs/domains, presentation, and so forth. This only works for spammers whose volume is high enough to overcome the "noise" inherent in email, but after letting his tool work through a corpus and group spam messages by sender, then manually checking these with WHOIS, dig and so forth, the tool is right a little over 50% of the time with no training whatsoever. I think he is planning to present something about this at some conference (CEAS?) this summer.


Anyway, all I wanted to try and make clear was there was statistical evidence that spam techniques change a lot more slowly than people usually assume. Not that this was some form of better filtering. In fact, I've been waiting for SpamBayes to get to at least a beta release so I can install it on my Apple laptop.

Thanks for the feedback!
Thomas Juntunen

-----BEGIN PGP SIGNATURE-----
Version: PGP SDK 3.0

iQA/AwUBQIE+jdFoei/9T3YdEQIQRQCgyzMCfviABf/wBKpNZId/Cw3z2xMAnikC
MTYEuD/Ri5tzgdbbNj0HPhO/
=xU9A
-----END PGP SIGNATURE-----

From sethg at GoodmanAssociates.com  Sat Apr 17 14:34:44 2004
From: sethg at GoodmanAssociates.com (Seth Goodman)
Date: Sat Apr 17 14:34:46 2004
Subject: [spambayes-dev] Re: spambayes-dev Digest, Vol 12, Issue 15
In-Reply-To: <r02010000-1033-266E536A901211D8BF83000393A6F1C8@[10.0.0.3]>
Message-ID: <MHEGIFHMACFNNIMMBACAAECCHMAA.sethg@GoodmanAssociates.com>

> From: Thomas Juntunen
> Sent: Friday, April 16, 2004 8:54 PM

<...>

> http://www.qaqd.com/research/spam-e1.htm

I would like to read this article, but the link redirects to a login page
that doesn't accept 'guest', 'anonymous' or an email address as a login.
Could you provide another link or send me a copy of the article?

>From Skip's post, he mentioned principle component analysis as the technique
the author used.  If this is the same as the method by that name we use in
electrical engineering, this means decomposing a signal into a series of
Eigenvectors (orthogonal components), each with a length (the Eigenvalue)
that indicates the strength (electrical power) of that particular component.
You then throw away the components that are similar in size to those that
are known to be noise (completely random, no information content), leaving
what are called the principle components.  Under good conditions, the
principle components comprise _most_ of the information portion of the
signal, though it doesn't always come out that way.  This is but one of many
methods for breaking a signal down into orthogonal components and removing
noise.  The method has its pro's and con's, which have a lot to do with the
nature of the signal and how much you know about it ahead of time.

I can think of several issues applying PC analysis to a text message instead
of a signal stream.  Since a text message can be parsed in different ways to
create a signal to do the Eigendecomposition on, results will depend on
whether you treat it as a bit stream, a character stream (with what
character length?) or a token stream (tokenized how?).  It would also be
possible to treat the SpamAssassin results as tokens and use only those to
create a token stream.

I need to read the article, but applying Eigendecomposition to a text
message raises a lot of questions for me.

--

Seth Goodman


From juntunen at well.com  Sun Apr 18 12:06:19 2004
From: juntunen at well.com (Thomas Juntunen)
Date: Sun Apr 18 12:07:07 2004
Subject: [spambayes-dev] URL Correction
In-Reply-To: <E1BEsI8-0002RR-Rs@mail.python.org>
Message-ID: <r02010000-1033-547BBD50915211D88988000393A6F1C8@[10.0.0.3]>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hey folks,

Anyone who wants to see that article on spam volatility by Terry Sullivan, be advised he has sent me a corrected URL. The other one pointed at a draft, this one is for the final version. Sorry about that!

http://www.qaqd.com/research/mit04sum.html


Thomas Juntunen

-----BEGIN PGP SIGNATURE-----
Version: PGP SDK 3.0

iQA/AwUBQIKZadFoei/9T3YdEQImCQCg31/OPjjfGv+0ayK92WqLdIFtuY0An24o
qvvkBFq5Vt9J7Vn+FKxPJlAE
=BUxZ
-----END PGP SIGNATURE-----

From juntunen at well.com  Sun Apr 18 15:23:33 2004
From: juntunen at well.com (Thomas Juntunen)
Date: Sun Apr 18 15:23:31 2004
Subject: [spambayes-dev] Re: spambayes-dev Digest, Vol 12, Issue 17
In-Reply-To: <E1BFEik-0005dT-Bl@mail.python.org>
Message-ID: <r02010000-1033-E21612A5916D11D89E95000393A6F1C8@[10.0.0.3]>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 04/18/04, Seth Goodman imposed order on a stream of electrons to say:

>I can think of several issues applying PC analysis to a text message instead
>of a signal stream.  Since a text message can be parsed in different ways to
>create a signal to do the Eigendecomposition on, results will depend on
>whether you treat it as a bit stream, a character stream (with what
>character length?) or a token stream (tokenized how?).  It would also be
>possible to treat the SpamAssassin results as tokens and use only those to
>create a token stream.

So far as I understand it, Dr. Sullivan didn't analyze the message text or headers themselves, he looked at which SpamAssassin rules were triggered over time. So the triggered rules are the vectors in this case.


Thomas Juntunen

-----BEGIN PGP SIGNATURE-----
Version: PGP SDK 3.0

iQA/AwUBQILHo9Foei/9T3YdEQKgNwCg6cT33IzOO5zXawXu8Bsdh14HJ2QAn3dW
xAl1gEdAFiWxQP8z9dVgVdZ/
=q7r9
-----END PGP SIGNATURE-----

From tdickenson at geminidataloggers.com  Mon Apr 19 10:09:57 2004
From: tdickenson at geminidataloggers.com (Toby Dickenson)
Date: Mon Apr 19 10:10:04 2004
Subject: [spambayes-dev] Re: [Spambayes] Re: Cannot connect to socket with
	sb_bnserver.py
In-Reply-To: <200404191455.04662.tdickenson@geminidataloggers.com>
References: <1080424390.4065.24.camel@porsche.hq.simlog.com>
	<4083D158.8020809@videotron.ca>
	<200404191455.04662.tdickenson@geminidataloggers.com>
Message-ID: <200404191509.57948.tdickenson@geminidataloggers.com>

replies set to spambayes-dev@python.org

On Monday 19 April 2004 14:55, Toby Dickenson wrote:

> that strace log shows that unlink("/home/ricard/.sbbnsock-modeleT") fails
> with ENOENT, so the socket does not exist.
>
> But connect to that socket is failing with ECONNREFUSED. Thats strange.....

aha! your linux kernel must be 2.2.x ! right?

patch attached.
-- 
Toby Dickenson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: time_to_get_a_proper_unix.diff
Type: text/x-diff
Size: 1232 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040419/9730dde9/time_to_get_a_proper_unix.bin
From papaDoc at videotron.ca  Mon Apr 19 10:21:15 2004
From: papaDoc at videotron.ca (papaDoc)
Date: Mon Apr 19 10:21:31 2004
Subject: [spambayes-dev] Re: [Spambayes] Re: Cannot connect to socket with
	sb_bnserver.py
In-Reply-To: <200404191509.57948.tdickenson@geminidataloggers.com>
References: <1080424390.4065.24.camel@porsche.hq.simlog.com>
	<4083D158.8020809@videotron.ca>
	<200404191455.04662.tdickenson@geminidataloggers.com>
	<200404191509.57948.tdickenson@geminidataloggers.com>
Message-ID: <4083E05B.4060008@videotron.ca>

Hi Toby,

>>that strace log shows that unlink("/home/ricard/.sbbnsock-modeleT") fails
>>with ENOENT, so the socket does not exist.
>>
>>But connect to that socket is failing with ECONNREFUSED. Thats strange.....
>>    
>>
>
>aha! your linux kernel must be 2.2.x ! right?
>  
>
Yes 2.2.20

I will apply the patch and let you know what happen.


Remi

-- 
/"\
\ /
 X   ASCII Ribbon Campaign
/ \  Against HTML Email


From gbrown at alumni.caltech.edu  Mon Apr 19 12:42:36 2004
From: gbrown at alumni.caltech.edu (Glenn Brown)
Date: Mon Apr 19 12:42:46 2004
Subject: [spambayes-dev] =?iso-8859-1?q?FW=3A_Overnight_shipping_on_x=E3n?=
	=?iso-8859-1?q?ax=2Cval=EDum_and_more?=
Message-ID: <000a01c4262d$533c48a0$2301000a@Glenn>

Dear spambayes-dev:
 
Even when forwarded to myself, the Spambayes' Outlook plugin will not score
this message, delete it as spam, or show spam clues.  I've tried versions
0.80 and 1.0b1, and both have the same problem.  In 6 months, I've only seen
this one other time, which was yesterday, so I'm suspecting a new attack
triggering an internal Spambayes failure.  I have no clue how they might be
doing this, but I bet you will find the problem intriquing.
 
I apologize for forwarding HTML email to this list, but I doubt plain text
will trigger the bug...
I'm going to be really embarrased if the forwarded message does not exibit
the same behaviour, but tests on my system assure me that it will, and if it
doesn't, your spam filter should prevent you from seeing this message. ;)  I
have not been able to rule out some problem specific to my system (like
maybe some database scalability limit, after ~24000 spam messages) because I
don't know any Spambayes users I can forward this to, other then you.
 
Enjoy,
--Glenn

-----Original Message-----
From: aboveboard achilles [mailto:wazfapwjnojtk@utvinternet.ie] 
Sent: Monday, April 19, 2004 12:24 AM
To: Alan
Subject: Cc:Overnight shipping on x?nax,val?um and more


 <http://actinolite.myopen5045drugs.biz/g34/> 

abreact abundant acidulous abram acute abo acetylene abrasion abduct acs
acrylate abyssinia aborigine aback acrobatic aaa accelerate account
accusation abstain ace absence accompanist abstruse acm acculturate
acidulous actinide abscissae abbas acquire abram accusatory acanthus aching
abrupt absinthe abstention abysmal acceptor aboveground abduct across accra
achromatic abnormal abstention ac accredit access abbreviate aboriginal
abeyance acreage accrual acolyte acadia abbott accredit abbas 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040419/b93fec76/attachment.html
From papaDoc at videotron.ca  Mon Apr 19 13:17:09 2004
From: papaDoc at videotron.ca (Remi Ricard)
Date: Mon Apr 19 13:17:23 2004
Subject: [spambayes-dev] A new patch submitted on sourceforge (pychsum)
Message-ID: <40840995.3020307@videotron.ca>

Hi,

I submitted a new file (as a patch) on sourceforge.

This is an utilities to generated a check sum of spam email and compare 
the resulting sum to previous spam if there is a match you can (flush 
this new spam). This file was created by Skip and sent to me by mail. (I 
hope Skip won't kill me since I'm submitting this and I did not ask for 
his permission).


Remi


P.S. I submitted this file since I add trouble to find the good and 
latest version on my PCs

From papaDoc at videotron.ca  Mon Apr 19 14:27:02 2004
From: papaDoc at videotron.ca (Remi Ricard)
Date: Mon Apr 19 14:27:07 2004
Subject: [spambayes-dev] Re: [Spambayes] Re: Cannot connect to socket with
	sb_bnserver.py
Message-ID: <408419F6.20700@videotron.ca>

Hi Toby,

> aha! your linux kernel must be 2.2.x ! right?
>
> patch attached.  
>
< cut the path (see previous message to get the patch)>

It is working  with the patch !!!!!!!!!!!!!!!!!!!!!!!!

I need to set my path so that sb_bnfilter can find the sb_bnserver.py
and it is working.

This is the difference between sb_filter ans sb_bnfilter

/gmc/logiciels/spambayes/scripts$ time for in in {1,2,3,4,5}; do echo 
"Running $i"; cat ~/Tmp/mail.eml  | 
/gmc/logiciels/spambayes/scripts/sb_filter.py -d 
~/Tmp/spambayes.statistic.db; done
real    1m38.150s
user    1m33.270s
sys     0m3.990s


/gmc/logiciels/spambayes/scripts$ time for in in {1,2,3,4,5}; do echo 
"Running $i"; cat ~/Tmp/mail.eml  | 
/gmc/logiciels/spambayes/scripts/sb_bnfilter.py -d 
~/Tmp/spambayes.statistic.db; done
real    0m46.275s
user    0m9.160s
sys     0m1.300s


So thank to all  of  you.

From kennypitt at hotmail.com  Mon Apr 19 14:27:38 2004
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Mon Apr 19 14:29:31 2004
Subject: =?iso-8859-1?Q?RE:_=5Bspambayes-dev=5D_FW:_Overnight_shipping_on_x=E3na?=
	=?iso-8859-1?Q?x=2Cval=EDum_and_more?=
In-Reply-To: <000a01c4262d$533c48a0$2301000a@Glenn>
Message-ID: <BAY16-DAV64ArGvnCjn00028462@hotmail.com>

This message scored fine on my SpamBayes, and produced a spam score of
0.7% thanks to the wonders of SpamBayes mailing list ham clues.  I'm
running from latest CVS source, which is basically equivalent to 1.0b1,
on Outlook 2003.  Here's the top portion of the "Show spam clues".
Could you attach log files from when you encountered this error?  The
Troubleshooting Guide under SpamBayes / Help / Troubleshooting Guide
will tell you how to find them.
 

Combined Score: 1% (0.00738917)

Internal ham score (*H*): 0.985378
Internal spam score (*S*): 0.000156295

# ham trained on: 103
# spam trained on: 169


86 Significant Tokens

token                               spamprob         #ham  #spam

'plain'                             0.0505618           4      0

'problem.'                          0.0505618           4      0

'seeing'                            0.0505618           4      0

'versions'                          0.0505618           4      0

'filter'                            0.0652174           3      0

'forwarded'                         0.0652174           3      0

'message-----'                      0.0652174           3      0

'alan'                              0.0918367           2      0

'plugin'                            0.0918367           2      0

'subject:dev'                       0.0918367           2      0

'to:addr:spambayes-dev'             0.0918367           2      0

'header:Importance:1'               0.0941901          37      6

'spambayes'                         0.134245            9      2

'subject:spambayes'                 0.135953            5      1

'html'                              0.148058            8      2

'assure'                            0.155172            1      0

'attack'                            0.155172            1      0

'limit,'                            0.155172            1      0

'myself,'                           0.155172            1      0

'received:edu'                      0.155172            1      0

'sender:addr:spambayes-dev-bounces' 0.155172            1      0

'spam,'                             0.155172            1      0

'to,'                               0.155172            1      0

'tried'                             0.162588            4      1

'skip:- 10'                         0.165055            7      2

'outlook'                           0.166134           10      3

'monday,'                           0.202339            3      1

'tests'                             0.202339            3      1

'spam'                              0.23915            14      7

'does'                              0.24132            10      5

'text'                              0.246248            6      3

'across'                            0.252149            4      2

'sent:'                             0.252149            4      2

'specific'                          0.252149            4      2

'really'                            0.260641            9      5

"i've"                              0.267806            7      4

'internal'                          0.268313            2      1

'message.'                          0.268313            2      1

"i'm"                               0.272039           15      9

'x-mailer:microsoft outlook, build 10.0.4510' 0.280132            5
3

'users'                             0.295068            9      6

'problem'                           0.29801             6      4

'from:'                             0.310408            7      5

'cc:2**0'                           0.324958            4      3

'message'                           0.336556           17     14

'database'                          0.343235            6      5

'doing'                             0.343235            6      5

'subject:'                          0.343235            6      5

'maybe'                             0.348391            7      6

'new'                               0.364136           31     29

'subject:: '                        0.364685           15     14

'header:Errors-To:1'                0.371938           31     30

'sender:no real name:2**0'          0.379611           29     29

'should'                            0.380359           16     16

'to:'                               0.380741           13     13

'able'                              0.381108           11     11

'list,'                             0.387141            3      3

'time,'                             0.387141            3      3

'score'                             0.390945            2      2

'after'                             0.395493           15     16

'sender:addr:python.org'            0.396461           27     29

'to:addr:python.org'                0.396461           27     29

'2004'                              0.399493           12     13

'you.'                              0.616696            6     16

'proto:http'                        0.622724           52    141

'april'                             0.631635            1      3

'subject:and'                       0.638645            2      6

'dear'                              0.643221            5     15

'even'                              0.643748            6     18

'account'                           0.671001            5     17

'shipping'                          0.67222             2      7

'doubt'                             0.691855            1      4

'rule'                              0.691855            1      4

'this,'                             0.691855            1      4

'abbreviate'                        0.844828            0      1

'abundant'                          0.844828            0      1

'acetylene'                         0.844828            0      1

'failure.'                          0.844828            0      1

'forwarding'                        0.844828            0      1

'subject:shipping'                  0.844828            0      1

'subject:\xed'                      0.844828            0      1

'trigger'                           0.844828            0      1

'accelerate'                        0.908163            0      2

'apologize'                         0.908163            0      2

'subjectcharset:iso-8859-1'         0.958716            0      5

'url:biz'                           0.985437            0     15


  _____  

From: spambayes-dev-bounces@python.org
[mailto:spambayes-dev-bounces@python.org] On Behalf Of Glenn Brown
Sent: Monday, April 19, 2004 12:43 PM
To: spambayes-dev@python.org
Subject: [spambayes-dev] FW: Overnight shipping on x?nax,val?um and more


Even when forwarded to myself, the Spambayes' Outlook plugin will not
score this message, delete it as spam, or show spam clues.  I've tried
versions 0.80 and 1.0b1, and both have the same problem.  In 6 months,
I've only seen this one other time, which was yesterday, so I'm
suspecting a new attack triggering an internal Spambayes failure.  I
have no clue how they might be doing this, but I bet you will find the
problem intriquing.
 
I apologize for forwarding HTML email to this list, but I doubt plain
text will trigger the bug...
I'm going to be really embarrased if the forwarded message does not
exibit the same behaviour, but tests on my system assure me that it
will, and if it doesn't, your spam filter should prevent you from seeing
this message. ;)
 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040419/a60f83ec/attachment-0001.html
From pekka.takala at pp.inet.fi  Tue Apr 20 04:38:33 2004
From: pekka.takala at pp.inet.fi (Pekka Takala)
Date: Tue Apr 20 04:37:49 2004
Subject: [spambayes-dev] re: Getting Mozilla (or Netscape) to work with
	pop3proxy (LINUX)
Message-ID: <4084E189.7070405@pp.inet.fi>

And nothing is good without bug support:

Sometimes ps ax |grep -q sb_server.py finds out itself although the 
sb_server is not actually running. By changing the line to read

ps ax |grep -v grep | grep -q sb_server.py

fixes the problem.
-- 
Pekka "Pihti" Takala
Nothing can be so bad that you cannot find something good in it!
65XXX assembler programmer/developer, linux user


From mhammond at skippinet.com.au  Tue Apr 20 19:18:03 2004
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Apr 20 19:18:27 2004
Subject: =?iso-8859-1?Q?RE:_=5Bspambayes-dev=5D_FW:_Overnight_shipping_on_x=E3na?=
	=?iso-8859-1?Q?x=2Cval=EDum_and_more?=
In-Reply-To: <BAY16-DAV64ArGvnCjn00028462@hotmail.com>
Message-ID: <13be01c4272d$bb978ee0$0200a8c0@eden>

>> Even when forwarded to myself, the Spambayes' Outlook plugin will not
>> score this message, delete it as spam, or show spam clues.  I've
>> tried versions 0.80 and 1.0b1,

When you say not delete or show spam clues, what exactly happens?  Do you
get a message that no filterable items are selected?

Kenny:
> This message scored fine on my SpamBayes, and produced a
> spam score of 0.7% thanks to the wonders of SpamBayes mailing
> list ham clues.  I'm running from latest CVS source, which

The problem could be in the mail as received by Glenn, but once forward on,
it again works (as Outlook inserts what is missing).  ie, the function
IsFilterCandidate() in msgstore.py is telling us not to filter the message.

Glenn - if you can work out how, can you see if you can get "dump_props.exe"
to find the message.  It should be a matter of opening a command-prompt,
running:

dump_props.exe Overnight shipping > dump.txt

and hopefully dump.txt will be created with information on all messages in
your inbox with "Overnight shipping" in the subject.  If you can get the
information on the message out, please open a bug at sourceforge, assigning
it to me, and attaching the output.

Thanks,

Mark.


From kennypitt at hotmail.com  Wed Apr 21 10:19:54 2004
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Wed Apr 21 10:21:36 2004
Subject: =?iso-8859-1?Q?RE:_=5Bspambayes-dev=5D_FW:_Overnight_shipping_on_x=E3na?=
	=?iso-8859-1?Q?x=2Cval=EDum_and_more?=
In-Reply-To: <13be01c4272d$bb978ee0$0200a8c0@eden>
Message-ID: <BAY16-DAV26sNQptgFr0002a722@hotmail.com>

Mark Hammond wrote:
>>> Even when forwarded to myself, the Spambayes' Outlook plugin will
>>> not score this message, delete it as spam, or show spam clues.  I've
>>> tried versions 0.80 and 1.0b1,
> 
> When you say not delete or show spam clues, what exactly happens?  Do
> you get a message that no filterable items are selected?
> 
> Kenny:
>> This message scored fine on my SpamBayes, and produced a
>> spam score of 0.7% thanks to the wonders of SpamBayes mailing
>> list ham clues.  I'm running from latest CVS source, which
> 
> The problem could be in the mail as received by Glenn, but once
> forward on, it again works (as Outlook inserts what is missing).  ie,
> the function IsFilterCandidate() in msgstore.py is telling us not to
> filter the message. 
> 
> Glenn - if you can work out how, can you see if you can get
> "dump_props.exe" to find the message.  It should be a matter of
> opening a command-prompt, running:
> 
> dump_props.exe Overnight shipping > dump.txt
> 
> and hopefully dump.txt will be created with information on all
> messages in your inbox with "Overnight shipping" in the subject.  If
> you can get the information on the message out, please open a bug at
> sourceforge, assigning it to me, and attaching the output.

Mark, attached is a logfile I got from Glenn that was not CC'd to the
list.  Maybe this will give you some more ideas as to what's going
wrong.  My best guess was that one of the tokens in the database is not
pickled correctly, possibly corrupted or maybe a leftover from one of
the foreign character problems we've had, and that this particular
message just happens to be one of the few that include that token.

-- 
Kenny Pitt
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spambayes1.log
Type: application/octet-stream
Size: 20576 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040421/9bebd489/spambayes1.obj
From jbanderas_23.86gd at mailcity.com  Tue Apr 20 16:21:19 2004
From: jbanderas_23.86gd at mailcity.com (Julian Banderas)
Date: Wed Apr 21 11:44:24 2004
Subject: [spambayes-dev] Mala direta e-mails listas de email
	http://www.gueb.de/divulgamail
Message-ID: <200404202021.i3KKLEsr072349@mxzilla1.xs4all.nl>

As melhores listas segmentadas de e-mails para mala direta. Todos os 
tipos:
http://www.gueb.de/divulgamail

Cadastros de e-mails segmentados por estados, profiss?es, empresas e 
pessoas f?sicas. Tudo que voc? pracisa para fazer a divulga??o e 
publicidade do seu neg?cio, programas para spam e e-mail marketing. 
Listagens atualizadas e garantidas. Visite agora:
http://www.gueb.de/divulgamail

From tameyer at ihug.co.nz  Wed Apr 21 20:09:09 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Wed Apr 21 20:10:29 2004
Subject: =?iso-8859-1?Q?RE:_=5Bspambayes-dev=5D_FW:_Overnight_shipping_on_x=E3na?=
	=?iso-8859-1?Q?x=2Cval=EDum_and_more?=
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305FF637D@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677CA0@its-xchg4.massey.ac.nz>

> Mark, attached is a logfile I got from Glenn that was not 
> CC'd to the list.  Maybe this will give you some more ideas 
> as to what's going wrong.
[...]

I had something pretty similar to this myself just this morning (traceback
at end of message).  I believe that I caused it by force quitting Outlook
when it was busy working at something.  Not long afterwards I also got the
"Bayes database has X, message database has Y" error message.  You'll notice
that Glenn also does:

"""
Bayes database initialized with 14808 spam and 3729 good messages
*** - message database has 18536 messages - bayes has 18537 - something is
screwey
"""

Perhaps something similar is at fault here?  In any case, retraining does
look like the best option, and it solved the "buttons don't do anything"
problem for me, too.  (I have the two old databases if anyone really cares,
but it doesn't seem worth looking at).

There is a lot of training there, but on the plus side retraining will mean
things work faster, the imbalance can be addressed, and lots of us (IIRC)
are happy with sub-1000-message databases.  Plus the problem will be fixed
:)

(I'm just hoping this doesn't mean that I'm now channelling all the problems
that arise here... <wink>)

=Tony Meyer

"""
Traceback (most recent call last):
  File "C:\Python23\lib\site-packages\win32com\server\policy.py", line 275,
in _Invoke_
    return self._invoke_(dispid, lcid, wFlags, args)
  File "C:\Python23\lib\site-packages\win32com\server\policy.py", line 280,
in _invoke_
    return S_OK, -1, self._invokeex_(dispid, lcid, wFlags, args, None, None)
  File "C:\Python23\lib\site-packages\win32com\server\policy.py", line 542,
in _invokeex_
    return func(*args)
  File "D:\spambayes\Outlook2000\addin.py", line 700, in OnClick
    TrainAsHam(msgstore_message, self.manager, save_db = False)
  File "D:\spambayes\Outlook2000\addin.py", line 142, in TrainAsHam
    if train.train_message(msgstore_message, False,
manager.classifier_data):
  File "D:\spambayes\Outlook2000\train.py", line 52, in train_message
    cdata.bayes.learn(tokenize(stream), is_spam)
  File "D:\spambayes\spambayes\classifier.py", line 273, in learn
    self._add_msg(wordstream, is_spam)
  File "D:\spambayes\spambayes\classifier.py", line 375, in _add_msg
    record = self._wordinfoget(word)
  File "D:\spambayes\spambayes\storage.py", line 261, in _wordinfoget
    r = self.db.get(word)
  File "C:\Python23\Lib\shelve.py", line 111, in get
    return self[key]
  File "C:\Python23\Lib\shelve.py", line 119, in __getitem__
    value = Unpickler(f).load()
cPickle.UnpicklingError: invalid load key, '
'.
"""


From tameyer at ihug.co.nz  Wed Apr 21 20:54:42 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Wed Apr 21 20:55:43 2004
Subject: [spambayes-dev] RE: [Spambayes] Amazing sloth
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304830103@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677CA7@its-xchg4.massey.ac.nz>

(I think maybe I wasn't wrong about channelling other people's problems...
<wink>)

> Here's a weird one, w/ Outlook 2000 and the addin from 
> not-so-recent-anymore CVS.  I decided to start over from 
> scratch today, so have a new(Berkeley) DB.

This is with Outlook 2002 (SP2) and the addin also from
not-so-recent-anymore CVS.  I also started over (see the spambayes-dev
message) from scratch today with a new (Berkeley) DB.  Specifically, I just
trashed the old database files (while Outlook was closed) and started
training as things arrived in the unsure folder.

> It's taking the addin from 4 to 10 seconds to score each(!) 
> message.  That's whether it's new incoming email, or via the 
> "Filter messages ..." menu item, or via a single "Show spam 
> clues".  It's mind-numbingly slow.

And, no surprise since you've read this far, I found this too.  Bizarre.

> While a message is being scored, Outlook is unresponsive to 
> keyboard or mouse input, but the process is using very little 
> CPU (typically a fraction of a percent, with very brief 
> spikes).  So it's waiting on *something*, but don't know what.
> 
> Nothing odd in the PythonWin Trace Collector display.  Ran 
> scanpst on all the relevant .pst files -- no problems.  The 
> sloth persists after restarting Outlook, and after a reboot.  
> No other Outlook operations have slowed, just SpamBayes.

All of this applies to my experience as well, although I didn't try scanpst
(I don't know if I have it, and since it didn't do Tim any good, it probably
wouldn't have helped me anyway).

> Two hours later: 

Heh ;).  I didn't spend two hours on it, though.  I remembered Tim's message
and so after about 5 minutes just started with new db's again.

>  The sloth went away then, just as mysteriously and 
> dramatically as it appeared.  Outlook remained open the entire time:
> 
>     extremely slow
>     retrain on 5 new ham and 5 new spam from scratch
>     zippy again

I started afresh (from training 1 ham and 9 spam) also, but in the same way
as before - close Outlook, move aside slow db's and start Outlook again.
Also zippy once again.

> So no clues, just bizarre symptoms.  If it happens to you, 
> don't be an idiot like I just was:  save the .db file before 
> retraining the problem away (it's the only relevant thing I 
> can think of that changed).

Normally I would have done just that, but I recalled this message (me who
struggles to remember what I had for dinner last night! <wink>) and so have
it zipped away for analysis.

So, I offer it up to anyone interested in looking into it, or offer myself
up to spend time looking into it if someone can suggest ways of doing that.
I don't really know where to start.

=Tony Meyer


From tim.one at comcast.net  Wed Apr 21 22:01:13 2004
From: tim.one at comcast.net (Tim Peters)
Date: Wed Apr 21 22:01:19 2004
Subject: [spambayes-dev] RE: [Spambayes] Amazing sloth
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677CA7@its-xchg4.massey.ac.nz>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEOJKGAB.tim.one@comcast.net>

[Tim, from a while ago]
>> Here's a weird one, w/ Outlook 2000 and the addin from
>> not-so-recent-anymore CVS.  I decided to start over from
>> scratch today, so have a new (Berkeley) DB.

[Tony Meyer]
> This is with Outlook 2002 (SP2) and the addin also from
> not-so-recent-anymore CVS.  I also started over (see the spambayes-dev
> message) from scratch today with a new (Berkeley) DB.  Specifically,
> I just trashed the old database files (while Outlook was closed) and
> started training as things arrived in the unsure folder.

>> It's taking the addin from 4 to 10 seconds to score each(!)
>> message.  That's whether it's new incoming email, or via the
>> "Filter messages ..." menu item, or via a single "Show spam
>> clues".  It's mind-numbingly slow.

> And, no surprise since you've read this far, I found this too.
> Bizarre.

I should mention that it happened two more times for me after starting over
from scratch, with very few msgs trained on each time (certainly less than
50 total).  At that point I got a new box with a gigabyte of RAM, and
switched to using a giant pickled dict instead.  Much faster scoring, no
problems, but much slower Outlook startup time and incremental training
times.

>> While a message is being scored, Outlook is unresponsive to
>> keyboard or mouse input, but the process is using very little
>> CPU (typically a fraction of a percent, with very brief
>> spikes).  So it's waiting on *something*, but don't know what.
>>
>> Nothing odd in the PythonWin Trace Collector display.  Ran
>> scanpst on all the relevant .pst files -- no problems.  The
>> sloth persists after restarting Outlook, and after a reboot.
>> No other Outlook operations have slowed, just SpamBayes.

> All of this applies to my experience as well, although I didn't try
> scanpst (I don't know if I have it, and since it didn't do Tim any
> good, it probably wouldn't have helped me anyway).

Whenever you see reference to the "Inbox Repair Tool", it means scanpst.exe.
I'm amazed that MS continues to make this thing so hard to find:  .pst files
routinely get corrupted in minor and major ways by Outlook (whether or not
SpamBayes is installed), and scanpst.exe finds at least one problem in my
.pst files every day(!).  You have scanpst.exe, but you may have to search
your disk to find it.

> Heh ;).  I didn't spend two hours on it, though.  I remembered Tim's
> message and so after about 5 minutes just started with new db's again.

>> The sloth went away then, just as mysteriously and
>> dramatically as it appeared.  Outlook remained open the entire time:
>>
>>     extremely slow
>>     retrain on 5 new ham and 5 new spam from scratch
>>     zippy again

> I started afresh (from training 1 ham and 9 spam) also, but in the
> same way as before - close Outlook, move aside slow db's and start
> Outlook again. Also zippy once again.

>> So no clues, just bizarre symptoms.  If it happens to you,
>> don't be an idiot like I just was:  save the .db file before
>> retraining the problem away (it's the only relevant thing I
>> can think of that changed).

> Normally I would have done just that, but I recalled this message (me
> who struggles to remember what I had for dinner last night! <wink>)
> and so have it zipped away for analysis.
>
> So, I offer it up to anyone interested in looking into it, or offer
> myself up to spend time looking into it if someone can suggest ways
> of doing that. I don't really know where to start.

Since I moved to a giant pickled dict, I don't care anymore <0.5 wink>.  An
interesting experiment would be to open it directly from a non-SpamBayes
Python program, and just time lookups and inserts.

There was a disturbing Python bug report against bsddb that I closed as
hopeless:

    http://www.python.org/sf/881522

This was about a huge slowdown in shelve after several thousands of keys had
been added.  There were strong hints that the huge slowdown was specific to
the combination of:

    "a modern" bsddb (after the ancient 1.85)
    Windows
    the hash flavor of bsddb

There were also hints that the BTree flavor of bsddb was faster than the
hash flavor, independent of the mystery-slowdown in the hash flavor.

Since we experienced Amazing Sloth under different versions of Outlook, and
very different OSes, my top guess has to be that the fault is in the dbhash
flavor of bsddb.


From tameyer at ihug.co.nz  Thu Apr 22 01:16:11 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Thu Apr 22 01:16:34 2004
Subject: [spambayes-dev] RE: [Spambayes] Amazing sloth
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305FF6565@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677CB3@its-xchg4.massey.ac.nz>

> I should mention that it happened two more times for me after
> starting over from scratch, with very few msgs trained on 
> each time (certainly less than 50 total).

Yay, something to look forward to <wink>.  I've managed to get up to 11 ham
and 11 spam without any problems, though.

[...]
> You have scanpst.exe, but you may
> have to search your disk to find it.

Indeed I did.  It was in C:\Program Files\Common Files\System\Mapi\1033.  Of
course.

> Since I moved to a giant pickled dict, I don't care anymore
> <0.5 wink>.

I suppose I do (assuming it may happen to me again), since I don't really
want to switch to a pickled dict, because I open and close Outlook
reasonably often, and have other uses for my (much smaller) memory.

> An interesting experiment would be to open it
> directly from a non-SpamBayes Python program, and just time 
> lookups and inserts.

Lookups don't appear to be affected at all, but inserts definitely are.
I've tried really simple (just multiple insertions) tests comparing a new
database, a database around the same size (which is about 5500 keys), the
slow database, and another Berkeley db with the same data (exporting the
slow one to text and then using that to create a new db) in case it was just
some quirk of entry order or the file itself.

There doesn't seem to be any difference between the dbs with the same data,
but they are 3 to 4 times slower than either the new db or the similarly
sized one.  This is with Python 2.3.3 and bsddb or Python 2.2.3 and bsddb3.

Playing around with creating dbs of the same size doesn't seem to be getting
me any closer to creating another database with this odd effect.  I realise
that you don't have time to look into this, but any chance you have further
suggestions about how I might investigate it?

> There was a disturbing Python bug report against bsddb that I
> closed as hopeless:
> 
>    http://www.python.org/sf/881522

I read this, and it does seem like it could be related, but I'm not sure how
to test that :)

=Tony Meyer


From cjh at tirania.nuclecu.unam.mx  Fri Apr 23 03:10:51 2004
From: cjh at tirania.nuclecu.unam.mx (cjh@tirania.nuclecu.unam.mx)
Date: Fri Apr 23 03:14:12 2004
Subject: [spambayes-dev] Spambayes-dev,
	TONS of piks hes and veedeos waiting 4 u
In-Reply-To: <2C51L068J83D8B19@python.org>
References: <2C51L068J83D8B19@python.org>
Message-ID: <4HJJ913HL6GGBB4B@tirania.nuclecu.unam.mx>

An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040423/a56ae7f4/attachment.html
From skip at pobox.com  Fri Apr 23 10:59:55 2004
From: skip at pobox.com (Skip Montanaro)
Date: Fri Apr 23 11:00:02 2004
Subject: [spambayes-dev] Interesting way to purge old msgs w/ t-t-e
Message-ID: <16521.12139.820834.72911@montanaro.dyndns.org>


I have been running train-to-exhaustion for awhile now and like it.  The
only persistent problem I've had to deal with is how to purge old data, that
is, what old messages to delete so my database doesn't grow without bound.
The solution popped into my brain the other day: use the new reversed()
builtin.  If indicated on the tte.py command line with the --reverse flag,
it sets up the mailbox iterators to march in reverse.  This gives more
weight to more recent messages.  Coupled with the --cullext flag it allows
me to easily purge old messages which aren't used in the actual training.
Startup for each testing round is delayed slightly, but that seems to be the
only negative side effect.

Skip

From jkx at pythonfr.org  Sat Apr 24 17:00:56 2004
From: jkx at pythonfr.org (Jkx@Pythonfr)
Date: Sat Apr 24 16:59:55 2004
Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin
Message-ID: <200404242300.56525.jkx@pythonfr.org>

First hy to everyone :) 

This is my first post on this mailing list. I take a 
little time this afternoon to write this piece of 
code and i want to know what other think about 
it. 


Extract from the code 

"""
SpamBayes server compliant w/ spamassassin

spamassassin can run as a daemon for a large
scale network (spamd). To use the spamd
spassassin provide a short piece of C code
written to be really efficient to make the
glue between MTA and spamd. 

SpamBayes (which is my preferred spam filtering)
don't provide this kind of stuff. It came w/
some python xmlrpc client server, but forking
a python for each incomming mail eat too much
cpu on linux box. 

So this piece of code is a fake spamassassin
server that use a spambayes for filtering.

This version has been tested w/ version 2.63
of spamc

Take care it dosn't support:
- SSL
- BSMTP

benchmark results w/ 600 mails w/ a ~ 650Kb
trained DB:
- procmail + sb_filter.py: 70 mails/min
- procmail + spamc + this:  206 mails/min
  (TCP or unixdomain achieve the same perf)
  

important notes:
- i don't test other server cause i need to
  something that work on system-wide

- this doesn't support simultanus acces so
  i achieve the same thoughout put w/ maildrop
  instead of procmail
  The reason why i don't write this, is that
  i don't know what to do:
  thread / fork / async ???
- it support filtering for virtual hosted mailbox 
  even if this is not the defaut behaviour
"""

Any comments / blame / flame .. is welcome :) 

ByeBye 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sb_global_server.py
Type: application/x-python
Size: 6309 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040424/8b6058e9/sb_global_server.bin
From skip at pobox.com  Sat Apr 24 17:38:21 2004
From: skip at pobox.com (Skip Montanaro)
Date: Sat Apr 24 17:38:46 2004
Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin
In-Reply-To: <200404242300.56525.jkx@pythonfr.org>
References: <200404242300.56525.jkx@pythonfr.org>
Message-ID: <16522.56909.879061.103902@montanaro.dyndns.org>

    jkx> spamassassin can run as a daemon for a large scale network
    jkx> (spamd). To use the spamd spassassin provide a short piece of C
    jkx> code written to be really efficient to make the glue between MTA
    jkx> and spamd.

    jkx> SpamBayes (which is my preferred spam filtering) don't provide this
    jkx> kind of stuff. It came w/ some python xmlrpc client server, but
    jkx> forking a python for each incomming mail eat too much cpu on linux
    jkx> box.

    jkx> So this piece of code is a fake spamassassin server that use a
    jkx> spambayes for filtering.

Take a look at sb_bnfilter.py.  It's like spamc/spamd only better.  The
daemon (sb_bnserver.py) is forked automatically and quietly exits after a
few seconds of idle time.

sb_bnfilter.py has only recently been added to the Spambayes CVS repository.
In the latest distribution I think it still turns up in the contrib
directory, but I moved it to the scripts directory a week or two ago, so it
should get installed by default when the next distribution is released.

Skip


From jkx at pythonfr.org  Sat Apr 24 18:31:56 2004
From: jkx at pythonfr.org (Jkx@Pythonfr)
Date: Sat Apr 24 18:30:54 2004
Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin
In-Reply-To: <16522.56909.879061.103902@montanaro.dyndns.org>
References: <200404242300.56525.jkx@pythonfr.org>
	<16522.56909.879061.103902@montanaro.dyndns.org>
Message-ID: <200404250031.56607.jkx@pythonfr.org>

On Saturday 24 April 2004 23:38, Skip Montanaro wrote:

>     jkx> So this piece of code is a fake spamassassin server that use a
>     jkx> spambayes for filtering.
>
> Take a look at sb_bnfilter.py.  It's like spamc/spamd only better.  The
> daemon (sb_bnserver.py) is forked automatically and quietly exits after a
> few seconds of idle time.


Yes i understand your meaning, but this tend to do something really 
different. 

sb_bnserver.py  forks itself and wait for the user connection, by this
way it cache the db parsing and all the hammie classes needed for 
working w/. 


but :
1) you still need to create a python process for every incomming mail
   sb_bnfilter. And python, even if it not a weight bloat, python eat 
   something like 4.5Mb of memory instead of the poor 500Ko of spamc

2) sb_bnserver need to be launch by the user (thought sb_bnfilter), 
    and it is written in this way, so it isn't system-wide filering.  
    spamc as some usefull stuff like round-robin filtering .. 
    For example, if i need to dipatch a lot of mail in mailbox (mailing list 
    for example), for every user it will fork n servers .. and so on ? 

I think sb_bn* is pretty nice for a system w/ only few mail accounts
and should performs very for bursting email dispatch for a single user
like after a fetchmail... but this isn't my goal.  


I admit that my code is a bit rought, as it only do the filtering , and don't
provide anyway of caching db, not simultanous acces but this is a first try .. 


/ apologize for my poor english / 

From tameyer at ihug.co.nz  Sat Apr 24 22:15:20 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Sat Apr 24 22:16:21 2004
Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305FF6CB4@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677CC7@its-xchg4.massey.ac.nz>

> sb_bnfilter.py has only recently been added to the Spambayes 
> CVS repository. In the latest distribution I think it still 
> turns up in the contrib directory, but I moved it to the 
> scripts directory a week or two ago, so it should get 
> installed by default when the next distribution is released.

FYI, the latest 1.0b1.1 source dists (both .zip and .tar.gz) include your
fix, moving them to the scripts directory (because otherwise the setup.py
script didn't manage to complete (my bad for the missed testing).

=Tony Meyer


From skip at pobox.com  Sat Apr 24 22:31:37 2004
From: skip at pobox.com (Skip Montanaro)
Date: Sat Apr 24 22:31:44 2004
Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin
In-Reply-To: <200404250031.56607.jkx@pythonfr.org>
References: <200404242300.56525.jkx@pythonfr.org>
	<16522.56909.879061.103902@montanaro.dyndns.org>
	<200404250031.56607.jkx@pythonfr.org>
Message-ID: <16523.8969.140499.186982@montanaro.dyndns.org>


    jkx> 1) you still need to create a python process for every incomming
    jkx>    mail sb_bnfilter. And python, even if it not a weight bloat,
    jkx>    python eat something like 4.5Mb of memory instead of the poor
    jkx>    500Ko of spamc

The sb_bnfilter/sb_bnserver combination runs several times faster on my
machine.  It would probably be faster if you recoded sb_bnfilter.py in C.
Feel free.

    jkx> 2) sb_bnserver need to be launch by the user (thought sb_bnfilter),
    jkx>    and it is written in this way, so it isn't system-wide filering.
    jkx>    spamc as some usefull stuff like round-robin filtering ..  For
    jkx>    example, if i need to dipatch a lot of mail in mailbox (mailing
    jkx>    list for example), for every user it will fork n servers .. and
    jkx>    so on ?

I don't recall that you said you wanted a single system-wide filter.
Spambayes isn't designed that way at any rate.  It will require some
significant effort.

    jkx> I think sb_bn* is pretty nice for a system w/ only few mail
    jkx> accounts and should performs very for bursting email dispatch for a
    jkx> single user like after a fetchmail... but this isn't my goal.

Some folks have experimented with using Spambayes for system-wide filtering.
I don't know that anybody's produced any conclusive results.

That said, one approach might be to rework sb_bnserver.py to open several
unix domain sockets (one per user) and listen on all of them.  When a
connection is made on a socket spin off a new thread to handle it and use
that user's database to score the message.  If the user doesn't have a
database of their own, default to a general database.

Once you have that working, you can rewrite sb_bnfilter.py in C to reduce
memory consumption and maybe improve performance a bit.  sb_bnserver.py
could probably be sped up just by running it with psyco.

Skip

From jkx at pythonfr.org  Sat Apr 24 23:22:11 2004
From: jkx at pythonfr.org (Jkx@Pythonfr)
Date: Sat Apr 24 23:21:08 2004
Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin
In-Reply-To: <16523.8969.140499.186982@montanaro.dyndns.org>
References: <200404242300.56525.jkx@pythonfr.org>
	<200404250031.56607.jkx@pythonfr.org>
	<16523.8969.140499.186982@montanaro.dyndns.org>
Message-ID: <200404250522.11544.jkx@pythonfr.org>


On Sunday 25 April 2004 04:31, Skip Montanaro wrote:
>     jkx> 1) you still need to create a python process for every incomming
>     jkx>    mail sb_bnfilter. And python, even if it not a weight bloat,
>     jkx>    python eat something like 4.5Mb of memory instead of the poor
>     jkx>    500Ko of spamc
>
> The sb_bnfilter/sb_bnserver combination runs several times faster on my
> machine.  It would probably be faster if you recoded sb_bnfilter.py in C.
> Feel free.

Faster than ? 
- sb_filter ? 
- spamc + code_attached in previous email ?

But why should i rewrote sb_bnfilter in C, since sb_bnserver doesn't 
feet w/ my needs . 


>     jkx> 2) sb_bnserver need to be launch by the user (thought
> sb_bnfilter), jkx>    and it is written in this way, so it isn't
> system-wide filering. jkx>    spamc as some usefull stuff like round-robin
> filtering ..  For jkx>    example, if i need to dipatch a lot of mail in
> mailbox (mailing jkx>    list for example), for every user it will fork n
> servers .. and jkx>    so on ?
> 
> I don't recall that you said you wanted a single system-wide filter.
> Spambayes isn't designed that way at any rate.  It will require some
> significant effort.

Where significant effort ? 
I really miss something. Have you read the code i provided ? 
It just serve as 1 single server (hammie filter) for a large number
of users. But all have their own database. 
- one and only one server (not one per user !)
- every user have its own db 


>     jkx> I think sb_bn* is pretty nice for a system w/ only few mail
>     jkx> accounts and should performs very for bursting email dispatch for
> a jkx> single user like after a fetchmail... but this isn't my goal.
>
> Some folks have experimented with using Spambayes for system-wide
> filtering. I don't know that anybody's produced any conclusive results.

What you think of system-wide filtering is : using the same hammie
filter database for all the users. 

Once more .. this is not what my code is done for.

my code try to face this problems: 
- spawning a python at each incomming mail (spamc)
- having one deamon (or more) per user . 

> That said, one approach might be to rework sb_bnserver.py to open several
> unix domain sockets (one per user) and listen on all of them.  When a
> connection is made on a socket spin off a new thread to handle it and use
> that user's database to score the message.  If the user doesn't have a
> database of their own, default to a general database.

Do  you really want to open one UnixDomain socket per user ????? 
I usually work w/ about 50 users right now .. 
( and i wrote this code to do on ~ 1000 accounts .. ). 

Another thing, i don't care about 'general database' .. this isn't the goal
i want a system managable for a large number of user.. 


> Once you have that working, you can rewrite sb_bnfilter.py in C to reduce
> memory consumption and maybe improve performance a bit.  sb_bnserver.py
> could probably be sped up just by running it with psyco.

pscyco have nothing about that. the trouble is 'exec a python' at each email
this is a bad idea. That why i use code ripped from spamassassin, because
1) it is really efficient code 
2) quite clear code (despite too much goto)
3) it is system-wide: 
    - use syslogd 
    - handler error (you don't loose mails w/) 
    - have round-robin capabities .. 
    - and so on .. 


I wrote this for sys-admin who wants to have spambayes for a large scale 
of users. and that can manage easly the way mails are filtered .. 
- only one spambayes server 
- all incoming mails are sent (thought spamc) to this server 
- and every user use it's own hammie database in there home. 

so even it the server falls for a strange raison mails aren't lost .. (spamc
do that perfectly ) 


Bye Bye 

From skip at pobox.com  Sun Apr 25 00:32:32 2004
From: skip at pobox.com (Skip Montanaro)
Date: Sun Apr 25 00:32:37 2004
Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin
In-Reply-To: <200404250522.11544.jkx@pythonfr.org>
References: <200404242300.56525.jkx@pythonfr.org>
	<200404250031.56607.jkx@pythonfr.org>
	<16523.8969.140499.186982@montanaro.dyndns.org>
	<200404250522.11544.jkx@pythonfr.org>
Message-ID: <16523.16224.232593.68870@montanaro.dyndns.org>


    jkx> Where significant effort ? 

    jkx> I really miss something. Have you read the code i provided ?  It
    jkx> just serve as 1 single server (hammie filter) for a large number of
    jkx> users. But all have their own database.
    jkx> - one and only one server (not one per user !)
    jkx> - every user have its own db 

No, I admit I didn't read your code.  I read your mail message and must have
not fully understood what you were after.  My apologies.

    jkx> Do  you really want to open one UnixDomain socket per user ????? 

Sure, why not?  Unix domain sockets are pretty cheap.

    jkx> I usually work w/ about 50 users right now ..  ( and i wrote this
    jkx> code to do on ~ 1000 accounts .. ).

    jkx> Another thing, i don't care about 'general database' .. this isn't
    jkx> the goal i want a system managable for a large number of user..

I don't think a shared database would work except for a very close group of
users (very similar ideas of what constitutes ham and spam).  How do your
users train their databases?  I presume you are doing all this on your mail
server.  Are your users local or remote?

    >> Once you have that working, you can rewrite sb_bnfilter.py in C to
    >> reduce memory consumption and maybe improve performance a bit.
    >> sb_bnserver.py could probably be sped up just by running it with
    >> psyco.

    jkx> pscyco have nothing about that. the trouble is 'exec a python' at
    jkx> each email

I don't see 'exec a python' as a huge problem.  Presumably on a busy server
the python interpreter and all the compiled bytecode will just be sitting in
memory buffers awaiting activation.  Lots of systems do the equivalent of
'exec a python' or more on a per message basis.  Have you tried it?  Was it
too slow?

    jkx> so even it the server falls for a strange raison mails aren't lost
    jkx> .. (spamc do that perfectly )

I'd rather trust my mail's delivery to procmail.  If sb_bn*.py craps out,
procmail is there to recover the message for me.  So far that combination
has been very robust.  It processes between 2,000 and 3,000 messages daily
(about 70% spam) for me on my laptop without a hiccup.  I generally don't
even notice that it's running.

I just ran a quick test of sb_bnfilter.py on my laptop.  In a directory
containing 501 spams (between 24 and 3080 lines each, average 142 lines) I
executed:

    for f in `find . -type f` ; do
        time sb_bnfilter.py < $f > /dev/null
    done 2>&1 | egrep real | sed -e 's/[^0-9.]//g' > ~/tmp/times.txt

The minimum real time was 0.180 seconds.  The maximum was 1.057 seconds.
The mean time was 0.260 seconds.  

I then tried it with a byte-compiled version of sb_bnfilter.py:

    for f in `find . -type f` ; do
        time python ~/local/bin/sb_bnfilter.pyc < $f > /dev/null
    done 2>&1 | egrep real | sed -e 's/[^0-9.]//g' > ~/tmp/times2.txt

The times improved slightly: min 0.172, max 0.957, mean 0.241.

I then tried a third test, adding -A 1000 to the sb_bnfilter.py command line
in the second test to keep a single sb_bnserver.py running for the entire
test.  Results: min 0.169, max 0.841, mean 0.236.  I'd try the psyco test
but my laptop is a Mac.

Presumably performance would also improve on a more serious mail server.
What's your target processing time per message?

Skip

From jkx at pythonfr.org  Sun Apr 25 08:50:21 2004
From: jkx at pythonfr.org (Jkx@Pythonfr)
Date: Sun Apr 25 08:49:16 2004
Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin
In-Reply-To: <16523.16224.232593.68870@montanaro.dyndns.org>
References: <200404242300.56525.jkx@pythonfr.org>
	<200404250522.11544.jkx@pythonfr.org>
	<16523.16224.232593.68870@montanaro.dyndns.org>
Message-ID: <200404251450.21193.jkx@pythonfr.org>

On Sunday 25 April 2004 06:32, Skip Montanaro wrote:
>     jkx> Where significant effort ?
>

>
> No, I admit I didn't read your code.  I read your mail message and must
> have not fully understood what you were after.  My apologies.
>
>     jkx> Do  you really want to open one UnixDomain socket per user ?????
>
> Sure, why not?  Unix domain sockets are pretty cheap.

Simply because this is not realist ... this will eat a bunch of socket for 
nothing .. Have you ever heard that OS has max open file descriptor
limit ? ?


>  How do your
> users train their databases?  I presume you are doing all this on your mail
> server.  Are your users local or remote?

The train will be done thought cron in Maildir folder. The users are remote
and use folders via imap 


>     jkx> pscyco have nothing about that. the trouble is 'exec a python' at
>     jkx> each email
>
> I don't see 'exec a python' as a huge problem.  Presumably on a busy server
> the python interpreter and all the compiled bytecode will just be sitting
> in memory buffers awaiting activation.  Lots of systems do the equivalent
> of 'exec a python' or more on a per message basis.  Have you tried it?  Was
> it too slow?


I think you should look closer at how mail delivery works ! 
Have you ever think that you can deliver a bunch of mails at the same time ? 
So you don't have only a one 'exec python' but you will have one per user
for simultanous incomming mail.. For example filtering done thought maildrop
can get (by default) 100 simultanus filter.. so do you really think that 100 * 
exec python is the same as 100 * spamc ??? (cause spamc eat ~500 Kb
and python ~ 4.5 Mb )


>     jkx> so even it the server falls for a strange raison mails aren't lost
>     jkx> .. (spamc do that perfectly )
>
>
> I just ran a quick test of sb_bnfilter.py on my laptop.  In a directory
> containing 501 spams (between 24 and 3080 lines each, average 142 lines) I
> executed:
[snip] .. 

This test doesn't represent any valuable information, since it use
1) only one user
2) only one access .. so only 1 spwan per mail etc etc ..

Please test the same thing w/ ~10 users .. and measure the 
nb of mail path thought the system (MTA + procmail + filter) 


> Presumably performance would also improve on a more serious mail server.
> What's your target processing time per message?

The less .. simply .. I just added a cache system to my code (maintaning 
a hash of already open hammie db) .. and i achieve to something like 
300 mails / min. and test without any filtering give me something like 
600 mails / min... i think doing better would be hard  but can be done 
( using fork / thread or async on the server socket delivery)


Bye Bye .. 

From tdickenson at geminidataloggers.com  Sun Apr 25 13:21:32 2004
From: tdickenson at geminidataloggers.com (Toby Dickenson)
Date: Sun Apr 25 13:21:38 2004
Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin
In-Reply-To: <200404251450.21193.jkx@pythonfr.org>
References: <200404242300.56525.jkx@pythonfr.org>
	<16523.16224.232593.68870@montanaro.dyndns.org>
	<200404251450.21193.jkx@pythonfr.org>
Message-ID: <200404251821.33037.tdickenson@geminidataloggers.com>

On Sunday 25 April 2004 13:50, Jkx@Pythonfr wrote:

> >     jkx> Do  you really want to open one UnixDomain socket per user ?????
> >
> > Sure, why not?  Unix domain sockets are pretty cheap.
>
> Simply because this is not realist ... this will eat a bunch of socket for
> nothing .. Have you ever heard that OS has max open file descriptor
> limit ? ?

There is an engineering compromise here split around having one big process 
and one big socket covering all users, compared to having per-user processes 
and sockets. You are right that it doesnt make sense to have one big global 
process listening on 100's of sockets.

Disadvantages of having one big process include:
1. security. This big process has to be priveliged enough to read a .hammiedb 
from every users home directory. In practice I guess you run it as root. The 
spambayes development team doesnt have the culture to justify that kind of 
trust IMO. (also IMO, nor should it)
2. functionality. spambayes assumes a per-user operational model. For example, 
I think sb_global_server currently doesnt handle per-user ~/.spambayesrc.

> I think you should look closer at how mail delivery works !
> Have you ever think that you can deliver a bunch of mails at the same time
> ? So you don't have only a one 'exec python' but you will have one per user
> for simultanous incomming mail.. For example filtering done thought
> maildrop can get (by default) 100 simultanus filter.. so do you really
> think that 100 * exec python is the same as 100 * spamc ??? (cause spamc
> eat ~500 Kb and python ~ 4.5 Mb )

Yes, using python for sb_bnfilter is a short-term measure. Its a prototype. C 
version is in progress

> my code try to face this problems: 
> - spawning a python at each incomming mail (spamc)
> - having one deamon (or more) per user . 

I agree the first of those is a problem, and needs to be fixed in sb_bn*. 
(reusing the lightweight s.a. code here is a good trick btw.)

I'm unconvinced so far that the overhead of having one deamon per user is a 
bigger problem than having spambayes run in a shared deamon with higher 
priveliges than a normal user.

-- 
Toby Dickenson


From jkx at pythonfr.org  Sun Apr 25 15:23:48 2004
From: jkx at pythonfr.org (Jkx@Pythonfr)
Date: Sun Apr 25 15:22:43 2004
Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin
In-Reply-To: <200404251821.33037.tdickenson@geminidataloggers.com>
References: <200404242300.56525.jkx@pythonfr.org>
	<200404251450.21193.jkx@pythonfr.org>
	<200404251821.33037.tdickenson@geminidataloggers.com>
Message-ID: <200404252123.48303.jkx@pythonfr.org>

On Sunday 25 April 2004 19:21, Toby Dickenson wrote:
> On Sunday 25 April 2004 13:50, Jkx@Pythonfr wrote:


> There is an engineering compromise here split around having one big process
> and one big socket covering all users, compared to having per-user
> processes and sockets. You are right that it doesnt make sense to have one
> big global process listening on 100's of sockets.
>
> Disadvantages of having one big process include:
> 1. security. This big process has to be priveliged enough to read a
> .hammiedb from every users home directory. In practice I guess you run it
> as root. The spambayes development team doesnt have the culture to justify
> that kind of trust IMO. (also IMO, nor should it)

In fact, if you look deeper in the code, you will see that i use this 
on virtual mail domain. with hierarchie looking as:
/var/lib/vmail/domain/user
for example /var/lib/vmail/example.com/contact. 
So i don't really get trouble w/ the right since this all virtual domains
are only owned by one user 'vmail'. 

But anyway in a normal setup you can:
- run the deamon as root (as done w/ spamassassin .. i don't think
  this is really risky, because by default the the socket is binded to 
  localhost .. etc etc ..)
- And you can easily imagine to run this as another user, and tweak
  the self.dbname according to your needs..  (for example put all 
  the db in a unique folder .. which only one account can access. 
  this is a common way to do stuff for large system) 


> 2. functionality. spambayes assumes a per-user operational model. For
> example, I think sb_global_server currently doesnt handle per-user
> ~/.spambayesrc.

Yeah that's true .. but again i think this is only a matter on implementation.
As the filter is done at each request .. i can imagine parsing this 
configuration file too. 

It's my first hack w/ spambayes. I just discovered the code yersterday .. 
so i think i really miss some points. And that why i'm asking for support 
here. 


> > I think you should look closer at how mail delivery works !
> > Have you ever think that you can deliver a bunch of mails at the same
> > time ? So you don't have only a one 'exec python' but you will have one
> > per user for simultanous incomming mail.. For example filtering done
> > thought maildrop can get (by default) 100 simultanus filter.. so do you
> > really think that 100 * exec python is the same as 100 * spamc ??? (cause
> > spamc eat ~500 Kb and python ~ 4.5 Mb )
>
> Yes, using python for sb_bnfilter is a short-term measure. Its a prototype.
> C version is in progress

Please look at spamc code, because it try to cover a large amount of 
issue (by alowing the change of username for example .. which is 
really usefull in my approach, or round-robin filtering too .. )

> > my code try to face this problems:
> > - spawning a python at each incomming mail (spamc)
> > - having one deamon (or more) per user .
>
> I agree the first of those is a problem, and needs to be fixed in sb_bn*.
> (reusing the lightweight s.a. code here is a good trick btw.)

Fine :) 

> I'm unconvinced so far that the overhead of having one deamon per user is a
> bigger problem than having spambayes run in a shared deamon with higher
> priveliges than a normal user.

I'm quite agree w/ you but keep in mind my example, my approach 
isn't so much different. I run only one process for all user.  and i think i 
can do that w/ sb_bn* but the sb_bn* doesn't support the 'user' setting. 

I really think that allowing users (in a large system) to spwan process
is a bad idea. not for a workstation of course but of a filtering server with
1000 mail accounts. My approach is exactly the same as used by things
like virus scanner .. Do you think hosting company will allow people to
install there own virus scrambler on there account ? They don't .. why :
Just because this will be to painfull to administer and spawn a bunch of 
process for nothing. My approach is the same. 


I really like spambayes, but i can't use on large system not because 
it isn't stable enought or doesn't feet w/ the goal but simply because 
it doesn't provide a nice way to administer a large number of accounts. 


Many thanks Toby .. this is kool to heard something different than 
look at sb_bn*. 


Bye Bye .. 


From ta-meyer at ihug.co.nz  Sun Apr 25 19:50:31 2004
From: ta-meyer at ihug.co.nz (Tony Meyer)
Date: Sun Apr 25 19:50:38 2004
Subject: [spambayes-dev] Testing Tools Changes
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677CD1@its-xchg4.massey.ac.nz>

I recently created new testing corpora for myself and ran various tests.  As
part of this, I made various changes to the testing scripts to make things
easier.  I'd like to know if anyone thinks any of these are worth checking
in:

export.py (in the Outlook2000 directory): I added a command-line option to
skip printing the total number of messages that would be exported.  I didn't
really care what this number was, and generating it took a long time.
PRO: This number doesn't seem all that useful.
CON: This complicates a fairly simple script with another option.

export.py: I added a command-line option to only export messages that were
received via a certain account.  I wanted an automatic method of separating
out messages from a couple of accounts, and this seemed the easiest way.  It
compares the "Delivered" or "Envelope" header to the given regex and only
exports if it matches.  In addition, if the account is "Exchange", then it
only exports if it appears to be an Exchange message (missing those headers;
has the "X-Exchange-Junk" stuff.
PRO: This is a handy way to only get certain messages out of Outlook.
CON: This complicates the script a fair bit, and I haven't done any checking
to see how robust the Delivered/Envelope headers are (all I know is that all
my non-Exchange messages have one or the other of these).

msgstore.py (in the Outlook2000 directory): When creating the 'faked up'
Exchange headers, I added a "X-Exchange-Delivery-Time" header, which the
data from that Outlook property.  Without this, a lot of the exported
messages couldn't be sorted by the incremental testing stuff, so ended up at
the end, which isn't really accurate.
sort+group.py: If it can't find any received headers, check for a 
sort+"X-Exchange-Delivery-Time" header, and use that instead.
PRO: This is a very simple change, and doesn't have any effect on
classification, and improves the accuracy of incremental testing.
CON: This gets added every time that we add fake headers for an Exchange
message, and there is presumably a (very small, I think) cost involved with
that - this includes day-to-day use of the plug-in, when this has no effect
at all.

mksets.py: added -H and -S command-line options to specify an alternative
pair of directories to create the sets in, rather than being fixed to
"Data/Ham" and "Data/Spam".
PRO: This is more like the other scripts.
CON: ?

incremental.py: at the moment, it uses *all* mail in Data/ - I changed it to
use the TestDriver hamdir/spamdir options only (so that you can have
multiple corpora in the Data/ directory, but test only some of it).
PRO: Makes the incremental testing more like the timcv stuff which more
people are familiar with.  Also easier to use, IMO.
CON: Changes the way the script works, so could break existing testing
setups.

fpfn.py: added a command-line flag to also print out unsures (IIRC this
script predates unsures) as well as fp and fn.
PRO: Especially when one reaches the Peters barrier and has very few fp or
fn, looking at the unsures is interesting.
CON: Complicates a very simple script (there are no command-line options at
the moment) and don't fit the name (but having a 'fpfnunsure.py' script that
does this seems pointless).

I also changed fpfn.py to print out each message and offer to move it to the
corresponding ham/spam set (I used it to check for misclassified messages),
but it doesn't seem like this is a good addition to the script.

I also wrote a few scripts to process the incremental.py output, using both
mkgraph.py and Excel (via COM), so that I ended up with reasonably useful
spreadsheets.  If anyone is interested in these, let me know and I'll put
them somewhere (I don't think there's any point checking them in, though).

=Tony Meyer

=Tony Meyer


From mhammond at skippinet.com.au  Sun Apr 25 22:31:50 2004
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sun Apr 25 22:32:12 2004
Subject: [spambayes-dev] Testing Tools Changes
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677CD1@its-xchg4.massey.ac.nz>
Message-ID: <00e501c42b36$a1baaa90$0200a8c0@eden>

> export.py (in the Outlook2000 directory): I added a
> command-line option to
> skip printing the total number of messages that would be
> exported.  I didn't
> really care what this number was, and generating it took a long time.
> PRO: This number doesn't seem all that useful.
> CON: This complicates a fairly simple script with another option.
>
> export.py: I added a command-line option to only export
> messages that were
> received via a certain account.  I wanted an automatic method
> of separating
> out messages from a couple of accounts, and this seemed the
> easiest way.  It
> compares the "Delivered" or "Envelope" header to the given
> regex and only
> exports if it matches.  In addition, if the account is
> "Exchange", then it
> only exports if it appears to be an Exchange message (missing
> those headers;
> has the "X-Exchange-Junk" stuff.
> PRO: This is a handy way to only get certain messages out of Outlook.
> CON: This complicates the script a fair bit, and I haven't
> done any checking
> to see how robust the Delivered/Envelope headers are (all I
> know is that all
> my non-Exchange messages have one or the other of these).

The last change sounds a little nasty, but in general these are tools for us
to use to try and perform decent testing for Outlook users.  AFAIK, this has
never happened :)  Thus, anything that may move us in that direction is
encouraged!  (You have not referenced a msgstore.py change above though,
have you?)

> msgstore.py (in the Outlook2000 directory): When creating the
> 'faked up'
> Exchange headers, I added a "X-Exchange-Delivery-Time"
> header, which the

Can you explain that one a little more?  Would it be possible/better to
generate the correct Date header?  (I assume you are saying these messages
don't have this header)

> mksets.py: added -H and -S command-line options to specify an
> alternative
> pair of directories to create the sets in, rather than being fixed to
> "Data/Ham" and "Data/Spam".
> PRO: This is more like the other scripts.
> CON: ?

Sounds OK to me.

No real opinion on the others.

Mark.


From ta-meyer at ihug.co.nz  Sun Apr 25 22:52:02 2004
From: ta-meyer at ihug.co.nz (Tony Meyer)
Date: Sun Apr 25 22:52:10 2004
Subject: [spambayes-dev] Testing Tools Changes
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13060E437C@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BC3@its-xchg4.massey.ac.nz>

> > export.py: I added a command-line option to only export
> > messages that were
> > received via a certain account.
[...]
> The last change sounds a little nasty, but in general these 
> are tools for us to use to try and perform decent testing for
> Outlook users.  AFAIK, this has never happened :)  Thus,
> anything that may move us in that direction is
> encouraged!

:)

> (You have not referenced a msgstore.py change above though,
> have you?)

No.  For my mail, either it had a "Delivered-To:" header, an "Envelope-To:"
header (with the value being a unique identifier (like
'ta-meyer@pop.ihug.co.nz') for the account), or it was a message that never
left the local Exchange server.  I don't know how true this holds across
other mail servers, though (this is with three different mail servers).  So
I could simply check for those headers, or for the "X-Exchange-Message"
header.

I agree that this sounds a little nasty (and you should see the code
<wink>).  I lean towards not checking it in (anyone interested can hopefully
find this thread in the archives anyway), since I'm not all that sure how
common (given the hypothetical decent testing) needing this (separation by
mail account) would be.

> > msgstore.py (in the Outlook2000 directory): When creating the
> > 'faked up'
> > Exchange headers, I added a "X-Exchange-Delivery-Time"
> > header, which the
> 
> Can you explain that one a little more?  Would it be 
> possible/better to generate the correct Date header?
> (I assume you are saying these messages don't have this header)

Yes, I am saying that (or should have ;).  For example, I get headers like
this (these are all the headers for this particular message - it had no
subject):

"""
X-Exchange-Message: true
To: Meyer, Tony
X-Exchange-Delivery-Time: Fri, 23 Apr 2004 15:56:50 +1200
"""

(This obviously includes the one I added).  It probably would be better to
generate the Date header instead (or maybe a Received header?) - I was too
lazy to look up the spec for what one of those would use, so added my own.
A proper Received or Date header would allow any tokenizing options that
work with those headers to use the data, which would be a more beneficial
(assuming those options help!) change.  I could work up a patch that does
this instead, perhaps.

In terms of code, in _GetFakeHeaders, I also retrieved the
PR_MESSAGE_DELIVERY_TIME property, added the appropriate 'delivery_time =
self._GetPotentiallyLarge...' bit and then did:

"""
            from time import timezone
            from email.Utils import formatdate
            headers.append("X-Exchange-Delivery-Time: "+\
                           formatdate(int(delivery_time)-timezone, True))
"""

I formatted the date so that it matched the one that a Received header has,
because this made the change to sort+group.py simpler than leaving it as
Outlook delivered it.

=Tony Meyer


From tdickenson at geminidataloggers.com  Mon Apr 26 03:27:32 2004
From: tdickenson at geminidataloggers.com (Toby Dickenson)
Date: Mon Apr 26 03:27:38 2004
Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin
In-Reply-To: <200404252123.48303.jkx@pythonfr.org>
References: <200404242300.56525.jkx@pythonfr.org>
	<200404251821.33037.tdickenson@geminidataloggers.com>
	<200404252123.48303.jkx@pythonfr.org>
Message-ID: <200404260827.32385.tdickenson@geminidataloggers.com>

On Sunday 25 April 2004 20:23, Jkx@Pythonfr wrote:
> On Sunday 25 April 2004 19:21, Toby Dickenson wrote:
> > On Sunday 25 April 2004 13:50, Jkx@Pythonfr wrote:

> - And you can easily imagine to run this as another user, and tweak
>   the self.dbname according to your needs..  (for example put all
>   the db in a unique folder .. which only one account can access.
>   this is a common way to do stuff for large system)

That makes sense.

> > Yes, using python for sb_bnfilter is a short-term measure. Its a
> > prototype. C version is in progress
>
> Please look at spamc code

Thanks for the tip. 

(Im sure it wont be usable as-is; the auto-forking in sb_bnfilter is useful on 
small systems where you dont want to run any daemon most of the time.)


> I really think that allowing users (in a large system) to spwan process
> is a bad idea. not for a workstation of course but of a filtering server
> with 1000 mail accounts. My approach is exactly the same as used by things
> like virus scanner .. Do you think hosting company will allow people to
> install there own virus scrambler on there account ? They don't .. why :
> Just because this will be to painfull to administer and spawn a bunch of
> process for nothing. My approach is the same.

So then you have 1000 different spambayes databases? My .hammiedb is 20M, so 
your big mail server needs 20G of storage for spam databases. This is sure to 
affect delivery performance since there is no way to cache all of that.

> and i think i 
> can do that w/ sb_bn* but the sb_bn* doesn't support the 'user' setting. 

You can specify database filename and socket names on the sb_bnfilter command 
line. It doesnt support 'users' directly, but it provides all you need to 
layer a 'users' system on top.

-- 
Toby Dickenson


From jkx at pythonfr.org  Mon Apr 26 03:51:18 2004
From: jkx at pythonfr.org (Jkx@Pythonfr)
Date: Mon Apr 26 03:53:58 2004
Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin
In-Reply-To: <200404260827.32385.tdickenson@geminidataloggers.com>
References: <200404242300.56525.jkx@pythonfr.org>
	<200404251821.33037.tdickenson@geminidataloggers.com>
	<200404252123.48303.jkx@pythonfr.org>
	<200404260827.32385.tdickenson@geminidataloggers.com>
Message-ID: <20040426075118.GB2996@tp1.enstb.org>

On Mon, Apr 26, 2004 at 08:27:32AM +0100, Toby Dickenson wrote:

> > I really think that allowing users (in a large system) to spwan process
> > is a bad idea. not for a workstation of course but of a filtering server
> > with 1000 mail accounts. My approach is exactly the same as used by things
> > like virus scanner .. Do you think hosting company will allow people to
> > install there own virus scrambler on there account ? They don't .. why :
> > Just because this will be to painfull to administer and spawn a bunch of
> > process for nothing. My approach is the same.
> 
> So then you have 1000 different spambayes databases? My .hammiedb is 20M, so 
> your big mail server needs 20G of storage for spam databases. This is sure to 
> affect delivery performance since there is no way to cache all of that.


Yes, there is no way to cache, but i think the system will be trained
on a small amount of spam. So i hope the database won't be too big. 
My current SB db is around 1.2 Mb so. But you really point me to 
something i missed. 


Is there any way to produce a 'resize' / 'consolidation' of the db ? 


> > and i think i 
> > can do that w/ sb_bn* but the sb_bn* doesn't support the 'user' setting. 
> 
> You can specify database filename and socket names on the sb_bnfilter command 
> line. It doesnt support 'users' directly, but it provides all you need to 
> layer a 'users' system on top.

That's true .. but plug this in a postfix delevery would be so simple
.. 


-- 
J?r?me Kerdreux / Labo MI ENST Brest

From skip at pobox.com  Mon Apr 26 08:13:58 2004
From: skip at pobox.com (Skip Montanaro)
Date: Mon Apr 26 08:14:17 2004
Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin
In-Reply-To: <200404260827.32385.tdickenson@geminidataloggers.com>
References: <200404242300.56525.jkx@pythonfr.org>
	<200404251821.33037.tdickenson@geminidataloggers.com>
	<200404252123.48303.jkx@pythonfr.org>
	<200404260827.32385.tdickenson@geminidataloggers.com>
Message-ID: <16524.64774.732715.536641@montanaro.dyndns.org>


    Toby> So then you have 1000 different spambayes databases? My .hammiedb
    Toby> is 20M, so your big mail server needs 20G of storage for spam
    Toby> databases. This is sure to affect delivery performance since there
    Toby> is no way to cache all of that.

The number varies.  My database is about 2.5MB and does just fine.  (See my
recent mail about using train-to-exhaustion and training backwards in the
file.)  That gets you down around 2.5GB, which is largely cacheable.

Skip

From papaDoc at videotron.ca  Mon Apr 26 21:31:39 2004
From: papaDoc at videotron.ca (Remi Ricard)
Date: Mon Apr 26 21:31:40 2004
Subject: [spambayes-dev] Openning a db
Message-ID: <1083029498.3822.15.camel@porsche>

Hi,

I'm having a problem:

When I create a db from scratch using the following command

sb_mboxtrain -d ./hammie.db -g ham.mbox -s spam.mbox

The db is created with the dbm_type="best"
in the dbmstorage.py. This will call the function 
dbmstorage.py: open_db3hash

but when I try to train again with the same command line
(I know this does nothing to the database but continue reading)
sb_mboxtrain -d ./hammie.db -g ham.mbox -s spam.mbox

Then I get the error message:
-------------------
 File "/gmc/logiciels/spambayes/spambayes/spambayes/hammie.py", line
266, in open
    return Hammie(storage.open_storage(filename, useDB, mode))
  File "/gmc/logiciels/spambayes/spambayes/spambayes/storage.py", line
680, in open_storage
    return klass(data_source_name, mode)
  File "/gmc/logiciels/spambayes/spambayes/spambayes/storage.py", line
164, in __init__
    self.load()
  File "/gmc/logiciels/spambayes/spambayes/spambayes/storage.py", line
189, in load
    self.dbm = dbmstorage.open(self.db_name, self.mode)
  File "/gmc/logiciels/spambayes/spambayes/spambayes/dbmstorage.py",
line 75, in open
    return f(db_name, mode)
  File "/gmc/logiciels/spambayes/spambayes/spambayes/dbmstorage.py",
line 22, in open_dbhash
    return bsddb.hashopen(*args)
  File "/usr/local/lib/python2.3/bsddb/__init__.py", line 193, in
hashopen
    d.open(file, db.DB_HASH, flags, mode)
bsddb._db.DBInvalidArgError: (22, 'Invalid argument -- ./hammid.db:
unsupported hash version: 8')
------------------------

since the db is openned with the dbhash
by calling the function
dbmstorage.py: open_dbhash


To solve my problem I'm imposing the dbm_type to be what I want but
I don't think this can be a fix ;-). So this is all yours to solve....

I'm running on RedHat 9 with python2.3 compiled from source.
the whichdb is really the one from the python2.3
(this was found with 
 echo test | strace -tt -f -o trace.txt python sb_mboxtrain.py -d
hammid.db -g ham.mbox -s spam.mbox)

Remi

-- 
Remi Ricard <papaDoc@videotron.ca>


From papaDoc at videotron.ca  Tue Apr 27 08:41:03 2004
From: papaDoc at videotron.ca (papaDoc)
Date: Tue Apr 27 08:41:08 2004
Subject: [spambayes-dev] Re: [Spambayes] Spam bayes in French ?
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677CEC@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F1304677CEC@its-xchg4.massey.ac.nz>
Message-ID: <408E54DF.7060409@videotron.ca>

Hi,

>Someone asked about this a wee while ago (on spambayes-dev, maybe?), but I
>don't know if anything was done or not.  I'm happy to make time to make any
>code changes necessary to make the task easier, but I can't really offer
>much in the way of translation itself (I suppose I could do a Maori
>translation at a push, but I imagine the potential users of such a
>translation number approximately 0).  Based on list traffic, a French
>version should be easily do-able.
>
>I'm guessing that someone here must have some (even non-Python) experience
>at doing this, yes?  Any volunteers to coordinate the effort?
>  
>
I volunteer to do the French translation but I'm almost a non-Python so 
the code should be set up if possible.

Remi

-- 
/"\
\ /
 X   ASCII Ribbon Campaign
/ \  Against HTML Email


From tim.one at comcast.net  Tue Apr 27 10:54:33 2004
From: tim.one at comcast.net (Tim Peters)
Date: Tue Apr 27 10:54:47 2004
Subject: [spambayes-dev] Local boy makes good
Message-ID: <LNBBLJKPBEHFEDALKOLCAELCKIAB.tim.one@comcast.net>

Congratulations!  Inboxer (Sean True's SpamBayes derivative) is the
"Microsoft & WUGNET Shareware Pick of the Week":

    http://www.wugnet.com/shareware/spow.asp?ID=551

This is a big deal, since it gets announced in approximately 38 billion
copies of MS's "Windows Platform News" newsletter (that's how I found out
about it).


From seant at iname.com  Tue Apr 27 11:41:39 2004
From: seant at iname.com (Sean True)
Date: Tue Apr 27 11:43:33 2004
Subject: [spambayes-dev] RE: [Spambayes] Local boy makes good
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAELCKIAB.tim.one@comcast.net>
Message-ID: <E1BIUk7-0005Eb-3u@mail.python.org>

> 
> Congratulations!  Inboxer (Sean True's SpamBayes derivative) is the
> "Microsoft & WUGNET Shareware Pick of the Week":
> 
>     http://www.wugnet.com/shareware/spow.asp?ID=551
> 
> This is a big deal, since it gets announced in approximately 
> 38 billion
> copies of MS's "Windows Platform News" newsletter (that's how 
> I found out
> about it).
> 

Impossible (obviously) without the spambayes community.

Thanks to _all_ of you.

-- Sean


From valk.beekman at xs4all.nl  Tue Apr 27 18:14:53 2004
From: valk.beekman at xs4all.nl (Valk Beekman)
Date: Tue Apr 27 18:15:03 2004
Subject: [spambayes-dev] wish from new user
Message-ID: <008201c42ca5$119f6020$6501a8c0@nl>

As a new user I would like to be able to set the word Spambayes uses to mark spam myself (or it should default to something like "*spam*" . The way it works now I would have all correspondence with  "spam"  in the subjectline discarded by OE. Sometimes people I know send me questions about spam.

Regards & s6 with your project,

Valk Beekman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040428/c96202ae/attachment.html
From tameyer at ihug.co.nz  Tue Apr 27 18:19:50 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Apr 27 18:20:03 2004
Subject: [spambayes-dev] wish from new user
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13060E4A1D@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677CF6@its-xchg4.massey.ac.nz>

> As a new user I would like to be able to set the
> word Spambayes uses to mark spam myself (or it
> should default to something like "*spam*" . The
> way it works now I would have all correspondence
> with  "spam"  in the subjectline discarded by OE.
> Sometimes people I know send me questions about spam.

Two points:

 1. The tag is actually "spam,", so you are safe as long as people don't put
a comma right after the word 'spam'.

 2. You can change this, you just have to manually edit your configuration
file.  The FAQ has lots of details about changing the options - you're after
the Headers section, and the header_spam_string option.

=Tony Meyer

---
Please always include the list (spambayes@python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.


From sourceforge at metrak.com  Tue Apr 27 19:27:39 2004
From: sourceforge at metrak.com (paul sorenson)
Date: Tue Apr 27 19:27:46 2004
Subject: [spambayes-dev] no messages to review
In-Reply-To: <E1BIV1w-0005iX-Qr@mail.python.org>
References: <E1BIV1w-0005iX-Qr@mail.python.org>
Message-ID: <408EEC6B.20007@metrak.com>

I am running spampayes proxy with mozilla thunderbird on Win XP.  I just 
installed 1.0b1 in the last couple of days.

When I attempt to review messages I see the message: "There are no 
untrained messages to display".  This is despite receiving dozens of 
email each day.

This has been happening for some time (before this install).  Then every 
now and then it seems to recognize a whole lot of messages.  Clicking 
"previous day" followed by "next day" doesn't seem to get me back to 
where I started.

For sanity's sake I just checked the thunderbird was point to the proxy, 
not my mail server.

cheers

From mhammond at skippinet.com.au  Tue Apr 27 19:40:43 2004
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Apr 27 19:42:12 2004
Subject: [spambayes-dev] Release 1.0?
Message-ID: <126401c42cb1$0ed80c50$0200a8c0@eden>

The last release seems to have gone OK, with only a couple of packaging
issues.  What say Tony and I just turn the crank, we call it 1.0, have a
beer and little party, and move on?

Mark.


From tim.one at comcast.net  Tue Apr 27 21:18:16 2004
From: tim.one at comcast.net (Tim Peters)
Date: Tue Apr 27 21:18:25 2004
Subject: [spambayes-dev] Release 1.0?
In-Reply-To: <126401c42cb1$0ed80c50$0200a8c0@eden>
Message-ID: <E1BIdiQ-000125-Vc@mail.python.org>

[Mark Hammond]
> The last release seems to have gone OK, with only a couple of packaging
> issues.  What say Tony and I just turn the crank, we call it 1.0, have a
> beer and little party, and move on?

I wish we had a better database story -- but apparently not enough to give
up enough sleep to get us one.

Other than that, the only killer flaw I notice ten times a day (in the
Outlook addin) is that in the "Filter messages ..." dialog, "Start
Filtering" should be the DEFPUSHBUTTON instead of "Close".  I've got
"Automatically move pointer to the default button in a dialog box" enabled
on my laptop (I hate using touchpads!), and so my mouse pointer always flies
to the wrong button when I open that dialog.  "Start Training" would be a
more useful default button on the Training tab too.

In short, if I'm reduced to whining about petty crap like that, we're
overdue for a 1.0 release <wink>.  Fantastic work, everyone!


From skip at pobox.com  Tue Apr 27 21:22:23 2004
From: skip at pobox.com (Skip Montanaro)
Date: Tue Apr 27 21:22:34 2004
Subject: [spambayes-dev] Release 1.0?
In-Reply-To: <126401c42cb1$0ed80c50$0200a8c0@eden>
References: <126401c42cb1$0ed80c50$0200a8c0@eden>
Message-ID: <16527.1871.288554.714725@montanaro.dyndns.org>


    Mark> What say Tony and I just turn the crank, we call it 1.0, have a
    Mark> beer and little party, and move on?

I'm hoisting a stein already. ;-)

Skip

From tameyer at ihug.co.nz  Tue Apr 27 22:31:26 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Apr 27 22:31:41 2004
Subject: [spambayes-dev] Release 1.0?
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13060E4A51@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677CF9@its-xchg4.massey.ac.nz>

> The last release seems to have gone OK, with only a couple of 
> packaging issues.  What say Tony and I just turn the crank,
> we call it 1.0, have a beer and little party, and move on?

If you give me to the weekend, then I'm fine with this.  I'd like to
incorporate a few imapfilter bug fixes that have been worked out over the
last week or so, and have another run through the open bug list to see if
there's anything there that can/should be resolved.  I'm flat out at work
until then, though.  (And a beer and party suit the weekend more anyway ;)

For the 1.1a1 release, I'd really like to:
  
 * Finish up the 'auto configure' stuff for sb_server.  Basically create a
wizard like the Outlook one that can setup SpamBayes, your mail client, and
do some initial training.  (With a limited list of clients (OE, Eudora,
Mozilla, Opera at the moment) - for the rest, you're on your own).

 * Have an imapfilter binary in the binary dist.  It's getting used by a few
more people now, so it would seem a nice option.  Maybe that'll convince
someone else to take over maintaining it, too <wink>.

 * Finish off the pop3dnd stuff that I started some time back.  This is
mostly working, and I still like the concept (training by drag-and-drop in
arbitrary mail clients), and it'd be nice to offer as an experimental
option.

 * Wack off a few deprecated/experimental options.  1.0 should in be use for
a while, so they get their chance :)

Plus looking at the database issues, as always, and training techniques
(particularly figuring out a way to offer tte)...

=Tony meyer


From mhammond at skippinet.com.au  Tue Apr 27 23:54:36 2004
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Apr 27 23:54:55 2004
Subject: [spambayes-dev] Release 1.0?
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677CF9@its-xchg4.massey.ac.nz>
Message-ID: <13fb01c42cd4$867749b0$0200a8c0@eden>

> If you give me to the weekend, then I'm fine with this.  I'd like to
> incorporate a few imapfilter bug fixes that have been worked
> out over the
> last week or so, and have another run through the open bug
> list to see if
> there's anything there that can/should be resolved.  I'm flat
> out at work
> until then, though.  (And a beer and party suit the weekend
> more anyway ;)

That sounds good to me!  Especially the party and beer bit :)  If we
restrict this to low-risk new bugs, we can still go for 1.0 as the next
release.

Mark.


From mhammond at skippinet.com.au  Wed Apr 28 00:28:26 2004
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Apr 28 00:28:50 2004
Subject: [spambayes-dev] Release 1.0?
In-Reply-To: <20040428011829.E8EC3A6395@dampier.southern.net.au>
Message-ID: <144501c42cd9$4203b0c0$0200a8c0@eden>

> Other than that, the only killer flaw I notice ten times a day (in the
> Outlook addin) is that in the "Filter messages ..." dialog, "Start
> Filtering" should be the DEFPUSHBUTTON instead of "Close".  I've got
> "Automatically move pointer to the default button in a dialog
> box" enabled
> on my laptop (I hate using touchpads!), and so my mouse
> pointer always flies
> to the wrong button when I open that dialog.

Fixed!

> "Start
> Training" would be a
> more useful default button on the Training tab too.

That one is harder, as the existing default button (Close) is not on the
property-page, but the parent.  Setting "Start" to DEFPUSHBUTTON gets it
drawn like it is the default, but "Close" still does too and seems to win :)

> Fantastic work, everyone!

Absolutely!  Not one of us here could have done anything with the others.
Congratulations, and thank you!

Mark.


From combover at mn.rr.com  Wed Apr 28 03:03:57 2004
From: combover at mn.rr.com (combover)
Date: Wed Apr 28 03:03:40 2004
Subject: [spambayes-dev] Possible new header parsing option...
In-Reply-To: <144501c42cd9$4203b0c0$0200a8c0@eden>
References: <144501c42cd9$4203b0c0$0200a8c0@eden>
Message-ID: <408F575D.5050103@mn.rr.com>


Was looking over SPF (http://spf.pobox.com) last weekend, and it looks 
very promising - already a handful of major domains have implemented it. 
Of course, the headers that will be associated with SPF's checks:
http://spf.pobox.com/newheader.html
will not be widely used until the major MTAs provide that option, but it 
seems to me that they could prove to be valuable tokens at the very 
least, and there might be a possibility of creating a SpamBayes plugin 
script to do the checking at the client level.

Then again, my understanding of how MTAs work and where exactly SPF 
checks would need to occur is not the best. Again, this isn't going to 
be the most useful until the majority of domains have published records, 
but would be beneficial once that point is reached.

My one concern with the specification itself, though, is: what's to stop 
spammers from forging these headers themselves? Is there a mechanism in 
the existing MTA plugins to discard any SPF headers already in place in 
a received mail? I know this is probably not the best place for those 
concerns, so maybe i'll subscribe to their dev list...

From rmalayter at bai.org  Wed Apr 28 08:28:06 2004
From: rmalayter at bai.org (Ryan Malayter)
Date: Wed Apr 28 08:28:13 2004
Subject: [spambayes-dev] Possible new header parsing option...
Message-ID: <792DE28E91F6EA42B4663AE761C41C2A021C96AC@cliff.bai.org>

[combover]
>My one concern with the specification itself, though, is: 
>what's to stop spammers from forging these headers 
>themselves?
 
Nothing, as you've guessed correctly.

>Is there a mechanism in the existing MTA plugins to discard 
>any SPF headers already in place in a received mail? I know 
>this is probably not the best place for those concerns, so 
>maybe i'll subscribe to their dev list...

That would be the correct approach. If a recieveing MTA checks for SPF
compliance, should throw out all other SPF-related headers before adding
its own.

Assuming the MTAs do this correctly, and SPF use becomes widespread (my
domain is one of only 7500 or so registered), these headers will be very
useful clues to spambayes. However, with Microsoft supporting Caller-ID
for Email, and Yahoo! supporting Domain Keys, SPF may not be the
ultimate winner as a sending-host verification standard.

I'm placing my bets on a unified standard ermerging sometime in the next
few years. Spam costs Yahoo! And MS so much money they cannot afford to
bicker about this issue too long.


From kennypitt at hotmail.com  Wed Apr 28 09:21:30 2004
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Wed Apr 28 09:22:59 2004
Subject: [spambayes-dev] Release 1.0?
In-Reply-To: <144501c42cd9$4203b0c0$0200a8c0@eden>
Message-ID: <BAY16-DAV3HT0NTDMo200031ba4@hotmail.com>

Mark Hammond wrote:
>> Other than that, the only killer flaw I notice ten times a day (in
>> the Outlook addin) is that in the "Filter messages ..." dialog,
>> "Start Filtering" should be the DEFPUSHBUTTON instead of "Close". 
>> I've got "Automatically move pointer to the default button in a
>> dialog box" enabled on my laptop (I hate using touchpads!), and so
>> my mouse pointer always flies to the wrong button when I open that
>> dialog. 
> 
> Fixed!

Did this get checked in?  I didn't see any notice on spambayes-checkins and
cvs update didn't apply any changes.

-- 
Kenny Pitt


From sourceforge at metrak.com  Wed Apr 28 18:54:28 2004
From: sourceforge at metrak.com (paul sorenson)
Date: Wed Apr 28 18:54:33 2004
Subject: [spambayes-dev] Re: no messages to review
In-Reply-To: <E1BIp35-0006t2-Mb@mail.python.org>
References: <E1BIp35-0006t2-Mb@mail.python.org>
Message-ID: <40903624.4050304@metrak.com>

Today the proxy decided there were messages to review.  I have buttons 
"previous day" "refresh" "next day" and ended up with 4 screenfuls of 
messages for training.

If spambayes reports no messages to train but I have been receiving 
messages, is there a simple way to check what criterion it is using?


> ------------------------------
> 
> Message: 3
> Date: Wed, 28 Apr 2004 09:27:39 +1000
> From: paul sorenson <sourceforge@metrak.com>
> Subject: [spambayes-dev] no messages to review
> To: spambayes-dev@python.org
> Message-ID: <408EEC6B.20007@metrak.com>
> Content-Type: text/plain; charset=us-ascii; format=flowed
> 
> I am running spampayes proxy with mozilla thunderbird on Win XP.  I just 
> installed 1.0b1 in the last couple of days.
> 
> When I attempt to review messages I see the message: "There are no 
> untrained messages to display".  This is despite receiving dozens of 
> email each day.
> 
> This has been happening for some time (before this install).  Then every 
> now and then it seems to recognize a whole lot of messages.  Clicking 
> "previous day" followed by "next day" doesn't seem to get me back to 
> where I started.

From ta-meyer at ihug.co.nz  Wed Apr 28 21:53:15 2004
From: ta-meyer at ihug.co.nz (Tony Meyer)
Date: Wed Apr 28 21:53:26 2004
Subject: [spambayes-dev] Testing Tools Changes
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BD1@its-xchg4.massey.ac.nz>

[me]
> I also changed fpfn.py to print out each message and
> offer to move it to the corresponding ham/spam set (I used it
> to check for misclassified messages),
> but it doesn't seem like this is a good addition to the script.

Browsing, I notice that this has been offered before (which would have saved
me the bother):

[ 618932 ] fpfn.py: add interactivity on unix
<http://sourceforge.net/tracker/index.php?func=detail&aid=618932&group_id=61
702&atid=498105>

I don't know if this makes it any more/less worthwhile including, though.

=Tony Meyer


From clare at optushome.com.au  Thu Apr 29 07:54:05 2004
From: clare at optushome.com.au (Clare Wagemans)
Date: Thu Apr 29 07:52:41 2004
Subject: [spambayes-dev] Spam Bayes falls over regularly
Message-ID: <LOBBKGMAGJCIGPLDMAKHMENPDHAA.clare@optushome.com.au>

Dear Sir

Every now and then, not only after updating, I get the message that Spam
Bayes is not working.  The box "Definite Spam" just disappears and I have to
create a new one, enable Spam Bayes and then retrain.  I think I would have
had it happen about 5 times in 6 months.

regards

Clare Wagemans


From tim.one at comcast.net  Thu Apr 29 15:21:35 2004
From: tim.one at comcast.net (Tim Peters)
Date: Thu Apr 29 15:21:35 2004
Subject: [spambayes-dev] RE: [Python-Dev] SSH problems getting into
	SourceForge's CVS?
In-Reply-To: <200404291420.i3TEKGn05101@guido.python.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEJAKJAB.tim.one@comcast.net>

If you're getting messages like this today when trying to cvs up:

   @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
   @       WARNING: POSSIBLE DNS SPOOFING DETECTED!          @
   @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
   The RSA1 host key for cvs.spambayes.sourceforge.net has changed,
   and the key for the according IP address 66.35.250.209
   is unknown. This could either mean that
   DNS SPOOFING is happening or the IP address for the host
   and its host key have changed at the same time.
   @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
   @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
   @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
   IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!

   [yadda yadda yadda]

it's apparently because SourceForge changed the way they like their CVS
servers to get addressed.  Here's a link to a python-dev article with a
Python script to crawl over your checkout tree and change the hidden CVS
cruft so that SF stops whining at you:

    http://mail.python.org/pipermail/python-dev/2004-April/044593.html

For example, in the root of my spambayes checkout, I ran

    cvs_chroot.py :ext:tim_one@cvs.sourceforge.net:/cvsroot/spambayes

and then SF stopped complaining.  The hostname part of my URLs used to be
"cvs.spambayes.sourceforge.net", and SF doesn't want the "spambayes." part
there anymore.  Since this string gets buried in CVS admin files in each of
your subtrees too, you really don't want to hunt them down and fiddle them
all by hand.


From G.Hartmann at kamax.de  Fri Apr 30 04:30:22 2004
From: G.Hartmann at kamax.de (Hartmann, Gunther)
Date: Fri Apr 30 04:28:27 2004
Subject: [spambayes-dev] Access rights for spambayes outlook plugin
Message-ID: <54930B904AEB7D468F603F92D518796901C66CE0@kxw2kho9.kamax.de>

Dear All,

I have scanned all the FAQ for my problem but couldn't find any hint, so I
try this way.

I would like to have my Inbox scanned by a collegue's spambayes while I am
out of office. We are running Outlook 2000 against an exchange server and my
collegue openes my inbox during startup of his outlook. 

However he can't select my inbox in the spambayes managers folder selection
box. It is displayed and one can select the checkbox but the message at the
bottom doesn't reflect this additional selection and stays with '1 folder
selected'. Clicking on OK doesn't select this additional inbox either.

If I configure the spambayes .ini-File by adding a second inbox-identifier
it refuses to start spambayes and the log file reads 'access refused'.

So my question is: what access rights do I need pass to my collegue on what?
I tried the highest possible one (which is 8) on both my mailbox AND the
inbox folder - but it didn't work. 

Any hints?

Mit freundlichen Gr??en / Best Regards / Saludos
Gunther Hartmann
                                              
Dr. Gunther Hartmann
Director R&D
KAMAX
Tel: +49 6633 79 162
Fax +49 6633 79 6162
mailto:g.hartmann@kamax.de
http://www.kamax.com


From agrabren at yahoo.com  Fri Apr 30 14:05:18 2004
From: agrabren at yahoo.com (Kevin Bruckert)
Date: Fri Apr 30 14:05:55 2004
Subject: [spambayes-dev] Microsoft Exchange Server integration / Web
	Interface Integration
Message-ID: <20040430180518.93111.qmail@web41506.mail.yahoo.com>

I've searched back a few months, found someone (Sean)
discussing this back in June of 2003, but no follow-up
since, nor found any useful links (in my opinion). So
let me explain my interests, and also offer my
assistance (I'm a seasoned programmer, although new to
Python... But I learn fast).

On an Microsoft Exchange 2003 server, we receive
plenty of spam. I've tried various solutions, but they
all fall short in the UI arena. What I want to do is
the following: Each user has a seperate database,
although an initial global database never hurt anyone.
>From there, users can either install a client module
into their Outlook, giving them an easy-to-use
feedback mechanism. Or, for many of the users,
integration into the Exchange Server web interface.
The integration into the web interface should be
smooth and easy-to-use as well, instead of having to
run between multiple places to report spam.

By running the filters on the server, mail is filtered
on entry to the system, and allows quick access to
import email while on-the-go.

I'm willing to put in as much effort as I can to do
such work, but might want a little help at various
stages to understand the existing architecture and
prevent re-working areas which are already written.

Thanks,
Kevin Bruckert


__________________________________
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs  
http://hotjobs.sweepstakes.yahoo.com/careermakeover 

From rmalayter at bai.org  Fri Apr 30 14:22:31 2004
From: rmalayter at bai.org (Ryan Malayter)
Date: Fri Apr 30 14:22:35 2004
Subject: [spambayes-dev] Microsoft Exchange Server integration /
	WebInterface Integration
Message-ID: <792DE28E91F6EA42B4663AE761C41C2A02411854@cliff.bai.org>

[Kevin Bruckert]
> On an Microsoft Exchange 2003 server, we receive
> plenty of spam. I've tried various solutions, but they
> all fall short in the UI arena. What I want to do is
> the following: Each user has a seperate database,
> although an initial global database never hurt anyone.
> >From there, users can either install a client module
> into their Outlook, giving them an easy-to-use
> feedback mechanism. Or, for many of the users,
> integration into the Exchange Server web interface.
> The integration into the web interface should be
> smooth and easy-to-use as well, instead of having to
> run between multiple places to report spam.
> 
> By running the filters on the server, mail is filtered
> on entry to the system, and allows quick access to
> import email while on-the-go.
> 
> I'm willing to put in as much effort as I can to do
> such work, but might want a little help at various
> stages to understand the existing architecture and
> prevent re-working areas which are already written.
 
The best server-side Exchange Server filter we evaluated in terms of UI
was Sunbelt Software's iHateSpam Server Edition. We ended up buying it
because it was the simplest to use and deploy, and works reasonably
well, even though it's not a Bayesian filter. The only UI is a set of
folders created in each users inbox that contain filtered spam as well
as a whitelist and blacklist. I suggest you check it out, they have a
free demo. We get capture rates of  88% with the current version and our
threshold set to 170.

None of the other commercial or open-source Bayesian filters - which
filter more accurately - came close to iHateSpam SE in terms of
deployment ease and ease of use. Those two factors overrode all of our
other criteria.

That said, if you want to use Spambayes in a sever-side scenario, you'll
have to customize the code a lot to make it work. You might be better
off trying to use something like DSPAM on a linux box as a gateway in
front of your Exchange server. It has per-user filtering.

Another option that worked well for us for a while was ASSP, avalable at
http://assp.sourceforge.net. However, it does not do per-user filtering.
It has one DB for all users on a server. We ended up abandoning it
because we couldn't get our test group to train it well, despite lots of
instruction. The performance was very good when it was only the IT
department using it, though ;-).

Regards,
	Ryan